426 9 126MB
English Pages XII, 1015 [1026] Year 2021
Advances in Intelligent Systems and Computing 1289
Kohei Arai Supriya Kapoor Rahul Bhatia Editors
Proceedings of the Future Technologies Conference (FTC) 2020, Volume 2
Advances in Intelligent Systems and Computing Volume 1289
Series Editor Janusz Kacprzyk, Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland Advisory Editors Nikhil R. Pal, Indian Statistical Institute, Kolkata, India Rafael Bello Perez, Faculty of Mathematics, Physics and Computing, Universidad Central de Las Villas, Santa Clara, Cuba Emilio S. Corchado, University of Salamanca, Salamanca, Spain Hani Hagras, School of Computer Science and Electronic Engineering, University of Essex, Colchester, UK László T. Kóczy, Department of Automation, Széchenyi István University, Gyor, Hungary Vladik Kreinovich, Department of Computer Science, University of Texas at El Paso, El Paso, TX, USA Chin-Teng Lin, Department of Electrical Engineering, National Chiao Tung University, Hsinchu, Taiwan Jie Lu, Faculty of Engineering and Information Technology, University of Technology Sydney, Sydney, NSW, Australia Patricia Melin, Graduate Program of Computer Science, Tijuana Institute of Technology, Tijuana, Mexico Nadia Nedjah, Department of Electronics Engineering, University of Rio de Janeiro, Rio de Janeiro, Brazil Ngoc Thanh Nguyen , Faculty of Computer Science and Management, Wrocław University of Technology, Wrocław, Poland Jun Wang, Department of Mechanical and Automation Engineering, The Chinese University of Hong Kong, Shatin, Hong Kong
The series “Advances in Intelligent Systems and Computing” contains publications on theory, applications, and design methods of Intelligent Systems and Intelligent Computing. Virtually all disciplines such as engineering, natural sciences, computer and information science, ICT, economics, business, e-commerce, environment, healthcare, life science are covered. The list of topics spans all the areas of modern intelligent systems and computing such as: computational intelligence, soft computing including neural networks, fuzzy systems, evolutionary computing and the fusion of these paradigms, social intelligence, ambient intelligence, computational neuroscience, artificial life, virtual worlds and society, cognitive science and systems, Perception and Vision, DNA and immune based systems, self-organizing and adaptive systems, e-Learning and teaching, human-centered and human-centric computing, recommender systems, intelligent control, robotics and mechatronics including human-machine teaming, knowledge-based paradigms, learning paradigms, machine ethics, intelligent data analysis, knowledge management, intelligent agents, intelligent decision making and support, intelligent network security, trust management, interactive entertainment, Web intelligence and multimedia. The publications within “Advances in Intelligent Systems and Computing” are primarily proceedings of important conferences, symposia and congresses. They cover significant recent developments in the field, both of a foundational and applicable character. An important characteristic feature of the series is the short publication time and world-wide distribution. This permits a rapid and broad dissemination of research results. ** Indexing: The books of this series are submitted to ISI Proceedings, EI-Compendex, DBLP, SCOPUS, Google Scholar and Springerlink **
More information about this series at http://www.springer.com/series/11156
Kohei Arai Supriya Kapoor Rahul Bhatia •
•
Editors
Proceedings of the Future Technologies Conference (FTC) 2020, Volume 2
123
Editors Kohei Arai Faculty of Science and Engineering Saga University Saga, Japan
Supriya Kapoor The Science and Information (SAI) Organization Bradford, West Yorkshire, UK
Rahul Bhatia The Science and Information (SAI) Organization Bradford, West Yorkshire, UK
ISSN 2194-5357 ISSN 2194-5365 (electronic) Advances in Intelligent Systems and Computing ISBN 978-3-030-63088-1 ISBN 978-3-030-63089-8 (eBook) https://doi.org/10.1007/978-3-030-63089-8 © Springer Nature Switzerland AG 2021 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Editor’s Preface
With the aim of providing a worldwide forum, where the international participants can share their research knowledge and ideas, the 2020 Future Technologies Conference (FTC) was held virtually on November 5–6, 2020. FTC 2020 focuses on recent and latest technological breakthroughs in the areas of computing, electronics, AI, robotics, security and communications and map out the directions for future researchers and collaborations. The anarchic spirit and energy of inquiry found in our community always help researchers to produce brilliant technological advances which continue to restructure entire computing community. FTC see participation from such researchers, academics and technologists from leading universities, research firms, government agencies and companies to submit their latest research at the forefront of technology and computing. We are pleased to review and select a volume of high-qualified papers from all submissions during the conference. We hope these papers which have been gone through the double-blind review process can provide helpful reference for all readers and scholars. In these proceedings, we finally selected 210 full papers including six poster papers to publish. We would like to express our gratitude and appreciation to all of the reviewers who helped us maintaining the high quality of manuscripts included in this conference proceedings. We would also like to extend our thanks to the members of the organizing team for their hard work. We are tremendously grateful for the contributions and support received from authors, participants, keynote speakers, program committee members, session chairs, steering committee members and others in their various roles. Their valuable support, suggestions, dedicated commitment and hard work have made FTC 2020 a success. We hope that all the participants of FTC 2020 had a wonderful and fruitful time at the conference! Kind Regards, Kohei Arai
v
Contents
A Generic Scalable Method for Scheduling Distributed Energy Resources Using Parallelized Population-Based Metaheuristics . . . . . . . Hatem Khalloof, Wilfried Jakob, Shadi Shahoud, Clemens Duepmeier, and Veit Hagenmeyer A Lightweight Association Rules Based Prediction Algorithm (LWRCCAR) for Context-Aware Systems in IoT Ubiquitous, Fog, and Edge Computing Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . Asma Abdulghani Al-Shargabi and Francois Siewe
1
22
Analytical View on Non-Invasive Measurement of Moving Charge by Position Dependent Semiconductor Qubit . . . . . . . . . . . . . . . . . . . . . Krzysztof Pomorski
31
Steganography Application Using Combination of Movements in a 2D Video Game Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ricardo Mandujano, Juan Gutierrez-Cardenas, and Marco Sotelo Monge
54
Implementation of Modified Talbi’s Quantum Inspired Genetic Algorithm for Travelling Salesman Problem on an IBM Quantum Computer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A. C. Gammanpila and T. G. I. Fernando Crowd Management of Honda Celebration of Light Using Agent-based Modelling and Simulation . . . . . . . . . . . . . . . . . . . . . . . . . Ryan Ficocelli, Andrew J. Park, Lee Patterson, Frank Doditch, Valerie Spicer, Justin Song, and Herbert H. Tsang
70
89
Machine Learning Prediction of Gamer’s Private Networks (GPN®S) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 Chris Mazur, Jesse Ayers, Jack Humphrey, Gaétan Hains, and Youry Khmelevsky
vii
viii
Contents
Estimating Home Heating and Cooling Energy Use from Monthly Utility Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 Sai Santosh Yakkali, Yanxiao Feng, Xi Chen, Zhaoji Chen, and Julian Wang Containers Runtimes War: A Comparative Study . . . . . . . . . . . . . . . . . 135 Ramzi Debab and Walid Khaled Hidouci Performance of Test-and-Set Algorithms for Naming Anonymous Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 Layla S. Aldawsari and Tom Altman Coded Access Architectures for Dense Memory Systems . . . . . . . . . . . . 173 Hardik Jain, Matthew Edwards, Ethan R. Elenberg, Ankit Singh Rawat, and Sriram Vishwanath Parallel Direct Regularized Solver for Power Circuit Applications . . . . 193 Yury A. Gryazin and Rick B. Spielman Cartesian Genetic Programming for Synthesis of Optimal Control System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205 Askhat Diveev Reverse Engineering: The University Distributed Services . . . . . . . . . . . 223 M. Amin Yazdi and Marius Politze Factors Affecting Students’ Motivation for Learning at the Industrial University of Ho Chi Minh City . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239 Nguyen Binh Phuong Duy, Liu Cam Binh, and Nguyen Thi Phuong Giang Towards Traffic Saturation Detection Based on the Hough Transform Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263 Abdoulaye Sere, Cheick Amed Diloma Gabriel Traore, Yaya Traore, and Oumarou Sie Performance Benchmarking of NewSQL Databases with Yahoo Cloud Serving Benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271 Irina Astrova, Arne Koschel, Nils Wellermann, and Philip Klostermeyer Internet of Art: Exploring Mobility, AR and Connectedness in Geocaching Through a Collaborative Art Experience . . . . . . . . . . . . 282 Pirita Ihamäki and Katriina Heljakka Preservers of XR Technologies and Transhumanism as Dynamical, Ludic and Complex System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300 Sudhanshu Kumar Semwal, Ron Jackson, Chris Liang, Jemy Nguyen, and Stephen Deetman Interview with a Robot: How to Equip the Elderly Companion Robots with Speech? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310 Pierre-André Buvet, Bertrand Fache, and Abdelhadi Rouam
Contents
ix
Composite Versions of Implicit Search Algorithms for Mobile Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327 Vitaly O. Groppen Moore’s Law is Ending: What’s Next After FinFETs . . . . . . . . . . . . . . 340 Nishi Shah Pervasive UX Journey: Creating Blended Spaces with Augmented Reality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347 Adriano Bernardo Renzi, Paulo Bezerra, Matheus Correia, Kathryn Lanna, Victor Duarte, and Hugo Reil Fast Probabilistic Consensus with Weighted Votes . . . . . . . . . . . . . . . . . 360 Sebastian Müller, Andreas Penzkofer, Bartosz Kuśmierz, Darcy Camargo, and William J. Buchanan A Process Mining Approach to the Analysis of the Structure of Time Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 379 Julio J. Valdés, Yaimara Céspedes-González, Kenneth Tapping, and Guillermo Molero-Castillo OptiShard: An Optimized and Secured Hierarchical Blockchain Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393 Shyam Kantesariya and Dhrubajyoti Goswami Qute: Query by Text Search for Time Series Data . . . . . . . . . . . . . . . . 412 Shima Imani, Sara Alaee, and Eamonn Keogh Establishing a Formal Benchmarking Process for Sentiment Analysis for the Bangla Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 428 AKM Shahariar Azad Rabby, Aminul Islam, and Fuad Rahman Detection of Malicious HTTP Requests Using Header and URL Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 449 Ashley Laughter, Safwan Omari, Piotr Szczurek, and Jason Perry Comparison of Classifiers Models for Prediction of Intimate Partner Violence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 469 Ashly Guerrero, Juan Gutiérrez Cárdenas, Vilma Romero, and Víctor H. Ayma Data Consortia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 489 Eric Bax, John Donald, Melissa Gerber, Lisa Giaffo, Tanisha Sharma, Nikki Thompson, and Kimberly Williams StreamNet: A DAG System with Streaming Graph Computing . . . . . . . 499 Zhaoming Yin, Anbang Ruan, Ming Wei, Huafeng Li, Kai Yuan, Junqing Wang, Yahui Wang, Ming Ni, and Andrew Martin
x
Contents
A Disaster Management System on Mapping Health Risks from Agents of Disasters and Extreme Events . . . . . . . . . . . . . . . . . . . . . . . . 523 Christine Diane Ramos, Wilfred Luis Clamor, Carl David Aligaya, Kristin Nicole Te, Magdiyel Reuel Espiritu, and John Paolo Gonzales Graphing Website Relationships for Risk Prediction: Identifying Derived Threats to Users Based on Known Indicators . . . . . 538 Philip H. Kulp and Nikki E. Robinson FLIE: Form Labeling for Information Extraction . . . . . . . . . . . . . . . . . 550 Ela Pustulka, Thomas Hanne, Phillip Gachnang, and Pasquale Biafora Forecasting Time Series with Multiplicative Trend Exponential Smoothing and LSTM: COVID-19 Case Study . . . . . . . . . . . . . . . . . . . 568 M. A. Machaca Arceda, P. C. Laguna Laura, and V. E. Machaca Arceda Quick Lists: Enriched Playlist Embeddings for Future Playlist Recommendation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 583 Brett Vintch Data Security Management Implementation Measures for Intelligent Connected Vehicles (ICVs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 594 Haijun Wang, Yanan Zhang, and Chao Ma Modeling Dependence Between Air Transportation and Economic Development of Countries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 601 Askar Boranbayev, Seilkhan Boranbayev, Tolendi Muratov, and Askar Nurbekov Sentiment Analysis to Support Marketing Decision Making Process: A Hybrid Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 614 Alaa Marshan, Georgia Kansouzidou, and Athina Ioannou Jupyter Lab Based System for Geospatial Environmental Data Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 627 Nikita A. Terlych and Ramon Antonio Rodriges Zalipynis Collaboration-Based Automatic Data Validation Framework for Enterprise Asset Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 639 Kennedy Oyoo EEG Analysis for Predicting Early Autism Spectrum Disorder Traits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 658 Parneet Kaur Saran and Matin Pirouz Decision Support System for House Hunting: A Case Study in Chittagong . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 676 Tanjim Mahmud, Juel Sikder, and Sultana Rokeya Naher
Contents
xi
Blockchain in Charity: Platform for Tracking Donations . . . . . . . . . . . 689 Sergey Avdoshin and Elena Pesotskaya Data Analytics-Based Maintenance Function Performance Measurement Framework and Indicator . . . . . . . . . . . . . . . . . . . . . . . . 702 C. I. Okonta and R. O. Edokpia Parallel Mapper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 717 Mustafa Hajij, Basem Assiri, and Paul Rosen Dimensional Analysis of Dataflow Programming . . . . . . . . . . . . . . . . . . 732 William W. Wadge and Abdulmonem I. Shennat EnPower: Haptic Interfaces for Deafblind Individuals to Interact, Communicate, and Entertain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 740 Nimesha Ranasinghe, Pravar Jain, David Tolley, Barry Chew, Ankit Bansal, Shienny Karwita, Yen Ching-Chiuan, and Ellen Yi-Luen Do Adaptive Customized Forward Collision Warning System Through Driver Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 757 Marco Stang, Martin Sommer, Daniel Baumann, Yuan Zijia, and Eric Sax JettSen: A Mobile Sensor Fusion Platform for City Knowledge Abstraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 773 Andres Rico, Yasushi Sakai, and Kent Larson No Jitter Please: Effects of Rotational and Positional Jitter on 3D Mid-Air Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 792 Anil Ufuk Batmaz, Mohammad Rajabi Seraji, Johanna Kneifel, and Wolfgang Stuerzlinger THED: A Wrist-Worn Thermal Display to Perceive Spatial Thermal Sensations in Virtual Reality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 809 Nicholas Soucy, Nimesha Ranasinghe, Avery Rossow, Meetha Nesam James, and Roshan Peiris Autonomous Landing of a Quadrotor with Wireless Charging . . . . . . . 830 Abdallah Aljasmi, Abdulrahman Yaghmour, Omar Almatrooshi, Rached Dhaouadi, Shayok Mukhopadhyay, and Nasser Qaddoumi Matching Algorithms in Ride Hailing Platforms . . . . . . . . . . . . . . . . . . 847 Guantao Zhao, Yinan Sun, Ziqiu Zhu, and Amrinder Arora LLWURP: LoRa/LoRaWAN Uniform Relay Protocol with a Single Input, Single Output Device . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 862 Olivier Flauzac, Joffrey Hérard, Florent Nolot, and Philippe Cola State Space Modeling of Tie-Line Based Microgrid for Implementation of Robust H∞ Controller . . . . . . . . . . . . . . . . . . . . . . . 877 Hessam Keshtkar and Farideh Doost Mohammadi
xii
Contents
cARd: Mixed Reality Approach for a Total Immersive Analog Game Experience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 889 Yuxuan Liu, Yuanchu Si, Ray Lc, and Casper Harteveld Urban Air Pollution Monitoring by Neural Networks and Wireless Sensor Networks Based on LoRa . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 907 Vanessa Alvear-Puertas, Paul D. Rosero-Montalvo, Jaime R. Michilena-Calderón, Ricardo P. Arciniega-Rocha, and Vanessa C. Erazo-Chamorro Dreamscape: Using AI to Create Speculative VR Environments . . . . . . 920 Rishab Jain A Gait Analysis of a Virtual Reality Inverse Treadmill . . . . . . . . . . . . . 938 Wil J. Norton, Jacob Sauer, and David Gerhard Computer-Vision System for Supporting the Goniometry . . . . . . . . . . . 946 Oswaldo Morales Matamoros, Paola Angélica Ruiz Araiza, Rubén Alejandro Sea Torres, Jesús Jaime Moreno Escobar, and Ricardo Tejeida Padilla An Integrated Low-Cost Monitoring Platform to Assess Air Quality Over Large Areas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 965 L. Brilli, F. Carotenuto, B. Gioli, A. Berton, S. Profeti, G. Gualtieri, B. P. Andreini, M. Stefanelli, F. Martelli, C. Vagnoli, and A. Zaldei Additive Manufacturing: Comparative Study of an IoT Integrated Approach and a Conventional Solution . . . . . . . . . . . . . . . . . . . . . . . . . 976 Harshit Shandilya, Matthias Kuchta, Ahmed Elkaseer, Tobias Müller, and Steffen G. Scholz An Integrated IoT-Blockchain Implementation for End-to-End Supply Chain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 987 Aamir Shahzad and Kaiwen Zhang Intelligent Roadways: Learning-Based Battery Controller Design for Smart Traffic Microgrid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 998 Farideh Doost Mohammadi, Hessam Keshtkar, and Benjamin Gendell Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1013
A Generic Scalable Method for Scheduling Distributed Energy Resources Using Parallelized Population-Based Metaheuristics Hatem Khalloof(B) , Wilfried Jakob, Shadi Shahoud, Clemens Duepmeier, and Veit Hagenmeyer Institute of Automation and Applied Informatics (IAI), Karlsruhe Institute of Technology (KIT), Karlsruhe, Germany {hatem.khalloof,wilfried.jakob,shadi.shahoud, clemens.duepmeier,veit.hagenmeyer}@kit.edu
Abstract. Recent years have seen an increasing integration of distributed renewable energy resources into existing electric power grids. Due to the uncertain nature of renewable energy resources, network operators are faced with new challenges in balancing load and generation. In order to meet the new requirements, intelligent distributed energy resource plants can be used which provide as virtual power plants e.g. demand side management or flexible generation. However, the calculation of an adequate schedule for the unit commitment of such distributed energy resources is a complex optimization problem which is typically too complex for standard optimization algorithms if large numbers of distributed energy resources are considered. For solving such complex optimization tasks, population-based metaheuristics – as e.g. evolutionary algorithms – represent powerful alternatives. Admittedly, evolutionary algorithms do require lots of computational power for solving such problems in a timely manner. One promising solution for this performance problem is the parallelization of the usually time-consuming evaluation of alternative solutions. In the present paper, a new generic and highly scalable parallel method for unit commitment of distributed energy resources using metaheuristic algorithms is presented. It is based on microservices, container virtualization and the publish/subscribe messaging paradigm for scheduling distributed energy resources. Scalability and applicability of the proposed solution are evaluated by performing parallelized optimizations in a big data environment for three distinct distributed energy resource scheduling scenarios. Thereby, unlike all other optimization methods in the literature – to the best knowledge of the authors, the new method provides cluster or cloud parallelizability and is able to deal with a comparably large number of distributed energy resources. The application of the new proposed method results in very good performance for scaling up optimization speed.
c Springer Nature Switzerland AG 2021 K. Arai et al. (Eds.): FTC 2020, AISC 1289, pp. 1–21, 2021. https://doi.org/10.1007/978-3-030-63089-8_1
2
H. Khalloof et al. Keywords: Parallel evolutionary algorithms · Microservices · Container virtualization · Parallel computing · Scalability · Scheduling distributed energy resources · Microgrid · Cluster computing
1
Introduction
Renewable Energy Resources (RERs) are recently widely integrated into the grid paving the road for more clean and environment-friendly energy. To facilitate the adoption and management of such RERs, the transition from a traditional centralized grid (macrogrid) to more decentralized grids (microgrids) is required [18,31]. Microgrids encompass respectively a localized group of Distributed Energy Resources (DERs) where each DER represents a small or larger scale and self-autonomous sub-system connected to an electricity network. DERs provide renewable energy generation and/or improve the overall power system reliability by balancing the energy supply and demand in a specific part of a power network by providing flexible load options or storage. Typically, a DER encompasses a group of small generation units such as PVs, wind turbines and diesel generators, electrical loads (demand-response) e.g. electric vehicles or flexible heating systems, and maybe storage. DERs interconnect bidirectionally to the grid through one or more Point(s) of Common Coupling (PCC) [14]. By the time, the usage of DERs in smart grids will dramatically increase providing more clean energy generated from RERs and additionally also maintaining and increasing power quality and system reliability. The flexibility of microgrids provides a significant potential to promote and integrate more DERs for featuring their beneficial traits. Despite being highly effective, microgrids have some limitations such as lack of system protection and customer privacy. Moreover, by increasing the number of DERs in the grid and due to the uncertainties of RERs and load, the efficient control and optimal usage of DERs by finding the proper schedule for using them represents a big challenge [46]. In general, scheduling problems e.g. scheduling DERs is an NP-hard optimization problem and therefore is typically too complex to be solved by exact optimization methods, especially if large size optimization problems are considered [7,40]. Metaheuristics such as Evolutionary Algorithms (EAs) became one of the most robust methods to solve such complex problems by finding good local optima or even the global one. The central concept of EAs is a population consisting of individuals representing tentative solutions. The individuals encode possible solutions and they are used to explore many areas of the solution space in parallel. Each individual is evaluated by a fitness function to identify its suitability as a solution for the problem. Genetic operators, namely, recombination and mutation, are iteratively applied to individuals to generate a new offspring for each generation until a termination criterion has been reached [16,45]. This approach of having a population of solutions and evaluating them over and over again takes a lot of computational resources for large problem sizes. Therefore, applying EAs for large scale optimization and NP-hard problems such as the problem of scheduling a large number of DERs can be time-consuming and computationally expensive. To speed up EAs, three different parallelization models,
A Scalable Method for Scheduling Distributed Energy Resources using EAs
3
namely the Global Model (Master-Slave Model), the Fine-Grained Model and the Coarse-Grained Model have been introduced and investigated in [9]. In the Global Model, the evaluation step is parallelized over several computing units (called slaves). In the Fine and Coarse-Grained Models, the population is structured to apply the genetic operators in parallel. Over the last decades, various approaches and frameworks e.g. [1,3,8,10– 13,17,21,23,27,28,36–39,42] have been introduced to enable the parallel processing of EAs following the above three parallelization models. For most of these frameworks e.g. [1,3,8,10,11,17,21,27,38,39,42], a monolithic software architecture was the classical approach for the implementation which decreases the modularity, usability and maintainability of the application. Recently, Big Data technologies such as Hadoop and Spark have been applied to speed up EAs e.g. [5,10,17,21,37,39,42]. However, most of these approaches also have a monolithic architecture which lacks hard boundaries and tends to become, with added functionality, complex and tightly coupled. This, in fact, limits the ability to provide simple and practical methods to plug in problem-specific functionality e.g. simulators and even to integrate existing EAs. By emerging modern software technologies, namely microservices, container virtualization and the publish/subscribe messaging paradigm, the parallelization of EAs in cluster and cloud environments to speed up EAs has become even more relevant, see e.g. [12,13,37,42]. Unlike monolithic applications, a microservices-based application contains several small, autonomous, highly cohesive and decoupled services that work together to perform a specific task. Since all services are independent from each other, each microservice is able to utilize its own technology stack allowing great flexibility. The independence of the services allows each service to scale on demand. Microservice applications comprise two main features, namely modularity and technology heterogeneity which allow the microservices to be developed by different teams based on different technologies. These advantages combined with container runtime automation unlock the full potential of a parallelized EA by executing it on large scale computing clusters [23]. In the present work, a new highly scalable, generic and distributed approach to schedule DERs is introduced. The microservice and container virtualizationbased framework presented in [23] is adapted to carry out the required tasks. As the simulation based evaluation is by far the most time consuming part, the proposed framework distributes EAs according to the Global Model (Master-Slave model) [9] where the evaluation is distributed over several computing units. Ondemand deployment of services on a high performance distributed computing infrastructure, namely a computing cluster, is supported. To validate the functionality of the proposed parallel approach, the EA GLEAM (General Learning Evolutionary Algorithm and Method) [6] is integrated into the framework. As a test task, the creation of an hourly day-ahead schedule plan for a simulated microgrid is chosen. In this microgrid, three use case scenarios are defined. In the first and second scenario, 50 DERs are considered to cover the required power for a simple load profile. In the third scenario 100 DERs are utilized to supply the requested power for a more complex load profile. For evaluation of the
4
H. Khalloof et al.
scalability and the performance of the new solution, the framework is deployed on a cluster with four nodes, each one has 32 Intel cores (2,4 GHz). The rest of the present paper is structured as follows. The next section reviews some related work for scheduling DERs based on EAs. Section 3 introduces the extended architecture of the proposed approach. Section 4 starts with a short introduction of the EA GLEAM serving as metaheuristic, and continues with a description of the defined use case scenarios, and finally obtained results are presented. Section 5 concludes with a summary and planned future work.
2
Related Work
EAs have attracted the attention of researchers to solve several optimization problems in energy systems, namely expansion planning, e.g. [32], maintenance scheduling, e.g. [43], scheduling energy resources (unit commitment) and economic dispatch [4,22,25,26,29,30,33,34,41,44], to name a few. In recent extensive overviews, Zia et al. [46] and Alvarado-Barrios et al. [2] presented comprehensive studies about different methods and techniques used in Energy Management Systems (EMS) to optimize and schedule the operations. In the following sections, we summarize some of these works studied in [2] and [46] focusing on using EAs (especially Genetic Algorithms GAs) for scheduling DERs. For the problem of scheduling DERs, the authors of [4,25,26,29,30,34] implemented GAs to schedule the power generation in microgrids. Several microgrids with sizes ranging from six to 12 DERs and a wide variety of generators e.g. PVs, wind turbines, microturbines and diesel engines and energy storage systems (batteries) are considered in these studies. While in [25,29,30,34] standard GA implementations were used, in [4] a memory-based GA algorithm and in [26] an improved GA combined with simulated annealing technique were utilized to accelerate GAs for finding the optimal schedule. Minimizing the operation cost was the objective function for all these works. However, in [29] the eco-pollutant treatment costs were additionally considered as objective function. Quan et al. in [34] defined five deterministic and four stochastic case studies solved by GA. They concluded that GA can introduce robust solutions for stochastic optimization problems. All the previous works focused on developing a respective new optimization algorithm using non-distributed EAs for achieving better solution quality. They tested their proposed solutions with microgrids consisting of small numbers of energy resources and deployed them using a monolithic software architecture. This limits the scalability and modularity of the proposed system which in turn restricts the possibility to handle scalable number of DERs. Despite the principally satisfactory performance of using EAs for scheduling DERs, there is no generic, parallel and scalable solution to facilitate the usage of the EAs for scheduling a scalable energy system on a scalable runtime environment such as a cluster and to work efficiently with other components e.g. forecasting frameworks and simulators.
A Scalable Method for Scheduling Distributed Energy Resources using EAs
5
Therefore, the present work introduces a highly parallel and scalable approach using a proven and established software environment based on microservices and container virtualization with full runtime automation on big computing clusters and an easy-to-use web-based management for scheduling DERs based on distributed EAs. It provides a highly flexible environment for solving the problem of scheduling DERs for external applications e.g. EMS, and allows easy communication with other needed tools such as forecasting tools and external simulators.
3
Microservice and Container Virtualization Approach for Scheduling DERs Using Parallelized EAs
In the following, the conceptual architecture of the proposed generic distributed approach for scheduling DERs based on EAs are detailed. The last subsection introduces GLEAM which is used as concrete EA for evaluating the approach. 3.1
Microservice and Container Virtualization-Based Architecture
The conceptual architecture of the proposed highly scalable metaheuristic optimization solution is derived from [23]. As shown in Fig. 1, the architecture has three main tiers, namely the User Interface (UI) Tier, the Cluster Tier and the Distributed Energy Resources (DERs) Tier. On the front-end, the UI Tier is dedicated to user interaction, e.g. input for defining optimization tasks, uploading optimization models, starting and stopping optimization tasks and presenting the obtained results. The UI Tier introduces a simple web-based UI to manage the interaction with the back-end tier. On the back-end, the Cluster Tier contains two sub-layers, namely the Container Layer and the Data Layer. The Container Layer contains all the services necessary to execute a parallel EA in the framework. This includes not only the services that actually execute the EA, but also services for coordinating the execution and distributing the data. Each service is realized as a microservice running in a containerized environment. The Data Layer stores all data and acts as an intermediate for message exchange. To reflect the different properties of the data, the Data Layer is subdivided into a Persistent and Temporary Storage (in-memory database). On the one hand, the Persistent Storage is responsible for storing the data needed for each optimization task such as forecasting data and the final results for further usage. On the other hand, the Temporary Storage stores the intermediate data that are exchanged between services when performing an optimization job. The Temporary Storage also realizes a publish/subscribe message exchange pattern to improve the decoupling among the services. The DERs Tier contains abstractions of the DERs that have to be scheduled. Each DER abstraction provides required data, e.g. the predicted generation, consumption and market price for the period considered, and the necessary technical properties about the nonrenewable sources e.g. diesel generators. This data is needed by the services within the Container Layer for creating
6
H. Khalloof et al. User-Interface Tier (Front-end)
Supporting and Learning Service
DERs Service
Interpretation Service 2
Interpretation Service 3
Calculation Service 2 Calculation Service 3 Calculation Service 4
…
Splitting & Joining Service
Local Search Service
Interpretation Service 1
…
Optimization Task Coordination Service
Evolutionary Operators Service
Calculation Service 1
Interpretation Service n
Calculation Service n
Cluster Tier (Back-end)
Container Managem ent Service
Coordina tion Service
DERs Tier
Fig. 1. The conceptual architecture of the proposed architecture with detailed container layer
an optimized scheduling plan. In the following sections, the container layer with the implemented services will be described in greater detail. Container Layer. For finding the optimal scheduling plan for a group of DERs using parallel EAs, the software solution needs to perform several tasks, namely, coordination of the execution of tasks, i.e. managing the containers, starting and managing external simulators and executing the parallel EAs for generating, splitting, distributing, and evaluating the chromosome lists, and collecting and joining the subresults to form the final results and applying the genetic operators. These tasks are performed by ten decoupled and cohesive microservices as shown in Fig. 1. The presented microservices are adapted from [23] for scheduling DERs using parallel EAs based on the global parallelization model. Three new services, namely Supporting and Learning Service, DERs Service and Interpretation Service are added. Some of the existing microservices are modified and renamed to reflect their extended functionalities and new tasks. The Distribution & Synchronization service is split into two services depending on its functionalities, namely, the Optimization Task Coordination Service and the Splitting & Joining Service. The framework is designed with this hierarchical structure for facilitating manageability and allowing extensibility. Adapted and newly added microservices are described in detail below. Coordination Service. The Coordination Service (formerly named Optimization Job Management Service [23]) is one of the core parts of the framework. It acts not only as a coordinator for multiple jobs, but also for the whole framework. After receiving the configuration, the Coordination Service asks the Container Management Service to start the required number of instances of the Interpretation Service and the Calculation Service. As soon as the required services are
A Scalable Method for Scheduling Distributed Energy Resources using EAs
7
booted up, the Coordination Service calls the Evolutionary Operators Service to create the requested number of chromosomes of the initial population. At the end of an optimization job, the Coordination Service receives the aggregated result and sends it to the visualization component to be visualized. The Coordination Service does not act as a master in the global model, rather it coordinates the services by initialization and termination. Evolutionary Operators Service. This service performs the task of the master in the global model. At first, it generates the initial population when called by the Coordination Service. Then, it calculates the fitness function to identify the individuals surviving for the next generation. Furthermore, it applies the genetic operators, namely crossover and mutation as well as the selection operation to generate the offspring. Optimization Task Coordination Service. The Optimization Task Coordination Service (formerly named as Distribution & Synchronization Service [23]) coordinates one optimization task by e.g. assigning a Task ID, selecting one of the available simulation models that is available and starting and stopping an optimizing task. Indeed, it acts as a coordinator between the Evolutionary Operators Service and other services. Splitting & Joining Service. The Splitting & Joining Service (formerly named as Distribution & Synchronization Service [23]) receives the offspring, i.e. the chromosome list from the Evolutionary Operators Service. Afterwards, it evenly splits and distributes the population to the Interpretation Service instances. By finishing the distribution of the subpopulations successfully, the Interpretation Services start the interpretation processes by receiving a start signal from the Splitting & Joining Service. As soon as the optimization task is finished, the Splitting & Joining Service creates the overall result list matching the original list format supported by Evolutionary Operators Service by joining the partial results. Finally, the overall result is sent back to the Optimization Task Coordination Service which in turn sends it back to the Evolutionary Operators Service for applying the genetic operators, namely selection, crossover and mutation. DERs Service. The DERs Service provides other services dynamic and static data about the DER components. Examples of dynamic data are the actual state of batteries, forecasting data for the generation of RERs, consumption and market prices which are continuously changed according to different factors such as the weather. Static data encompasses the number and type of DER components, technical constraints for the conventional energy resources e.g. minimum and maximum capacity, ramping limits and minimum up and down times, to name a few. Both types of data are stored in a database where each DER can insert and update its related data automatically, if it has an Energy Management System Interface (EMS-IF). Otherwise, a manual insertion and update is required. The Evolutionary Operators Service and the Interpretation Service instances need such data for the generation of the initial population and for the chromosome interpretation process as described later.
8
H. Khalloof et al.
Interpretation Service. As its name implies, it is responsible for interpreting the chromosomes in the context of the optimization problem solved. For controlling DERs, the Evolutionary Operators Service generates scheduling operations represented by genes with relative values (e.g. in percent of the maximum providable power within a given time interval) representing the requested power share from each DER at specific time interval. These values must be interpreted by converting them to absolute values for evaluation (simulation) purposes. For example, for RERs, the relative generation values are multiplied by the corresponding forecasting data of the RERs to obtain the absolute values of a certain schedule. Since the interpretation process can require much computing time according to the size of chromosomes, the framework can deploy as many Interpretation Service instances as required allowing a parallel interpretation for scalability. Calculation Service. The Calculation Service (or simulator) performs the calculations required to evaluate the individuals of the distributed population. It is called by the Interpretation Service for evaluating the offspring with respect to the given evaluation criteria. It takes a list of unevaluated individuals as the input and outputs the related evaluation results for each individual. Container Management Service. The Container Management Service creates as many Interpretation Service and Calculation Service instances as needed allowing runtime scalability. After creating and initializing the required instances successfully, the Container Management Service publishes a ready signal in order to start the processing of the optimization job. Supporting and Learning Service. Typically, EAs start to generate the initial population randomly which ensures the necessary diversity of the start population and allows for an initial breadth search. On the other hand, using a given solution of a similar task can speed up the search at the risk of pushing the search into the direction of these solutions. Thus, only a few prior solutions should be taken as a part of the initial population. This can significantly accelerate an EA [20]. This service supports the Evolutionary Operators Service by generating the initial population and can use some already-found solutions (i.e. scheduling plans in case of DERs scheduling) for this based on predefined selection criteria. Local Search Service. The Local Search Service is an extension of a deployed EA to support Memetic Algorithms (MAs). This service provides the ability for using appropriate local search methods or heuristics to accelerate the evolutionary search of an EA by local improvement of the offspring. The publish/subscribe pattern is used to realize the communications between the scalable microservices i.e. the Interpretation Service and the Calculation Service as well as between them and the other (non-scalable) microservices. The use of the publish/subscribe messaging paradigm ensures a seamless deployment, full decoupling among the services and an efficient and reliable data exchange among the services (cf. [24]). However, RESTful service APIs are useful for enabling the communication among the services specifically the microservices
A Scalable Method for Scheduling Distributed Energy Resources using EAs
9
Evolutionary Operators Service Splitting & Joining Service Splitting & Joining Service Interpretation Service
DERs Service
Calculation Service Splitting & Joining Service Evolutionary Operators Service
Splitting & Joining Service Interpretation Service
DERs Service
Calculation Service Splitting & Joining Service Evolutionary Operators Service Coordination Service
Fig. 2. Mapping the related microservices to the pseudo-code of the parallel EAs based on the Global Model for scheduling DERs
that are non-scalable in runtime, namely the Coordination Service, the Evolutionary Operators Service, the Optimization Task Coordination Service, the Splitting & Joining Service, the Container Management Service and the DERs Service. In Fig. 2, the pseudo-code of the parallel EAs – based on the Global Model – for scheduling DERs is mapped to the related microservices.
4
EA GLEAM for Scheduling DERs
The process of scheduling DERs consists of a set of scheduling operations that determine which DERs are involved in the power generation process and to what extent, in order to supply the required energy per time interval. The concrete EA GLEAM [6] is integrated into the Evolutionary Operators Service for scheduling DERs, as it has proven its suitability for general scheduling problems in several different applications e.g. [20]. GLEAM is acting as a master of the Global Model and generates the initial population, it applies the genetic operators and calculates the fitness value for each chromosome. The main feature that distinguishes GLEAM from other EAs is its flexible coding used to optimize not only time-dependent processes but also any other optimization problems such as scheduling and design optimization. The coding in GLEAM is based on a set of genes that are linked together forming a linear chain which represents a chromosome. The length of the chromosomes can either be fixed or altered dynamically by evolution. In the following section, the GLEAM
10
H. Khalloof et al.
based solution for chromosome representation and interpretation for scheduling DERs is described. 4.1
Solution Representation and Interpretation
Typically, a scheduling problem is broken down into several scheduling operations (e.g. one or more for each DER) which are represented by genes. In GLEAM, the structure of a gene is flexible and the number and types of its decision parameters are defined related to the nature of the optimization problem. The genes are moved as a whole by the respective genetic operators, which corresponds directly to the change in the sequence of the planning operations. Each scheduling operation is coded by one gene that consists of a fixed gene ID, which corresponds to the unit ID of the related DER, and the following decision variables: start time, duration and the power fraction as shown in Fig. 3.
Unit ID Start time Duration Power fraction (P)
Gen. 1 2 7 5 0.7
Gen. 2 1 9 8 0.3
Gen. 3 2 10 6 0.8
Fig. 3. A chromosome with three genes encoding a possible solution to schedule two generation units
While the start time is used to determine the start time of taking energy from this DER and the duration refers to the number of time intervals to which this setting applies, the power fraction variable determines the amount of energy in relation to the forecasted maximum that can be obtained from a DER. Since the number of required scheduling operations is not known a priori, the length of each chromosome is changed dynamically by the evolution. Mutation operators such as the duplication, deletion or insertion of individual genes or gene segments are used to alter the length of chromosomes (cf. [6,19] for a detailed discussion). Chromosome Interpretation. For the construction of an allocation matrix, the genes of a chromosome are successively treated so that a later gene overwrites matrix entries of the previous ones with the same Unit ID. This is considered as the first step of chromosome interpretation by the Interpretation Service. For each chromosome list, the first task is generating an allocation matrix where the number of rows m is equal to the number of resources, i.e. DERs in this chromosome list, and the number of columns n represents the time intervals. When the building of the allocation matrix is finished, the Interpretation Service starts the second step of interpretation, namely, converting the relative values of power fraction to absolute values by multiplying each value in the allocation matrix by the corresponding values of the actual maximum power generation supplied by
A Scalable Method for Scheduling Distributed Energy Resources using EAs
11
the DERs Service for the corresponding time interval. As a result, a new matrix with absolute values is produced and prepared for evaluation (simulation) by the Calculation Service.
5
Evaluation
In this section, the performance of the proposed distributed solution with respect to scalability is discussed. First, three use case scenarios are introduced in Sect. 5.1. Afterwards, the mathematical optimization problem with objective functions and constraints is formulated. Thereafter, the GLEAM configuration and the deployment of the experiment using services on a cluster are described. The interpretation of the results will then be discussed in Sect. 5.4. 5.1
Use Case Scenarios
For evaluating the scalability and generality of the proposed approach, three DER scheduling scenarios instrumenting a different number of DERs and DER mixes (only PV, PV with other generation sources or storage) with predefined generation behaviour, and two different load profiles are defined, see Fig. 4.
Fig. 4. Use Case Scenarios used for evaluation
For defining renewable generation behaviour, the hourly real power generation data for 50 and 100 PVs provided by Ausgrid [35] is used. Each DER has an EMS which manages and coordinates this DER. The EMS has a communication interface (EMS-IF) which provides flexibility of the DER in terms of the
12
H. Khalloof et al.
amount of energy that can be sold at a specific time interval with a specific price to consumers, e.g. the aggregated more or less controllable load (house symbols) as shown in Fig. 4. In the first scenario depicted in the upper left part of Fig. 4, 50 DERs can offer power for 24 h to cover a simple load profile (load profile A) as shown in Fig. 5. For the period between 7 and 17 o’clock, the EMSs offer the power to be sold from PVs and outside this period from other resources such as batteries or wind turbines. In the second scenario depicted in the upper right part of Fig. 4, the same load profile (load profile A) as in the first scenario is used. However, only a part of DERs, namely 25 DERs can offer power for the consumer 24 h from the PVs combined with other resources. The other 25 DERs have only PVs and therefore can offer power only for 10 h between 7 and 17 o’clock. In the third use case scenario depicted in the lower part of Fig. 4, 100 DERs provide the requested power for 24 h to cover a more complex load profile (load profile B) as shown by the blue line in Fig. 5. The main task of the distributed GLEAM is to minimize the daily bill costs of the customer by generating the optimal hourly scheduling plan for one day ahead. Additionally, there are some constraints which have to be fulfilled. 9
Consumption in kWh
8 7 6 5
4 3 2 1 0
Hours Load profile A (Use case 1 and 2 )
Load profile B (Use case 3)
Fig. 5. The two load profiles used for evaluation
5.2
Objective Functions and Constraints
For the present evaluation, the Cost-Effective Operation Mode [2] is considered. Equation (1) defines the cost function as a nonlinear (e.g. quadratic) function to be minimized for the above three use cases. Cost =
N T i=1 t=1
Ci,t ∗ (Pi,t ) =
N T
2 [αPi,t + βPi,t + γ]
(1)
i=1 t=1
where N is the number of DERs, T is the number of the time intervals, Ci,t is the price in (EUR) for each kWh taken from resource i in time interval
A Scalable Method for Scheduling Distributed Energy Resources using EAs
13
t, Pi,t is the scheduled power in kWh taken from resource i in time interval t and α, β and γ are the cost function coefficients defined for each DER at every time interval t. Since DERs should work as much as possible by only using locally supplied power, the power balance within each DER is considered an important optimization objective. For achieving such balance, an additional objective function, namely the Daily Total Deviation (DTD) function shown in Eq. (2) is defined. It is the sum of absolute differences between the required power and the scheduled one at every time interval t. For arriving at a local balance DTD should be as low as possible. N T (2) Pi,t − Dt DT D = t=1 i=1
where Dt is the requested power by the load in time interval t in kWh. To guarantee that the evolutionary search process preferably finds solutions without undersupply at each hour, the Hours of Undersupply (HU) function shown in Eq. (5) is defined. It represents the number of hours of undersupply and takes an integer value between zero (the optimal case no undersupply) and T (the worst case there are undersupply in all hours). The initial value of HU is zero. N HU ++, if Dt > i=1 Pi,t : t ∈ (1, .., T ) (3) HU = HU otherwise Due to the nonlinearity of the cost and DTD functions, the optimization problem is formulated as a nonconvex mixed-integer nonlinear optimization problem. Moreover, it is a multi-objective problem: M inimize[Cost, DT D]
(4)
HU = 0
(5)
subject to The optimization problem defined above in Eqs. (4) and (5) is an adequate problem for our evaluation, since the scheduling of DERs is NP-hard optimization problem [40] and formulated as nonconvex mixed-integer nonlinear optimization problem which need lots of computational power. Moreover, the numerical solution for such optimizarion problem is typically too complex for exact optimization methods [7]. Hence, EAs represent a robust and powerful alternatives [15]. The EA GLEAM should minimize the cost and DTD objective functions as far as possible while holding the constraint HU. The Calculation Service is responsible for computing the values of the above objective functions (criteria) and constraint for each individual i.e. chromosome. The weighted sum defined in Eq. (7) is used to combine the results of the criteria into a fitness value. The fitness scale is set to a range between “0” and “100.000”. The fitness value determines in GLEAM the likelihood of an individual reproducing and passing on its genetic information. This happens especially when choosing a partner and deciding whether to accept or reject the offspring
14
H. Khalloof et al.
when forming the next generation. In order to handle the equality constraint HU, a Penalty Function P F shown in Eq. (6), which yields a value between zero and one (no undersupply) is defined. The fitness determined from the other two criteria is multiplied by this, so that an undersupply of 5 h already reduces the fitness value to a third. P F (HU ) = (1 −
1 HU ) T
(6)
F itness = (0.4 ∗ Cost + 0.6 ∗ DT D) ∗ P F (HU ) 5.3
(7)
Deployment on a Cluster
For instrumenting the solution on a computer cluster, it is deployed on a cluster with four computing nodes where each node has 32 Intel cores (2,4 GHz) resulting in 128 independent computing units, 128 GB RAM and an SSD disk. The nodes are connected to each other by a LAN with 10 GBit/s bandwidth. A modern software environment based on container automation technology guarantees a seamless deployment of the microservices on the cluster. For enabling containerization, the most popular open source software, namely Docker1 is used. Docker performs operating-system-level virtualization to isolate the applications. This is achieved by running containers on the Docker engine that separates the applications from the underlying host operating system. For container orchestration, Kubernetes2 is chosen as container orchestration system.
Framework Services Coordination Service
Optimization Task Coordinati on Service
Node 1
Splitting & Joining Service
Container Management Service
Evolutionary Operators Service
Node 2
DERs Service
Node 3
Interpretation Service
Calculation Service
Node 4
Docker Kubernetes
Cluster & OS Fig. 6. Mapping the proposed microservice architecture to the cluster with four nodes
1 2
www.docker.com. www.kubernetes.io.
A Scalable Method for Scheduling Distributed Energy Resources using EAs
15
It is used in many production environments due to its flexibility and reliability. Kubernetes defines several building blocks which are called Pods to separate “computing loads” from each other and provide mechanisms to deploy, maintain and scale applications. A Pod is the smallest building block in the Kubernetes object model and represents one or more running processes. The highly distributable Redis3 is deployed as an in-memory database serving as a temporary storage for intermediate results and their exchange. Redis provides the publish/subscribe messaging paradigm. The persistent database storing DER forecasting data for power generation, power consumption and market prices is implemented by using the InfluxDB4 time series database. Figure 6 shows the technological layers and an example of how the services can be mapped to the CPUs on the four nodes. It is important to notice that the required instances from the Interpretation Service and the Calculation Service are distributed over all nodes dynamically. 5.4
Results and Discussion
In the following, the efficiency of the parallel method for scheduling DERs developed based on modern technologies, namely, microservice and container virtualization is introduced. The achieved quality of the schedules using distributed EAs and the scalability in cluster environments are particularly discussed. Resulting Schedules. For achieving a good trade-off between exploration and exploitation, appropriate strategy parameters of the EA, namely, the size of the population and the number of generations must be determined. For this, we perform several tests with 120 slaves and varying the population size as follows: 120, 180, 240, 300 and 420 individuals, so that each slave at minimum can process one individual. The number of offsprings per pairing is set to eight. To limit the effort, the number of generations is set to 420. For the first use case an optimal schedule with 21 DERs from the available 50 DERs–depicted in Fig. 7a, can be obtained with a population size of 180 individuals. For the second use case, GLEAM needs more individuals, namely, 300 to explore the search space sufficiently and to find an optimal scheduling plan – shown in Fig. 7b – using more DERs, namely, 31 from the available 50 DERs. In comparison with the first use case, the number of scheduling operations (genes) and the corresponding number of evaluations are increased significantly. This is due to the fact that one half of the used 50 DERs are restricted to supply power only for 10 h per day resulting in a more heterogeneous search space and a further constraint to GLEAM. Figure 7c shows how the required energy is covered by the two types of DERs considered in the second use case. As shown, the pure DERs with only PVs contribute with a generation portion between 13% (at 8 o’clock) and 54% (at 14 o’clock). For the third use case with 100 DERs and a 3 4
www.redis.io. www.influxdata.com.
16
H. Khalloof et al.
(a) The optimal scheduling plan for use case 1
(b) The optimal scheduling plan for use case 2 KWh 4 3 2 1
0 12:00 1:00 2:00 3:00 4:00 5:00 6:00 7:00 8:00 9:00 10:00 11:00 12:00 1:00 2:00 3:00 4:00 5:00 6:00 7:00 8:00 9:00 10:00 11:00 AM AM AM AM AM AM AM AM AM AM AM AM PM PM PM PM PM PM PM PM PM PM PM PM
Energy supplied from DERs with only PVs
Energy supplied from DERs with PVs and other resources
(c) Proportion of energy generated by PV in use case 2
(d) The optimal scheduling plan for use case 3
Fig. 7. The optimal scheduling plans obtained for the defined use cases, coloured rectangles represent the amount of scheduled power taken from the DERs contributed in the schedule, blue line is the consumption profile
more complex load profile, a scheduling plan – shown in Fig. 7d – is found with a population of size 240 individuals. Framework Scalability. In order to assess the performance of the proposed software architecture, we analyze the scalability of the framework for the above three use case scenarios introduced in Sect. 5.1. The number of computing units (slaves), namely, the Interpretation Service instances as well as the number of the corresponding Calculation Service instances is varied between 1 and 120 so that the minimum of two cores is left on each node for the OS.
A Scalable Method for Scheduling Distributed Energy Resources using EAs
17
Table 1. The computational time of the three use cases when increasing the number of computing units (slaves) #of computing units Computational time in minutes Use case 1 Use case 2 Use case 3 1
780
1290
4175
8
133
237
611
24
67
123
342
40
55
99
275
56
56
97
270
72
54
90
263
88
55
85
255
104
44
80
250
112
43
72
246
120
38
66
200
Table 1 shows the scalability results of the three use cases where the total time for each optimization job is measured. It can be concluded that by increasing the difficulty of the optimization problem, the total time needed to find an optimal solution is increased. Therefore, the scheduling process for the second and third use cases takes more time as the first one, since GLEAM performs more evaluations. Within 420 generations, GLEAM achieves 548566 evaluations for the first use cases, 919908 for the second one and 799462 for the third use case. By using more computing units, the framework is able to reduce the total time from 780 to 38 min in the first use case, from 1290 to 66 min in the second use case and from 4175 to 200 min in the last use case. For each use case, the computation time of the parallel implementation decreases more slowly at a certain point, since the communication overhead of the increased number of computing units (slaves) exceeds the increased performance of the parallelization.
6
Conclusion and Future Work
In this paper, a new parallel, highly modular, flexible and scalable method for scheduling Distributed Energy Resources (DERs) based on Evolutionary Algorithms (EAs) is presented. In contrast to other optimization methods, the new proposed solution enables an efficient parallelization of EAs, full runtime automation and an easy deployment on high performance computing environments such as clusters or cloud environments. Furthermore, it provides the ability to deal with a comparably large number of DERs. Modern software technologies, namely, microservices, container virtualization and the publish/subscribe messaging paradigm are exploited to develop the desired method. The architecture clearly separates functionalities related to EAs and the ones related to
18
H. Khalloof et al.
scheduling DERs. For each functionality, a microservice is designed and implemented. Furthermore, container virtualization is utilized to automatically deploy the microservices on nodes of an underlying cluster to perform their tasks. The combination of microservices and container virtualization enables an easy integration of an existing EA into the framework and facilitates the communication with other required services like simulators and forecasting tools for power generation and consumption, market price and weather. Furthermore, using the publish/subscribe messaging paradigm guarantees a seamless data exchange between the scalable services which are deployed on-demand. In order to evaluate the functionalities of the proposed solution, three use case scenarios with different types and numbers of DERs are defined and studied. The scalability of the framework is demonstrated by varying the number of computing units between 1 and 120. The results show that the new distributed solution is an applicable approach for scheduling a scalable number of DERs using EAs based on the mentioned three lightweight technologies in a scalable runtime environment. As part of future work, more detailed evaluations related to the communication overhead of the solution will be undertaken. Other parallelization models for EA such as Coarse-Grained Model can also be applied and compared with the current presented approach. Furthermore, a comparison with a central solver like simplex or other distributed population-based metaheuristics will be considered.
References 1. Alba, E., Almeida, F., Blesa, M., Cotta, C., D´ıaz, M., Dorta, I., Gabarr´ o, J., Le´ on, C., Luque, G., Petit, J., et al.: Efficient parallel LAN/WAN algorithms for optimization the Mallba project. Parallel Comput. 32(5–6), 415–440 (2006) 2. Alvarado-Barrios, L., Rodr´ıguez del Nozal, A., Tapia, A., Mart´ınez-Ramos, J.L., Reina, D.G.: An evolutionary computational approach for the problem of unit commitment and economic dispatch in microgrids under several operation modes. Energies 12(11), 2143 (2019) 3. Arenas, M.G., Collet, P., Eiben, A.E., Jelasity, M., Merelo, J.J., Paechter, B., Preuß, M., Schoenauer, M.: A framework for distributed evolutionary algorithms. In: International Conference on Parallel Problem Solving from Nature, pp. 665– 675. Springer (2002) 4. Askarzadeh, A.: A memory-based genetic algorithm for optimization of power generation in a microgrid. IEEE transactions on sustainable energy 9(3), 1081–1089 (2017) 5. Barba-Gonz´ alez, C., Garc´ıa-Nieto, J., Nebro, A.J., Cordero, J., Durillo, J.J., Navas-Delgado, I., Aldana-Montes, J.F.: jmetalsp: a framework for dynamic multiobjective big data optimization. Appl. Soft Comput. 69, 737–748 (2018) 6. Blume, C., Jakob, W.: Gleam-an evolutionary algorithm for planning and control based on evolution strategy. In: GECCO Late Breaking Papers, pp. 31–38 (2002) 7. Brucker, P.: Scheduling Algorithms. Springer, Cham (2007) 8. Cahon, S., Melab, N., Talbi, E.-G.: Paradiseo: a framework for the reusable design of parallel and distributed metaheuristics. J. Heurist. 10(3), 357–380 (2004) 9. Cant´ u-Paz, E.: A survey of parallel genetic algorithms. Calculateurs paralleles, reseaux et systems repartis 10(2), 141–171 (1998)
A Scalable Method for Scheduling Distributed Energy Resources using EAs
19
10. Di Martino, S., Ferrucci, F., Maggio, V., Sarro, F.: Towards migrating genetic algorithms for test data generation to the cloud. In: Software Testing in the Cloud: Perspectives on an Emerging Discipline, pp. 113–135. IGI Global (2013) 11. Fortin, F.-F., De Rainville, F.-M., Gardner, M.-A., Parizeau, M., Gagn´e, C.: Deap: evolutionary algorithms made easy. J. Mach. Learn. Res. 13, 2171–2175 (2012) 12. Garc´ıa-Valdez, M., Merelo, J.J.: evospace-js: asynchronous pool-based execution of heterogeneous metaheuristics. In: Proceedings of the Genetic and Evolutionary Computation Conference Companion, pp. 1202–1208 (2017) 13. Merelo Guerv´ os, J.J., Mario Garc´ıa-Valdez, J.: Introducing an event-based architecture for concurrent and distributed evolutionary algorithms. In: International Conference on Parallel Problem Solving from Nature, pp. 399–410. Springer (2018) 14. Guo, Y., Fang, Y., Khargonekar, P.P.: Hierarchical architecture for distributed energy resource management. In: Stochastic Optimization for Distributed Energy Resources in Smart Grids, pp. 1–8. Springer (2017) 15. Hart, W.E., Krasnogor, N., Smith, J.E.: Recent Advances in Memetic Algorithms, vol. 166. Springer, Heidelberg (2004) 16. Holland, J.H.: Outline for a logical theory of adaptive systems. J. ACM (JACM) 9(3), 297–314 (1962) 17. Huang, D.-W., Lin, J.: Scaling populations of a genetic algorithm for job shop scheduling problems using MapReduce. In: 2010 IEEE Second International Conference on Cloud Computing Technology and Science, pp. 780–785. IEEE (2010) 18. IEEE. IEEE guide for monitoring, information exchange, and control of distributed resources interconnected with electric power systems. IEEE Std 1547.3-2007, pp. 1–160, November 2007 19. Jakob, W., Gonzalez Ordiano, J.A., Ludwig, N., Mikut, R., Hagenmeyer, V.: Towards coding strategies for forecasting-based scheduling in smart grids and the energy Lab 2.0. In: Proceedings of the Genetic and Evolutionary Computation Conference Companion, pp. 1271–1278 (2017) 20. Jakob, W., Quinte, A., Stucky, K.-U., S¨ uß, W.: Fast multi-objective scheduling of jobs to constrained resources using a hybrid evolutionary algorithm. In: International Conference on Parallel Problem Solving from Nature, pp. 1031–1040. Springer (2008) 21. Jin, C., Vecchiola, C., Buyya, R.: MRPGA: an extension of MapReduce for parallelizing genetic algorithms. In: 2008 IEEE Fourth International Conference on eScience, pp. 214–221. IEEE (2008) 22. Kazarlis, S.S., Bakirtzis, A.G., Petridis, V.: A genetic algorithm solution to the unit commitment problem. IEEE Trans. Power Syst. 11(1), 83–92 (1996) 23. Khalloof, H., Jakob, W., Liu, J., Braun, E., Shahoud, S., Duepmeier, C., Hagenmeyer, V.: A generic distributed microservices and container based framework for metaheuristic optimization. In: Proceedings of the Genetic and Evolutionary Computation Conference Companion, pp. 1363–1370 (2018) 24. Khalloof, H., Ostheimer, P., Jakob, W., Shahoud, S., Duepmeier, C., Hagenmeyer, V.: A distributed modular scalable and generic framework for parallelizing population-based metaheuristics. In: International Conference on Parallel Processing and Applied Mathematics, pp. 432–444. Springer (2019) 25. Li, H., Zang, C., Zeng, P., Yu, H., Li, Z.: A genetic algorithm-based hybrid optimization approach for microgrid energy management. In: 2015 IEEE International Conference on Cyber Technology in Automation, Control, and Intelligent Systems (CYBER), pp. 1474–1478. IEEE (2015) 26. Liang, H.Z., Gooi, H.B.: Unit commitment in microgrids by improved genetic algorithm. In: 2010 Conference Proceedings IPEC, pp. 842–847. IEEE (2010)
20
H. Khalloof et al.
27. Merelo, J.J., Fernandes, C.M., Mora, A.M., Esparcia, A.I.: SofEA: a pool-based framework for evolutionary algorithms using couchDB. In: Proceedings of the 14th Annual Conference Companion on Genetic and Evolutionary Computation, pp. 109–116 (2012) 28. Meri, K., Arenas, M.G., Mora, A.M., Merelo, J.J., Castillo, P.A., Garc´ıa-S´ anchez, P., Laredo, J.L.J.: Cloud-based evolutionary algorithms: an algorithmic study. Nat. Comput. 12(2), 135–147 (2013) 29. Nemati, M., Bennimar, K., Tenbohlen, S., Tao, L., Mueller, H., Braun, M.: Optimization of microgrids short term operation based on an enhanced genetic algorithm. In: 2015 IEEE Eindhoven PowerTech, pp. 1–6. IEEE (2015) 30. Nemati, M., Braun, M., Tenbohlen, S.: Optimization of unit commitment and economic dispatch in microgrids based on genetic algorithm and mixed integer linear programming. Appl. Energy 210, 944–963 (2018) 31. Rajmohan Padiyar, K., Kulkarni, A.M., Dynamics and Control of Electric Transmission and Microgrids. Wiley Online Library (2019) 32. Park, J.-B., Kim, J.-H., Lee, K.Y.: Generation expansion planning in a competitive environment using a genetic algorithm. In: IEEE Power Engineering Society Summer Meeting, vol. 3, pp. 1169–1172. IEEE (2002) 33. Pereira-Neto, A., Unsihuay, C., Saavedra, O.R.: Efficient evolutionary strategy optimisation procedure to solve the nonconvex economic dispatch problem with generator constraints. IEE Proc. Gener. Trans. Distrib. 152(5), 653–660 (2005) 34. Quan, H., Srinivasan, D., Khosravi, A.: Incorporating wind power forecast uncertainties into stochastic unit commitment using neural network-based prediction intervals. IEEE Trans. Neural Netw. Learn. Syst. 26(9), 2123–2135 (2014) 35. Ratnam, E.L., Weller, S.R., Kellett, C.M., Murray, A.T.: Residential load and rooftop PV generation: an Australian distribution network dataset. Int. J. Sustain. Energy 36(8), 787–806 (2017) 36. Roy, G., Lee, H., Welch, J.L., Zhao, Y., Pandey, V., Thurston, D.: A distributed pool architecture for genetic algorithms. In: 2009 IEEE Congress on Evolutionary Computation, pp. 1177–1184. IEEE (2009) 37. Salza, P., Ferrucci, F.: An approach for parallel genetic algorithms in the cloud using software containers. arXiv preprint arXiv:1606.06961 (2016) 38. Salza, P., Ferrucci, F., Sarro, F.: elephant56: design and implementation of a parallel genetic algorithms framework on Hadoop MapReduce. In: Proceedings of the 2016 on Genetic and Evolutionary Computation Conference Companion, pp. 1315– 1322 (2016) 39. Sherry, D., Veeramachaneni, K., McDermott, J., O’Reilly, U.-M.: Flex-GP: genetic programming on the cloud. In: European Conference on the Applications of Evolutionary Computation, pp. 477–486. Springer (2012) 40. Tseng, C.L.: On Power System Generation Unit Commitment Problems. University of California, Berkeley (1996) 41. Valenzuela, J., Smith, A.E.: A seeded memetic algorithm for large unit commitment problems. J. Heurist. 8(2), 173–195 (2002) 42. Verma, A., Llor` a, X., Goldberg, D.E., Campbell, R.H.: Scaling genetic algorithms using MapReduce. In: 2009 Ninth International Conference on Intelligent Systems Design and Applications, pp. 13–18. IEEE (2009) ˇ sevski, A., Cepin, ˇ 43. Volkanovski, A., Mavko, B., Boˇsevski, T., Cauˇ M.: Genetic algorithm optimisation of the maintenance scheduling of generating units in a power system. Reliab. Eng. Syst. Saf. 93(6), 779–789 (2008) 44. Walters, D.C., Sheble, G.B.: Genetic algorithm solution of economic dispatch with valve point loading. IEEE Trans. Power Syst. 8(3), 1325–1332 (1993)
A Scalable Method for Scheduling Distributed Energy Resources using EAs
21
45. Whitley, D.: A genetic algorithm tutorial. Stat. Comput. 4(2), 65–85 (1994) 46. Zia, F., Elbouchikhi, E., Benbouzid, M.: Microgrids energy management systems: a critical review on methods, solutions, and prospects. Appl. Energy 222, 1033–1055 (2018)
A Lightweight Association Rules Based Prediction Algorithm (LWRCCAR) for Context-Aware Systems in IoT Ubiquitous, Fog, and Edge Computing Environment Asma Abdulghani Al-Shargabi1(&) and Francois Siewe2 1
2
Information Technology Department, Faculty of Computer, Qassim University, Buraydah, Saudi Arabia [email protected], [email protected] Software Technology Research Laboratory (STRL), De Montfort University, Leicester, UK [email protected]
Abstract. Proactive is one main aspect of ubiquitous context-aware systems in IoT environment. Ubiquitous context-aware systems in IoT environment needs a light-weight intelligent prediction techniques especially within fog and edge computing environment where technologies capabilities are poor. On the other hand, the data that ubiquitous context-aware systems depends on to learn is big. This paper suggests a light-weight prediction algorithm to help such system to work effectively. The proposed algorithm is improvement of RCCAR algorithm. RCCAR utilizes association rules for prediction. The contribution of this paper is minimize the number of association rules by giving a priority for associations that produced of high order itemsets before the lowest ones. The prediction is scored and formulated mathematically using confidence association rules measure. A real dataset is used in many different scenario experiments. The proposed algorithm achieves good with reasonable prediction score. For future work, extensive experiments with many datasets is recommended. Keywords: RCCAR Prediction Pervasive computing Ubiquitous computing Fog computing Edge computing İnternet of Things IoT Context-aware systems Association rules Data mining Big data analytics
1 Introduction In Internet of Things (IoT) era, many intelligent technologies become available to facilitate tasks and improve human-computer interaction especially in fog computing and edge computing [1]. As consequence, a huge volume, high velocity and more complex data is produced daily. IoT applications with a numerous different collection of sensors, wearable devices, and smart phone is one rich source of this data [2]. A valuable knowledge is embedded implicitly in this big data. Context-Aware Systems (CAS) is one vital component in the wide range of pervasive/ubiquitous computing within IoT environment. CASs are systems that are © Springer Nature Switzerland AG 2021 K. Arai et al. (Eds.): FTC 2020, AISC 1289, pp. 22–30, 2021. https://doi.org/10.1007/978-3-030-63089-8_2
A Lightweight Association Rules Based Prediction Algorithm
23
aware of their situation (or context) in their physical, virtual (ICT) and user environment. CASs gather context based on big and variant sensors network. To realize CASs functions perfectly, CASs are in need to prediction to be proactive and to ensure that the gathered context is true, especially in real-time CASs [22]. In such environment, many algorithms have been developed for mining useful patterns to improve IoT CASs and its different applications and make it more intelligent and proactive. These algorithms encompass a different amazing collection of serial, parallel, distributed, and MapReduce-based algorithms run in local computers, distributed and parallel environments, clusters, grids, clouds and/or data centers [1, 2]. Data mining on big IoT data has been a basic component in pervasive computing to discover interesting patterns for many IoT-related applications such as smart cities, urban analytics and mining social network data to detect communities [2]. Main data mining tasks in IoT big data include: (1) mining frequent pattern; (2) classification or supervised learning. After discovering frequent patterns, they can be used to form association rules or build associative classifier for classification and/or prediction. Alternatively, other classifiers such as decision trees and random forests can be built [2]. This research employs association rules (AR) to develop predictive systems in ubiquitous, fog, and edge computing. In Fog and edge computing, we need to eliminate the data used in building the predictive models. RCCAR algorithm employs AR to build a predictive model using a little data as possible [7–9]. The proposed algorithm is an improvement of RCCAR algorithm. It is more efficient lightweight version to work in IoT Fog and edge and computing environment. The rest of this paper is layout as follows: Sect. 2 introduces the related work. An overview of RCCAR algorithm is introduced in Sect. 3. Section 4 is devoted to introduce the proposed algorithm. Section 5 shows the algorithm implementation, experiments, and findings. Finally, Sect. 6 conclude the work and discusses the potential future work.
2 Related Work Association Rules Mining is a technique that finds frequent patterns from given database. Patterns are represented as rules with two main measures, rule support, and rule confidence. These rules called association rules that describes the patterns in data. As name describes, these rules describes the associations between data value, which values come together [11, 12]. Association rules are formulated as follows [12]: x ) y; ½Support ¼ a%; confidence ¼ b% which means the occurrence of x value comes associated with occurrence y with support value a and confidence value b. Rule support and confidence respectively reflect the usefulness and certainty of discovered rules. A support of a% means that a% of all the transactions under analysis show that x and y come together. A confidence of b% means that 60% of x occurrences comes with y. Typically, association rules are considered interesting if they satisfy both a minimum support threshold and a minimum
24
A. A. Al-Shargabi and F. Siewe
confidence threshold. These thresholds can be a set by users or domain experts [11, 12]. Support and confidence are calculates as follows [12]: Support (x ) yÞ ¼ countðXn \ Y Þ
ðX \ Y Þ Confidence (x ) yÞ ¼ count countð X Þ Where n is number of transactions in database. AR proves significant success in different IoT applications. For instance, authors in [1] used an algorithm that uses AR for intelligence in the Internet of Vehicles (IoV). In [3] an occupational adaptive classifier is designed based on association rules using LinkedIn data to help other areas of research and industrial applications, such as career adaptability judgment, interest training, career development, and career recommendation. In [4], an association classification algorithm has been introduced and employed to help solving one important cybercrime web security problem, which is phishing websites. In education area, [5] applied AR to find useful knowledge for supporting admission policies planning. In [6], AR is applied to introduce a predicting method for a city’s overall traffic state. As case study, the traffic system in Shanghai is considered. Authors in [14], introduced a method to analyze users workflow preferences using association rules. The method suggested filtering the rules using rules confidence and rules lift value to obtain preferred workflow for a specific user. Lift value is association rule measure that reflect how much occurrences of two items affect each other. Authors in [1] developed an accurate weighted interest degree recommendation algorithm using association rules. However, this algorithm still use all parameters in data. In [4], an association classification algorithm has been introduced. The algorithm is called PWCAC. It introduced a new method for generating the association rules and suggested a new method for predicting the class of new data [4]. In [13], a new enhancement of AR is introduced. The authors in [13] combined the popular AR algorithm that called Apriori with the probabilistic graph model to improve the accuracy of prediction. An interesting contribution is proposed in [15], where a novel association classification and regression algorithm based is introduced. The main contribution of this research is minimizing the time complexity of association rules technique and improve AR efficiency significantly, where efficiency is one main drawback of AR. At first, the clustering algorithm is used to discretize the transaction data. Then, the discrete transaction database is transformed into Boolean matrix based on matrix operation. A new frequent itemsets tree structure is proposed, and all frequent itemsets can be obtained when the tree structure is constructed. Finally, the rules are reconstructed. So, the database is scanned once in comparable with Apriori algorithm that scan the database k time. Where k is the maximum order of combinations of items in database. A similar contribution is introduced in [16], an improvement of Apriori algorithm is suggested to enhance time complexity. The method transforms the transaction database into the upper triangular matrix (UTM) with support count. Using this method, the frequent itemsets can be generated by scanning database only once, which greatly improves the efficiency of the algorithm and reduces the computational complexity. Authors in [17] introduced a new method to filter ARs in interesting manner. The Association rule is given a weight to indicate its importance based on its both support and confidence measures collectively. This method in weighting achieves a good success in prediction accuracy. In [18], Authors introduced an algorithm called
A Lightweight Association Rules Based Prediction Algorithm
25
MDPrefR to minimize the number of important ARs. The main idea of this algorithm is creating a reference rule that realizes the best AR three measures support, confidence, and pearl, and then calculating the degree of similarity of all rules one and finally removing rules that are not similar to the reference rule. An interesting method, which is called PSTMiner is introduced by [19] to prune AR model and so improve the efficiency. This method used chi-square statistical measure to prune the redundant rules. In [20] authors introduces a modification of classic Apriori algorithm by filtering the produced rule based on minimum support measure value while generating the candidate itemsets. Author in [21] proposed n-cross validation technique to reduce association rules which are irrelevant.
3 Overvıew of RCCAR Algorıthm RCCAR is an algorithm that uses association rules in a particular way to outweigh one of many conflicting values of the same data item that comes from different sensors [7]. So, it used basically for resolving conflicts in sensors data. It developed as a lightweight algorithm to be used in context-aware systems within ubiquitous computing environment. However, RCCAR then improved to introduce a solution for many sensors data shortcomings beside conflicts such as missing values and erroneous values [7, 9]. RRCAR predicts the value or outweigh the correct value using association rules. On the other hand, RCCAR algorithm is improved to simplify the algorithm time complexity to be more efficient in working in ubiquitous computing environment. This improvement was by choosing the most important variables using decision tree before producing associations [8]. RCCAR depends on the fact that the occurrence of any data item/feature item will be associated with occurrences of other data items/feature items. RCCAR introduced a light solution for context data prediction in ubiquitous contextaware systems. It employed the association rules to predict the context data items by studying the associations that link this target data item and other occurred items. RCCAR is introduced and formulated in [7, 8]. The mathematical model is built on the association rules mathematical model. RCCAR uses confidence scores of association rules that produced previously off-line for the data under investigation. It uses all association rules that related to the investigated context data. The predicted value is that value with largest confidence score. The confidence score comes from summing all association rules confidence collectively. It uses all orders of associations to get the confirmation. Formula 1 describes this mechanism [7]: Prediction ScoreðxÞ ¼
Xw Xdk ðxÞ k¼2
i¼1
confidenceðyi ðkÞ ) xÞ
ð1Þ
Where x is the predicted value, y is the other variables values, and w is the maximum number of itemsets according to occurrences in context database for the current values of context, and d is the number of associations. k is starting with 2 because the associations start with two elements. The equation shows that the associations which will be considered in the summation are just whose x in the right side [7].
26
A. A. Al-Shargabi and F. Siewe
4 The Proposed Algorıthm- LWRCCAR The proposed algorithm is called Lightweight RCCAR algorithm “LWRCCAR”. The introduced solution is improvement of RCCAR algorithm which based on Association rules. The introduced solution aims at simplify RCCAR mechanism by using just the larger item set of association rules in prediction. This method should achieve good prediction and also should improve the efficiency of RCCAR by simplifying the calculations and minimizing the time complexity. In proposed approach we do not need for all associations of all orders, just the largest order associations will be considered. It would be useful to have more efficient version of RCCAR algorithm. This version can be efficiently employed in and edge computing and ubiquitous computing environment in general where time is a critical factor. The proposed approach is also uses the association rules and consequently its measures-confidence- to formulate the solution. The predicted value with the best/maximum prediction score is formulated as shown by Formula 2, and Formula 3: PredictedvalueðxÞ ¼ xj
ð2Þ
Xn Prediction Score xj ¼ Max confidence ð y Þ ) x j n 1
ð3Þ
Where x is the predicted value, xj is all values of variable x, n is the number of association rules with the larger order that is related to xj and yn is association rule itself.
5 Implementatıon, Experıments, and Results The proposed solution is implemented using WEKA tool. It is a simple and light tool for most data mining tools. Many Experiments were conducted to prove the proposed solution. Experiments include prediction using the largest order association -as the new approach propose-, the next order associations, and the 1st order associations to compare the results and show that there are no need to use all association rules as original RCCAR did. The used dataset is well-known dataset. It is appropriate for the nature of approach. It is real weather historical data with long time depth. It also has a suitable volume for prediction and also has many dimensions/features. The dataset is Southampton monthly weather historical data (1855–1999) [10]. It is officially recorded by Southampton Weather Station. This dataset contains 1744 instance. Variables are year, month, temperature max degree, temperature min degree, air frost, rainfall, and sunshine hours. The data is cleaned and preprocessed before experiments. The experiments used three variables to predict the temperature max degree, these variables are month, temperature min degree, and rainfall. The continuous variables are transformed to be nominal variables as association rules requires. The findings shows interested results. The proposed approach achieves a remarkable success. As shown by Fig. 1, using 3-order associations achieves the best results against the less orders associations. It is the best even with different data time depth. As consequently, we are
A Lightweight Association Rules Based Prediction Algorithm
27
Fig. 1. Prediction of “temperature max degree” using different orders of associations of other variables
Fig. 2. Prediction of “temperature max degree” using small data time depth
not in need to use the confirmation using all associations as RCCAR did, the bigger order is enough. This results will improve the efficiency of RCCAR. The prediction results using 10 years’ time depth inspires conducting further study using smaller time depth as depicted in Fig. 2. The experiments included 5, 4, 3, and 2 years. Obviously, the best result was when using 2-Years’ time depth. This result was expected where the weather is a natural phenomenon that change slowly over the years. And by examine the results in Fig. 3, the best prediction was with 2 years, 5 years, and 50 years. This result is consistent with our simple information about climate where the climate has a natural cycle over every 50 years.
28
A. A. Al-Shargabi and F. Siewe
Prediction Using 3-order Associations 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 100 Years 50 Years 30 years 20 years 10 Years 5 Years accepted(true value)
low
moderate
high
very high
very low
2 Years
Fig. 3. Prediction of “temperature max degree” using 3-order associations with other variables and different data time depth.
6 Conclusıons and Future Works As shown in results, the proposed approach has achieved good success against RCCAR. By examining the results carefully, some points can also concluded: (1) Choosing the appropriate time depth/frame for the prediction improves the results; (2) the prediction using proposed approach does not depend on the adjacent previous data, so it is expected to work well with any type of data within any context. In other word, the data does not need to have a linear regression between the target feature and other different features; (3) the produced prediction model is more efficient than the original RCCAR. For future works, it is recommended to compare the introduced solution with other prediction approaches. In addition, it will be useful to examine the solution with other different types of data and different context. Finally, because of RCCAR is introduced for prediction in ubiquitous context aware systems, the proposed approach will work better within this environment in terms of efficiency. It is expected to work well even within edge computing applications.
References 1. Lin, F., Zhou, Y., You, I., Lin, J., An, X., Lü, X.: Content recommendation algorithm for intelligent navigator in fog computing based IoT environment. IEEE Access 7, 53677–53686 (2019). Special Section on Collaboration for Internet of Things
A Lightweight Association Rules Based Prediction Algorithm
29
2. Braun, P., Cuzzocrea, A., Leung, C.K., Pazdor, A.G.M., Souza, J., Tanbeer, S.K.: Pattern mining from big IoT data with fog computing: models, issues, and research perspectives. In: IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID) (2019) 3. Si, H., Haopeng, W., Zhou, L., Wan, J., Xiong, N., Zhang, J.: An industrial analysis technology about occupational adaptability and association rules in social networks. IEEE Trans. Ind. 3, 1698–1707 (2019). https://doi.org/10.1109/tii.2019.2926574 4. Alqahtani, M.: Phishing websites classification using association classification (PWCAC). In: IEEE International Conference on Computer and Information Sciences (ICCIS) (2019) 5. Rojanavasu, P.: Educational data analytics using association rule mining and classification. In: IEEE The 4th International Conference on Digital Arts, Media and Technology and 2nd ECTI Northern Section Conference on Electrical, Electronics, Computer and Telecommunications Engineering (2019) 6. Yuan, C., Yu, X., Li, D., Xi, Y.: Overall traffic mode prediction by VOMM approach and AR mining algorithm with large-scale data. IEEE Trans. Intell. Transp. Syst. 20(4), 1508– 1516 (2019) 7. Al-Shargabi, A.A., Siewe, F.: Resolving context conflicts using association rules (RCCAR) to improve quality of context-aware. In: 8th IEEE International Conference on Computer Science & Education (ICCSE 2013), Colombo, Sri Lanka (2013) 8. Al-Shargabi, A.A., Siewe, F.: An Efficient approach for realizing rccar to resolve context conflicts in context-aware systems. In: 2nd Conference on Control, Systems & Industrial Informatics (ICCSII 2013), Bandung, Indonesia (2013) 9. Al-Shargabi, A.A.G., Siewe, F.; A multi-layer framework for quality of context in ubiquitous context-aware Systems. Int. J. Pervasive Comput. Commun. (2018) 10. http://www.southamptonweather.co.uk/sotonhist.php 11. Akulwar, P., Pardeshi, S., Kamble, A.: Survey on different data mining techniques for prediction. In: Proceedings of the Second International conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud) (I-SMAC 2018) (2018) 12. Han, J., Kamber, M., Pei, J.: Data Mining Concepts and Techniques, 3rd edn. Morgan Kaufmann Publishers, Elsevier (2016) 13. Sheng, G., Hou, H., Jiang, X., Chen, Y.: A novel association rule mining method of big data for power transformers state parameters based on probabilistic graph model. IEEE Trans. Smart Grid 9(2), 695–702 (2018) 14. Ahmadon, M.A.B., Yamaguchi, S.: User workflow preference analysis based on confidence and lift value of association rul. In: IEEE 7th Global Conference on Consumer Electronics (GCCE) (2018) 15. Wang, L., Zhu, H., Huang, R.: A novel association rule prediction algorithm for classification and regression. In: IEEE Conference on Decision and Control (CDC), Miami Beach, USA (2018) 16. Cao, M., Guo, C.: Research on the improvement of association rule algorithm for power monitoring data mining. In: IEEE 10th International Symposium on Computational Intelligence and Design (2017) 17. Refai, M.H., Yusof, Y.: Group-based approach for class prediction in associative classification. In: IEEE 3rd International Conference on Information Retrieval and Knowledge Management (2016) 18. Mohammed, M., Taoufiq, G., Youssef, B., Mohamed, E.F.; A New way to select the valuable association rules. In: IEEE 7th International Conference on Knowledge and Smart Technology (KST) (2015)
30
A. A. Al-Shargabi and F. Siewe
19. Lakshmi, K.P., Reddy, C.R.K.: Fast rule-based prediction of data streams using associative classification mining. In: IEEE 5th International Conference on IT Convergence and Security (ICITCS) (2015) 20. Goyal, L.M., Beg, M.M.S.: An efficient filtration approach for mining association rules. In: IEEE International Conference on Computing for Sustainable Global Development (INDIACom) (2014) 21. Rameshkumar, K., Sambath, M., Ravi, S.: Relevant association rule mining from medical dataset using new irrelevant rule elimination technique. In: IEEE International Conference on Information Communication and Embedded Systems (ICICES) (2013) 22. Al-Shargabi, A.A.Q., Siewe, F., Zahary, A.: Quality of context in context-aware systems. In: International Journal of Pervasive Computing and Communications, UK (2018)
Analytical View on Non-Invasive Measurement of Moving Charge by Position Dependent Semiconductor Qubit Krzysztof Pomorski1,2,3(B) 1
School of Computer Science, University College Dublin, Dublin, Ireland [email protected] 2 Wroclaw School of Information Technology, Wroclaw, Poland 3 Quantum Hardware Systems, Lodz, Poland
Abstract. Detection of moving charge in free space is presented in the framework of single electron CMOS devices. It opens the perspective for construction of new type detectors for beam diagnostic in accelerators. General phenomenological model of noise acting on position based qubit implemented in semiconductor quantum dots is given in the framework of simplistic tight-binding model.
Keywords: Moving charge
1
· Position-based qubit detector
Motivation for Weak Measurement of Moving Charged Particles
In nature matter has the attribute of having electric charge. Interaction between charged particles is the foundation base for atoms and molecules. Currently various experiments are conducted with charged particles as present in CERN and DESY accelerators. The controlled movement of charged particles as protons, electrons, light and heavy ions or other elementary particles takes place under static and time-dependent electric field and magnetic field generated in well organized pattern that is consequence of Maxwell equations. In particular one uses the magnetic focussing to keep the accelerator beam confined to certain finite geometrical space. Moving charges generate electric and magnetic field what is reflected in time-dependent electric field and time-dependent vector potential field. Such time-dependent electric and magnetic field can be sensed by various types of detectors. If movement of the charged particle is traced only by time-dependent fields that are generated by observed particle one can deal with weak non-invasive measurement. Moving particle will always bring some action on the detector. On another hand the detector will respond to the action of external time-dependent magnetic and electric field. This detector response will generate counter electric and magnetic field that will try to compensate the c Springer Nature Switzerland AG 2021 K. Arai et al. (Eds.): FTC 2020, AISC 1289, pp. 31–53, 2021. https://doi.org/10.1007/978-3-030-63089-8_3
32
K. Pomorski
effect of field trying to change the state of detector. Therefore one has mutual interaction between moving charged particles and detector. However if the speed of moving particles is very high this interaction will take very short time and will only slightly deflect the trajectory of moving charged particle that is under observation. Therefore one deals with weak measurement that is changing the physical state of object under observation in perturbative way [2]. Now we will concentrate on the description of the detector of moving charged particles. One can choice various types of detectors as measuring device for example: superconducting SQUIDs, Josephson junctions, NV-diamond sensors or single electron semiconductor devices. Because of rapid developments in cryogenic CMOS technology and scalability of those detectors we will concentrate on single electron semiconductor devices as most promising detectors for massive use. Positionbased semiconductor qubits were described by [1,3,4,6,8,9].
2
Single Electron Devices as Detectors of Moving Charged Particles
Quite recently it has been proposed by Fujisawa [1] and Petta [12] to use single electron devices for classical and quantum information processing. This technology relies on the chain of coupled quantum dots that can be implemented in various semiconductors. In particular one can use CMOS transistor with source and drain with channel in-between that is controlled by external polarizing gate as it is depicted in Fig. 1. Recent CMOS technologies allow for fabrication of transistor with channel length from 22 nm to 3 nm. If one can place one electron in source-channel-drain system (S-CH-D) than one can approximate the physical system by two coupled quantum dots. It is convenient to use tightbinding formalism to describe electron dynamics with time. In such case instead of wavefunction of electron it is useful to use maximum localized wavefunctions (Wannier functions) of that electron on the left and right quantum dot that are denoted by |1, 0 and |0, 1. One obtains the following simplistic Hamiltonian of position-based qubit given as H = Ep (1) |1, 0 1, 0|+Ep (2) |0, 1 0, 1|+t1→2 |1, 0 0, 1|+t2→1 |0, 1 1, 0| . (1) Here Ep (1) or Ep (2) has the meaning of minima of confining potential on the left or right quantum dot. It can be recognized as localized energy on the left or right quantum dot. The tunneling process between left and right quantum dot or classical movement electron between left and right quantum dot can be accounted by the term |t|1→2 that has the meaning of delocalized energy (energy participating in particle transfer between quantum dots). If electron kinetic energy is much beyond the potential barrier separating left and right quantum dot that one can assign the meaning of kinetic energy to the term |t|1→2 or |t|2→1 . The quantum state of position based qubit is given as superposition of presence on the left and right quantum dot and is expressed by the formula |ψ = αt |1, 0 + βt |0, 1 ,
(2)
Quantum Metrology in Position-Based Qubit
33
where |1, 0 = wL (x), |0, 1 = wR (x) are maximum localized functions on the left and right side of position based qubit. In case of position dependent qubit we have d E t (3) i |ψ = ∗p1 s12 |ψ = E(t) |ψ . ts12 Ep2 dt For simplicity we consider Ep1 = Ep2 = Ep , ts12 = ts . We have two eigenergies E1 = Ep − ts and E2 = Ep + ts and eigenstates are as follows: 1 +1 1 |E1 = √ (|1, 0 − |0, 1) = √ , 2 2 −1 1 +1 1 . (4) |E2 = √ (|1, 0 + |0, 1) = √ 2 2 +1 In general case we have superposition of energy levels E1 and E2 as |ψ = cE1 eiφE1 |E1t + cE2 eiφE2 |E2t and in details we have E1 E2 i t i t +c +c e e E1 E2 + eiφE2 |ψ = eiφE1 E1 E2 −cE1 e i t +cE2 e i t E1 E2 +eiφE1 cE1 e i t + eiφE2 cE2 e i t α(t) = = (5) E1 E2 β(t) −eiφE1 cE1 e i t + eiφE2 cE2 e i t where |cE1 |2 + |cE2 |2 = 1 (sum of occupancy probability of energy level 1 and 2) and |α(t)|2 + |β(t)|2 = 1 (sum of probability of occupancy of left and right side by electron). Under influence of very quickly moving charge we have d Ep1 + f1 (t) ts12 + f3 (t) i |ψ = ∗ |ψ = E(t) |ψ . (6) ts12 + f3 (t)∗ Ep2 + f2 (t) dt More exactly we have d α(t) α(t) Epef f 1 (t) tef f −s12 (t) i = β(t) tef f −s12 (t)∗ Epef f 2 (t) dt β(t) E + f1 (t) ts12 + f3 (t) α(t) α(t) = ∗ p1 = E(t) β(t) β(t) ts12 + f3 (t)∗ Ep2 + f2 (t)
(7)
Single proton movement in proximity of position based qubit f1 (t) =
N protons
V1 (k)δ(t − t1 (k)),
k=1
f2 (t) =
N protons
V2 (k)δ(t − t1 (k)),
k=1
f3 (t) =
N protons k=1
(V3 (k) + iV4 (k))δ(t − t1 (k)).
(8)
34
K. Pomorski
In general case one shall have effective values of Epef f 1 (t), Epef f 2 (t), tef f −s12 (t) and tef f −s21 (t) given by formulas +∞ 2 d 2 ∗ dxwL (x)(− + Vpol (x) + Vp (t))wL (x), Epef f 1 (t) = 2me dx2 −∞ +∞ 2 d 2 ∗ dxwR (x)(− + Vpol (x) + Vp (t))wR (x), Epef f 1 (t) = 2me dx2 −∞ +∞ 2 d 2 ∗ tef f −s12 (t) = dxwR (x)(− + Vpol (x) + Vp (t))wL (x), 2me dx2 −∞ +∞ 2 d 2 ∗ tef f −s21 (t) = dxwL (x)(− + Vpol (x) + Vp (t))wR (x), (9) 2me dx2 −∞ where wL (x) and wR (x) are maximum localized states (Wannier functions) in the left and right quantum dots and where Vpol (x) is the qubit polarizing electrostatic potential with Vp (t) as electrostatic potential coming from proton moving in the accelerator beam. For simplicity let us consider 3 terms perturbing single electron qubit Hamiltonian f1 (t) = V1 δ(t − t1 ), f2 (t) = V2 δ(t − t2 ), f3 (t) = (V3 + iV4 )δ(t − t2 )
(10)
and we obtain the modified Hamiltonian of qubit as d α(t) α(t) ts12 + (V3 + iV4 )δ(t − t1 ) Ep1 + V1 δ(t − t1 ) i = ∗ Ep2 + V2 δ(t − t1 ) β(t) ts12 + (V3 − iV4 )δ(t − t1 ) dt β(t) α(t) = E(t) (11) β(t) and is the system of two coupled differential equations: d α(t) = (Ep1 + V1 δ(t − t1 ))α(t) + (ts12 + (V3 + iV4 )δ(t − t1 ))β(t), dt d i β(t) = (Ep2 + V2 δ(t − t1 ))β(t) + (t∗s12 + (V3 − iV4 )δ(t − t1 ))α(t), (12) dt
i
that can be rewritten in discrete form as 1 (α(t + δt) − α(t − δt)) = (Ep1 + V1 δ(t − t1 ))α(t) 2δt + (ts12 + (V3 + iV4 )δ(t − t1 ))β(t), 1 (β(t + δt) − β(t − δt)) = (Ep2 + V2 δ(t − t1 ))β(t) i 2δt + (t∗s12 + (V3 − iV4 )δ(t − t1 ))α(t),
i
t1 −δt
(13)
Applying operator t1 −δt dt to both sides of previous equations with very small δt → 0 we obtain two algebraic relations as − + + i(α(t+ 1 ) − α(t1 )) = V1 α(t1 ) + (V3 + iV4 )β(t1 ),
Quantum Metrology in Position-Based Qubit − + + i(β(t+ 1 ) − β(t1 )) = V2 β(t1 ) + (V3 − iV4 )α(t1 ).
35
(14)
Linear combination of quantum states of qubit before the measurement is expressed by quantum states of qubit after weak measurement that was due to the interaction of qubit with external passing charged particle so we obtain + − (i − V1 )α(t+ 1 ) − (V3 + iV4 )β(t1 ) = iα(t1 ),
+ − (i − V2 )β(t+ 1 ) − (V3 − iV4 )α(t1 ) = iβ(t1 ).
Last equations can be written in the compact form as − + (i − V1 ) −(V3 + iV4 ) α(t1 ) α(t1 ) = i −(V3 − iV4 ) (i − V2 ) ) β(t+ β(t− 1 1) or equivalently + −1 − α(t1 ) (i − V1 ) −(V3 + iV4 ) α(t1 ) = i − iV ) (i − V ) −(V ) β(t+ β(t− 3 4 2 1 1)
(15)
(16)
(17)
and it implies that quantum state after weak measurement is obtained as the linear transformation of the quantum state before the measurement so + − α(t1 ) ( + iV2 ) (−iV3 + V4 ) α(t1 ) (18) = + β(t1 ) β(t− ( + iV1 )( + iV2 ) + V32 + V42 (−iV3 − V4 ) ( + iV1 ) 1 )
and hence − + − + α(t1 ) α(t1 ) M1,1 M1,2 α(t1 ) α(t1 ) ˆ = M = M2,1 M2,2 β(t+ β(t− β(t+ β(t− 1) 1) 1) 1) 1 = 4 ( + (−V1 V2 + V32 + V42 )2 + 2 (V12 + V22 + 2(V32 + V42 ))) Mr(1,1) + iMi(1,1) Mr(1,2) + iMi(1,2) × . (19) Mr(2,1) + iMi(2,1) Mr(2,2) + iMi(2,2) ˆ as M1,1 Now we identify diagonal parts of matrix M (2 (2 + V22 + V32 + V42 )) + (−V1 V2 + V32 + V42 )2 + 2 (V12 + V22 + 2(V32 + V42 ))) ((−2 V1 + V2 (−V1 V2 + V32 + V42 ))) ) + i( 4 ( + (−V1 V2 + V32 + V42 )2 + 2 (V12 + V22 + 2(V32 + V42 ))) 1 = 4 2 2 2 ( + (−V1 V2 + V3 + V4 ) + 2 (V12 + V22 + 2(V32 + V42 )))
M1,1 =
(4
×[(2 (2 + V22 + V32 + V42 )) + i((−2 V1 + V2 (−V1 V2 + V32 + V42 )))] 1 = 4 ( + (−V1 V2 + V32 + V42 )2 + 2 (V12 + V22 + 2(V32 + V42 ))) (20) ×[Mr(1,1) + iMi(1,1) ]
36
K. Pomorski
and M2,2 (2 (2 + V12 + V32 + V42 )) (4 + (−V1 V2 + V32 + V42 )2 + 2 (V12 + V22 + 2(V32 + V42 ))) ((−2 V2 + V1 (−V1 V2 + V32 + V42 ))) ) + i( 4 ( + (−V1 V2 + V32 + V42 )2 + 2 (V12 + V22 + 2(V32 + V42 ))) 1 = 4 ( + (−V1 V2 + V32 + V42 )2 + 2 (V12 + V22 + 2(V32 + V42 )))
M2,2 =
× [(2 (2 + V12 + V32 + V42 )) + i((−2 V2 + V1 (−V1 V2 + V32 + V42 )))] 1 = 4 2 2 2 ( + (−V1 V2 + V3 + V4 ) + 2 (V12 + V22 + 2(V32 + V42 ))) (21) × [Mr(2,2) + iMi(2,2) ]. Non-diagonal parts of matrix are given as ((−(V1 + V2 )V3 + (2 − V1 V2 + V32 )V4 + V43 )) + (−V1 V2 + V32 + V42 )2 + 2 (V12 + V22 + 2(V32 + V42 ))) ((2 V3 + (V1 + V2 )V4 + V3 (−V1 V2 + V32 + V42 ))) ) − i( 4 ( + (−V1 V2 + V32 + V42 )2 + 2 (V12 + V22 + 2(V32 + V42 ))) 1 = 4 2 2 2 ( + (−V1 V2 + V3 + V4 ) + 2 (V12 + V22 + 2(V32 + V42 )))
M1,2 =
(4
×[((−(V1 + V2 )V3 + (2 − V1 V2 + V32 )V4 + V43 )) ± i(((2 V3 + (V1 + V2 )V4 + V3 (−V1 V2 + V32 + V42 ))))] 1 = 4 2 2 2 ( + (−V1 V2 + V3 + V4 ) + 2 (V12 + V22 + 2(V32 + V42 ))) × [Mr(1,2) + iMi(1,2) ]
(22)
and (((V1 + V2 )V3 + (2 − V1 V2 + V32 )V4 + V43 )) )) (4 + (−V1 V2 + V32 + V42 )2 + 2 (V12 + V22 + 2(V32 + V42 ))) ((−2 V3 + (V1 + V2 )V4 − V3 (−V1 V2 + V32 + V42 ))) + i( 4 ) ( + (−V1 V2 + V32 + V42 )2 + 2 (V12 + V22 + 2(V32 + V42 ))) 1 = 4 ( + (−V1 V2 + V32 + V42 )2 + 2 (V12 + V22 + 2(V32 + V42 )))
M2,1 = (−(
× [−(((V1 + V2 )V3 + (2 − V1 V2 + V32 )V4 + V43 )) + + i((−2 V3 + (V1 + V2 )V4 − V3 (−V1 V2 + V32 + V42 )))] 1 = 4 ( + (−V1 V2 + V32 + V42 )2 + 2 (V12 + V22 + 2(V32 + V42 ))) × [Mr(2,1) + iMi(2,1) ].
(23)
We recognize energy transfer during proton movement in close proximity to position dependent qubit so the quantum state after weak measurement is given as
Quantum Metrology in Position-Based Qubit E1
37
E2
|ψm = eiφE1m cE1m e i t |E1 + eiφE2m cE2m e i t |E2 E1 E2 +eiφE1m cE1m e i t + eiφE2m cE2m e i t = E1 E2 −eiφE1m cE1m e i t + eiφE2m cE2m e i t iφ E1 E2 M1,1 M1,2 +e E1 cE1 e i t + eiφE2 cE2 e i t = E1 E2 M2,1 M2,2 −eiφE1 cE1 e i t + eiφE2 cE2 e i t
(24)
We obtain the quantum state after weak measurement in the form as iφE1m
e
cE1m
E1 E1 E2 iφ M1,1 M1,2 e− i t +e E1 cE1 e i t + eiφE2 cE2 e i t 1 −1 = √ (25) E1 E2 M2,1 M2,2 2 −eiφE1 cE1 e i t + eiφE2 cE2 e i t
and iφE2m
e
cE2m
E2 E1 E2 iφ e− i t M1,1 M1,2 +e E1 cE1 e i t + eiφE2 cE2 e i t 11 = √ E1 E2 M2,1 M2,2 2 −eiφE1 cE1 e i t + eiφE2 cE2 e i t
(26)
We have 1 eiφE1m cE1m = √ (cE2 e(i(φE2 +(E1 −E2 )t)) (M1,1 + M1,2 − M2,1 − M2,2 ) 2 (27) + cE1 e(iφE1 ) (M1,1 − M1,2 − M2,1 + M2,2 )) = and 1 eiφE2m cE2m = √ (cE2 e(i(φE1 +(−E1 +E2 )t)) (M1,1 − M1,2 + M2,1 − M2,2 ) 2 (28) + cE2 e(iφE2 ) (M1,1 + M1,2 + M2,1 + M2,2 )) = Last expression are given with parameters of weak measurement as eiφE1m cE1m =
1 2(4 + (−V1 V2 + V32 + V42 )2 + 2 (V12 + V22 + 2(V32 + V42 ))) × (cE1 |(22 + V12 + V22 + 2V1 V3 + 2V3 (V2 + V3 ) + 2V42 )Cos[φE1 ] + cE2 ((−V12 + V22 ) + 2(2 − V1 V2 + V32 )V4 + 2V43 ) × Cos[φE2 + (E1 − E2 )t] + cE1 (2 (V1 + V2 − 2V3 ) + (V1 + V2 + 2V3 )(V1 V2 − V32 − V42 ))Sin[φE1 ] + cE2 ((V1 − V2 )(2 − V1 V2 + V32 ) + 2(V1 + V2 )V4
+ (V1 − V2 )V42 )Sin[φE2 + (E1 − E2 )t])) + i[((cE1 (−2 (V1 + V2 − 2V3 ) − (V1 + V2 + 2V3 )(V1 V2 − V32 − V42 )) × Cos[φE1 ] − cE2 ((V1 − V2 )(2 − V1 V2 + V32 ) + 2(V1 + V2 )V4
38
K. Pomorski + (V1 − V2 )V42 )Cos[φE2 + (E1 − E2 )t] + cE1 (22 + V12 + V22 + 2V1 V3 + 2V3 (V2 + V3 ) + 2V42 )Sin[φE1 ] + cE2 ((−V12 + V22 ) + 2(2 − V1 V2 + V32 )V4 + 2V43 )
× Sin[φE2 + (E1 − E2 )t]))
(29)
and consequently we have eiφE2m cE2m =
1 (30) 2(4 + (−V1 V2 + V32 + V42 )2 + 2 (V12 + V22 + 2(V32 + V42 ))) × ((cE2 (22 + V12 + V22 − 2(V1 + V2 )V3 + 2V32 + 2V42 )Cos[φE2 ] ±cE1 ((V1 − V2 )(V1 + V2 ) + 2(2 − V1 V2 + V32 )V4 + 2V43 ) × Cos[φE1 + (−E1 + E2 )t] + cE2 (2 (V1 + V2 + 2V3 ) + (V1 + V2 − 2V3 )(V1 V2 − V32 − V42 ))Sin[φE2 ] + cE1 ((V1 − V2 )(2 − V1 V2 + V32 ) − 2(V1 + V2 )V4
+ (V1 − V2 )V42 )Sin[φE1 + (−E1 + E2 )t])) + i (−cE2 (2 (V1 + V2 + 2V3 ) + (V1 + V2 − 2V3 )(V1 V2 − V32 − V42 )) × Cos[φE2 ] + cE1 (−(V1 − V2 )(2 − V1 V2 + V32 ) + 2(V1 + V2 )V4 + (−V1 + V2 )V42 )Cos[φE1 − (E1 − E2 )t] + cE2 (22 + V12 + V22 − 2(V1 + V2 )V3 + 2V32 + 2V42 )Sin[φE2 ] − cE1 ((V1 − V2 )(V1 + V2 ) + 2(2 − V1 V2 + V32 )V4
+ 2V43 )Sin[φE1 − (E1 − E2 )t])
It is quite straightforward to obtain probability of occupancy of energy E1 by electron in position based qubit after weak measurement (one interaction with passing charge particle) and it is given as (cE1m )2 =
(4(4
+ (−V1 V2 +
V32
+
V42 )2
2 + 2 (V12 + V22 + 2(V32 + V42 ))))
× [(c2E1 (42 + (V1 + V2 + 2V3 )2 ) + c2E2 ((V1 − V2 )2 + 4V42 ) + 2cE1 cE2 (−(V1 − V2 )(V1 + V2 + 2V3 ) + 4V4 )Cos[φE1 − φE2 − (E1 − E2 )t1 ] − 4cE1 cE2 [(V1 − V2 ) + (V1 + V2 + 2V3 )V4 ]Sin[φE1 − φE2 − (E1 − E2 )t1 ])]
(31) In similar fashion we obtain the probability of occupancy of energy level E2 by position dependent qubit that is given as (cE2m )2 =
2 4(4 + (−V1 V2 + V32 + V42 )2 + 2 (V12 + V22 + 2(V32 + V42 )))
Quantum Metrology in Position-Based Qubit
39
× c2E2 (42 + (V1 + V2 − 2V3 )2 ) + c2E1 ((V1 − V2 )2 + 4V42 ) − 2cE1 cE2 ((V1 − V2 )(V1 + V2 − 2V3 ) + 4V4 )Cos[φE1 − φE2 − (E1 − E2 )t] + 4cE1 cE2 ((V1 − V2 ) − (V1 + V2 − 2V3 )V4 )Sin[φE1 − φE2 − (E1 − E2 )t]
(32) Consequently we obtain phase imprint on energy eigenstate E1 given by the relation eiφE1m =
1 2(4 + (−V1 V2 + V32 + V42 )2 + 2 (V12 + V22 + 2(V32 + V42 ))) × (cE1 |(22 + V12 + V22 + 2V1 V3 + 2V3 (V2 + V3 ) + 2V42 )Cos[φE1 ]
+ cE2 ((−V12 + V22 ) + 2(2 − V1 V2 + V32 )V4 + 2V43 )Cos[φE2 + (E1 − E2 )t] + cE1 (2 (V1 + V2 − 2V3 ) + (V1 + V2 + 2V3 )(V1 V2 − V32 − V42 ))Sin[φE1 ]
+ cE2 ((V1 − V2 )(2 − V1 V2 + V32 ) + 2(V1 + V2 )V4 + (V1 − V2 )V42 )Sin[φE2 +(E1 − E2 )t])) + i[((cE1 (−2 (V1 + V2 − 2V3 ) − (V1 + V2 + 2V3 )(V1 V2 − V32 − V42 ))Cos[φE1 ]
− cE2 ((V1 − V2 )(2 − V1 V2 + V32 ) + 2(V1 + V2 )V4
+ (V1 − V2 )V42 )Cos[φE2 + (E1 − E2 )t]
+ cE1 (22 + V12 + V22 + 2V1 V3 + 2V3 (V2 + V3 ) + 2V42 )Sin[φE1 ] + cE2 ((−V12 + V22 ) + 2(2 − V1 V2 + V32 )V4 + 2V43 )Sin[φE2 + (E1 − E2 )t])) ×
(4(4
+ (−V1 V2 +
×[(c2E1 (42
V32
+
V42 )2
2 + 2 (V12 + V22 + 2(V32 + V42 ))))
+ (V1 + V2 + 2V3 )2 ) + c2E2 ((V1 − V2 )2 + 4V42 )
+ 2cE1 cE2 (−(V1 − V2 )(V1 + V2 + 2V3 ) + 4V4 )Cos[φE1 − φE2 − (E1 − E2 )t] − 1 − 4cE1 cE2 [(V1 − V2 ) + (V1 + V2 + 2V3 )V4 ]Sin[φE1 − φE2 − (E1 − E2 )t])]
2
(33)
and phase imprint on energy eigenstate E2 given by the relation iφE2m
e
=
1 (4 + (−V1 V2 + V32 + V42 )2 + 2 (V12 + V22 + 2(V32 + V42 ))) 2 2 2 × i (−cE2 ( (V1 + V2 + 2V3 ) + (V1 + V2 − 2V3 )(V1 V2 − V3 − V4 ))Cos[φE2 ] 2
2
2
+ cE1 (−(V1 − V2 )( − V1 V2 + V3 ) + 2(V1 + V2 )V4 + (−V1 + V2 )V4 )Cos[φE1 2
− (E1 − E2 )t] + cE2 (2 +
2 V1
+ 2
2 V2
− 2(V1 + V2 )V3 + 2
2 2V3 3
+
2 2V4 )Sin[φE2 ]
±cE1 ((V1 − V2 )(V1 + V2 ) + 2( − V1 V2 + V3 )V4 + 2V4 )Sin[φE1 − (E1 − E2 )t]) 2 2 2 2 2 + ((cE2 (2 + V1 + V2 − 2(V1 + V2 )V3 + 2V3 + 2V4 )Cos[φE2 ]
40
K. Pomorski 2
2
3
±cE1 ((V1 − V2 )(V1 + V2 ) + 2( − V1 V2 + V3 )V4 + 2V4 )Cos[φE1 + (−E1 + E2 )t] 2
2
2
+ cE2 ( (V1 + V2 + 2V3 ) + (V1 + V2 − 2V3 )(V1 V2 − V3 − V4 ))Sin[φE2 ] 2
2 V3 )
2
− 2(V1 + V2 )V4 + (V1 − V2 )V4 )Sin[φE1 + cE1 ((V1 − V2 )( − V1 V2 + 2 2 2 2 2 2 +(−E1 + E2 )t])) ((cE2 (4 + (V1 + V2 − 2V3 ) ) + cE1 ((V1 − V2 ) + 4V4 ) − 2cE1 cE2 ((V1 − V2 )(V1 + V2 − 2V3 ) + 4V4 )Cos[φE1 − φE2 − (E1 − E2 )t] + − 1 + 4cE1 cE2 ((V1 − V2 ) − (V1 + V2 − 2V3 )V4 )Sin[φE1 − φE2 − (E1 − E2 )t]))
3
2
.
(34)
Dynamic of Two Qubit Electrostatic Entanglement Under Influence of Weak Measurement
We have the following two interacting qubit Hamiltonian for isolated quantum system given in the form ⎛ Ep1 + Ep1 + ⎜ ⎜ ⎜ t∗ s2 ⎜ H =⎜ ∗ ⎜ t s1 ⎜ ⎝ 0
q2 d 1,1
ts2 Ep1 + Ep2 +
q2 d 1,2
0
ts1
0
0
ts1
Ep2 + Ep1 +
t∗ s1
t∗ s2
q2 d 2,1
⎞
ts2 q2 d 2,2
Ep2 + Ep2 +
⎟ ⎟ ⎟ ⎟ ⎟ . (35) ⎟ ⎟ ⎠
Placement of external probing particle affecting only qubit 1 modifies this Hamiltonian ⎛ q2 ts2 E + f1 (t) + E + p1 d ⎜ p1 1,1 ⎜ ⎜ q2 ⎜ t∗ Ep1 + f1 (t) + E + ⎜ s2 p2 d ⎜ 1,2 H = ⎜ ⎜ ∗ ⎜ 0 ts1 + f3 (t) − if4 (t) ⎜ ⎜ ⎝ 0 t∗ s1 + f3 (t) − if4 (t)
ts1 + f3 (t) + if4 (t)
0
0
ts1 + f3 (t) + if4 (t)
⎞
q2 Ep2 + f2 (t) + E + ts2 p1 d 2,1 q2 ∗ Ep2 + f2 (t) + E + ts2 p2 d 2,2
⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟. ⎟ ⎟ ⎟ ⎟ ⎠
Let us investigate the equations of motion for 2-qubit system under influence of external charge particle. Let us assume that the quantum system is given as ⎛ ⎞ γ1 ⎜γ2 ⎟ 2 2 2 2 ⎟ |ψ(t) = ⎜ (36) ⎝γ3 ⎠ , |γ1 (t)| + |γ2 (t)| + |γ3 (t)| + |γ4 (t)| = 1 γ4 We end up with 4 equation system given as [Ep1 + f1 (t)) + Ep1 + ∗
q2 d ]γ1 (t) + ts2 γ2 (t) + ts1 γ3 (t) + [f3 (t) + if4 (t)]γ3 (t) = i γ1 (t), d1,1 dt
ts2 γ2 (t) + [Ep1 + f1 (t) + Ep2 + ∗
2
q d ]γ2 (t) + [ts1 + f3 (t) + if4 (t)]γ4 (t) = i γ2 (t), d1,2 dt
(ts1 + f3 (t) − if4 (t))γ1 (t) + [Ep2 + f2 (t) + Ep1 +
q2 d2,1
]γ3 (t) + ts2 γ4 (t) = i
d γ3 (t) dt
(37) (38) (39)
Quantum Metrology in Position-Based Qubit ∗
∗
(ts1 + f3 (t) − if4 (t))γ2 + ts2 γ3 (t) + (Ep2 + f2 (t) + Ep2 +
q2 d )γ4 (t) = i γ4 (t) d2,2 dt
41
(40)
with single proton movement in proximity of position based qubit generating f1 (t) =
N protons
V1 (k)δ(t − t1 (k)), f2 (t) =
N protons
k=1
f3 (t) =
N protons
V2 (k)δ(t − t1 (k)),
k=1
(V3 (k) + iV4 (k))δ(t − t1 (k)).
(41)
k=1
that in most simple version has the form f1 (t) = V1 δ(t − t1 ), f2 (t) = V2 δ(t − t1 ), f3 (t) = (V3 ) + iV4 )δ(t − t1 ).
(42)
t −δt Applying operator t11−δt dt to both sides of 37 equation with very small δt → 0 we obtain 4 algebraic relations as − + + i(γ1 (t+ 1 ) − γ1 (t1 )) = V1 γ1 (t1 ) + [V3 + iV4 ]γ3 (t1 ), − + + i(γ2 (t+ 1 ) − γ2 (t1 )) = V2 γ2 (t1 ) + [V3 + iV4 ]γ4 (t1 ), − + + i(γ3 (t+ 1 ) − γ3 (t1 )) = [V3 − iV4 ]γ1 (t1 ) + V2 γ3 (t1 ), − + + i(γ4 (t+ 1 ) − γ4 (t1 )) = [V3 − iV4 ]γ2 (t1 ) + V2 γ4 (t1 ).
(43)
and is equivalent to the relation ⎛
⎞⎛ ⎞ ⎛ ⎞ γ1 (t+ γ1 (t− i − V1 0 −[V3 + iV4 ] 0 1 ) 1 ) + ⎟ − ⎟ ⎜ ⎜ 1 ⎜ 0 −[V3 + iV4 ]⎟ 0 i − V1 ⎜ ⎟ ⎜γ2 (t1+ )⎟ = ⎜γ2 (t1− )⎟ (44) ⎝ ⎠ ⎝ ⎠ ⎝ − iV ] 0 i − V 0 −[V (t ) (t γ γ i 3 4 2 3 1 3 1 )⎠ − 0 i − V2 0 −[V3 − iV4 ] ) γ4 (t+ γ 4 (t1 ) 1
what brings the system of coupled quantum states of position based qubit after the passage of charged particle at time t+ 1 in dependence on the quantum state + − with condition t = t + Δt for Δt → 0 is given in the algebraic form at time t− 1 1 1 as
⎞ γ1 (t+ 1 ) t+ + ⎜γ2 (t+ ⎟ 1 1 1 )⎟ ˆ ) ψ(t− ψ(t1 ) = ⎜ = dt H(t + 1 ) ⎝γ3 (t1 )⎠ − i t1 γ4 (t+ 1 ) ⎞−1 ⎛ ⎞ ⎛ γ1 (t− 0 −[V3 + iV4 ] 0 i − V1 1 ) − ⎟ ⎟ ⎜ ⎜ 0 −[V3 + iV4 ]⎟ ⎜γ2 (t1 )⎟ 0 i − V1 = i ⎜ ⎠ ⎝γ3 (t− ⎠ ⎝−[V3 − iV4 ] 0 i − V2 0 1 ) − 0 i − V2 0 −[V3 − iV4 ] γ4 (t1 ) ⎞ ⎛ ) γ1 (t− 1 − ⎟ ⎜ ˆ ψ(t− ˆ ⎜γ2 (t1− )⎟ = M =M 1 ) ⎝γ3 (t1 )⎠ γ4 (t− 1 ) ⎛
42
K. Pomorski ⎛
M1,1 ⎜M2,1 ⎜ =⎝ M3,1 M4,1 ⎛ M1,1 ⎜M2,1 ⎜ =⎝ M3,1 M4,1
M1,2 M2,2 M3,2 M4,2
M1,3 M2,3 M3,3 M4,3
M1,2 M2,2 M3,2 M4,2
M1,3 M2,3 M3,3 M4,3
⎞⎛ ⎞ γ1 (t− M1,4 1 ) − ⎟ ⎜ M2,4 ⎟ ⎟ ⎜γ2 (t1 )⎟ ⎠ M3,4 ⎠ ⎝γ3 (t− 1 ) M4,4 ) γ4 (t− 1 ⎞ M1,4 M2,4 ⎟ ⎟ [γ (t− ) |x1 + γ2 (t− 1 ) |x2 M3,4 ⎠ 1 1 M4,4
− + γ3 (t− 1 ) |x3 + γ4 (t1 ) |x4 ] ˆ (|E1 E1 | + |E2 E2 | + |E3 E3 | + |E4 E4 |)[γ1 (t− =M 1 ) |x1
− − + γ2 (t− 1 ) |x2 + γ3 (t1 ) |x3 + γ4 (t1 ) |x4 ] − − ˆ (E1 | |x1 γ1 (t1 ) + E1 | |x2 γ2 (t− =M 1 ) + E1 | |x3 γ3 (t1 ) − − ˆ + E1 | |x4 γ4 (t− 1 )) |E1 + M (E2 | |x1 γ1 (t1 ) + E2 | |x2 γ2 (t1 )
− − ˆ + E2 | |x3 γ3 (t− 1 ) + E2 | |x4 γ4 (t1 )) |E2 + M (E3 | |x1 γ1 (t1 )
− − + E3 | |x2 γ2 (t− 1 ) + E3 | |x3 γ3 (t1 ) + E3 | |x4 γ4 (t1 )) |E3 − − ˆ (E4 | |x1 γ1 (t1 ) + E4 | |x2 γ2 (t1 ) +M
− + E4 | |x3 γ3 (t− 1 ) + E4 | |x4 γ4 (t1 )) |E4 ⎛ |E1 E1 | |x1 |E1 E1 | |x2 |E1 E1 | |x3 ⎜|E2 E2 | |x1 |E2 E2 | |x2 |E2 E2 | |x3 ˆ⎜ =M ⎝|E3 E3 | |x1 |E3 E3 | |x2 |E3 E3 | |x3 |E4 E4 | |x1 |E4 E4 | |x2 |E4 E4 | |x3
⎞⎛ ⎞ γ1 (t− |E1 E1 | |x4 1 ) − ⎜ ⎟ |E2 E2 | |x4 ⎟ ⎟ ⎜γ2 (t1 )⎟ ⎠ |E3 E3 | |x4 ⎠ ⎝γ3 (t− 1 ) − |E4 E4 | |x4 γ4 (t1 ) ⎛ − ⎞ γ (t ) 1 1 − ⎟ ⎜ ˆ |E1 [ E1 | |x1 , E1 | |x2 , E1 | |x3 , E1 | |x4 ⎜γ2 (t1− )⎟] =M ⎝γ3 (t1 )⎠ γ4 (t− 1 ) ⎛ − ⎞ γ1 (t1 ) − ⎟ ⎜ ˆ |E2 [ E2 | |x1 , E2 | |x2 , E2 | |x3 , E2 | |x4 ⎜γ2 (t1− )⎟] +M ⎝γ3 (t1 )⎠ γ4 (t− 1 ) ⎛ − ⎞ γ (t ) 1 1 ⎟ ⎜γ2 (t− 1 )⎟ ˆ ⎜ + M |E3 [ E3 | |x1 , E3 | |x2 , E3 | |x3 , E3 | |x4 ⎝ ⎠] γ3 (t− 1 ) − γ4 (t1 ) ⎞ ⎛ γ1 (t− 1 ) − ⎜γ2 (t1 )⎟ ˆ ⎟ ⎜ + M |E4 [ E4 | |x1 , E4 | |x2 , E4 | |x3 , E4 | |x4 ⎝ ⎠] γ3 (t− 1 ) − γ4 (t1 ) ⎞ ⎛ M1,1 M1,2 M1,3 M1,4 ⎜M2,1 M2,2 M2,3 M2,4 ⎟ ⎟ =⎜ ⎝M3,1 M3,2 M3,3 M3,4 ⎠ M4,1 M4,2 M4,3 M4,4
Quantum Metrology in Position-Based Qubit
43
− − − ⎞ E1 | |x1 γ1 (t− 1 ) + E1 | |x2 γ2 (t1 ) + E1 | |x3 γ3 (t1 ) + E1 | |x4 γ4 (t1 ) − − − ⎟ ⎜E2 | |x1 γ1 (t− 1 ) + E2 | |x2 γ2 (t1 ) + E2 | |x3 γ3 (t1 ) + E2 | |x4 γ4 (t1 )⎟ ×⎜ − − − ⎝E3 | |x1 γ1 (t− ) + E | |x γ (t ) + E | |x γ (t ) + E | |x γ (t 3 2 2 1 3 3 3 1 3 4 4 1 )⎠ 1 − − − E4 | |x1 γ1 (t1 ) + E4 | |x2 γ2 (t1 ) + E4 | |x3 γ3 (t1 ) + E4 | |x4 γ4 (t− 1 )
⎛
ˆ can be rewritten as and matrix M ˆ = M
(V32
⎛
=
=
+ V4
)2
i + ( + iV1 )( + iV2 )
⎞ 0 −(iV4 + V3 ) 0 (V2 − i) ⎜ 0 −(iV4 + V3 ) ⎟ 0 (V2 − i) ×⎜ .⎟ ⎝ −(−iV4 + V3 ) ⎠ 0 (V1 − i) 0 0 (V1 − i) 0 −(−iV4 + V3 ) 1 2
4 + 2 (V12 + V22 + 2 (V32 + V42 )) + (−V1 V2 + V32 + V42 )
×[(2 (V1 + V2 )) + i((2 − V1 V2 + V32 + V42 ))] ⎛ ⎞ (V2 − i) 0 −(iV4 + V3 ) 0 ⎜ 0 −(iV4 + V3 )⎟ 0 (V2 − i) ⎟ ×⎜ ⎝−(−iV4 + V3 ) ⎠ 0 (V1 − i) 0 0 (V1 − i) 0 −(−iV4 + V3 ) 1 2
4 + 2 (V12 + V22 + 2 (V32 + V42 )) + (−V1 V2 + V32 + V42 ) × ((2 V2 (V1 + V2 )) + (2 (2 − V1 V2 + V32 + V42 )))
⎛ ⎞ 1000 ⎜0 1 0 0⎟ ⎟ +i(−3 (V1 + V2 ) + V2 ((2 − V1 V2 + V32 + V42 ))) ⎜ ⎝0 0 0 0⎠ 0000 2 2 2 2 2 + (( V1 (V1 + V2 )) + ( ( − V1 V2 + V3 + V4 ))) ⎛ ⎞ 0000 ⎜ 0 0 0 0⎟ ⎟ +i(−3 (V1 + V2 ) + V1 ((2 − V1 V2 + V32 + V42 ))) ⎜ ⎝0 0 1 0⎠ 0001 ⎛ ⎞ 0010 ⎜0 0 0 1⎟ ⎟ ±[(2 (V1 + V2 )) + i((2 − V1 V2 + V32 + V42 ))]V3 ⎜ ⎝1 0 0 0⎠ 0100 ⎛ ⎞ 0 0 −1 0 ⎜ 0 0 0 −1⎟ ⎟ +[i(2 (V1 + V2 )) − ((2 − V1 V2 + V32 + V42 ))]V4 ⎜ ⎝+1 0 0 0 ⎠ . 0 +1 0 0
44
K. Pomorski
One ends up with algebraic condition for the quantum state just after t1 = t+ 1 − so we have the relation between quantum state at t+ 1 and t1 expressed in the algebraic way as cE1m eiφE1m e
E1 t1
|E1 + cE2m eiφE2m e
E2 t1
|E2 + cE3m e
E3 t1
eiφE3m |E3
E4 t1
+ cE4m eiφE4 e |E4 ⎞ ⎛ γ1 (t+ 1 ) ⎟ ⎜γ2 (t+ 1 1 )⎟ =⎜ ⎠ = 4 + 2 (V 2 + V 2 + 2 (V 2 + V 2 )) + (−V V + V 2 + V 2 )2 ⎝γ3 (t+ 1 ) 1 2 1 2 3 4 3 4 γ4 (t+ 1 ) × ((2 V2 (V1 + V2 )) + (2 (2 − V1 V2 + V32 + V42 ))) ⎛ ⎞ 1000
⎜0 1 0 0⎟ ⎟ +i(−3 (V1 + V2 ) + V2 ((2 − V1 V2 + V32 + V42 ))) ⎜ ⎝0 0 0 0⎠ 0000 2 2 2 2 2 + (( V1 (V1 + V2 )) + ( ( − V1 V2 + V3 + V4 ))) ⎛ ⎞ 0000
⎜0 0 0 0⎟ ⎟ +i(−3 (V1 + V2 ) + V1 ((2 − V1 V2 + V32 + V42 ))) ⎜ ⎝0 0 1 0⎠ 0001 ⎛ ⎞ 0010 ⎜0 0 0 1⎟ ⎟ ±[(2 (V1 + V2 )) + i((2 − V1 V2 + V32 + V42 ))]V3 ⎜ ⎝1 0 0 0⎠ 0100 ⎛ ⎞ ⎞ ⎛ 0 0 −1 0 γ1 (t− 1 ) ⎜ 0 0 0 −1⎟ ⎜γ2 (t− ⎟ 1 )⎟ ⎟ ⎜ +[i(2 (V1 + V2 )) − ((2 − V1 V2 + V32 + V42 ))]V4 ⎜ ⎝+1 0 0 0 ⎠ ⎝γ3 (t− ⎠ ) 1 0 +1 0 0 ) γ4 (t− 1 1 = 4 + 2 (V12 + V22 + 2 (V32 + V42 )) + (−V1 V2 + V32 + V42 )2 ⎛ ⎞ + ((2 V2 (V1 + V2 )) + (2 (2 − V1 V2 + V32 + V42 ))) + i(−3 (V1 + V2 ) ⎜ ⎟
⎜ ⎟ ⎜ +V2 ((2 − V1 V2 + V32 + V42 ))) γ1 (t− ⎟ 1 ) ⎜ ⎟ ⎜ ⎟ 2 2 2 2 2 3 ⎜ + (( V2 (V1 + V2 )) + ( ( − V1 V2 + V3 + V4 ))) + i(− (V1 + V2 ) ⎟ ⎟
⎜ ⎜ ⎟ ⎜ +V2 ((2 − V1 V2 + V32 + V42 ))) γ2 (t− ⎟ 1 ) ⎟ × ⎜ ⎜ ⎟ 2 2 2 2 2 3 ⎜ + (( V1 (V1 + V2 )) + ( ( − V1 V2 + V3 + V4 ))) + i(− (V1 + V2 ) ⎟ ⎜ ⎟
⎜ ⎟ ⎜ +V1 ((2 − V1 V2 + V32 + V42 ))) γ3 (t− ⎟ 1 ) ⎜ ⎟ ⎜ ⎟ 2 2 2 2 2 3 ⎜ + (( V1 (V1 + V2 )) + ( ( − V1 V2 + V3 + V4 ))) + i(− (V1 + V2 ) ⎟ ⎝ ⎠
+V1 ((2 − V1 V2 + V32 + V42 ))) γ4 (t− ) 1
Quantum Metrology in Position-Based Qubit
45
⎛
⎞ − [(2 (V1 + V2 ))V3 + ((2 − V1 V2 + V32 + V42 ))V4 ] − i[(2 (V1 + V2 ))V4 ⎟ ⎜
⎟ ⎜ ⎟ ⎜ +((2 − V1 V2 + V32 + V42 ))V3 ] γ3 (t− 1 ) ⎟ ⎜ ⎟ ⎜ 2 2 2 2 2 ⎜ − [ (V1 + V2 ))V3 + (( − V1 V2 + V3 + V4 ))V4 ] − i[( (V1 + V2 ))V4 ⎟ ⎟ ⎜
⎟ ⎜ ⎟ ⎜ +((2 − V1 V2 + V32 + V42 ))V3 ] γ4 (t− 1 ) ⎟ ⎜ +⎜ ⎟ 2 2 2 2 2 ⎜ − [(( − V1 V2 + V3 + V4 ))V4 + ( (V1 + V2 ))V3 ] + i[( (V1 + V2 ))V4 ⎟ ⎟ ⎜
⎟ ⎜ ⎟ ⎜ −((2 − V1 V2 + V32 + V42 ))V3 ] γ1 (t− 1 ) ⎟ ⎜ ⎟ ⎜ 2 2 2 2 2 ⎜ − [(( − V1 V2 + V3 + V4 ))V4 + ( (V1 + V2 ))V3 ] + i[( (V1 + V2 ))V4 ⎟ ⎠ ⎝
−((2 − V1 V2 + V32 + V42 ))V3 ] γ2 (t− 1 ) ˆ [cE1 eiφE1 e =M +cE4 eiφE4 e
− E1 t1
− E4 t1
|E1 + cE2 eiφE2 e
− E2 t1
|E2 + cE3 e
− E3 t1
eiφE3 |E3
|E4 ].
Last equation implies 4 relations cE1m eiφE1m e
E1 t1
ˆ [cE1 eiφE1 e = E1 | M
− E1 t1
|E1 + cE2 eiφE2 e
− E3 t1
eiφE3 |E3 + cE4 eiφE4 e ⎞ γ1 (t− 1) − ⎟ ⎜ ˆ ⎜γ2 (t1− )⎟ , = E1 | M ⎝γ3 (t1 )⎠ γ4 (t− 1) + cE3 e
cE2m eiφE2m e
E2 t1
⎛
ˆ [cE1 eiφE1 e = E2 | M
− E1 t1
− E3 t1
cE3m eiφE3m e
E3 t1
ˆ [cE1 eiφE1 e = E3 | M
− E1 t1
− E3 t1
cE4m eiφE4m e
E4 t1
ˆ [cE1 eiφE1 e = E4 | M
− E1 t1
(45)
− E4 t1
− E2 t1
|E2
|E4 ] (46)
|E1 + cE2 eiφE2 e
eiφE3 |E3 + cE4 eiφE4 e ⎛ ⎞ γ1 (t− 1) − ⎟ ⎜ ˆ ⎜γ2 (t1− )⎟ , = E3 | M ⎝γ3 (t1 )⎠ γ4 (t− 1) + cE3 e
|E2
|E4 ]
|E1 + cE2 eiφE2 e
eiφE3 |E3 + cE4 eiφE4 e ⎞ ⎛ γ1 (t− 1) − ⎟ ⎜ ˆ ⎜γ2 (t1− )⎟ = E2 | M ⎝γ3 (t1 )⎠ γ4 (t− 1) + cE3 e
− E4 t1
− E2 t1
− E4 t1
− E2 t1
|E2
|E4 ]
|E1 + cE2 eiφE2 e
(47)
− E2 t1
|E2
46
K. Pomorski − E3 t1
eiφE3 |E3 + cE4 eiφE4 e ⎞ ⎛ γ1 (t− 1) − ⎟ ⎜ ˆ ⎜γ2 (t1− )⎟ = E4 | M ⎝γ3 (t1 )⎠ γ4 (t− 1) + cE3 e
− E4 t1
|E4 ] (48)
The probability of occupancy of eigenergy E1 , E2 , E3 and E4 for interacting qubit system after measurement of charged particle passage is given by |cE1m |2 , |cE2m |2 ,|cE3m |2 ,|cE4m |2 and phase imprint of given eigenenergy state is given by factors eiφE1m eiφ2m ,eiφE3m ,eiφE4m . Let us consider the case of two symmetric qubits whose system is depicted at Fig. 2. We have the following Hamiltonian: ⎛ ⎜ ⎜ ˆ =⎜ H ⎜ ⎝
Ep1 + Ep1 + Ec(1,1 ) t2sr − it2si t1sr − it1si
it2si + t2sr
it1si + t1sr
Ep1 + Ep2 + Ec(1,2 )
0
0
0
t1sr − it1si
Ep2 + Ep1 + Ec(2,1 ) t2sr − it2si
⎞
0
⎟ ⎟ ⎟ ⎟ it2si + t2sr ⎠ Ep2 + Ep2 + Ec(2,2 ) it1si + t1sr
(49) that can be simplified by placement of 2 qubit system in the geometrical configu2 ration giving the following electrostatic energies as Ec(1,1 ) = Ec(2,2 ) = Ec1 = qd 2 and Ec(2,1 ) = Ec(1,2 ) = Ec2 = √ 2 q . We set Ep2 = Ep2 = Ep1 = Ep1 = 2 Ep and we introduce Epp =
d +(a+b) 2 2Ep + qd and
Epp1 = 2Ep + √
q2 d2 +(a+b)2
Finally simplified Hamiltonian has the following form ⎛ ⎞ Epp it2si + t2sr it1si + t1sr 0 ⎜ Epp1 0 it1si + t1sr ⎟ ˆ = ⎜ t2sr − it2si ⎟ H ⎝ t1sr − it1si 0 Epp1 it2si + t2sr ⎠ 0 t1sr − it1si t2sr − it2si Epp We obtain 4 orthogonal eigenstates of the system
.
(50)
Quantum Metrology in Position-Based Qubit
47
moving charge
V(x)
potential changed due to external flow of charge
initial potential
x Fig. 1. (Top left): scheme of position based qubit as given by [6] and act of weak measurement by external charged probe [3, 13, 14]; (Top right): act of passage of charged particle in the proximity of position based qubit and renormalization of qubit confining potential due to the external perturbation; (Middle): scheme of various energy levels present in qubit [4]; (Bottom): different qubit eiegenergy levels for different confining potential cases. It is worth mentioning that passing electric charge can induce quantum system transitions between many energetic levels.
48
K. Pomorski
a: Example of various energy levels exisng in the system of 2 coupled quantum dots. Localized and delocalized states can be spoed.
b: Example of single-electron wavefuncon distribuons corresponding to transion of quantum informaon from eiegenergy qubit to Wannier qubit [4].
c: Effecve potenal of single-electron wavefuncon corresponding to transion of quantum informaon from eiegenergy qubit to Wannier qubit [4].
Fig. 1. (continued)
⎞
(Epp−Epp1)2 +4 −2 t1si2 +t1sr2 t2si2 +t2sr2 +t1si2 +t1sr2 +t2si2 +t2sr2 −Epp +Epp1 (t1si−it1sr)(t2si−it2sr) ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ,⎟ ⎜
⎜ ⎟
⎜2 ⎟ 4 −2 2 −Epp +Epp1 +t2 +t2 t2 +t2 (Epp −Epp1 )2 +4 −2 t2 +t2 t2 +t2 t2 +t2 t2 +t2 +t2 +t2 t2 +t2sr +t2 +t2 +t2 +t2 −(Epp −Epp1 ) ⎜ ⎟ 1sr 2sr 1sr 2sr 1sr 2sr 2sr 1sr 1sr 2si 1si 2si 1si 1si 1si 2si 1si 2si 2si ⎜ ⎟ ⎜ ⎟
⎜ ⎟ ⎜ ⎟ t1si2 +t1sr2 t2si2 +t2sr2 −t1si2 −t1sr2 (t1sr+it1si) ⎜ ⎟ ⎜ ⎟ , ⎜ ⎟
⎜ ⎟
⎜ ⎟ 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 ⎜ ⎟ t1si +t1sr 4 −2 t1si +t1sr −(Epp−Epp1) t1si +t1sr t2si +t2sr +t1si +t1sr +t2si +t2sr −Epp+Epp1 (Epp−Epp1) +4 −2 t2si +t2sr +t1si +t1sr +t2si +t2sr ⎜ ⎟ ⎜ ⎟
|E1 = ⎜ ⎟
⎜ ⎟ ⎜ ⎟ t1si2 +t1sr2 t2si2 +t2sr2 −t2si2 −t2sr2 (t2sr+it2si) ⎜ ⎟ ⎜ ⎟ , ⎜ ⎟
⎜ ⎟
⎜ ⎟ 4 −2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 +t2si +t2si −Epp+Epp1 +t1si +t1sr +t2sr −(Epp−Epp1) t2si +t2sr t2si +t2sr t2si t1si +t1sr t1si +t1sr +t1si +t1sr +t2sr +t2sr (Epp−Epp1) +4 −2 ⎜ ⎟ ⎜ ⎟ ⎜ ⎟
⎜ ⎟
⎜ ⎟ ⎜ ⎟ t1si2 +t1sr2 t2si2 +t2sr2 +t1si2 +t1sr2 +t2si2 +t2sr2 −Epp+Epp1 (Epp−Epp1)2 +4 −2 ⎜ ⎟ ⎜ ⎟ ⎜ ⎟
⎜ ⎟
⎝ ⎠ 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 4 −2 (Epp −Epp1 ) +4 −2 t1si +t1sr t2si +t2sr +t +t1sr +t t2si +t2sr +t1si +t1sr +t2si +t2sr +t2sr −Epp +Epp1 t1si +t1sr −(Epp−Epp1) 1si 2si
⎛ ⎞
t1si2 +t1sr2 t2si2 +t2sr2 +t1si2 +t1sr2 +t2si2 +t2sr2 +Epp−Epp1 (Epp−Epp1)2 +4 −2 (t1si−it1sr)(t2si−it2sr) ⎜ ⎟ ⎜ ⎟ ⎜ ,⎟ ⎜ ⎟
⎟
⎜ ⎜ ⎟ t1si2 +t1sr2 t2si2 +t2sr2 +t1si2 +t1sr2 +t2si2 +t2sr2 (Epp−Epp1)2 +4 −2 t1si2 +t1sr2 t2si2 +t2sr2 (Epp−Epp1) t1si2 +t1sr2 t2si2 +t2sr2 +t1si2 +t1sr2 +t2si2 +t2sr2 +Epp−Epp1 +4 −2 ⎜2 ⎟ ⎜ ⎟ ⎜ ⎟
⎜ ⎟ ⎜ ⎟ t1si2 +t1sr2 t2si2 +t2sr2 +t1si2 +t1sr2 (t1sr+it1si) − ⎜ ⎟ ⎜ ⎟ , ⎜ ⎟
⎜ ⎟
⎟ ⎜ 2 +t1sr2 (Epp−Epp1) 2 +t1sr2 t2si2 +t2sr2 +t1si2 +t1sr2 +t2si2 +t2sr2 +Epp−Epp1 +4 −2 2 +t1sr2 t2si2 +t2sr2 +t1si2 +t1sr2 +t2si2 +t2sr2 2 +4 −2 ⎜ ⎟ t1si (Epp−Epp1) t1si t1si ⎜ ⎟ ⎜ ⎟
|E2 = ⎜ ⎟
⎜ ⎟ 2 +t1sr2 t2si2 +t2sr2 +t2si2 +t2sr2 ⎜ ⎟ (t2sr+it2si) − t1si ⎜ ⎟ ⎜ ⎟ , ⎜ ⎟
⎜ ⎟ ⎜ ⎟ t2si2 +t2sr2 (Epp−Epp1) (Epp−Epp1)2 +4 −2 t1si2 +t1sr2 t2si2 +t2sr2 +t1si2 +t1sr2 +t2si2 +t2sr2 +Epp−Epp1 +4 −2 t1si2 +t1sr2 t2si2 +t2sr2 +t1si2 +t1sr2 +t2si2 +t2sr2 ⎜ ⎟ ⎜ ⎟ ⎜ ⎟
⎜ ⎟
⎜ ⎟ ⎜ ⎟ (Epp−Epp1)2 +4 −2 t1si2 +t1sr2 t2si2 +t2sr2 +t1si2 +t1sr2 +t2si2 +t2sr2 +Epp−Epp1 ⎜ ⎟ ⎜ ⎟ ⎜ ⎟
⎜ ⎟
⎝ ⎠ 2(Epp−Epp1) (Epp−Epp1)2 +4 −2 t1si2 +t1sr2 t2si2 +t2sr2 +t1si2 +t1sr2 +t2si2 +t2sr2 +Epp−Epp1 +4 −2 t1si2 +t1sr2 t2si2 +t2sr2 +t1si2 +t1sr2 +t2si2 +t2sr2
⎛
Quantum Metrology in Position-Based Qubit 49
(52)
⎞ ⎛
2 2 2 2 (t1si −it1sr )(t2si −it2sr ) (Epp −Epp1 )2 +4 2 t2 +t2 t2 +t2 ⎟ ⎜ 1sr 2sr +t1si +t1sr +t2si +t2sr +Epp −Epp1 2si 1si ⎟ ⎜ ⎟ ⎜− ⎜ ⎟
⎟ ⎜ ⎜ 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 +t2sr +Epp −Epp1 +4 2 t t +t2sr ⎟ t +t1sr +t2sr t +t1sr +t1sr +t +t1sr +t1sr +t +t2sr +t +t2sr +t (Epp −Epp1 ) (Epp −Epp1 ) +4 2 t t ⎟ ⎜ 2si 2si 2si 2si 1si 2si 1si 1si 1si 1si ⎜ ⎟ ⎟ ⎜
⎟ ⎜ ⎟ ⎜ t1si2 +t1sr2 t2si2 +t2sr2 +t1si2 +t1sr2 (t1sr+it1si) ⎟ ⎜ ⎟ ⎜ , , ⎟ ⎜
⎟ ⎜
⎟ ⎜ ⎟ ⎜ t1si2 +t1sr2 t2si2 +t2sr2 +t1si2 +t1sr2 +t2si2 +t2sr2 +Epp−Epp1 +4 2 t1si2 +t1sr2 (Epp−Epp1) (Epp−Epp1)2 +4 2 t1si2 +t1sr2 t2si2 +t2sr2 +t1si2 +t1sr2 +t2si2 +t2sr2 ⎟ ⎜ ⎟ ⎜
|E4 = ⎜ ⎟
⎟ ⎜ 2 +t1sr2 t2si2 +t2sr2 +t2si2 +t2sr2 ⎟ ⎜ t1si (t2sr+it2si) ⎟ ⎜ ⎟ ⎜ ⎟ ⎜
⎟ ⎜ ⎟ ⎜ 2 2 2 2 (Epp−Epp1)2 +4 2 t2si2 +t2sr2 (Epp−Epp1) t1si2 +t1sr2 t2si2 +t2sr2 +t1si2 +t1sr2 +t2si2 +t2sr2 +Epp−Epp1 +4 2 t1si2 +t1sr2 t2 +t2 ⎟ ⎜ 2sr +t1si +t1sr +t2si +t2sr 2si ⎟ ⎜ ⎟ ⎜
⎟ ⎜
⎟ ⎜ 2 2 2 2 2 2 2 2 2 ⎟ ⎜ +t2sr +Epp −Epp1 +t1sr +t +t2sr +t +t1sr t t (Epp −Epp1 ) +4 2 ⎟ ⎜ 2si 1si 2si 1si ⎟ ⎜ , ⎟ ⎜
⎟ ⎜
⎠ ⎝ 2 2 2 2 2 2 2 2 2 t2 +t2 (Epp −Epp1 )2 +4 2 2(Epp −Epp1 ) t2 +t1sr t2 +t2 t2 +t2 2sr +t1si +t1sr +t2si +t2sr +Epp −Epp1 +4 2 1sr 2sr +t1si +t1sr +t2si +t2sr 2si 1si 2si 1si
(51)
⎛ ⎞
t1si2 +t1sr2 t2si2 +t2sr2 (t1si−it1sr)(t2si−it2sr) −Epp+Epp1+ (Epp−Epp1)2 +4 t1si2 +t1sr2 +t2si2 +t2sr2 +2 ⎜ ⎟ ⎜ ⎟ ⎜− ⎟ ⎜ ⎟
⎟
⎜
⎜ 2 ⎟ t1si2 +t1sr2 t2si2 +t2sr2 −(Epp−Epp1) −Epp+Epp1+ (Epp−Epp1)2 +4 t1si2 +t1sr2 +t2si2 +t2sr2 +2 t1si2 +t1sr2 t2si2 +t2sr2 4 t1si2 +t1sr2 +t2si2 +t2sr2 +2 t1si2 +t1sr2 t2si2 +t2sr2 ⎜ ⎟ ⎜ ⎟ ⎜ ⎟
⎟ ⎜ ⎜ ⎟ t1si2 +t1sr2 t2si2 +t2sr2 i(t1si−it1sr) t1si2 +t1sr2 + ⎜ ⎟ ⎜ ⎟ − ⎟ ⎜
⎟ ⎜
⎜ ⎟ 2 +t1sr2 t2si2 +t2sr2 2 +t1sr2 4 t1si2 +t1sr2 +t2si2 +t2sr2 +2 2 +4 t1si2 +t1sr2 +t2si2 +t2sr2 +2 2 +t1sr2 t2si2 +t2sr2 ⎜ ⎟ t1si t1si (Epp−Epp1) −(Epp−Epp1) −Epp+Epp1+ t1si ⎜ ⎟ ⎜ ⎟
|E3 = ⎜ ⎟
⎜ ⎟ 2 +t2sr2 + 2 +t1sr2 t2si2 +t2sr2 ⎜ ⎟ t1si i(t2si−it2sr) t2si ⎜ ⎟ ⎜ ⎟ − ⎜ ⎟
⎜ ⎟ ⎜ ⎟ t1si2 +t1sr2 t2si2 +t2sr2 t1si2 +t1sr2 t2si2 +t2sr2 −(Epp−Epp1) −Epp+Epp1+ (Epp−Epp1)2 +4 t1si2 +t1sr2 +t2si2 +t2sr2 +2 t2si2 +t2sr2 4 t1si2 +t1sr2 +t2si2 +t2sr2 +2 ⎜ ⎟ ⎟ ⎜ ⎟ ⎜
⎜ ⎟
⎜ ⎟ 2 +t1sr2 t2si2 +t2sr2 2 +4 t1si2 +t1sr2 +t2si2 +t2sr2 +2 ⎟ ⎜ t1si (Epp−Epp1) −Epp+Epp1+ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜
⎟ ⎜
⎝ ⎠ 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 t1si +t1sr 2 4 t1si +t1sr +t2si +t2sr +2 t2si +t2sr t1si +t1sr t2si +t2sr −(Epp−Epp1) −Epp+Epp1+ (Epp−Epp1) +4 t1si +t1sr +t2si +t2sr +2
50 K. Pomorski
Quantum Metrology in Position-Based Qubit
51
Fig. 2. Concept of two electrostatically interacting qubits and weak measurement of moving heavy charged particle from accelerator beam or solar wind.
with 4 energy eigenvalues ⎛ ⎞ 2 2 + t2 ) (t2 + t2 ) 1⎝ (E − E ) + 4 −2 (t pp1 1sr 2sr + E 1si 2si ⎠ − pp E1 = pp + Epp1 2 +t21si + t21sr + t22si + t22sr (53) ⎛ ⎞ 2 2 2 2 2 1 (E − Epp1 ) + 4 −2 (t1si + t1sr ) (t2si + t2sr ) + Epp + Epp1 ⎠ E2 = ⎝ pp 2 +t21si + t21sr + t22si + t22sr (54) ⎛ ⎞ 2 2 2 2 2 1 (E − Epp1 ) + 4 2 (t1si + t1sr ) (t2si + t2sr ) + Epp + Epp1 ⎠ E3 = ⎝− pp 2 +t21si + t21sr + t22si + t22sr (55) ⎛ ⎞ 2 2 + t2 ) (t2 + t2 ) 1 ⎝ (E − E ) + 4 2 (t pp1 pp 2sr + E 1si ⎠ E4 = 1sr 2si pp + Epp1 2 +t21si + t21sr + t22si + t22sr (56)
4
Conclusion
I have shown the effect of transport of charged particle as proton in accelerator beam acting of single electron in position based qubit placed in the proximity of the accelerator. One can conclude that the beam of moving charged particles brings the change of occupancy of energetic levels in position electrostatic qubit and is inducing phase imprint across qubit. In most general case one can
52
K. Pomorski
Fig. 3. Zoo of position dependent qubit topologies that could be used for beam diagnostics.
expect that two level system represented by qubit will change its initial occupancy (as for example from 2) into N energy levels with phase imprint made on each eigenenergy level. However under assumption that the perturbing factor expressed by moving charge in accelerator beam is weak the conducted considerations are valid. Conducted considerations are also valid for the case of floating potential that is potential polarizing the qubit state. Therefore presented picture can be considered as phenomenological model of noise for electrostatic qubit that provides the description for qualitative and quantitative assessment of noise on two kinds of decoherence times commonly known as T1 and T2 . The presented results were presented at the seminar [5]. In such way one can account for very complicated electromagnetic environment in which position electrostatic semiconductor qubit is placed. In particular one can trace the decay of quantum information encoded in the qubit. One also expects that in the situation of 2 electrostatically qubits the passage of external charged particles is changing the quantum entanglement between qubits and anticorrelation function characterising two interacting qubits. Part of this work was presented in [3,6,11]. The results can be extended quite straightforward to the more complicated structures by the mathematical framework given in [7,10,11]. Particular attention shall be paid to the structures depicted in Fig. 3.
References 1. Fujisawa, T., Hayashi, T., Hirayama, Y.: Electron counting of single-electron tunneling current. Appl. Phys. Lett. 84, 2343 (2004). https://doi.org/10.1063/ 1.1691491
Quantum Metrology in Position-Based Qubit
53
2. Bednorz, A., Franke, K., Belzig, W.: Noninvasiveness and time symmetry of weak measurements. New J. Phys. 15 (2013). https://iopscience.iop.org/article/ 10.1088/1367-2630/15/2/023043 3. Pomorski, K., Giounanlis, P., Blokhina, E., Leipold, D., Staszewski, R.: Analytic view on coupled single electron lines. Semicond. Sci. Technol. 34(12), 125015 (2019). https://doi.org/10.1088/1361-6641/ab4f40/meta 4. Pomorski, K., Giounanlis, P., Blokhina, E., Leipold, D., Peczkowki, P., Staszewski, R.: From two types of electrostatic position-dependent semiconductor qubits to quantum universal gates and hybrid semiconductor-superconducting quantum computer. In: Spie, Proceedings Volume 11054, Superconductivity and Particle Accelerators 2018; 110540M, published on 2019. https://www. spiedigitallibrary.org/conference-proceedings-of-spie/11054/110540M/From-twotypes-of-electrostatic-position-dependent-semiconductor-qubits-to/10.1117/12. 2525217.short 5. Pomorski, K.: Detection of moving charge by position dependent qubits. CERN 2020 -Dublin UCD Webinar, 23 January 2020. https://indico.cern.ch/event/ 876476/contributions/3693568/ 6. Pomorski, K., Peczkowski, P., Staszewski, R.: Analytical solutions for N interacting electron system confined in graph of coupled electrostatic semiconductor and superconducting quantum dots in tight-binding model. Cryogenics 109, 103117 (2020) 7. Pomorski, K., Staszewski, R.: Towards quantum internet and non-local communication in position-based qubits. In: AIP Conference Proceedings, AIP Conference Proceedings, vol. 2241, p. 020030 (2020). https://doi.org/10.1063/5.0011369 8. Bashir, I., Asker, M., Cetintepe, C., Leipold, D., Esmailiyan, A., Wang, H., Siriburanon, T., Giounanlis, P., Blokhina, E., Pomorski, K., Staszewski, R.B.: Mixed-Signal Control Core for a Fully Integrated Semiconductor Quantum Computer System-on-Chip. In: Sep 2019 ESSCIRC 2019 - IEEE 45th European Solid State Circuits Conference (ESSCIRC) (2019). https://ieeexplore.ieee.org/ abstract/document/8902885/ 9. Giounanlis, P., Blokhina, E., Pomorski, K., Leipold, D., Staszewski, R.: Modeling of semiconductor electrostatic qubits realized through coupled quantum dots. IEEE Open Access (2019). https://ieeexplore.ieee.org/stamp/stamp.jsp? arnumber=8681511 10. Pomorski, K.: Analytic view on N body interaction in electrostatic quantum gates and decoherence effects in tight-binding model. ArXiv: 1912.01205 (2019).https:// arxiv.org/abs/1912.01205 11. Pomorski, K., Staszewski, R.: Analytical solutions for N-electron interacting system confined in graph of coupled electrostatic semiconductor and superconducting quantum dots in tight-binding model with focus on quantum information processing. ArxiV:1907.02094 (2019). https://arxiv.org/abs/1907.03180 12. Mills, A.R., Zajac, D.M., Gullans, M.J., Schupp, F.J., Hazard, T.M., Petta, J.R.: Shuttling a single charge across a one-dimensional array of silicon quantum dots. Nat. Commun. 10, 1063 (2019) 13. Staszewski, R.B., et al.: Position-based CMOS charge qubits for scalable quantum processors at 4K. In: 2020 IEEE International Symposium on Circuits and Systems (ISCAS), Sevilla, pp. 1–5 (2020). https://doi.org/10.1109/ISCAS45731.2020. 9180789 14. Likharev, K.K.: Single-electron devices and their applications. Proc. IEEE 87(4), 606–632 (1999). https://doi.org/10.1109/5.752518
Steganography Application Using Combination of Movements in a 2D Video Game Platform Ricardo Mandujano(&), Juan Gutierrez-Cardenas, and Marco Sotelo Monge Universidad de Lima, Lima, Peru [email protected], [email protected], [email protected]
Abstract. Steganography represents the art of hiding information within a harmless medium such as digital images, video, audio, etc. Its purpose is to embed and transmit a message without raising suspicion to a third party or attacker who wishes to obtain that secret information. This research aims to propose a methodology with steganography using as a cover object a 2D platform video game. The experimentation model followed consists of using the combination of horizontal and vertical movements of the enemies by applying the numbering in base 5 or quinary where each character of the message is assigned a quinary digit. In the proposal for improvement the video game is set with 20 enemies per level along the map. The concealment is divided into 3 phases from the choice of the message, allocation of quinary values and generation of the videogame level. Finally, the limitations found will be presented based on experimentation. Keywords: 2D videogames
Steganography Information hiding
1 Introduction Steganography is a technique that has been used since ancient times. It represents a technique that allows two objects to transmit a hidden message within them. The message is usually hidden into an entity known as a cover object. The cover object could be any digital media such as a picture, video, or even a videogame [6]. A problem with this technique is the embedded capacity of the cover object. Almohammad [1] suggested that the quantity of information that, for example, an image could conceal, is somewhat limited. This phenomenon occurs when the object that we want to hide has a size closest to the cover object; in this situation, the carrier object starts to lose definition. This loose definition could raise suspicion about if the cover object is hiding a payload within it. We have found that there is a limited number of publications that make use of video games with steganographic purposes. The video games have shown to be a viable option because they contain several objects that make it possible to hide a payload inside them; for example, the use of background objects, maps, and even enemy moves. In this research, we aim to make a proposal that uses the movements of a set of © Springer Nature Switzerland AG 2021 K. Arai et al. (Eds.): FTC 2020, AISC 1289, pp. 54–69, 2021. https://doi.org/10.1007/978-3-030-63089-8_4
Steganography Application Using Combination of Movements
55
videogame enemies to hide a payload on them. We have chosen a based-5 numeric system, each mapped to a different move of an enemy that would serve to conceal a message. Regarding the type of videogame used, we have decided to choose a multiplatform videogame based on a 2D scenario. Our research work is organized in the following manner: In Sect. 2, we have compiled brief information about a couple of games that use steganography on video games. In Sect. 3, we describe the basics of steganography along with the description of some methods used in the video game field. In Sect. 4, we describe our methodology, the algorithms, and the set of steps to hide a message by using a based-5 numeric system. We end our paper by showing the results of our proposal, along with an analysis of the embedding capacity of it.
2 Background 2.1
Modern Steganographic System
Steganography represents unseen communications achieved by hiding information within digital objects. Steganography’s goal is not to draw attention about the transmission of a hidden message. Otherwise, this goal is frustrated and therefore the steganographic system fails [4]. The general process of incrustation consists on the following: secret message M, cover object (digital object where hiding is performed), message encryption (when necessary) and communication channel if it is to be transmitted to a receiver [10]. The general approach that is applied in most of the steganography techniques is depicted in Fig. 1, following the approach of Joseph and Vishnukumar [5].
Fig. 1. General model for steganography techniques.
2.2
Types of Modern Steganography
For digital images, hiding is based on the frequency level of the file pixels. The LSB (Least significant bit) technique is the most known implementation since it manipulates and accommodates the pixels distribution to hide the message in their favor. To have an
56
R. Mandujano et al.
idea about the storage capacity in a digital object, Fig. 2 shows an image composed by 24 bits where it is possible to store at least 3 bits per pixel. In case you have an image with dimensions of 800 x 600 pixels it is possible to embed a total of 1,440,000 bits. Figure 2 shows an example of steganography with images using the LSB or Least Significant Bit method.
Fig. 2. Steganography representation with digital images.
The process for embedding information into the pixels of an image consists on extracting all the pixels of the image and store them in an array. Then, extracting the characters to be embedded and store them in a character array and do the same for the stego key, but store them in a matrix called array of keys. Choose the first pixel and the characters of the array of keys and place them in the first component of the pixel, if there are more characters in the array place the remaining in the first component of the following pixels. Table 1 describes the main characteristics of steganography and information hiding mostly considered in the literature [11]. Table 1. Characteristics of steganography and information hiding. Undetectability Represents how strong the message is hidden in the cover object. It ensures that the information is not distorted
Capacity Amount of data that can be camouflaged into the cover object
Robustness Prevents intrusion with hidden information during transmission to the destination
Security Responsible for protecting the information with an encryption key in case the intruder wants to decipher the message
Other Types of Modern Cover Objects. Font-size based steganography in text first employs payload encryption for generating a binary stream, and then chooses a factor by which the font size must be changed (known as the F-factor). In addition, there is color-shifting steganography where the user changes the color of the characters in the false text according to the color selected and the color calculated by the program. Platform games are tied to their various mechanisms used to hide data. Wolf [12]
Steganography Application Using Combination of Movements
57
highlights that this kind of games are characterized by difficulty levels, directional movement actions, jumps, attacks, among others. Validation Techniques with Stegoanalysis. Stegoanalysis is intended to detect the existence of hidden information in a stego object. It represents the opposite of steganography since it aims to break the hiding algorithm, thus uncovering the secret message. A steganographic technique is said to be safe when it is resistant to various types of stegoanalysis attacks. Bearing this in mind, there are various techniques that try to break the security of the cover object and extract the secret message. According to Mahato et al., it is reasonable to assume that the adversary will carry out investigations with the aim of ensuring there is something unusual in the cover object [7]. In addition, it is to assume the possibility that he knows the general method but ignores internal details or “agreements” that have been made between the sender and receiver regarding the communication channel. Traffic Analysis in the Communication Channel. According to Desoky and Younis, in their research on the Chestega’s methodology [2], one of the most frequent techniques a third party could carry out is to analyze the observed traffic in the communication channel as well as the access patterns upon the cover objects. The main objective of this type of analysis is to detect unusual activity between the sender and receiver of the hidden message by inspecting network traffic features. This approach applies to any contemporary steganography technique regardless of the type of cover object used, in addition to obtaining effective results at low cost. Comparison and Contrast Attacks. Finding inaccurate or unusual details during gameplay (movements, patterns, visual characteristics, etc.) in video game implementations turns out to be an intuitive source of noise that can alert the third party. It is possible to find contradictions when using data that violates the rules originating from the same videogame. Taking this into account, the use of untraceable data (based on a private context) is usually considered an attack. It is to note that traffic analysis techniques can support the execution of comparison attacks with the idea of outlining differences between current and old traffic data to find any visible inconsistencies. But, usually performing this attack turns out to be a challenge since the data must be very consistent in order to be properly analyzed.
3 State of the Art Gibbs and Shashidhar introduced a method called StegoRogue [3]. In this method, the authors hide a message using Steganography inside a videogame map generated in two dimensions. The maps were generated from libraries composed of ASCII characters to the terminal screen, where each level follows its own sets of rules and algorithms. The methodology presented in StegoRogue used the Depth First Search algorithm and stacks as data structures; so that every playthrough of Rogue is unique. StegoRogue begins the map generation from the center and branches it in the form of a tree. Each room represented a character of the message along with the creation of different items, such as food, treasures, equipment, or usable items. The fourth root expands with only
58
R. Mandujano et al.
a child room. From this point, the other generated rooms are in the form of trees which are stored on a stack. When the stack contains a number of rooms greater than half the length of the message, a random number appears.
Fig. 3. StegoRogue map, in the figure we can appreciate a 2d generated map with rooms which might contain the same letters, but that they do not conceal any message [3].
This number breaks the extensions of rooms with excessive length. After the generation of a new room, the authors obtained the coordinates and the direction of upper rooms by using a peek function. These coordinates traverse the adjacent areas in an anticlockwise fashion for finding an empty space for a new room. In the case that there is no such space, a pop() function is invoked, verifying again for space. This procedure is repeated until the map is completed. The verification part is performed by examining that each of the characters of the message were inserted inside the generated room or until the stack is empty [3]. In Fig. 3, we show an example of the above method by employing a map of 160 86 units. The generation of each map lasted at least one second. The technique mentioned above can be modified for a better adaptation to new gaming algorithms applied to two-dimensional maps. Also, it is feasible to store larger messages by chaining multiple maps levels. This means to join maps by using paths such as ladders or teleports. Also, the use of terrains and geographical generated maps can be explored for steganographic means. In the research work of Mosunov et al. [8], the authors presented a steganography model by using the classic Pac-Man game. They mentioned that the ghosts in Pac-Man could use a deterministic pattern of movements for an obfuscation algorithm. In their proposal, the authors take the random movements generated from the ghosts when the player eats a power pill. These movements are considered for hiding a message M. A ghost could transmit a bit of information when it reaches an intersection with three or more possible paths. In Table 2, we can see the types of bits to transmit and the intersections that a ghost could take during the game. Nevertheless, during the experimentation, the authors detected a low embedding rate. The reason was that the ghosts only could be frightened for a short period (4.5 to 6 s). Additionally, there could exist a particular scenario at the moment of choosing an intersection. For example, multiple ghosts could arrive concurrently to the same
Steganography Application Using Combination of Movements
59
Table 2. Type of variables for a message in bits and type of intersections to choose. Type of bit Transmission Intersections 0-directions 0 bit 2 to more 1-directions 1 bit Unique
Fig. 4. Types of intersections and paths that the ghost choose for transmitting the bits of a concealed message. Source: Jeong et al. (2013)
intersection. This could provoke that the algorithm would not be able to choose an adequate path to transmit one bit of information from the message. In Fig. 4, we show the types of paths and intersections that a ghost could follow. Ou and Chen defined a steganography implementation by using the Tetris videogame [9]. For this, the authors used the different shapes showed in the Tetris game or tetriminos, which are: I, J, L, O, S, T, and Z that resemble letters of the alphabet and are used on a 7-based enumeration system. The game Tetris for the tetrimino generation uses two types of generators: (1) dice-like generator that works as throwing a dice of 7 faces and (2) shuffle-like generator, which works as a random card shuffle. The authors focused on the dice-like generator method. The proposed system requires two conditions to make that the generated sequence would resemble a one generated from the original game. The first one is that when the game starts, it should generate a new sequence of tetriminos, irrelevant if the secret message has changed. The second one is that after the sequence of tetriminos is showed to the player, the game should not stop until the player decides to quit the game.
Fig. 5. An example of a hidden message that contains the string 161641502277 [9].
60
R. Mandujano et al.
We can see in Fig. 5 an example of the hiding of the message 7654321010, which is transformed into a seven-based numerical system. The numbers 1 and 0 indicate that the message continues and ends, respectively. The number 3, denoted by P, indicates the length of the message that is missing. The system, for security purposes, uses an RSA cryptographic system for encrypting the original message before applying the proposal. ALGORITHM 1: Concealment sequence Input: Rs, pub_key, S0 Output: ST 1: do while (game status) is false 2: Phase 1: Random Number Generation Rs 3: Rs n 4: tetromino outlet n tetrominos 5: encryption pub_key + n tetrominos 6: Random generation of tetrominos 7: ST Random sequences of tetrominos 8: if all tetrominos are shown = true 9: Play again? true 10: Start phase 2 true 11: Phase 2: Random number generation 12: Rs n 13: Stegoed sequence generation 14: ST stegoed sequence + S0 15: stegoed tetromino outlet true 16: elseif 17: tetrominos outlet n tetrominos 18: end if 19: end 20: Phase 3 21: do 22: Game preservation true 23: end
On Fig. 6, we can observe the results obtained from Ou and Chen [9] when their proposal was compared with other implementations of the Tetris game. The authors mentioned that the examined implementations and the proposed one all show a uniform distribution of the generation of tetriminos. This particular characteristic is relevant because it shows that their proposal did not showed any bias or anomaly in the tetrimino generation that could raise suspicion about if the current game actually contains a payload.
4 Methodology 4.1
Proposal
Our proposal is based on the combination of horizontal and vertical movements, performed by a set of enemies that are presented to a player during a game. We will employ a numerical system in base 5, where the digits from 0 to 4 will be used to represent any character to be transmitted. We can see in Fig. 7 our proposal that will be used as a steganographic procedure, that by using the movements of a set of enemies in
Steganography Application Using Combination of Movements
61
Fig. 6. Probability of appearance for each piece of tetriminos in four videogames [9].
Fig. 7. General workflow proposed that shows the player/enemy functionalities and the step of message hiding.
a platform-based videogame in 2D as a cover object, would allow us to transmit a message to a known recipient. We decided to implement our steganographic proposal by considering a game based on platforms in two dimensions. We tried to mimic the same characteristics and mechanics of our game as the classic Super Mario Bros videogame from 1985. The general characteristics of the game are the following: Player Movements. For this type of platform game, the player should be able to move in three different directions (left, right, and up/jump). Each action is translated from a controller input and generating the desired action. For each input direction, there will be a variable called movement modifier that will be updated according to the actions made by the player. In this particular case, we have ruled out a downward movement because we have considered that our player maps would not have additional elements such as ladders or tunnels that the player could use. Nevertheless, the enemies that our player will face will have this functionality implemented. Combination of Movements in Base-5 In our proposal, we employ a combination of movements by using a base-5 numeric system, which is represented by numbers from 0 to 4. In Fig. 8, we can see the assigned number for each movement and a single attack from an enemy. For our proposal, we have thought to assign to each number from
62
R. Mandujano et al.
0 to 9 and each uppercase and lowercase letter from the alphabet a corresponding base5 number. For example, number 2 is represented by 3, number 4 by 10, number 5 by 10, and the list continue. For the uppercase letters, our coding goes from 21 for letter A, 22 for letter B thru 121 for letter Z; the same goes for the lower-case letter starting from 122 for letter a, 123 for letter b until 222 for letter z. On Table 3 is depicted all the movements with their correspondent base-5 number. For not overcrowding the screen of the game with many enemies, we have considered that a number of 20 enemies per screen will be visually appealing.
Fig. 8. Schemata that show the movements and attacks that an enemy could perform, and that will be encoded in a base-5 numeric system.
4.2
Message Structure
We can see in Table 4, a hidden concatenated text message: “Thereisnothingunusualinthismessage”, with 34 characters along with their transformation into base-5 and the set of actions or movements to embed this message into the game. 4.3
Proof of Concept
Module of Message Hiding. In this part, we have deployed a flow of processes to show the procedure for hiding a message based on the movements and simple attack from the enemies: In Fig. 9, we show the block diagram that describes our steganographic process. This diagram is divided into the following phases: In phase 1, we start with the selection of the message that we want to process. We check that the length should not exceed the twenty characters because each level is configured for that quantity of enemies.
Steganography Application Using Combination of Movements
63
Table 3. Mapping between the movement of an enemy and their corresponding base-5 transformation. Movement type Left movement Right movement Up (jump) Down (bend over) Attack movement
Value in base-5 numbering 1 2 3 4 0
Table 4. Example of an embedded message in base-5 with their corresponding sequence of moves of an enemy. 34 characters Quinary based T Sec (110) h Sec (134) e Sec (131) r Sec (204) e Sec (131) i Sec (140) s Sec (210) … … s Sec (210) a Sec (122) g Sec (133) e Sec (131)
Sequence movements Left, left, attack Left, up, down Left, up, left Right, attack, down Left, up, left Left, down, attack Right, left, attack … Right, left, attack Left, right, right Left, up, up Left, up, left
In case we need to process more characters, we need to generate an additional level for the remaining part of the message. Phase 2 relates to assigning their corresponding base5 number to each character. At this point, we double-check that the generated base-5 number is no longer than three digits, which is the necessary length to cover all our needed numbers and alphabetical characters. In phase 3, we assign the sequence of actions for each enemy, and we position them in the map along with loading the necessary maps and background needed for our game.
64
R. Mandujano et al.
Fig. 9. Block diagram of our steganographic algorithm proposal based on a base-5 numeric system.
Steganography Application Using Combination of Movements
65
Steganographic Sequence Algorithm ALGORITHM 2: Steganographic sequence proposal Input: M0, J0, E0, Ec Output: Mc 1: if game status is false { 2: Phase 1: Select message to hide 3: Input M0 message 4: Message characters M0 5: Output Mc message 6: character counter n 7: for n=0; M0[n] not null; n++ { 8: if n0 >0 0
−δΘ
>0 0 L1) must be greater than one and, second, the computer processing time for the branches on L1 to LNL must be significantly larger than the processing time for the main branch (L1). The processing time for every time step in the solution of the timedependent problem must be large enough to justify the overhead time for creating several OpenMP threads for parallel execution. In the majority of practical problems, these steps are not met but, even in a sequential solver variant, the method produces the solution in OðN Þ floating operations where N is the size of the solution vector. The next planned step in the development of the direct solver in SCREAMER is to parallelize the solution process on several consequent time steps.
5 Numerical Results In this section, we present the performance of the two described direct solvers on three problems: circuit with main branch and 2 branches on level two, with main branch and 4 branches on level two, and with main branch and four branches on level two and three. Table 1 present the computer time for all three problems. We tested these solvers on one core of the standard quad core desktop iMac with an Intel Core i7, 2.93 GHz processor and 16 GB of RAM. We can see that even in this situation, the new direct algorithm gives between 30% and 83% speedup in the computer time for different numbers of branch levels, nodes in the branches, and different time scales. Unfortunately, the overhead on every time step in OpenMP implementation of the second solver on multicore computer in these examples was too large to allow a significant speedup in parallel implementation. The result demonstrates the restriction of the proposed method. For this parallel approach to be efficient, the secondary branches in the circuit must be significantly longer than the main branch. This drawback can be overcome by using temporal parallelization of the algorithm. This development is now under consideration by the authors.
204
Y. A. Gryazin and R. B. Spielman Table 1. Computing times (wallclock times).
Circuit topology Main Branch + 2 L2 branches Main Branch + 4 L2 branches Main Branch + 4 L2 branches + 4 L3 branches
Sequential algorithm 255 s 0.055 s 0.16 s
Parallel algorithm 139 s 0.039 s 0.070 s
Speedup 83% 30% 56%
6 Conclusions SCREAMER is a fully open-source circuit code that can address most electrical circuits of interest to high-voltage, pulsed-power designers. The generalized implementation of SCREAMER has no limit on the size or the number of branches on branches. The new parallel direct solver was developed. The underlying numerical algorithm is based on graph partitioning, which is used to split the problem into a series of independent subsystems that can be solved in parallel. The partitioning in this approach is performed naturally since it is determined by the existing branch structure of the circuit. The factorization and solution steps are executed in parallel at each branch level. The results of the test problems confirm the high efficiency of the proposed algorithm.
References 1. Kiefer, M.L., Widner, M.M.: Screamer – a single-line pulsed-power design tool. In: Proceedings of the 5th IEEE Pulsed Power Conference, Arlington, VA, USA, pp. 685–688 (1985) 2. Spielman, R.B., Gryazin, Y.: Screamer V4.0 – a powerful circuit analysis code. In: Proceedings of the 20th IEEE Pulsed Power Conference, Austin, TX, USA, pp. 637–642 (2015) 3. Spielman, R.B., Gryazin, Y.: Screamer: a optimized pulsed-power circuit-analysis tool. In: Proceedings of the IEEE International Power Modulator and High Voltage Conference, San Francisco, CA, USA, pp. 269–274 (2016) 4. Davis, T.A.: Direct Methods for Sparse Linear Systems. Society for Industrial and Applied Mathematics, Philadelphia, PA (2006) 5. Davis, T.A., Rajamanickam, S.: A survey of direct methods for sparse linear systems. Acta Numerica 25, 383–566 (2016) 6. Duff, I.S., Erisman, A.M., Reid, J.K.: Direct Methods for Sparse Matrices. Oxford University Press, Oxford (2017) 7. George, A.: Nested dissection of a regular finite element mesh. SIAM J. Numer. Anal. 10(2), 345–363 (1973) 8. George, A., Liu, J.W.: Computer Solution of Large Sparse Positive Definite Systems. Prentice-Hall, Englewood Cliffs (1981)
Cartesian Genetic Programming for Synthesis of Optimal Control System Askhat Diveev(B) Federal Research Center “Computer Science and Control” of the Russian Academy of Sciences, Vavilov str. 44/2, 119333 Moscow, Russia [email protected]
Abstract. The problem of general synthesis of optimal control system of mobile robot is considered. In the problem it is necessary to find a feedback control function such that a mobile robot can achieve the set terminal position from any point of area of initial conditions. The solution of this problem is a mathematical expression of control function. For solution of this problem Cartesian genetic programming (CGP) is used. CGP is one of symbolic regression methods. The methods of symbolic regression allow numerical with the help of computer to find analytical form of mathematical expression. CGP codes multi-dimension function in a form of integer matrix on the base of the sets of arguments and elementary functions. Every string of this matrix is a code of one call of a function. For search of optimal solution, a variation genetic algorithm is used that realizes the principle of small variation of basic solution. The algorithm searches for a mathematical expression of the feedback control function in the form of code and at the same time value of parametric vector that is one of arguments of this function. An example of numerical solution of the control system synthesis problem for mobile robot is presented. It is introduced a conception a space of machine made functions. In this space functions can’t have value is infinity, and all function can be presented in the form of Taylor’s series only with a finite number of members.
Keywords: Cartesian genetic programming Control system synthesis
1
· Symbolic regression ·
Introduction
The genetic programming [1] has opened up the possibility to search for mathematical expressions by the computer. Earlier at the computer search of mathematical expressions or formulas researcher at first was determining a mathematical expression with accuracy to parameters and then computer look for optimal values of this parameters according to some given criterion. It’s very strange that such great invention had stayed overlooked by mathematician society. Now all analytical problem where solutions have to be received in the form of mathematical expressions can be found by computer. Nonlinear and differential c Springer Nature Switzerland AG 2021 K. Arai et al. (Eds.): FTC 2020, AISC 1289, pp. 205–222, 2021. https://doi.org/10.1007/978-3-030-63089-8_13
206
A. Diveev
equations, integrals, inverse functions other problems now can be solved with using the genetic programming. Such tasks includes the problem of control system synthesis. In this problem it is necessary to find a mathematical expression for control function. In general case this control function must have a special property. If this control function is inserted in right part of differential equations of mathematical model of control object, then in the space of solutions of this differential equations a special stable equilibrium point is appeared. All solutions of the differential equations from some area of initial conditions will be aspire to this special attraction point. The genetic programming was the first method that can be used for search of a mathematical expression. Now there are more than ten such methods. All these methods can be grouped as methods of symbolic regression. To symbolic regression methods the following ones are belonged: grammatical evolution [2], analytic programming [3], Cartesian genetic programming [4], inductive genetic programming [5], network operator method [6], method of parser-matrix evolution [7], binary genetic programming [8] and others. All these methods code a mathematical expression in the form of special record and search for optimal solution in the set of codes by as a rule genetic algorithm. To apply a genetic algorithm in each method of symbolic regression a special crossover operator was developed. All methods of symbolic regression can be modified by application in them of the principle of small variations of the basic solution [9], then these methods will have word “variation” in their name [10,11]. The application of this principle is needed in solution of the problems with huge search spaces to narrow them down. In this work the control system synthesis problem for mobile robot is solved. This problem is a very important one in control theory. A solution of this problem allows to build control system and to make the control object stable. Complexity of this problem consists of that for any initial conditions from some domain it necessary to find one function, that is called control function. This control function allows to compute values of controls for current state of a control object. As result object have to be moved from any state of some area to a set point in the states’ space. As a rule this problem is solved by technical or semi-analytical methods [12–14]. This methods require study of control object mathematical model and construct controllers for each channel of control. We apply symbolic regression methods for solution of the control synthesis problem. Before we used the network operator method [15], genetic programming, analytic programming and others. These methods allows to construct numerical algorithms for solution of the synthesis problem without considering analytical forms of control object mathematical models of Here we apply the variation Cartesian genetic programming, that didn’t use before for the synthesis problem. In the real problem a control system is searched to provide stability to a mobile robot in some point in the state space. A mobile robot moves to a stable point from any point of the state space. Cartesian genetic programming unlike the genetic programming doesn’t change length of code after crossover operation. A code of mathematical expression in the form of Cartesian genetic programming is an integer matrix.
Cartesian Genetic Programming
207
Every column of this matrix is a call of function from the base set of elementary functions. A variation Cartesian genetic programming includes a code of one basic solution, and others possible solutions are codded by a sets of variation vectors. In the work we present the formal problem statement of control synthesis problem, then we describe the method of symbolic regression Cartesian genetic programming and its variation modification. Found by a symbolic regression methods functions has not to calculate value is equaled infinity at the searching and the using in a control object, therefore we introduce a new mathematical space. This space is called the space of machine made functions. We presented some results if theoretical studies of the introduced space. In application part of the work the problem of control synthesis for mobile robot is solved by the variation Cartesian genetic programming method.
2
The Problem of Control System Synthesis
Consider a problem statement of control system synthesis. A mathematical model of control object is given x˙ = f (x, u),
(1)
where x is a state vector, u is a control vector, x = [x1 . . . xn ]T , u = [u1 . . . um ]T , m ≤ n, f (x, u) = [f1 (x, u) . . . fn (x, u)]T . On values of control components boundaries are set + u− i ≤ ui ≤ ui , i = 1, . . . , m,
(2)
+ where u− i and ui are low and upper boundaries of control component i, i = 1, . . . , m. The set of initial conditions is given
X0 = {x0,1 , . . . , x0,K },
(3)
A terminal position is given xf = [xf1 . . . xfn ]T .
(4)
It is necessary to find a control in the form of function of a state vector u = h(xf − x),
(5)
A quality criterion of control is given J = max{tf,1 , . . . , tf,K } + a1
K f x − x(tf,i , x0,i ) → min, i=1
(6)
208
A. Diveev
where a1 is a weight coefficient, tf,i is a time of achievement the terminal position (4) from the initial condition x0,i of the set of initial conditions (3), i ∈ {1, . . . , K}, t, if t < t+ and x(t, x0,i ) − xf ≤ ε tf,i = , (7) t+ , otherwise t+ and ε are given positive values, x(t, x0,i ) is a solution of the system x˙ = f (x, h(xf − x)), for initial conditions x(0) = x0,i , i ∈ {1, . . . , K}, n f x − x = (xf − xi )2 . i
(8)
(9)
i=1
3
Cartesian Genetic Programming
To solve the problem of synthesis (1)–(9) the Cartesian genetic programming (CGA) is applied. CGA codes a mathematical expression in the form of a set of integer vectors. G = (g1 , . . . , gM ),
(10)
i T gi = [g1i . . . gR ] ,
(11)
where g1i is the number of a function, gji is the number of an argument, j = 2, . . . , R. To code a mathematical expression it is necessary to determine the basic set of elementary functions and the set of arguments of the mathematical expression. Let the basic set includes k1 functions with one argument, k2 functions with two arguments, and k3 functions with three arguments. Then the basic set of elementary functions is F = (f1 (z), . . . , fk1 (z), fk1 +1 (z1 , z2 ), . . . fk1 +k2 (z1 , z2 ), . . . , fk1 +k2 +1 (z1 , z2 , z3 ), . . . fk1 +k2 +k3 (z1 , z2 , z3 )).
(12)
The set of arguments is F0 = (x1 , . . . , xn , q1 , . . . , qp ),
(13)
where xi is a variable, i = 1, . . . , n, qj is a constant parameter of the mathematical expression, j = 1, . . . , p. Sometimes in the set F0 constants 0 and 1 are included.
Cartesian Genetic Programming
209
If at the coding of mathematical expression functions are used only with not more than three arguments, then R = 4, and components of vector (11) is determined by equation g1i ∈ {1, . . . , k1 + k2 + k3 }, gji ∈ {1, . . . , n + p + i − 1}, j = 2, . . . , R.
(14)
In order to calculate a mathematical expression by a code of Cartesian genetic programming (10) the vector of results is defined
where
y = [y1 . . . yM ]T ,
(15)
⎧ i i ⎨ fg1i (g2 ), if g1 ≤ k1 yi = fg1i (g2i , g3i ), if k1 < g1i ≤ k2 , ⎩ fg1i (g2i , g3i , g4i ) if k2 < g1i ≤ k3
(16)
where i = 1, . . . , M . Let us consider an example of coding the following mathematical expression y1 = exp(q1 x1 )(sin(q2 x2 ) + cos(q3 x3 )),
(17)
For this example the basic sets are F = (f1 (z) = z, f2 (z) = −z, f3 (z) = cos(z), f4 (z) = sin(z), f5 (z) = exp(z), f6 (z1 , z2 ) = z1 + z2 , f7 (z1 , z2 ) = z1 z2 ,
(18)
F0 = (x1 , x2 , x3 , q1 , q2 , q3 ),
(19)
To code the mathematical expression q1 x1 a function production is found in the set of functions (18). It is function under the number 7, f7 (z1 , z2 ) = z1 z2 . Then the numbers of elements in the set of arguments (19) are found. The parameter q1 is an element under the number 4, the variable x1 is an element 1. As a result, the code of q1 x1 is g1 = [5 4 1; 0]T . The code of the mathematical expression (17) has the following form: ⎛⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤⎞ 7 7 7 5 4 3 6 7 ⎜⎢ 4 ⎥ ⎢ 5 ⎥ ⎢ 6 ⎥ ⎢ 7 ⎥ ⎢ 8 ⎥ ⎢ 9 ⎥ ⎢ 11 ⎥ ⎢ 10 ⎥⎟ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎟ G1 = ⎜ (20) ⎝⎣ 1 ⎦ , ⎣ 2 ⎦ , ⎣ 3 ⎦ , ⎣ 5 ⎦ , ⎣ 1 ⎦ , ⎣ 3 ⎦ , ⎣ 12 ⎦ , ⎣ 13 ⎦⎠ . 2 3 4 6 2 4 5 6 To search for optimal mathematical expression a genetic algorithm is used. Together with the mathematical expression structure, optimal values of the parameters are looked for. The parameters are also arguments of the mathematical expression. Let us consider steps of the algorithm. Firstly, the set of coded possible solutions is generated randomly S = {G1 , . . . , GH },
(21)
210
A. Diveev
where Gi = (gi,1 , . . . , gi,M ), i = 1, . . . , H. For each structure of mathematical expression a vector of parameters is also generated randomly qij = ξ(qi+ − qi− ) + qi− , i = 1, . . . , p, j = 1, . . . , H,
(22)
where ξ is a random number from interval [0; 1], qi− , qi+ are low and upper boundaries of parameters, i = 1, . . . , p, j = 1, . . . , H, p is a dimension of vector of parameters. Each possible solution is estimated by a goal function C = {c1 = J(G1 , q1 ), . . . , cH = J(GH , qH )},
(23)
where J(Gi , qi ) is a goal function, qi is a vector of parameters, i ∈ {1, . . . , H}. The best solution Gi− is found ci− = min{c1 , . . . , cH }.
(24)
For crossover operation two possible solutions Gα , qα , Gβ , qβ are selected randomly, α, β ∈ {1, . . . , H}. A probability of performing a crossover operation is calculated c− c− Pc = max i , i . (25) cα cβ A random number ξ form 0 to 1 is generated. If it is less than Pc , crossover is performed. Two crossover points are found randomly r1 ∈ {1, . . . , M }, r2 ∈ {1, . . . , p}.
(26)
One point is for structural part and another is point for parametric part. After crossover operation, four new possible solutions are received qH+1 GH+1 qH+2 GH+2 qH+3 GH+3 qH+4 GH+4
= [q1α , . . . , qrα2 , qrβ2 +1 . . . qpβ ]T , = (gα,1 , . . . , gα,r1 , gβ,r1 +1 , . . . , gβ,M ), = [q1β , . . . , qrβ2 , qrα2 +1 . . . qpα ]T , = (gβ,1 , . . . , gβ,r1 , gα,r1 +1 , . . . , gα,M ), = [q1α , . . . , qrα2 , qrβ2 +1 . . . qpβ ]T , = Gα , = [q1β , . . . , qrβ2 , qrα2 +1 . . . qpα ]T , = Gβ ,
(27)
Two sons are obtained by crossing structural and parametric parts, two other sons have structural parts the same like parents and only parametric parts are crossed over. After that a mutation operation is performed. Mutation is performed with set probability Pμ . A random number ξ is generated from interval [0; 1], and if
Cartesian Genetic Programming
211
it less than Pμ , then mutation is performed. Points of mutation are found for structural and parametric parts μ1 ∈ {1, . . . , M }, μ2 ∈ {1, . . . , p}.
(28)
New values are generated for new solutions in points μ1 and μ2 , g1H+1,μ ∈ {1 . . . , k1 + k2 + k3 } giH+1,μ ∈ {1, . . . , |F0 | + μ1 − 1}, i = 2, . . . , 4, q H+1 μ2 = ξ(qμ+2 − qμ−2 ) + qμ−2 .
(29)
After that the first new possible solution is estimated by the given criterion fH+1 = J(GH+1 , qH+1 ).
(30)
Then in the population the worst solution is found fj + = max{f1 , . . . , fH }.
(31)
If the first new solution is better than the worst solution in the population, fH+1 < fj + ,
(32)
then the new first solution is inserted instead of the worst solution into the population + qj ← qH+1 , (33) Gj + ← GH+1 . All this acts (30)–(33) are repeated for other new possible solutions (GH+2 , qH+2 ), (GH+3 , qH+3 ), (GH+4 , qH+4 ). Let us consider an example of crossover operation for structural part of possible solution. Let the first selected parent be (20). Let the second selected parent be ⎛⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤⎞ 7 7 7 5 7 4 7 3 ⎜⎢ 1 ⎥ ⎢ 2 ⎥ ⎢ 3 ⎥ ⎢ 9 ⎥ ⎢ 8 ⎥ ⎢ 11 ⎥ ⎢ 12 ⎥ ⎢ 13 ⎥⎟ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎟ (34) G2 = ⎜ ⎝⎣ 4 ⎦ , ⎣ 5 ⎦ , ⎣ 6 ⎦ , ⎣ 2 ⎦ , ⎣ 10 ⎦ , ⎣ 2 ⎦ , ⎣ 7 ⎦ , ⎣ 4 ⎦⎠ . 2 1 2 3 1 3 3 2 The code of second parent corresponds to a mathematical expression y2 = cos(q1 x1 sin(q2 x2 exp(q3 x3 ))).
(35)
Assume that a crossover operation point is r = 5. Then two new codes are obtained ⎛⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤⎞ 7 7 7 5 4 4 7 3 ⎜⎢ 4 ⎥ ⎢ 5 ⎥ ⎢ 6 ⎥ ⎢ 7 ⎥ ⎢ 8 ⎥ ⎢ 11 ⎥ ⎢ 12 ⎥ ⎢ 13 ⎥⎟ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎟ G3 = ⎜ (36) ⎝⎣ 1 ⎦ , ⎣ 2 ⎦ , ⎣ 3 ⎦ , ⎣ 5 ⎦ , ⎣ 1 ⎦ , ⎣ 2 ⎦ , ⎣ 7 ⎦ , ⎣ 4 ⎦⎠ , 2 2 3 4 6 2 3 3
212
A. Diveev
⎛⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤⎞ 7 7 7 5 7 3 6 7 ⎜⎢ 1 ⎥ ⎢ 2 ⎥ ⎢ 3 ⎥ ⎢ 9 ⎥ ⎢ 8 ⎥ ⎢ 9 ⎥ ⎢ 11 ⎥ ⎢ 10 ⎥⎟ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎟ G4 = ⎜ ⎝⎣ 4 ⎦ , ⎣ 5 ⎦ , ⎣ 6 ⎦ , ⎣ 2 ⎦ , ⎣ 10 ⎦ , ⎣ 3 ⎦ , ⎣ 12 ⎦ , ⎣ 13 ⎦⎠ . 2 1 2 3 1 4 5 6
(37)
These new codes correspond to the following mathematical expressions
4
y3 = cos(q1 x1 sin(sin(q2 x2 ))),
(38)
y4 = exp(q3 x3 )(q2 x2 exp(q3 x3 ) + cos(exp(q3 x3 ))).
(39)
Variation Cartesian Genetic Programming
In the tasks of searching complex mathematical expressions, it is advisable to limit the search scope so that the search for the optimal solution is carried out in the neighborhood of some given basic solution. This possibility is provided by the principle of small variations of the basic solution. It can be used in any task where optimal solutions are searched for in a non-numeric space where it is hard to calculate the distance between two possible solutions. According to principle of small variations of basic solution it is necessary to define small variation of code of Cartesian genetic programming. Let a small variation of code of Cartesian genetic programming will be a change of one element of the code. Then to record a small variation it is enough to use an integer vector with three components (40) w = [w1 w2 w3 ]T , where w1 is a number of column in the code, w2 is a number of line in the column w1 . w1 , w3 is a new value of the element gw 2 Let us consider an example. Let we have a vector of variations w = [4 1 4]T .
(41)
Let us apply this small variation to the code (37). The code is received ⎛⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤⎞ 7 7 7 (4) 7 3 6 7 ⎜⎢ 1 ⎥ ⎢ 2 ⎥ ⎢ 3 ⎥ ⎢ 9 ⎥ ⎢ 8 ⎥ ⎢ 9 ⎥ ⎢ 11 ⎥ ⎢ 10 ⎥⎟ ⎜ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎥⎟ . (42) ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ w ◦ G4 = ⎝⎣ ⎦ , ⎣ ⎦ , ⎣ ⎦ , ⎣ , , , , 4 5 6 2 ⎦ ⎣ 10 ⎦ ⎣ 2 ⎦ ⎣ 12 ⎦ ⎣ 13 ⎦⎠ 2 1 2 3 1 4 5 6 Here, the new element value is enclosed in parentheses and shown in bold. The new code corresponds to the following mathematical expression: y5 = exp(q3 x3 )(q2 x2 exp(q3 x3 ) + cos(cos(q3 x3 ))).
(43)
To search optimal solution on the space of small variations of basic possible solution the one basic solution is set. This basic solution can be written by specialist, which well knows the problem and can write approximately a good
Cartesian Genetic Programming
213
solution. Note, that engineers and developers of control systems almost always know approximately the structure of the needed control system or know what structure will not work. The set of possible solutions is set in the form of ordered sets of variation vectors (44) W = {W1 , . . . , WH } where Wi = (wi,1 , . . . , wi,d ).
(45)
d is a given number of small variations of the basic solution for obtaining one possible solution. Elements of the set of variations vector are generated randomly w1i ∈ {1, . . . , M }, {1, . . . , 4}, w2i ∈ , {1, . . . , k1 + k2 + k3 }, if w2i = 1, i w3 ∈ {1, . . . , |F0 | + w1i − 1},
(46)
where i = 1, . . . , H. Each possible solution Gi is received after small variations of the basic solution (47) Gi = wi,d ◦ . . . ◦ wi,1 ◦ G0 , where G0 is a code of the basic solution. Crossover and mutation operations are performed to the sets of variation vectors. To make a crossover operation two possible solutions are selected from the set (45) Wα = (wα,1 . . . , wα,d ) (48) Wβ = (wβ,1 . . . , wβ,d ), where α, β ∈ {1, . . . , H}. A crossover point is defined randomly r1 ∈ {1, . . . , d}.
(49)
Two new possible solutions are obtained after exchange tails of selected parents WH+1 = (wα,1 , . . . , wα,r1 , wβ,r1 +1 , . . . , wβ,d ), (50) WH+1 = (wβ,1 , . . . , wβ,r1 , wα,r1 +1 , . . . , wα,d ). Let us consider an example of crossover operation. Let a code (37) be a code of the basic solution. Assume that the following two possible solutions were selected like parents ⎛⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤⎞ 1 4 6 8 Wα = ⎝⎣ 1 ⎦ , ⎣ 2 ⎦ , ⎣ 1 ⎦ , ⎣ 2 ⎦⎠ , 6 1 5 4 (51) ⎛⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤⎞ 2 7 3 4 Wβ = ⎝⎣ 1 ⎦ , ⎣ 1 ⎦ , ⎣ 1 ⎦ , ⎣ 1 ⎦⎠ . 5 4 5 3
214
A. Diveev
These sets of variation vectors correspond to the following mathematical expressions: yα = q1 (x2 q2 exp(x1 ) + exp(x3 q3 )), (52) yβ = cos(exp(x3 )) sin(cos(exp(x3 ))). Let crossover point be r = 2. Two new possible solutions are obtained: ⎛⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤⎞ 1 4 3 4 WH+1 = ⎝⎣ 1 ⎦ , ⎣ 2 ⎦ , ⎣ 1 ⎦ , ⎣ 1 ⎦⎠ , 6 1 5 3
WH+2
⎛⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤⎞ 2 7 6 8 = ⎝⎣ 1 ⎦ , ⎣ 1 ⎦ , ⎣ 1 ⎦ , ⎣ 2 ⎦⎠ . 5 4 5 4
(53)
As a result two following mathematical expressions are obtained: yH+1 = cos(x1 )(x2 q2 cos(x1 ) + cos(exp(x3 ))), yH+2 = q1 sin(exp(x2 ) exp(x3 q3 )).
(54)
Mutation operation is performed by generating of new variation vector in the mutation point.
5
Space of Machine-Made Functions
When the control system synthesis problem is solving, it is necessary to assume that found synthesized function will be realize on a board processor of robot or any other automatic device. This means that the found synthesized functions can be realized by computer and their Modulo value never reach infinity. Let us consider such space of function. This space is a subspace of the real space Rn# ⊆ Rn ,
(55)
where Rn# is a machine-made space. This space Rn# possesses the following properties. For any vector x = [x1 . . . xn ]T ∈ Rn# of dimension n are satisfied conditions: 1) (56) |xi | ≤ B + < ∞, i = 1, . . . , n. 2) it exists a small positive value δ − > 0, that if |xi | < δ − , then xi = 0, i = 1, . . . , n.
(57)
˙ ∈ Rn# . if x(t) ∈ Rn# , then x(t)
(58)
3) 4) it exists a value satisfactory accuracy Δ˜ > δ − that ˜ then xi ± α = xi , i = 1, . . . , n. if |α| < Δ,
(59)
Cartesian Genetic Programming
215
Usually in the problems with differential equations, the value satisfactory accuracy is a half-step of integration. The derivative of a function in the machine-made space R# is calculated by the relation ∂f (z) f (z + δ − ) − f (z) = . (60) ∂z δ− Consider an example. ∂ sin(z) sin(z + δ − ) − sin(z) = = ∂z δ− sin(z) cos(δ − ) + sin(δ − ) cos(z) − sin(z) = δ− sin(z) + δ − cos(z) − sin(z) = cos(z). δ− Here the following equation is used cos(δ − ) = 1 − 0.5(δ − )2 = 1,
(61) (62) (63)
(64)
according to Eq. (59). In the introduced space R# , the machine-made functions are recorded as usual functions from mathematical analysis, but with a condition that their values are never equal to infinity. For example, 1/z, if |z| > δ − −1 . (65) z = sgn(z)B + , otherwise If it is necessary to emphasize in notation that this is a machine-made function, then the special subscript can be used. For example, sin# (z), exp# (z), etc. Theorem 1. Any machine-made function can be presented in the form of Taylor’s series with a finite number of members. Proof. Assume f (z) is a machine-made function, then for a point z = a Taylor’s series has the following form: L f (k) (a) k=0
k!
(z − a)k = f (a) + f (a)(z − a) +
f (a) (z − a)2 + . . . 2!
f (L) (a) (z − a)L . L! Value of derivative is limited |f (k) (a)| ≤ B + and the value of a denominator increases k!. For some member of Taylor’s series the following inequality will be implemented: B+ < δ− k! According to property (57) all subsequent members of the series will be zero.
... +
216
6
A. Diveev
Synthesis of Control System of Mobile Robot
Let us consider the synthesis problem for mobile robots with two tracks. A mathematical model of a robot is [16] x˙ 1 = 0.5(u1 + u2 ) cos(x3 ), x˙ 2 = 0.5(u1 + u2 ) sin(x3 ), x˙ 3 = 0.5(u1 − u2 ),
(66)
For the model (66) the set of initial conditions is given
where
0,k 0,k T X0 = {x0,k = [x0,k 1 (i1 ) x2 (i2 ) x3 (i3 )] : + + + k = i1 + (i2 − 1)i1 + (i3 − 1)i1 i2 },
(67)
+ x0,k 1 (i1 ) = x1 − (i1 − 1)δ1 , 0,k x2 (i2 ) = x+ 2 − (i2 − 1)δ2 , 0,k x3 (i3 ) = x+ 3 − (i3 − 1)δ3 ,
(68)
+ + + + + i1 = 1, . . . , i+ 1 , i2 = 1, . . . , i2 , i3 = 1, . . . , i3 , x1 = 2, x2 = 2, x3 = π/4, δ1 = 1, + + + δ2 = 1, δ3 = π/4, i1 = 5, i2 = 5, i3 = 3. In the synthesis problem 5 · 5 · 3 = 75 initial conditions are considered The terminal position is given
xf = [0 0 0]T . The quality control criterion is given 72 3 1 (xj (tf,i − x0,i ) − xf )2 → min, Je = tf + j 72 i=1 j=1
(69)
(70)
where tf = max{x(tf,i , x0,i ) : i = 1, . . . , 72},
(71)
tf,i is determined by Eqs. (7)–(9). To solve the synthesis problem variation Cartesian genetic programming is used. As a basic solution for the problem the following control function was selected u1 = q1 (xf1 − x1 ) + q2 (xf2 − x2 ) + q3 (xf3 − x3 ) + q4 x3 , u2 = q1 (xf1 − x1 ) + q2 (xf2 − x2 ) + q3 (xf3 − x3 ) + q4 x3 .
(72)
where q = [q1 q2 q3 q4 ]T is a vector of parameters. The set of arguments has the following form F0 = {xf1 − x1 , xf2 − x2 , xf3 − x3 , x3 , q1 , q2 , q3 , q4 , 0, 1}.
(73)
Cartesian Genetic Programming
217
The code of Cartesian genetic programming has length M = 24. A code of the basic solution has the following form: ⎛⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ 30 30 30 30 29 29 ⎜⎢ 1 ⎥ ⎢ 2 ⎥ ⎢ 3 ⎥ ⎢ 4 ⎥ ⎢ 11 ⎥ ⎢ 15 ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ G(ui ) = ⎜ ⎝⎣ 5 ⎦ , ⎣ 6 ⎦ , ⎣ 7 ⎦ , ⎣ 8 ⎦ , ⎣ 12 ⎦ , ⎣ 13 ⎦ , 1 2 3 4 5 1 ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ 1 29 29 1 29 1 ⎢ 16 ⎥ ⎢ 17 ⎥ ⎢ 18 ⎥ ⎢ 19 ⎥ ⎢ 20 ⎥ ⎢ 21 ⎥ ⎢ ⎥,⎢ ⎥,⎢ ⎥,⎢ ⎥,⎢ ⎥,⎢ ⎥, ⎣ 14 ⎦ ⎣ 4 ⎦ ⎣ 1 ⎦ ⎣ 9 ⎦ ⎣ 9 ⎦ ⎣ 7 ⎦ 4 5 6 8 3 5 ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ 1 1 1 1 1 1 ⎢ 22 ⎥ ⎢ 23 ⎥ ⎢ 20 ⎥ ⎢ 21 ⎥ ⎢ 22 ⎥ ⎢ 23 ⎥ ⎢ ⎥,⎢ ⎥,⎢ ⎥,⎢ ⎥,⎢ ⎥,⎢ ⎥, ⎣ 9 ⎦ ⎣ 11 ⎦ ⎣ 13 ⎦ ⎣ 15 ⎦ ⎣ 17 ⎦ ⎣ 18 ⎦ 18 19 10 12 14 16 ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤⎞ 1 1 1 1 1 1 ⎢ 24 ⎥ ⎢ 25 ⎥ ⎢ 26 ⎥ ⎢ 27 ⎥ ⎢ 28 ⎥ ⎢ 29 ⎥⎟ ⎢ ⎥ , ⎢ ⎥ , ⎢ ⎥ , ⎢ ⎥ , ⎢ ⎥ , ⎢ ⎥⎟ . (74) ⎣ 20 ⎦ ⎣ 21 ⎦ ⎣ 23 ⎦ ⎣ 21 ⎦ ⎣ 23 ⎦ ⎣ 25 ⎦⎠ 21 22 24 22 24 26 The random correct numbers of arguments are specified in non-used positions of the code of the basic solution. At the search forty computer-realized functions were used, of them k1 = 28 functions with one argument, k2 = 8 functions with two arguments, and k3 functions with three arguments. f1 (z) = z, f2 (z) = z 2 , f3 (z) = −z, f4 (z) = sgn(z) |z|, f5 (z) = z −1 , f6 (z) = exp(z), f7 (z) = log(|z|), tanh(0.5z), f8 (z) = 1, if z ≥ 0 , f9 (z) = 0, otherwise f10 (z) = sgn(z), f11 (z) = cos(z), f12 (z) = sin(z), f13 (z) = arctan(z), f14 (z) = z 3 ,
(75)
218
A. Diveev
√ f15 (z) = 3 z, z, if |z| < 1 , f16 (z) = sgn(z), otherwise f17 (z) = sgn(z) log(|z| + 1), f18 (z) = sgn(z)(exp(|z|) − 1), f19 (z) = sgn(z) exp(−|z|), f20 (z) = 0.5z, f21 (z) = 2z, f22 (z) = 1 − exp(−|z|), f23 (z) = z − z 3 , (1 + exp(−z))−1 , f24 (z) = 1, if z > 0 , f25 (z) = 0, otherwise 0, if |z| < ε , f26 (z) = sgn(z), otherwise √ f27 (z) = sgn(z)(1 − 1 − z 2 ), f28 (z) = z(1 − exp(−z 2 )), f29 (z1 , z2 ) = z1 + z2 , f30 (z1 , z2 ) = z1 z2 , f31 (z1 , z2 ) = max{z1 , z2 }, f32 (z1 , z2 ) = min{z1 , z2 }, f33 (z1 , z2 ) = z1 + z2 − z1 z 2, f34 (z1 , z2 ) = sgn(z1 + z2 ) z12 + z22 , f35 (z1 , z2 ) = sgn(z1 + z2 )(|z1 | + |z2 |), f36 (z1 , z2 ) = sgn(z1 + z2 )|z1 ||z2 |, z2 , if z1 > 0 f37 (z1 , z2 , z3 ) = , z3 , otherwise z3 , if z1 > z2 , f38 (z1 , z2 , z3 ) = −z 3 , otherwise z2 + z3 , if z1 > 0 , f39 (z1 , z2 , z3 ) = z2 − z3 , otherwise f40 (z1 , z2 , z3 ) = max{z1 , z2 , z3 }.
(76)
(77)
(78)
In result of calculations the variation Cartesian genetic programming received the following mathematical expressions for control functions: ⎧ + ˜ i > u+ ⎨ ui , if u i − (79) ui = ui , if u ˜ i < u− i , i = 1, 2, ⎩ u ˜i , otherwise where u ˜1 = sgn(q3 arctan(3q2 (A − x3 ) + q1 (xf1 − x1 )) − x3 ) × (exp |q3 arctan(3q2 (A − x3 ) + q1 (xf1 − x1 )) − x3 | − 1), 3 u ˜2 = −2q2 (A − x3 ) + sgn(xf1 − x1 )−
(80)
Cartesian Genetic Programming
2q2 (A − x3 ) + sgn(xf1 − x1 ) + q1 (xf1 − x1 ),
219
(81)
A = arctan(q1 (xf1 − x1 )q2 (xf2 − x2 )), q1 = 3.61914, q2 = 3.85645, q3 = 3.36719. To find these mathematical expressions the system (66) was integrated more than 2.5 million times. The code of Cartesian genetic programming for obtained mathematical expression has the following form: ⎛⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ 30 30 9 5 30 13 ⎜⎢ 1 ⎥ ⎢ 2 ⎥ ⎢ 11 ⎥ ⎢ 11 ⎥ ⎢ 11 ⎥ ⎢ 15 ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ G(ui ) = ⎜ ⎝⎣ 5 ⎦ , ⎣ 6 ⎦ , ⎣ 7 ⎦ , ⎣ 11 ⎦ , ⎣ 12 ⎦ , ⎣ 6 ⎦ , 6 0 3 4 5 1 ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ 3 30 29 29 29 29 ⎢ 4 ⎥ ⎢ 16 ⎥ ⎢ 18 ⎥ ⎢ 19 ⎥ ⎢ 11 ⎥ ⎢ 21 ⎥ ⎢ ⎥,⎢ ⎥,⎢ ⎥,⎢ ⎥,⎢ ⎥,⎢ ⎥, ⎣ 2 ⎦ ⎣ 17 ⎦ ⎣ 6 ⎦ ⎣ 19 ⎦ ⎣ 19 ⎦ ⎣ 20 ⎦ 5 3 4 5 6 8 ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ 29 30 15 3 39 13 ⎢ 20 ⎥ ⎢ 14 ⎥ ⎢ 22 ⎥ ⎢ 24 ⎥ ⎢ 25 ⎥ ⎢ 26 ⎥ ⎢ ⎥,⎢ ⎥,⎢ ⎥,⎢ ⎥,⎢ ⎥,⎢ ⎥, ⎣ 9 ⎦ ⎣ 23 ⎦ ⎣ 13 ⎦ ⎣ 9 ⎦ ⎣ 7 ⎦ ⎣ 18 ⎦ 16 18 19 10 10 14 ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤⎞ 29 29 29 1 30 18 ⎢ 27 ⎥ ⎢ 28 ⎥ ⎢ 29 ⎥ ⎢ 30 ⎥ ⎢ 31 ⎥ ⎢ 32 ⎥⎟ ⎢ ⎥ , ⎢ ⎥ , ⎢ ⎥ , ⎢ ⎥ , ⎢ ⎥ , ⎢ ⎥⎟ . (82) ⎣ 17 ⎦ ⎣ 24 ⎦ ⎣ 23 ⎦ ⎣ 10 ⎦ ⎣ 23 ⎦ ⎣ 11 ⎦⎠ 26 21 22 24 22 24 Formulas (80), (81) for the controls u ˜i , i = 1, 2, corresponds to two last columns. In Fig. 1 and 2, the optimal trajectories of movement to the terminal point (69) from different 16 initial conditions are presented.
Fig. 1. Optimal trajectories from eight initial conditions.
220
A. Diveev
Fig. 2. Optimal trajectories from eight initial conditions.
Initial conditions for solutions in the Fig. 1 were x0,1 = [2 2 π/4]T , x0,5 = [−2 2 π/4]T , x0,21 = [2 − 2 π/4]T , x0,25 = [−2 − 2 π/4]T , x0,51 = [2 2 π/4]T , x0,55 = [−2 2 − π/4]T , x0,71 = [2 − 2 − π/4]T ,x0,75 = [−2 − 2 − π/4]T . Initial conditions for solutions in the Fig. 2 were x0,26 = [2 2 0]T , x0,27 = [1 2 0]T , x0,29 = [−1 2 0]T , x0,30 = [−2 2 0]T , x0,46 = [2 −2 0]T , x0,47 = [1 −2 0]T , x0,49 = [2 − 2 0]T , x0,50 = [−2 − 2 0]T . Figure 1 and 2 show that all trajectories achieve the terminal point enough precisely from different given initial conditions.
7
Conclusions
This work is devoted to automation of synthesis of control system. In the paper the problem of numerical synthesis of control system is formulated. In the problem a set of initial conditions and one terminal position are given. It is necessary to find a feedback control function such that control object can achieve the terminal position from all initial conditions. It is proposed to use for this purpose new method of symbolic regression the variation Cartesian genetic programming. Form codding of mathematical expression by Cartesian genetic programming is presented. The code is an integer matrix. Every column of this matrix is a call of elementary function. For more effectively search for optimal solution it is proposed to use the principle of small variation of basic solution. Examples of crossover operations for Cartesian genetic programming and its variation modification are presented. The space of machine made functions is introduced. In this space functions never is equal infinity. Four properties of the machine made function space are formulated. It is shown that all continued functions can be presented in form of Taylor series with finite number of members. In computational experiment the synthesis problem of stabilization system for mobile robot is considered. In the problem seventy five initial conditions were given. The set of elementary functions include forty elements, twenty eight functions with one argument, eight functions with two arguments, and four func-
Cartesian Genetic Programming
221
tions with three arguments. Variation Cartesian genetic programming has found a nonlinear control function that solve this problem. Further it is necessary to study of different symbolic regression methods and to compare them for solution of problems, where it is required to find as a solution in the form of mathematical expression. Also it is necessary to continue studies of new mathematical space of machine made functions. Acknowledgments. This work was performed with partial support from the Russian Science Foundation (project No 19-11-00258).
References 1. Koza, J.R.: Genetic Programming: On the Programming of Computers by Means of Natural Selection. MIT Press, Cambridge, MA, USA (1992) 2. Ryan C., Collins J., Neill M.O.: Grammatical evolution: Evolving programs for an arbitrary language. In: Banzhaf W., Poli R., Schoenauer M., Fogarty T.C. (eds) Genetic Programming. EuroGP 1998. Lecture Notes in Computer Science, vol 1391. pp. 83–96, Springer, Berlin, Heidelberg (1998) https://doi.org/10.1007/ BFb0055930 3. Zelinka, I., Oplatkova, Z., Nolle, L.: Analytic programming symbolic regression by means of arbitrary evolutionary algorithm. Int. J. Simul. Syst. Sci. Technol. 6(9), 44–56 (2005) 4. Julian M.J., Thomson, P.: Cartesian genetic programming. In: Proceedings of the European Conference on Genetic Programming (EuroGP2000), vol. 1802, pp. 121– 132. Milan, Italy (2000) https://doi.org/10.1007/978-3-540-46239-2 9 5. Nikolay Nikolaev, N., Iba, I.: Adaptive Learning of Polynomial Networks. In: Goldberg, D., Koza, J., (Eds) Inductive Genetic Programming. Genetic and Evolutionary Computation. pp. 25–80. (2006). https://doi.org/10.1007/0-387-31240-4 2 6. Diveev, A.I., Sofronova, E.A.: The network operator method for search of the most suitable mathematical equation. Chapter in the book Bio-Inspired Comput. Algorithm Appl. 47(3), 7061–7066 (2012). https://doi.org/10.5772/36071 7. Luo, Ch., Zhang, S.-L.: Parse-matrix evolution for symbolic regression. Eng. Appl. Artif. Intell. 25(6), 1182–1193 (2012). https://doi.org/10.1016/j.engappai.2012.05. 015 8. Diveev, A., Sofronova, E.: Automation of synthesized optimal control problem solution for mobile robot by genetic programming. Adv. Intell. Syst. Comput. 1038, 1054–1072 (2019). https://doi.org/10.1007/978-3-030-29513-4 77 9. Diveev, A.: Small variations of basic solution method for non-numerical optimization. IFAC-Papers-Online 38, 28–33 (2015). https://doi.org/10.1016/j.ifacol.2015. 11.054 10. Diveev, A., Ibadulla, S., Konyrbaev, N., Shmalko, E.: Variational genetic programming for optimal control system synthesis of mobile robots. IFAC-Papers-Online 48, 106–111 (2015). https://doi.org/10.1016/j.ifacol.2015.12.018 11. Diveev, A., Ibadulla, S., Konyrbaev, N., Shmalko, E.: Variational analytic programming for synthesis of optimal control for flying robot. IFAC-Papers-Online 48, 75–80 (2015). https://doi.org/10.1016/j.ifacol.2015.12.013 12. Mizhidon, A.D.: On a problem of analytic design of an optimal controller. Autom. Remote Control 72(11), 2315–2327 (2011). https://doi.org/10.1134/ S0005117911110063
222
A. Diveev
13. Podvalny, S.L., Vasiljev, E.M.: Analytical synthesis of aggregated regulators for unmanned aerial vehicles. J. Math. Sci. 239(2), 135–145 (2019). https://doi.org/ 10.1007/s10958-019-04295-w 14. Prajna, S., Parrilo, P.A., Rantzer, A.: Nonlinear control synthesis by convex optimization. IEEE Trans. Autom. Control 49(2), 304–314 (2004). https://doi.org/10. 1109/TAC.2003.823000 15. Diveev, A.I.: A numerical method for network operator for synthesis of a control system with uncertain initial values. J. Comput. Syst. Sci. Int. 51(2), 228–243 (2012). https://doi.org/10.1134/S1064230712010066 ˇ 16. Suster, P., Jadlovsk´ a, A.: Racking trajectory of the mobile robot khepera ii using approaches of artificial intelligence. Acta Electrotechnica et Informatica 11(1), 38– 43 (2011). https://doi.org/10.2478/v10198-011-0006-y
Reverse Engineering: The University Distributed Services M. Amin Yazdi(B) and Marius Politze IT Center, RWTH Aachen University, Aachen, Germany {yazdi,politze}@itc.rwth-aachen.de
Abstract. In response to the growth of demand for web services, there is a rapid increase in distributed systems. Accordingly, software architects design components in a modular fashion to allow for higher flexibility and scalability. In such an infrastructure, a variety of microservices are continuously evolving to respond to the needs of every application. These microservices asynchronously provide reusable modules for other services. To gain valuable insights into the actual software or dynamic user behaviors within distributed systems, data mining, and process mining disciplines provide many powerful data-driven analysis techniques. However, gaining reliable insights into the overall architecture of a heterogeneous distributed system is proved to be challenging and is a tedious task. In this paper, on the one hand, we present a novel approach that enables domain experts to reverse engineer the architecture of the distributed system and monitor its status. On the other hand, it allows the analysis and extraction of new insights about dynamic usage patterns within a distributed environment. With the help of two case studies under real-life conditions, we have assessed our methodology and demonstrated the validity of our approach to discover new insights and bottlenecks in the system. Keywords: Distributed services · Reverse engineering systems · Data science · Process mining
1
· Software
Introduction
Continuous digitalization of services would lead us toward heterogeneous and decentralize systems. Distributed systems benefit from the flexibility of microservices, extendibility, reusability, and maintainability. On the contrary, it also poses continuous challenges such as waste of resources, complicated coordination, and, causing a tedious effort to extract comprehensive data for analysis or troubleshooting. Due to the high demand for analysis of data-flow within the distributed systems, we investigated and rolled out our methodology within the IT Center services at RWTH Aachen University. The University IT infrastructure supports the development of various decentralized but interconnected microservices [20]. However, often it is challenging to comprehend the intercommunication of different microservices. Usually, to respond to the business processes, microservices and applications need to intercommunicate systematically, c Springer Nature Switzerland AG 2021 K. Arai et al. (Eds.): FTC 2020, AISC 1289, pp. 223–238, 2021. https://doi.org/10.1007/978-3-030-63089-8_14
224
M. A. Yazdi and M. Politze
resulting in an interdependency. In this diverse environment, some microservices are built for particular purposes, yet partially also being used by various applications. These applications utilize the interlinks between the services to adhere to the concept of reusability of software modules. Albeit continuous and concurrent user interactions, there are significant threats to the seamless process workflow due to simultaneous agile development on every single service [27]. Some of the microservices are highly integrated, governed, and tested, while at the same time, there are ad-hoc microservices that are continuously expanding without a clear overview of the overarching architectural constraints and possible dependencies. Despite these requirements and challenges, often, there is no apparent architectural oversight to identify which microservices are operating in which applications to inform essential strategies around the development, maintenance, and consolidation of the services. In this study, we focus on enabling the analysis of real-life operational processes, structure, and dynamic behavior of users in an interactive Distributed System (DS). In order to analyze such systems, various modeling techniques help to demonstrate the intercommunication of different components. In this study, process mining and data science can serve us in having a better understanding of the operational aspects of the system. Hence, we propose and demonstrate a novel method to obtain operational processes in a real-life scenario by enabling reverse engineering for software architecture in distributed services. Furthermore, we confirm the validity of the technique by discovering and analyzing the dynamic behavior patterns within the System Under Study (SUS). In our view, every reverse engineering discovery and modeling technique have to address several criteria. In such methods, one has to address the concerns and constraints on the method used to obtain, collect, process, and analyze the data. Accordingly, it is essential to elaborate on the infrastructure where data is collected and how that method can handle the multi-threading of data. Furthermore, it has to discuss the methodology used to tailor the correlating information and the targeting model. Due to the complex nature of distributed services, often data are stored at multiple locations and on different databases, making it challenging to centralize data and elicit useful information from the mesh of microservices that are interconnected and running concurrently. The overall methodology of this study is as follows: i) We describe and analyze the SUS. ii) Justify the approach used for gathering data. iii) We then elaborate on data preprocessing and preparing the data for further analysis. iv) Finally, we examine the reliability of our approach by two case studies. The rest of this paper is as follows: Sect. 2, introduces the related works and studies concerning data analysis in distributed settings. Section 3, describes the methodology used to empower data analysis and reverse engineer the SUS. In Sect. 4, we discuss the assessment of our approach in two different case studies with the help of process mining and data mining techniques. Section 5, focuses on the challenges experienced during the implementation of our methodology and suggestions for future improvements. Finally, in Sect. 6, we throw light on the benefits and the contributions of our work.
Reverse Engineering the University Distributed Services
225
Table 1. Bird’s-eye view on different reverse engineering approaches to analyze distributed systems. Author
Info. Source
App. Layer
Corr. of Events
Sequence Order Comm. Type
Per. Ind. Granularity
Enviro. Target Model
Ackermann et. al. [1] Beschastnikh et al. [2] Briand et al. [3] Van Hoorn et al. [25] Leemans et al. [11] Moe et al. [12] Salah et. al. [18]
Network Packets Instrumentation Instrumentation Instrumentation Instrumentation Network Interceptor Provided Log
TCP/IP Java Java Java Independent COBRA Java
Network Packets Comm. Channels Comm. Channels No Correlation Comm. Channels No Correlation Comm. Channels
Dependent Dependent Dependent Independent Dependent Dependent Dependent
No No No No Yes Yes No
Components Components Control-Flow Components External Interface Varied Components
Real Test Test Real Real Real Test
UML CFSM UML Monitor Log Process Model LQN UML
Yazdi et al.
Instrumentation
Independent OAuth2 Tokens Dependent
Hierarchical
Real
Process & Data Model
2
Single Thread Multi Thread Single Thread Multi Thread Single Thread Multi Thread Multi Thread
Multi Thread Yes
Related Work
The attempt to analyze software behavior in DS using reverse engineering is not new, and there exist several works of literature in the field. In this section, we discuss the literature targeting to understand the behavior of the DS, analysis techniques, and its challenges. Analysis of Distributed Systems: In order to understand the behavior of a system, it is not possible to look into source code and observe the intercommunication between different components and dynamic bindings within a system. Hence, in Table 1, we compare and give a bird’s-eye view on different approaches used to analyze DS. To facilitate the comparison of different proposed approaches and position our work within state of the art, we use multiple criteria such as Communication Type, Correlation of Distributed Events, Application Layer, Environment, Information Source, Granularity, Performance Analysis, and Target Model. Information Source: It is the strategy used to retrieve dynamic data. Similar to our approach, most of the suggested methods [3,11,25] use some sort of instrumentation or alternation of existing code to generate the traces of information required for analysis. In [2], the authors were able to adopt the previously generated logs to acquire the necessary data. The authors of [1,12] have considered the monitoring and intercepting the low-level network packets between clients and servers as a source of information. Application Layer: The constraint of the proposed approach to be able to reverse engineer a DS in a particular programming language. Most of the related studies focus on the instrumentation of the Java programming language [2,3,18,25]. Furthermore, [1] relies on the TCP/IP application layer, and [12] targets the COBRA translation layer and acts as middleware. Correlation of Distributed Events: The methodology to correlate executed events in DS. The authors of [2,3,11,18], are using extra communication channels to inspect and indicate the correlation of events. Instead, authors in [1] focus on network packets transmitted between the sender and receiver.
226
M. A. Yazdi and M. Politze
Sequence Order: The order of execution of events within a process. For the analysis of traces, except [25], authors have suggested techniques to reverse engineer the run-time software execution with respect to the order of events occurring during a process. Therefore, a trace variant directly influences the target model. Communication Type: Communication type defines the execution of tasks of a process in either single thread or multi-threaded fashion.. It refers to the capacity of the approach in interpreting the multiple software execution processes simultaneously. In [1,3,11], the proposed technique is capable of analyzing the single-threaded process, while others [2,12,18,25] facilitate the evaluating of multi-threaded software execution. Performance Indicator: Focuses on the possibility of using a technique for analyzing the software run-time performance in a DS environment. Only a few existing techniques [11,12] facilitating the analysis of software execution performance and hence, help to find bottlenecks in a system. Granularity: It is the level of detail of information acquired by an approach. In [1,2,12,18], authors have focused on the behavior captured at the software components level. On the other hand, authors for [3] have a focus on deep, low-level software control-flow information. In [11], authors have attempted to capture the user requests and trace the software execution cycle to respond to those user requests. Environment: The environment in which a method is built for. In order to be able to replicate a reverse engineering approach for a DS, it is essential to replicate that method in real-life settings. However, the authors in [2,3,18] have only examined their proposed solution in a lab/test environment. However, this is especially critical in an expanding DS as the scalability of a solution depends on being able also to handle real-life software execution processes without further readjustments. Target Model: The model that an approach can produce. The strategy introduced in [1,3,18] are aiming for a generation of UML sequence diagrams. In [2], authors use Communicating Finite State machines (CFSM) as their targeting model. Authors for [12] used the Layered Queuing Networks (LQN) model to study tradeoffs in software architecture to predict the effects of changes to architecture before the actual implementation. Similar to the work done in [11], we have also focused on process models to precisely express the mapping of traces and the expected software behavior. Additionally, in Sect. 4.2, we also explore the possibility of elicitating data mining models using our method. Despite the various studies in the area of reverse engineering, many of them are focused on a testing environment and generation of UML sequence diagrams. In the real-life scenario, the suggested approach should have a minimal impact on the performance of the system. Additionally, the proposed approaches generate a non-dynamic data granularity for analysis purposes, some are too detailed, and some are very high level. In a huge distributed system environment like ours, it is essential to propose a solution that can acquire the right level of information with respect to the requirements in hand. Hence, the current approaches
Reverse Engineering the University Distributed Services
227
lack the possibility of scalability and adopting the level of data. Moreover, every microsystem in a distributed environment may have a different programming language, so, one should focus on solutions that are language independent. As the SUS is on continuous implementation and operation, our solution should be non-intrusive to business logic. To summarize, the main challenge for analyzing distributed systems is to discover the interactivity aspect of a DS and the intercommunication between different system components. Data and Process Science. As are explained in [8,15,16], data science includes but not limited to data extraction, preparation, transformation, presentation, predictions, and visualization of a massive amount of structured or unstructured data that are static or streaming. It aims to turn data into real value by analyzing a big set of data and obtain insights from the mass of data. A comprehensive description of data science is given by Van der Aalst [24]: “Data science is an interdisciplinary field, ... ,it includes data extraction, data preparation, data exploration, data transformation, storage and retrieval, computing infrastructures, various types of mining and learning, presentation of explanations and predictions, and the exploitation of results taking into account ethical, social, legal, and business aspects”. As a sub-discipline of data science, Van der Aalst has introduced process mining and suggested three types of phases [21,24]. Namely, Discovery techniques such as Fuzzy miner, Inductive miner, etc. [4,22,23] that are usually used as a starting point for extracting a process model from raw event logs. Conformance checking that compares an existing process model with an event log of the same process to check if the extracted model complies with that of reality and vice versa. This gives opportunities to diagnose deviations and determine the effects of non-compliance events and bottlenecks. Enhancement aims to improve the actual process performance by extending the former model [7]. Additionally, data and process mining can enable us to monitor services in real-time and allows for early detection of anomalies within systems. To our knowledge, the majority of techniques do not support the issue of correlation in distributed systems in a scalable fashion, and they focused on discovering a control-flow model rather than interactivity of software components. Therefore, we argue that we need a reverse engineering technique to discover precise models to gain insights into the overall architecture of the DS and reflect a real-life scenario of the system’s dynamic behavior.
3
Proposed Approach
In the following section, we describe the methodology implemented at the university. To achieve an optimal solution and to enable the capturing of relevant data, it is first essential to gain sufficient knowledge on current distributed infrastructure at the university and anticipate possible challenges. Then, identify the sweet spots within the current paradigm of SUS and apply the changes necessary to collect desired data. According to the challenges discussed in Sect. 2, we
228
M. A. Yazdi and M. Politze
expect to encounter added noise and outliers among gathered data when dealing with distributed services. Therefore, the acquired data has to go through an evaluation step to serve the data quality required by data analysis measures. 3.1
Methodology
Hence, on the one hand, it is essential to begin by investigating all available data sources; on the other hand, scoping the data to avoid getting overwhelmed with data flooding. In our case, we required to discover the underlying infrastructure, where acquiring data is the most efficient, reliable, and least expensive. For our approach, we began by asking stakeholders and software architects to assist us in gaining a mutual understanding of constraints and expected data structure within available services. With every iteration of methodology evaluation, the findings were discussed with domain experts to revise the method development path and its validity. In this methodology, besides acquiring sufficient data to study the dynamic system behaviors, we are aiming for an approach that does not create additional work for developers and has the least impact on the efficiency of the SUS, this we attempted to achieve by minimal instrumentation of a dedicated token service using the OAuth2 workflow. Our university web services demand independent service operators that are joint together to form the system landscape. Such an IT architecture guarantees scalability, maintainability, and reusability of the software components. Thus, the task of responding to the user request can be processed utilizing multiple web services. As an example, the software components for archiving a file can be reused by other applications in the system to initiate an archiving process. Due to the scattered data-flow within such a distributed but interconnected workflow, analyzing the dynamic behavior of components during usage and user activities becomes very challenging and nearly impossible, especially when dealing with legacy systems. Table 2. Sample of aggregated data from distributed services using multiple data sources. Timestamp
Access token User hashId Method call
...
...
Microservice
Role
...
...
...
... Student
2019-12-16 08:59:55 Izo9VZ9vz
dhcwb4MII
GetInfo
userinfo.rwth
2019-12-16 08:59:56 Izo9VZ9vz
dhcwb4MII
GetNotifications
editpns.rwth
Student
...
...
...
...
...
...
2019-12-16 09:00:23 fW6O7i85R
LOrv8dNur
GetWeeks
campus.rwth
Student
...
...
...
...
...
2019-12-16 09:00:32 CF12boZC3 nfW6O7i85
GetReport
campus.rwth
Student-employee
2019-12-16 09:00:32 CF12boZC3 nfW6O7i85
IsAuthorized
editpns.rwth
Student-employee
2019-12-16 09:01:07 nCkCF12bo nfW6O7i85
GetAllFiles
simplearchive.rwth
Student-employee
2019-12-16 09:03:15 eoeBGbLfl
nfW6O7i85
GetSchema
metadataadmin.rwth
Student-employee
...
...
...
...
... Student
...
...
2019-12-16 09:15:18 mcnneXNOe 6XE8m5KaE GetReport
campus.rwth
2019-12-16 09:15:19 mcnneXNOe 6XE8m5KaE GetNotifications
editpns.rwth
Student
...
...
...
...
...
...
2019-12-16 09:34:46 skPXZjMwR 9PtIBdGFS GetPicturesForUser picturemanagement.rwth Employee ...
...
...
...
...
...
Reverse Engineering the University Distributed Services
229
Table 2 shows a sample of the aggregated data collected by our approach. The timestamp is the time when a request has been executed and processed. The access token is the unique token that is used during the authorization process. The user hashId is a hashed and anonymized users’ unique identifier. The activity column collects the software components that were triggered. The microservice is a unique identifier for every single microservice involved in the process. Furthermore, the role column, identify the relationship between the user and the organization provided by the identity management service. By using process mining techniques, we can discover real-life processes and identify units for improvements. Through logging the execution threads (event logs) between different node instances, traces are created. These traces represent the life-cycle of various processes and hence, allow us to discover the behavior of a system. In this system, a user request, creates a sequence of event logs that may scatter to multiple nodes depending on the nature of the request, each is collected as traces individually, and each trace is corresponding to an execution process instance to generate an overall model. These traces can be service requests by external users or internal communication of different service components. 4) Issue token Distributed System
Token Service Offline Data Analysis
1) Initialize 5) Request data 6) Verify token
8) Log export
7) Respond with data
2) Request access 3) Consent access
Fig. 1. The schematic representation of the OAuth2 workflow in distributed services at RWTH Aachen University and our high-level strategy for data acquisition.
Information Source: The OAuth2 workflow implemented by the token service facilitates a secure channel to authorize web services and handle users’ access authorization without supplying the users’ credentials [13]. Figure 1 shows the steps for the authorization process using the OAuth2 service and the information retrieval strategy. 1. The workflow initializes with a user accessing an application that requires to access one of the microservices in the context of that user. 2. The application redirects the user to the web interface of the token service. 3. The user consents that the application accesses the microservice in their name.
230
M. A. Yazdi and M. Politze
4. The token service issues an access token to the application, as a representation for the authorization. 5. On each request to the microservices, the application uses the token to convey the authorization. 6. For each successive request, the microservice checks the validity of the token against the token service. 7. The application, and therefore the user, receives a response with requested data from the microservice. 8. This validation workflow allows for interception and acquisition of all microservices that require authorization [14] and thus export instrumented logs for further analysis. As a result of employing the token service to collect and aggregate the information about processed requests, by creating logs export to offline data analysis tools, we can monitor the resources and services that are responsible during software and users’ workflow. We assess our approach following the criteria mentioned earlier in Sect. 2. Application Layer: By instrumentation of the token service, we can target the intercommunication of software components via different interfaces regardless of other services’ programming language. Hence this is a language-independent method that allows for ease of use, maintainability, and scalability throughout the SUS. Correlation of Distributed Events: As a recall, we are interested in reverse engineering the dynamic behavior of software execution life-cycle per user request. Therefore, tracing sequences of events occurring between different software components can yield promising results. To accomplish this, we focus on a nonintrusive approach to capture enough data across all services without interfering with the legacy code or modifying source code manually. Referring to Table 2, we argue that by instrumenting the access token identifier generated by the token service, access token and user id are reliable candidates for the correlation of events in SUS as it is a common component across all services. Sequence Order: Given the possibility of execution of several user requests simultaneously, it is essential to capture the sequences of the events in which they occur. Additionally, this method enables us to run conformance checking by capturing the sequence of events that has been executed in order to fulfill users’ requests and compare them to the expected sequence order. Communication Type: Notice that, in order to respond upon a single resource request, there may be multiple threads of activities executed, involving several microservices. Nevertheless, these activities that required authorization and token validation (step 6 in Fig. 1) are captured and stored. Therefore, all instances of a token being used by multiple services can be captured and analyzed, regardless of the simultaneous association of several microservices with a single user request. Performance: Performance is one of the essential indicators for effectively tracking the performance of the SUS and discover bottlenecks at high-level as well as low-level processes. The token service generates timestamps per log entry; hence we use timestamps for analyzing the performance of the SUS in responding to user requests and enable us to discover bottlenecks.
Reverse Engineering the University Distributed Services
231
Data Granularity: Referring to the sample data shown in Table 2, the logging system captures high-level data such as the microservice in demand as well as low-level data such as the method calls (software components) involved to respond to a user request. Moreover, the database can capture as much data as required to add more contextual analysis, such as the Role of a user, allowing for organizational mining (see Sect. 4.2). However, the data set is limited to the requests that require authorization life-cycle. Environment: Instead of a controlled environment, our technique focuses on capturing real-life data-flow in a distributed environment that is generated by actual user interactions. Target Model: We argue that, following our methodology, we aim to create process models (Ex. Sect. 4.1) and data models (Ex. Sect. 4.2) with precise and clear semantics.
92,539
editpns.rwth 482,961
19
5
329,282
managedas.rwth 114
campus.rwth 59,415
1,823
eduroam.rwth 17,319
481
userinfo.rwth 213,484
99
metadataadmin.rwth 453
11
122,813
4
2
2
16
portal.rwth 11
feedback.rwthapp.rwth 27
3
userdetails.rwth 18
51,437
picturemanagement.rwth
317
19
11
metadata.rwth 1,622
345
5
sendpns.rwth 2
4
329
simplearchive.rwth 933
2
(a) Absolute frequency of service usage. Darker node or edge color represent higher case frequency.
editpns.rwth
79.3 secs
instant
8.9 secs
managedas.rwth
6.3 mins 466 millis3.8 mins
instant
campus.rwth
userinfo.rwth
instant
2.9 secs
17.9 secs
instant
231 millis217 millis
6.3 mins
230 millis
23.2 secs
10.6 mins
feedback.rwthapp.rwth instant
eduroam.rwth instant
metadataadmin.rwth instant
43 millis
portal.rwth
userdetails.rwth
instant
instant
picturemanagement.rwth instant
metadata.rwth instant
11.6 secs 23.5 secs
sendpns.rwth instant
simplearchive.rwth instant
(b) Performance analysis and bottleneck discovery. The thickness and color of edges represent higher delays a process.
Fig. 2. Discovered business process model of SUS using fuzzy miner at service level.
232
3.2
M. A. Yazdi and M. Politze
Data Preparation
Often the complexity and challenges of elicitating quality data from raw data are underestimated. Due to the presence of noise, outliers, undetected variations, redundancies, and even missing values that are gathered in an interconnected and heterogeneous setting, the data preparation process becomes a time-consuming, error-prone, and iterative task [5]. Hence, we established a standard procedure for offline data preprocessing to maintain desirable data quality concerning data analysis objectives. Initially, data and its attributes are filtered, anonymized, and cleaned based on the scope and research questions. Later, noise and outliers are detected using the Z-Score analysis [19] and excluded from the sample data. Finally, the reliability of obtained data can only be confirmed if the sample data is adhering to data completeness, integrity, accuracy, and consistency standards.
4
Experimental Evaluations
This section discusses two case studies to evaluate the validity of our approach to reverse engineer the architecture of the SUS with the help of process mining and data mining. We employed process mining for the business process model discovery task and data science for the organizational mining task. 4.1
Business Process Model:
In this case study, we investigate and discover the process model of interconnected microservices, and we applied process mining on top of aggregated data that is driven by our approach. We hypothesized that our approach could support process mining to develop and extract business process models from the distributed system. Hence, we used process mining to answer the following research questions: 1. To what extend, process mining can support reverse engineer the dynamic behavior in SUS to draw insights? 2. What are the main bottlenecks in the architecture of the system to improve the overall user experience? Log Extraction: As is described in Sect. 2, to enable process mining, one needs logs that include a timestamp, case Ids, and activities. Our approach supports us with the aggregation of logs within the necessary standard. Every user interaction on the client-side triggers a set of activities and involves several resources to respond accordingly. Our data sample for one week, contains 102981 cases, 776370 events, and 13 activities.
Reverse Engineering the University Distributed Services
233
Log Analysis: We used the Disco-tool [9] for our process mining task and model analysis. It provides us with excellent usability, reliability, and performance. Disco utilizes Fuzzy miner [24] for generating simplified process models while highlighting the most frequent activities. Each case represents a complete process containing one or several services involved in the execution lifecycle. The nodes and the edges of the model represent the activities and execution path respectively. The numbers at the nodes show activity frequencies, and the numbers at the arcs show the frequencies of corresponding execution paths. Analysis Results: Figure 2 demonstrates the model that is driven by applying the Fuzzy minner on top of the sample data shown in Table 2. By looking at the discovered process model, we can discover new insights from the actual behavior of the system, and it is possible to identify which services work as clusters and which services behave as stand-alone. Moreover, it is clear that editpns (the notification service) and userinfo (authorization service) act as middleware microservices, and if any of them fail, it may cause significant interruptions in the whole system. Additionally, from Fig. 2a, we can note that these are the most frequent active services and can help a system architect to draw the attention of the developer team to the services with the highest demand. Overall, besides profound and inspiring insights for the system architect team, we are planning to apply this method for further evaluation of the software development process where new functionality is added and observe the effect of it on the rest of the system. Furthermore, by focusing on a specific use case scenario, we can identify the most frequent start and end of traces within the whole distributed environment. Additionally, Fig. 2b illustrates the bottlenecks in the SUS. It is clear that editpns and userinfo services are having a significant impact on application performance throughout the SUS while processing a huge number of requests. Now, given our collected dataset, it is possible to drill down and focus on methods triggered within each service to identify the primary source of delays in user response at the software components level. As an example, by using the created hierarchical model shown in Fig. 3, we drilled down and further investigated the editpns service to recognize the possible source of the issue in this service inefficiency. By employing the discovered model, we can immediately identify which other software components from the other three services are contributing to the problem of the bottleneck in the editpns. As illustrated in Fig. 3, the method call Canteens/GetCoffeeBars from the portal, User/GetInfo from userinfo, and Exams/GetReport from the campus are the candidates for further investigation as they appear to be contributing to inefficiencies in processing requests of Notifications/Devices/IsAuthorized method in the editpns service. Overall, by utilizing this reverse engineering method besides answering our research questions, several unknown architectural issues were discovered across the whole SUS. These were unnecessary method calls, method-loops or interservice-loops, unnecessary authorization checks, and code inefficiencies within software components.
234
4.2
M. A. Yazdi and M. Politze
Organizational Mining
To further support the reliability and extendability of our methodology, we have additionally extended our data set with organizational aspects (user roles). The objective of the organizational mining is to identify and predict the role of users according to his/her set of activities executed using data mining techniques. This analysis is to investigate the feasibility and accuracy of data mining in extracting users’ role at the university services by evaluating the dataset captured by our technique. Currently, our systems record three types of roles, namely, student, employee, and student-employee. The student-employee role identifies the individuals who have the student as well as the employee affiliation.
Canteens/GetCoffeeBars instant
5.5 secs
144 millis
12.3 mins
Canteens/GetWeekMenu
Notifications/Devices/IsAuthorized instant
instant
337 millis
Canteens/GetCanteens
4.6 secs
Notifications/GetNotifications
instant
editpns.rwth
instant
10.7 secs6 secs
Apps/RWTHApp/GetBlockedSites
portal.rwth
8.9 mins
instant
1.3 secs
826 millis
14.6 mins
Events/Timetable/GetWeeks instant
575 millis
User/GetInfo
1.2 secs
instant
userinfo.rwth
8.8 secs
Exams/GetReport
campus.rwth
instant
Fig. 3. Performance analysis at software components level. Each colored group represents a service and illustrate the hierarchical representation. Table 3. Descriptive matrix generated by transforming the sample dataset shown in Table 2. Role
GetInfo GetNotifications GetWeeks GetReport IsAuthorized GetAllFiles GetSchema ...
Student
1
1
0
0
0
0
0
Student
0
0
1
0
0
0
0
...
Student-employee 0
0
0
1
1
1
1
...
...
Student
0
1
0
1
0
0
0
...
Employee
0
0
0
0
1
0
0
...
...
...
Student-employee 2 ...
...
...
...
...
...
...
...
...
0
0
1
3
1
0
...
...
...
...
...
...
...
...
Reverse Engineering the University Distributed Services
235
Data Selection: As was described in Sect. 3, our methodology has the capacity of extendability with other attributes such as users’ roles. The sample data shown in Table 2 includes user interactions along with the services that use the token service. By heuristic observation, it is evident that there is a direct link between the role and set of executing activities. Data Preprocessing: The selected data presented us with the challenge of uneven distribution of groups and can result in bias machine learning results. It contains 82,3% entries with student role associations, 16,1% links to employees, and 1.6% to student-employees. However, this reflects the reality of user group distribution at the University. To overcome the problem of imbalanced dataset, it is essential to apply balancing techniques. To investigate the minority and less frequent classes, one might use SMOTE [6] method to over-balance a data set. On the other hand, to maintain an even class distribution, under-balancing can reduce the frequency of majority class occurrences. Data Conversion: As shown in Table 3 the data is transformed into a vectorized descriptive matrix format where every row is a unique user, and every column represents an executed software component. The matrix format includes aggregated data where repeated interactions only increase the counter for an activity for that respective user and hence produce weight to the descriptive vector. Data Mining: We applied several machine learning techniques to identify the best performing classification algorithm for our objectives. For our analysis, we use 3-fold cross-validation methodology [26]. We split our dataset into two sets, one as the training set, containing 2/3 of the data, and the other as the validation set, containing 1/3 of the data. The analysis is executed with the help of RapidMiner [10] studio. Analysis Result: For the analysis, we used several algorithms, namely, Random Forest, Support Vector Machine (SVM), Decision Tree, Deep learning, and Naive Bayes. The resulting accuracy resembles the existence of noise and the necessity of further data preprocessing. Nevertheless, as shown in Fig. 4, the results from the role-mining, yield an average model accuracy of 85% for naive Bayes and SVM algorithms after oversampling using SMOTE technique and the overlapping error bars indicate that the data sets do not vary significantly. Therefore, the findings confirm the validity of the data collection approach used in our method and indicate that by running further outlier and noise analysis, we should be able to acquire a more satisfying result. Further investigation may be needed to validate if falsely classified instances are correlated among algorithms or if approaches like boosting can be used to combine multiple classifiers effectively.
236
M. A. Yazdi and M. Politze
Fig. 4. Model accuracy for role mining using RapidMiner studio.
5
Challenges and Future Work
Besides the difficulty of instrumenting this approach to record and log the user interactions at the software component level, we went through the number of iterations to come up with an optimal data collection, convergence, and preprocessing strategy. Utilizing our methodology, we achieved a high-quality dataset that is valuable for more data analysis projects. Despite promising results found in our process mining and data mining case studies, there is an uncertainty on the convergence of recorded data from all involved services. This is due to multidimensional data sources or because not all software components require an authorization token to execute. Since the token validation based logging system may not be required for every resource request, there is a chance for an incomplete dataset for specific data analysis scenarios. By further investigating our recorded dataset, we encountered some incomplete or missing features. We needed to use data emulation methods to replace the missing values with somewhat realistic ones. Moreover, as privacy and ethics can become a concern in our approach, we used the SHA-512 algorithm to encrypt and hash the user names into unique ids before exporting the data. However, by frequency analysis, it may still be possible to deduce exclusive user behaviors [17]. Furthermore, data collection is an ongoing process, meaning that there are streams of data that need to be evaluated on the fly for specific analysis goals, instead of offline data analysis. Currently, we only capture the incoming calls’ timestamps; this may not be sufficient for running accurate performance analysis. Hence, we would like to improve our methodology with the software component’s processing-time to obtain the performance of each method individually. However, we may acquire an overwhelming amount of data that is hard to analyze, hence further event log abstraction methods may need to be employed. In the future, we would like to systematically characterize the existing offline data preprocessing methods and try to integrate and automate this step into the paradigm of the overall architecture. We suggest further investigation on the analysis of data streams and the methods to extend this methodology to handle streams of data by perpetual automated data preprocessing and
Reverse Engineering the University Distributed Services
237
preparation. Furthermore, one can enhance this method with social network analysis or metadata analysis to support research data provenance and to enable research data reusability.
6
Conclusion
In this paper, we described a novel and reliable method for running reverse engineering to obtain real-life logs from distributed systems within the university services. Our approach allowed us to analyze dynamic system behavior with respect to users’ requests in real-life operational processes to fulfill those requests. Furthermore, we described how this technique is useful to analyze software components as well as user behaviors. To ensure the validity of data collected by this technique, we conducted two case studies to demonstrate and evaluate our approach. The result of our evaluations suggests that our approach is capable of delivering a promising basis for executing other data-driven projects. Despite further necessary improvements described in Sect. 5, this approach has proven to be able to discover hierarchical process models throughout the system. Besides, we presented the ability to run performance analysis and discover bottlenecks in the system at the service level as well as the components level. Therefore, the data generated by our approach proved the capacity to analyze software components and the discovery of user behaviors within interconnected distributed services. Overall, the contribution of this work is two-fold, on the one hand, it enables domain experts to reverse engineer the architecture of the distributed system and monitor its status. On the other hand, it allows the analysis and extraction of new insights about usage patterns within a distributed environment.
References 1. Ackermann, C., Lindvall, M., Cleaveland, R.: Recovering views of inter-system interaction behaviors. In: 2009 16th Working Conference on Reverse Engineering, pp. 53–61. IEEE (2009) 2. Beschastnikh, I., Brun, Y. Ernst, M.D., Krishnamurthy, A.: Inferring models of concurrent systems from logs of their behavior with CSight. In: Proceedings of the 36th International Conference on Software Engineering, pp. 468–479 (2014) 3. Briand, L.C., Labiche, Y., Leduc, J.: Toward the reverse engineering of UML sequence diagrams for distributed Java software. IEEE Trans. Soft. Eng. 32(9), 642–663 (2006) 4. Buijs, J.C.A.M., Van Dongen, B.F., van Der Aalst, W.M.P.: On the role of fitness, precision, generalization and simplicity in process discovery. In: OTM Confederated International Conferences “On the Move to Meaningful Internet Systems”, pp. 305–322. Springer (2012) 5. Cai, L., Zhu, Y.: The challenges of data quality and data quality assessment in the big data era. Data Sci. J. 14 (2015) 6. Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)
238
M. A. Yazdi and M. Politze
7. Cheng, L., van Dongen, B.F. van der Aalst, W.M.P.: Efficient event correlation over distributed systems. In: Proceedings of the 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, pp. 1–10. IEEE Press (2017) 8. Donoho, D.: 50 years of data science. 337:2015 (2015). http://courses.csail.mit. edu/18 9. G¨ unther, C.W., Rozinat, A.: Disco: discover your processes. BPM (Demos) 940, 40–44 (2012) 10. Hofmann, M., Klinkenberg, R.: RapidMiner: Data Mining Use Cases and Business Analytics Applications. CRC Press, Boca Raton (2013) 11. Leemans, M., van der Aalst, W.M.P.: Process mining in software systems: discovering real-life business transactions and process models from distributed systems. In: 2015 ACM/IEEE 18th International Conference on Model Driven Engineering Languages and Systems (MODELS), pp. 44–53. IEEE (2015) 12. Moe, J., Carr, D.A.: Using execution trace data to improve distributed systems. Softw.: Practi. Exp. 32(9), 889–906 (2002) 13. Politze, M.: Extending OAuth2 to join local services into a federative SOA. In: EUNIS 23rd Annual Congress; Shaping the Digital Future of Universities, pp. 124–132 (2017) 14. Politze, M., Decker, B.: Extending the OAuth2 workflow to audit data usage for users and service providers in a cooperative scenario. In: DFN-Forum Kommunikationstechnologien, pp. 41–50 (2017) 15. Press, G.: A very short history of data science. Forbes. com (2013) 16. Provost, F., Fawcett, T.: Data science and its relationship to big data and datadriven decision making. Big data 1(1), 51–59 (2013) 17. Rafiei, M., von Waldthausen, L., van der Aalst, W.M.P.: Ensuring confidentiality in process mining. In: SIMPDA, pp. 3–17 (2018) 18. Salah, M., Mancoridis, S.: Toward an environment for comprehending distributed systems. In: WCRE, pp. 238–247 (2003) 19. Shiffler, R.E.: Maximum z scores and outliers. Am. Stat. 42(1), 79–80 (1988) 20. Valdez, A.C., Yazdi, M.A., Ziefle, M., et al.: Orchestrating collaboration-using visual collaboration suggestion for steering of research clusters. Proc. Manuf. 3, 363–370 (2015) 21. van Der Aalst, W.: Process Mining: Discovery, Conformance and Enhancement of Business Processes, vol. 2. Springer, Heidelberg (2011) 22. Van der Aalst, W., Weijters, T., Maruster, L.: Workflow mining: discovering process models from event logs. IEEE Trans. Knowl. Data Eng. 16(9), 1128–1142 (2004) 23. Van der Aalst, W.M.P., De Medeiros, A.K.A., Weijters, A.J.M.M.: Genetic process mining. In: International Conference on Application and Theory of Petri Nets, pp. 48–69. Springer (2005) 24. van der Aalst, W.M.P.: Process Mining: Data Science in Action. Springer, Heidelberg (2016) 25. van Hoorn, A., Rohr, M., Hasselbring, W., Waller, J., Ehlers, J., Frey, S., Kieselhorst, D.: Continuous monitoring of software services: design and application of the kieker framework (2009) 26. Wiens, T.S., Dale, B.C., Boyce, M.S., Kershaw, G.P.: Three way k-fold crossvalidation of resource selection functions. Ecol. Modell. 212(3–4), 244–255 (2008) 27. Yazdi, M.A.: Enabling operational support in the research data life cycle. In: Proceedings of the First International Conference on Process Mining, pp. 1–10. CEUR (2019)
Factors Affecting Students’ Motivation for Learning at the Industrial University of Ho Chi Minh City Nguyen Binh Phuong Duy, Liu Cam Binh, and Nguyen Thi Phuong Giang(&) Industrial University of HCM City-IUH, Ho Chi Minh City, Vietnam {nguyenbinhphuongduy, nguyenthiphuonggiang}@iuh.edu.vn, [email protected]
Abstract. This paper examines factors that affect students’ motivation for learning, including, teacher behavior, teaching method, learning environment, and learning goal orientation by using regression analysis of data collected from one hundred ninety-six students at Industrial University of Ho Chi Minh City, VietNam. In addition, the researcher also investigates whether students’ gender, age, and field of study have any influence on student motivation through Independent Sample T-test and One-Way ANOVA tests. The results show that learning goal orientation has the most influence on student motivation, whereas there is no difference between groups of gender, age, and field of study. Furthermore, the independent variables explain nearly 30,22% of the variance of student motivation, implying that there are many other factors influencing student motivation for further study. Keywords: Motivation for learning behavior Teaching methodology
Learning goal orientation Teacher
1 Literature Review 1.1
Motivation and Motivation for Learning
Motivation is a psychological term often used in the field of education, and is understood as efforts and commitments towards goals. Motivation is not explained as a result of a process starting from any “motive”. The “motive” points out the reason for doing a specific thing temporarily with an ambiguous and relatively superficial goal, while “motivation” shows the reason for doing something long-term and broader than motivation. Motivation and motivation may be the same in a given time period, but when it comes to education, it is more appropriate to use the concept of motivation [19]. Motivation is what drives you to take action. It is our inspiration to accomplish something. Indeed, motivation has long been viewed as the primary cause of personal behavior. Motivation is defined as motivating actions or processes; stimulation and interaction created to encourage efforts for certain individuals. Summary, motivation is something (such as needs or desires) that will drive the action. of an individual [12]. © Springer Nature Switzerland AG 2021 K. Arai et al. (Eds.): FTC 2020, AISC 1289, pp. 239–262, 2021. https://doi.org/10.1007/978-3-030-63089-8_15
240
N. B. P. Duy et al.
Complex concepts of motivation often emphasize stimulation directly to individuals: an inner self-effort or an encouragement from the external environment [9, 14] explained the difficulties in defining the definition of motivation due to too many “philosophical orientation on human nature and on what can be known about people”. He argues that motivation is “a set of energies originating from both inside and outside of an individual to initiate a related action, which is oriented, intensity and duration”. It is clear that the study of motivation according to the authors is very difficult, because we must focus the research on human nature. As mentioned, some researchers believe that motivation is the only factor that directly affects the academic success of students, and all other factors that affect success affect motivation [15]. Specifically, students’ motivation to study reflects the level of orientation, concentration and effort of students in the process of learning the contents of the subject [13]. Learning makes more sense when people participate for their own personal gain, not for satisfying an external need [3, 7, 11] argued that intrinsic motivation would lead learners to find or accept clear and complex learning experiences, which gave them an opportunity to challenge the world. His opinion and thus promote their abstract thinking. [5] clearly shows that “tangible rewards can weaken the internal motivation of individuals”. In short, motivation is a fundamental factor when considering student learning. Instructors can assist in increasing and developing learning motivation, helping students achieve optimal performance in the classroom. In addition, through creating a favorable learning environment, setting clear learning goals with enthusiasm in lectures can help students find joy and excitement in learning [17]. 1.2
Related Literature
Five factors improving student motivation of the Williams brothers’ study The objective of the study is to examine how the impact of student factors, faculty, content, teaching methods/processes and the learning environment on learning motivation affects, the best way to increase learning motivation. The author uses qualitative methods, but does not mention how the specific research design. In the research results section, the author answers the question “What is the best way to promote student learning?”. The author argues that it is important to consider these five components, which can contribute to increasing or hindering student motivation [18]. In each element, the author also proposes an approach to how to increase rather than hinder students’ learning motivation throughout their learning process. The research paper evaluates extensively on factors affecting learning motivation. A student during the course of study in the lecture hall they have to interact with teachers, friends, the learning environment and these are factors to consider to increase their motivation. The author uses a qualitative method with very convincing arguments. The difficulty of the research is to read and refer to many documents, as well as to have long-term experience in the field of education to have profound arguments and persuade readers.
Factors Affecting Students’ Motivation
241
Research by Klein, Noe and Wang “motivation and learning outcomes” in 2006 In the study, the author considers the factors affecting learning outcomes through an intermediate variable which is the learning motivation. The purpose of this study is to examine how learning goal-driven factors (LGO), modes of communication, and awareness of barriers and support impact learning motivation and learning outcomes. The study used a direct on wesite survey method with the participation of 600 students at many training courses. Quantitative research is the main method used in this study. The author has come up with a model based on in-depth research by Colquitt, Lepine, and Noe and the “input-process-output” learning model called Brown Ford’s IPO. Accordingly, the “training motivation” theory recognizes that learning motivation has a direct effect on learning outcomes. In addition, individual characteristics and contextual factors are considered to have direct and indirect effects on motivation and learning outcomes. The IPO model shows that the link between transmission methods and learning outcomes through active learning includes learning motivation. The IPO model also indicates that the mode of communication can have a different effect on motivation and subsequent learning outcomes. Learning motivation is influenced by learners’ characteristics, teaching characteristics, barriers and cognitive support [10] (Fig. 1).
Learner Characteristics (Learning goal orientation)
Perceived Barriers/Enablers
Motivation to Learn
Course Outcomes
Instructional Characteristics (Delivery Mode) Fig. 1. Model of factors affecting learning outcomes of Klein et al.
The results of the study showed a significant positive correlation with three factors: learner characteristics, perceived barriers/support, teaching characteristics to learning motivation and final learning outcomes. This is a research in favor of personal perception and evaluation of the learners, so online surveys will make the study less reliable. In the study, there were many scales as well as confusing questions, this time the online survey made the researcher lose control of the respondent. Research applies only in the field of education rather than an organization training context, which is also reasonable considering the author’s research objectives and context. In fact, the author also points out that the results of the study can also be generalized to non-academic settings. In addition, this study does not show that there are any factors other than the above three factors that create or motivate, specifically in the characteristics of learning, the authors only consider one factor. The face is the “oriented learning goals”.
242
N. B. P. Duy et al.
Research “Factors affecting the learning motivation of students of Bahauddin Zakariya University, Multan (Pakistan) The objective of the study is to explore factors affecting learning motivation. Study sampling survey from 300 people through stratified sampling technique, using descriptive statistical methods, correlation, variance analysis and reliability. Research results show that the use of effective teaching methods, appropriate learning environment and active learning can increase students’ motivation for learning. In particular, encouraging the creation of a dynamic learning environment such as creating debate, or discussion opportunities, creating a collaborative learning environment and working in small groups can amplify students’ learning motivation. In addition to exerting pressure on students with a large volume of lessons, outdated teaching methods, large class sizes reduce students’ interest and their motivation to study [16]. The author’s research is more discoverable than checking the impact of factors, although the author still uses many quantitative methods in analysis. The study does not highlight the contents of the related theories, the author mainly presents a short problem that makes readers relatively difficult to grasp with the issues presented.
2 Research Model and Hypotheses Faculty behavior factors are often the most interested in research on learning motivation. Lecturers are the people directly involved in imparting knowledge to students, so faculty behavior can support or hinder student learning motivation. If lecturers have good qualifications and pedagogical skills, they are interested in students, they will contribute to increase learning motivation [16, 18]. Many studies show that students’ personal factors such as their clear learning goals will contribute to learning motivation [9, 10, 16, 18]. Previous studies also agreed that teaching methods and learning environment have a positive impact on learning motivation [10, 16, 18]. In addition, in his research [10], the application of information transmission also affects the learning motivation, in particular the author compares the two training courses directly at the educational institution. and distance learning. Teaching content is also thought to have an impact on learning motivation [18]. Students with clear academic goals often choose a higher education institution based on their field of expertise, more specifically the content that any university will transmit reach students. Students choosing a university for training in economics rather than technology are relevant to their interests, as well as their future career orientation. Nowadays, with the development of information technology and the perfection of the enrollment work of universities, we can easily grasp information about the content of teaching and training. Table 1 summarizes the factors affecting students’ learning motivation from the 5 studies of the authors. The four factors covered by many studies in the table include faculty behavior, student learning goal orientation, teaching methods and learning environment. Based on the studies of [10, 11, 17–19], the author proposes a model of four factors affecting learning motivation including: Teacher behavior, student learning goals, learning environment and teaching methods.
Factors Affecting Students’ Motivation
243
Table 1. Total of the weaker elements image effects to dynamic force studying collective of student membes Administrative acts faculty member The guide section targets school file of student members Teaching methods Environmental field school training Method informal communication achieved through private
[16] x x x x
[17] x x x x
[18] x x x x
[10] [9] x x x x
x
Important components that influence students’ motivation are: the student himself, the instructor, the content of the teaching, methods/processes, and the environment. For example, students must have a significant interest in education and feel the value that learning brings. Lecturers must be well-trained, must follow the education process, support and meet the appropriate requirements of students, and especially they must be able to inspire well. The content conveyed must be accurate, timely, and appropriate for both the future and present needs of students. Teaching methods or processes must be creative, interesting, rewarding and provide tools that can help students apply them in real life. The learning environment must be safe, positive, promoting both the individual role and teamwork [6]. 2.1
Lecturer Factor (Faculty Behavior)
In the process of teaching in university, the teacher is the subject of teaching activities, playing a leading role in the teaching process at the university. Instructors with teaching functions organize, control and direct the activities of students, ensuring that students fulfill fully and high quality the prescribed requirements in accordance with the purpose to teach at university [4]. A teacher’s passion has a significant impact on the energy of the class, which enhances the value of work and appeals to students, helping them want to know more. Motivation plays an important role in teachers’ pedagogy. As a lecturer you need to think about ways that can motivate your student’s motivation. Teachers can empower and provide support to their students. The quality learning environment where they have the full support from the school, the teacher will facilitate for students to develop themselves, more interested in the subject [17]. Students are more motivated to learn from the instructors. However, education is not about creating characteristics to be popular with students, but teachers need to focus on many factors such as good professional knowledge, conducting clear classroom assessments, effective teaching methods as a result, there is encouragement. In addition, lecturers need to be trained in time to grasp many new teaching trends [18]. H1: Acts of the faculty members has a positive impact on student learning motivation.
244
2.2
N. B. P. Duy et al.
Student Factors (Orientation of Student’s Learning Goals)
Students are the subject of teaching activities and the subject of cognitive activities of research nature. The other, in the course of teaching in university, the student is an object active, independent, creative subject in order to acquire the relevant knowledge and skills regarding them future career [4]. Therefore, students with learning activities must implement the function of awareness of issues defined by the content of teaching in universities. Students or learners’ characteristics are expressed through their research orientation and learning style. Students themselves taking courses are often guided by a desire for a degree or another reward. Competition is also a trait among students, if there is competition between members, it will make more sense for them to take the highest position in the classroom. The characteristics of learners therefore affect learning motivation [9]. When it comes to learner characteristics, learning goal orientation (LGO) is a factor of interest, because many studies show that LGO is a strong impact factor and the learning and distribution of efforts. forces in learning [8]. This is also consistent with the study of [10], in this study, the authors only analyze the direction of learning goals in the characteristics of learners. H2: Student learning goals orientation has a positive impact on student learning motivation. 2.3
Classroom Learning Environment
According to Hinde-McLeod & Reynoldss (2007) cited in [2], “creating the right learning environment can support the development of students in the classroom”. It is a place where students enjoy their studies, a place of self-development. This same view [18] considers that environment is an important component to increase students’ learning motivation. H3: The classroom learning environment has a positive impact on student motivation. 2.4
Teaching Methods
Dang Vu Hoat and Ha Thi Duc [4] in “Theory of university teaching” that “teaching methods are stipulated by the content of teaching, in other words, the content of teaching governs the choice choose a teaching method in university “. The appropriate selection and use of teaching methods will help make the course content an important factor in the experience of university students, thereby motivating them to learn to master the system. their basic knowledge, basic knowledge, specialized knowledge and future career orientation. According to Alderman (1990) cited in [18], the teaching method is the way teachers use to approach students. The two basic approaches to support and increase motivation in the classroom are (1) creating a classroom structure, with shared teaching methods to provide more motivating, participatory learning environments. enthusiasm and create the optimal learning capacity from students; (2) help students develop tools that allow them to grasp and regulate themselves.
Factors Affecting Students’ Motivation
245
H4: Teaching method has a positive impact on student motivation.
3 Research Methods 3.1
Qualitative Research
The research process uses both qualitative and quantitative methods. Qualitative research aims to adjust the scale of previous studies to suit the object and scope of the study is a student of Industrial University of Ho Chi Minh City. According to previous studies and theoretical basis, the research paper has proposed four factors affecting the dependent variable, which is the “learning motivation” of students, in particular, there are 30 observed variables representing five weak factors. Implementation steps: (1) prepare a discussion outline to prepare for the group discussion process, (2) collecting information from the research subjects. The discussion outline consists of 2 parts: introduction and refinement of participants in the group discussion, discussion: building a group discussion outline to conduct information gathering; conduct group discussion according to the outline discussed above; Summary of results. The results of the group discussion showed that the 30 observed variables representing the initial 5 factors are highly representative, The participants in the discussion fully understood the factors affecting learning motivation. previous construction. However, the final scale has some adjustments. 3.2
Quantitative Research
Quantitative research using convenient sampling methods. Data was collected through a survey of students studying at the Industrial University of Ho Chi Minh City. Direct interview technique of the places that the students study under the specific guidance from the interviewer. In the quantitative study, EFA explored factor analysis and multiple regression analysis. When conducting the EFA discovery factor analysis, the minimum sample size should be 50, preferably 100 and the minimum observation/measurement ratio should be 5: 1 and preferably 10: 1 (Hari et al. The, 2006 is cited in [13]). In this study, the total number of observed variables is 27, so the minimum sample size is 27 * 5 = 135 and the best sample size is 270. When regression analysis MRL people often based on empirical formula to calculate the sample size as follows: n 50 + 8p. Where n is the minimum sample size needed and p is the number of independent variables in the model. According to Green (1991) quoted in [13] that “this formula is suitable if p < 7”. Applying the above formula we have the minimum sample size required when p = 4 is n = 82. So, to apply both methods above, the minimum sample size for this study is 135 and the best for the study is 270. In this study, the sample size is 200 selected. 3.3
Data Analysis Method
To conduct data analysis, 220 questionnaires will be used to conduct the survey. After collecting, the questionnaires were checked and eliminated unsatisfactory, then coding,
246
N. B. P. Duy et al.
inputting and cleaning the data. The data analysis was then performed by SPSS 20 software with 196 satisfactory surveys. Sample research analysis uses descriptive analysis to analyze the properties of research samples such as gender, age, and field of study. In this section, the main methods used are statistical analysis of frequency and frequency. Test and evaluate the scale. In order to assess the scale of concepts in research, we need to check the reliability and validity of the scale. Based on the Cronbach’s Alpha reliability coefficients, Item-total correlation to eliminate observed variables does not contribute to the description of the concept to be measured, Cronbach’s coefficient’s Alpha if Item Deleted to help evaluate the elimination of observed variables in order to improve Cronbach’s Alpha reliability coefficient for the concept to be measured, and the exploratory factor analysis (EFA) method to check the validity of scale of research concepts. Cronbach’s Alpha analysis: in this section the scales with a Cronbach’s Alpha coefficient greater than 0.6 are considered to be reliable. In addition, we also check the correlation coefficient of the total variable of each measurement variable if greater than or equal to 0.3 as satisfactory. If the correlation of the total variable of a measurement variable is greater than 0.3 but too small compared to the remaining variables, we can still consider whether to remove this variable or not? Thus, in Cronbach’s Alpha analysis, we will remove scales with small coefficients (a < 0.6) and also remove observed variables with correlation coefficients of small corrected total (< 0.3) from model because these observed variables are unsuitable or do not make sense for the scale. Exploratory Factor Analysis (EFA): After removing uncertainty variables through Cronbach’s Alpha analysis, factor analysis method (EFA) is used to determine convergent validity, discriminant validity, and at the same time, reducing the estimated parameters for each group of variables. In order for the scale to converge, the single correlation coefficient between variables and factor loading must be greater than or equal to 0.5 in a factor (0.4 factor loading < 0.5 is considered important; factor loading > 5 is considered to be of practical significance). To achieve discriminant validity, the difference between factors must be greater than or equal to 0.3 (kiA − kiB 0.3). However, we need to consider the content value before making a decision to remove or not remove a measurement variable [13]. The number of factors is determined based on the Eigenvalue index - representing the variance explained by each factor. The number of factors that are determined by the factor (stopping at the factor) with Eigenvalue is at least equal to 1 ( 1) and those with Eigenvalue smaller than 1 will be excluded from the model of Variance explained. criteria): the total extracted variance must be 50% or more. In this study, we use the Principal components extraction method with Varimax rotation and stops when extracting elements with Eigenvalues greater than or equal to 1. Regression analysis. After the EFA discovery factor analysis, the extracted factors will be run linear regression. The regression correlation analysis aims to confirm the appropriateness of the research model, testing hypotheses to determine the degree of influence of each factor affecting the dependent variable. MLR multiple regression model for the study: DL ¼ b0 þ b1:GV þ b2:SV þ b3:MT þ b4:PP
ð1Þ
Factors Affecting Students’ Motivation
247
4 Data and Results 4.1
Sample Characteristics of the Survey
The variables used in analyzing sample characteristics include: gender, year of student, major in which the student studied. Regarding gender, the sample does not show much difference between men and women. The percentage of men participating in the survey was 46.9% less than the percentage of women. Percentage of students in year 4 taking the survey accounted for nearly 60% of the research sample, these are students who have almost completed the course, they are learning and interacting with a lot of teachers, with many teaching methods different. The specialized banking and finance group accounted for the highest proportion, more than 50% of the survey sample. The students from other disciplines account for 8.7%, most of them in this group are first and second year students who have not been graduated with a subject according to the regulations of Industrial University of Ho Chi Minh City (see also Appendix A.1). 4.2
Test and Evaluate the Scale
Cronbach’s Alpha test The scales presented in the study will be tested for reliability by Cronbach’s Alpha method. The results after running the analysis, the initial scales are reliable (Cronbach’s Alpha coefficient is 0.60 or higher). All five scales have high reliability values and vary in the range (0.7–0.90), as shown in Table 2 as follows: Table 2. Summing Cronbach’s Alpha coefficients for 5 scales No.
The scale
1 2
Teacher behavior Orientation of student learning goals Study environment Teaching methods Motivation for learning
3 4 5
Number of observed variables 7 6
Coefficient Cronbach’s Alpha 0.856 0.789
3 7 4
0.765 0.874 0.763
Explore Factor Analysis (EFA) After analyzing Cronbach’s Alpha, 4 independent variables of the research model and 1 dependent variable with 27 observed variables remained the same for EFA discovery factor analysis. The independent variables are analyzed at the same time, separately the dependent variable “learning motivation” will be analyzed separately. In factor analysis, the author used principal extraction method with perpendicular rotation and stops when extracting elements with eigenvalue greater than 1.
248
N. B. P. Duy et al.
EFA analysis of independent variables. The independent variables include: faculty behavior has 7 observed variables, student’s learning goals orientation has 6 observed variables, learning environment has 3 observed variables and teaching methods with 7 observed variables observations, included in the EFA analysis. The method of extracting “Principal Axis Factoring” with “Varimax” rotation is used in the factor analysis of independent components. The analytical results are presented in Appendix A.2, specifically: KMO and Bartlett tests in factor analysis showed that KMO = 0.835 > 0.5 and sig < 0.05 showed suitable data for factor analysis. The extracted variance is 57.409% (>50%), indicating that the four factors extracted explain 57.409% of the data variation. With this result, the scale drawn satisfactory. Pause when extracting elements at factor 4 with eigenvalue = 1,518. Factors are extracted with observed variables and corresponding factor load factors (only showing load factor > 0.3). All observed variables have a load factor > 0.5, so the observed variables measure the concept we need to measure. Rotaled Component Matrix shows the convergence of observed variables into factor groups. We see that the teaching methodology components of student’s learning goals (SV2, SV6, SV5, SV4, SV5 and SV1), the learning environment (MT1, MT3, MT2) all converge to the right nucleus factor as stated in the summary of the scale. The observed variable PP1 converges with the first group of factors (GV3, GV5, GV4, GV7, GV2, GV1 and GV6), so it can be said that the objects in the survey sample believe that “the use group discussion method” is highly dependent on the trainer, and PP1 in subsequent analyzes will be used to measure the faculty behavioral factor. The variable PP2, “using modern teaching methods” both measures the factor “teacher behavior” and measures the factor “teaching method”. However, the load factor of PP2 in factor 1 is lower than that in factor 2 (specifically, 0.346 < 0.5 and 0.669), so we still keep PP2 in factor 2. The elimination of observed variables PP1 out of the “teaching method” has increased the Cronbach’s Alpha coefficient of the “method of teaching” scale to 0.883. Meanwhile, adding PP1 to the “faculty behavior” scale as factor analysis increases the Cronbach’s Alpha coefficient of this scale to 0.868. EFA analysis depends on the “learning motivation”. The dependent variable “learning motivation” is measured by 4 observed variables DL1, DL2, DL3, and D4. The results of running EFA analysis for this variable are summarized in Appendix A.2. Similar to the independent variables, KMO and Bartlett tests in factor analysis for the dependent variable “learning motivation” showed that the KMO coefficient = 0.766 (> 0.5) and the sig value in the Chi-square statistics were 0.000. This shows that the data is suitable for conducting analysis. There is a factor extracted from the EFA analysis for learning dynamics, which is consistent with the original theory and scale. The variance extracted is 58,433% > 50%, Eigenvalue value = 2,337 > 1, satisfactory. The observed variables have factor load factor > 0.5, suitable. Adjusting the scale: After analyzing EFA factor, the observed variable PP1 converges on the scale of “lecturer behavior”. Thus, the two scales of “teacher behavior” and “teaching method” have a change in the adjusted measurement variable. The remaining three scales including “orientation of students’ learning goals”, “learning environment”, and “learning motivation” are left to conduct regression analysis. The results of factor analysis for the independent variable showed that 4 factors were
Factors Affecting Students’ Motivation
249
extracted, consistent with the theoretical basis and the proposed original research model. Therefore, the adjusted research model after testing is also the original proposed model, shown in Fig. 2.
Faculty behavior Orientation of student's learning goals Classroom learning environment
H1+ H2+ H3+
Motivation for learning
H4+
Teaching methods Fig. 2. Tissue formation studies at all manufacturing
4.3
Regression Analysis
Correlation Analysis Building a correlation coefficient matrix helps us to consider linear correlations between variables in the model. Specifically, we can consider the correlation between independent variables and dependent variables, even between independent variables. If the independent variables are strongly correlated with each other, we must pay attention to the phenomenon of multi-collinearity. In addition, we can remove any independent variable if it is not correlated with the dependent variable. The results show a correlation between the independent and dependent variables (sig < 0.05). In which the student variable - “oriented learning goals” has the highest correlation with the dependent variable DL - “learning motivation” (0.479). In addition, the independent variables are significantly correlated with each other, particularly between the GV and PP variables with a relatively high correlation of 0.494, we will consider multicollinearity in the regression analysis (Appendix A.3). Regression Analysis The independent variables (GV, SV, MT, PP) and dependent variables (DL) are included in the model to test the hypothesis by the Enter method (concurrently), because the hypotheses presented are behavioral factors. lecturers, student’s learning goals orientation, learning environment and teaching methods positively impact students’ learning motivation. Regression results are presented in Tables 3(a), 3(b) and 3(c). The regression results show that the determination coefficient R2 = 0.308 (6¼ 0). R2 tends to be an optimistic estimate of the model’s suitability for data in case there are more than 1 explanatory variables in the model. Here we use the determination coefficient Radj2 = 0.294 to explain the fit of the model will be safer and more accurate.
250
N. B. P. Duy et al. Table 3. (a) Model summary table. (b) ANOVA table. (c) Regression weight table (a)
tissue formation
R
R²
R adj ²
Degrees deviation benchmark estimates predict
first
0.555
0.308
0.294
0.53925 (b)
tissue formation
first
Sum of squares
DF
Average the medium average
F
Sig
21,297
0.000
Islamic rules
24,771
4
6.193
part balance
55,541
191
0.291
total
80,312
195
(c) Out
B
SE
β standard chemical
1,151
0.312 0.075
-0.005
SV
0.005 0.417
0.066
MT
0.052
0.056
PP
0.192
0.052
GV
t
Sig
relatively important
Multi plus online
Cor
Partial
Part
T
VIF
0.690
1,449
0.912
1,096
3,688
0.000 0.946
0.237
-0.005
0.395
0.068 6.272
0.000
0.479
0.413
0.004 0.377
0.063
0.930
0.354
0.232
0.067
0.056
0.780
1,283
0.265
3,670
0.000
0.398
0.257
0.221
0.692
1,444
In ANOVA table (Table 3(b)), the F test shows the significance level sig = 0.000 < 0.05. Thus, the regression model is appropriate, the independent variables in the model explain nearly 30,22% of the variance of learning dynamic variables. The remaining 69,78% is due to the impact of other factors not included in the model. Table 3(c), regression weight shows that the variables SV, PP have an influence on the learning motivation of students because they have sig significance level is less than 0.05. These variables have a positive impact on the DL dependent variable due to the positive Beta coefficient. Comparing the impact of these two variables on the DL variable, we see that the coefficient of bSV is greater than the coefficient of PPP. Therefore, the variable SV impact on tourism is much stronger than the variable PP. Both variables SV and PP have magnification coefficient VIF < 2, thus meeting the requirements. As such, setting clear learning goals will increase student motivation. Similarly, effective teaching methods also increase student motivation. The variables GV and MT both have VIF < 2, but they are not statistically significant due to the sig significance level of 0.946, 0.354 (> 0.05), so we will consider removing these two variables from the tissue (Fig. 2). The variable GV has a negative effect on DL with (bGV = −0.005), while MT has a positive effect on tourism with (bMT = 0.063). However, if we look at the Cor correlation coefficient in Table 3(c), we see that both the variables GV and MT positively affect the variable DL with the correlation coefficients, respectively, 0.237 and 0.232.
Factors Affecting Students’ Motivation
251
In the regression weight table, the value of GV, MT are all less than 2. The correlation coefficient between GV and MT variables is 0.420 (Appendix A.3), the correlation between these two independent variables is quite high when compared Compare correlation with other variables. Looking at the partial correlation coefficient Pcor (GV, DL) in Table 3(c) (is the correlation between GV and DL when the linear effect of other independent variables on the independent variable GV is removed) and the correlation coefficient In particular, the Scorecard (GV, DL) (which is the correlation between GV and DL when the linear effect of other independent variables on both tourism and teacher is removed) is negative. This means the remaining variables have explained the GV variable that explains DL. In this case, we cannot conclude that lecturer behavior has no impact on learning motivation, because faculty behavior has been shown in the remaining independent variables. Consider additional correlation coefficients between the independent variables: cor (GV, SV) = 0.214, cor (GV, MT) = 0.420, cor (GV, PP) = 0.494. We found that the correlation between GV and independent variables is relatively high, so it leads to multicollinearity even though VIF of GV variables are satisfactory (VIF = 1,449 < 2). Regression analysis results after the GV variable type has R2 = 0.308, it can be seen that adding the GV variable to the model does not increase the coefficient of determination R2. Whereas the R2 after removing the variable MT decreases to 0.305. Through the above arguments, we will remove the GV variable from the model and conduct a second regression. The results are shown in Appendix A.3. The results show that the MT variable in the regression weight table after removing the GV variable, has a significant level of sig = 0.342 > 0.05. Thus, we will still consider removing MT variables from the model due to no statistical significance. Thus, we will consider the regression model when there are only two variables, SV and PP. Looking at the regression analysis results (after removing the variables GV and MT), R2 decreased from 0.308 to 0.304, but Radj2 increased from 0.294 to 0.298. Thus, two variables SV and PP explain nearly 30% of the variance of the DL variable. The significance level in the F-test is satisfactory (Sig = 0.000 < 0.05). This result also shows that independent variables SV and PP actually affect tourism by having a significant level of sig < 0.05. In which SV variable has the strongest impact on data when coefficients bSV = 0.399 and bPP = 0.284, these variables have the same impact on data due to positive Beta coefficients. These variables all have a magnification coefficient VIF < 2, thus meeting the requirements. 4.4
Test the Hypotheses of the Model
The research hypotheses mentioned will be tested in this section. Specifically, hypothesis H1: Teacher behavior has a positive impact on student motivation. In Pearson correlation analysis, there is a significant impact on the dependent variable “learning motivation”. As analyzed, the regression weight (Table 3(c)), looking at the partial correlation coefficient Pcor (GV, DL) in (is the correlation between GV and DL when the linear effect of other independent variables on the independent variable GV is removed) and the partial correlation coefficient Scor (GV, DL) (is the correlation between teachers and DL when the linear effect of other independent variables on both tourism and teacher is removed) is negative. This means the remaining variables have
252
N. B. P. Duy et al.
explained the GV variable that explains DL. In this case, we cannot conclude that lecturer behavior has no impact on learning motivation, because faculty behavior has been shown in the remaining independent variables. Therefore, hypothesis H1 is still accepted, specifically, lecturers ‘behaviors contribute to increasing students’ learning motivation. Hypothesis H2: Student learning goals orientation has a positive impact on student learning motivation. Based on the regression results show that Beta coefficient is 0.399 with significance level Sig = 0.000 < 0.05, hypothesis H2 is accepted. As such, it can be concluded that students having clear learning goals will increase their motivation for learning. Hypothesis H3: The classroom learning environment has a positive impact on student motivation. Although Table 3(c) shows that the MT variable is not statistically significant in the regression analysis, however, the MT variable still has a significant correlation with DL when Pearson correlation analysis. And as analyzed, the explanation of the MT variable is explained by the remaining variables, since the correlation coefficient between MT and the remaining variables is relatively high. So the author decided to accept the hypothesis H3. Hypothesis H4: Teaching method has a positive impact on student motivation. Similar to the hypothesis H2, based on the regression results, we see that the Beta coefficient of the PP variable is 0.284 with the significance level sig = 0.000 < 0.05, the hypothesis H4 is accepted. Thus, the effective teaching methods can increase students’ motivation for learning. 4.5
Test the Difference in Learning Motivation According to Some Personal Characteristics of Students
Test the difference in learning dynamics by student’s gender. In this study, the author used the Independent t-test to check whether there were any significant differences in the motivations between male and female groups. The results show the significance level sig. In the Levene test = 0.084 (>0.05), this proves that there is no difference in variance for male and female students’ learning motivation. Therefore, we will consider the values in the column “Equal variances not assumed” in the T-test. Sig value. In the T-test = 0.491 (>0.05), there was no significant difference in the mean value between the male and female groups. Thus, we can conclude that there is no difference in learning motivation between the group of male and female students at Industrial University of Ho Chi Minh City, VietNam. Test the difference in learning motivation according to students’ school year. The number of school years of students in the study has 4 groups, the coding order in this variable is: “1” for freshman, “2” for second year students, “3” for third year students and “4” for final year students. To test the differences in learning dynamics of the 4 groups, the author uses One-Way ANOVA test. The results show that “between groups” has a sig level of significance. = 0.734 (>0.05) so we conclude there is no significant difference in learning motivation among students of different academic years. Test the difference in learning dynamics by students’ majors. The study majors of the students are classified into 5 groups, in the order of the coding: “1” the economics
Factors Affecting Students’ Motivation
253
group, “2” the business administration group, “3” the banking and finance industry, “4” accounting - auditing industry group, “5” other industry groups. Similarly, when examining the “student year” variable, the author still uses the One-Way ANOVA test, to see whether or not there is a difference in learning motivation for students of different disciplines. The results show that “among groups” of disciplines have Sig value. = 0.318 (>0.05), so it can be said that there is no significant difference in learning motivation among students from different disciplines (see also Appendix A.4).
5 Conclusion The regression results show that, there are 2 factors that really affect the learning motivation of students of Industrial University of Ho Chi Minh City: “student’s learning goals orientation” and “teaching methods”. However, the other two factors, including “teacher behavior” and “classroom learning environment”, still have an impact on learning motivation as the author argues in the hypothesis testing section. Comparison of research results: In order to discuss more clearly the results of the research, the author conducted a comparison with the research of Nguyen Dinh Tho et al. (2010) cited in [13]. In this study, the author examines the impact of faculty capacity on the learning motivation of students of economics. The author uses a scale to measure the concept of “faculty competence” different from this study. However, when considering the components of faculty capacity such as “teaching capacity”, “subject organization” or “classroom interaction”, it is relatively consistent with the scale of “internal teacher behavior” and “teaching methodology in this research”. The results of the “teaching capacity” variable also have a negative Beta coefficient and are not statistically significant. This shows that the use of lecturers with good professional competence, broad knowledge, and good expressive ability will not increase students’ motivation for learning. It can be said that increasing motivation for students is not easy because it depends on many factors. Two independent variables thought to influence learning dynamics only explained nearly 30,22% of the variance of this dependent variable, while the remaining 69,78% were other factors not considered in this study. The summary of study, the author found two factors that really impact on learning motivation. The first factor is that in the students themselves have a clear learning goal which contribute to increasing motivation. The second, most of faculty the development of good and student-centered teaching methods will contribute to increased motivation. The learning goals as mentioned above belongs to each student’s internal motivation. Many students enter the university lecture hall have determined what job they will do in the future; they are people who like to explore issues related to their subjects and disciplines; academic challenges, and more importantly they are proactive in their learning process. However, when looking at the statistical graph of the frequency of the variables for the concept of “goal-oriented learning”, nearly 50% of students could not identify their goals in the course of university study. Can students answer why they choose an economics school rather than a technical school? Although the learning goal depends on the individual student, there are many external factors that can affect the negative or positive changes of this factor. The fact that a university has a specific
254
N. B. P. Duy et al.
curriculum with all subjects attached with a detailed outline, this will help students who are interested in subjects, will find it easier to explore subjects, disciplines, which university they prefer. According to Amabile et al. (1994, p. 950) [1], curiosity and attention are the factors that influence students’ internal motivation. In addition, the learning environment in research has a significant correlation with students’ learning goal orientation. Creating a dynamic, highly competitive learning environment in the classroom helps students navigate their learning goals, because goal-oriented students will enjoy learning in the environment require high levels of competence; they are ready to be assigned assignments and questions that are challenging which not afraid of challenges in learning. Teaching method: this factor is highly dependent on teachers, the support from the school and the cooperation of learners. Lecturers cannot use learner-centered teaching methods when a class has too many students, nor can they use modern tools if the school’s facilities are not good. Group discussion method will not be effective if there is no cooperation from students. The faculty factor has a significant impact on the teaching method (correlation analysis shows that cor (GV, PP) = 0.494). Lecturers are people who directly communicate with students during university teaching, they are also people who directly use specific teaching methods, so pedagogical knowledge of teachers is very important. Therefore, it is important to add pedagogical skills for lecturers who are not in the pedagogy sector. Today’s modern higher education is gradually eliminating the teaching method with the teacher-centric teaching, now students are placed at the center, at this time the role of the lecturers does not diminish, even they has a particularly important role in shaping the learning process of students. In this study, using convenient sampling method, to obtain 196 surveys, the author surveyed in 12 different classes. However, the same sample size is small and unevenly distributed between the study facilities and the classroom, so other studies may choose a stratified approach to sample selection, and increase the sample size in the study. However, the first four independent factors explain only 30,22% of the variance of the dependent variable. So there are many other factors that affect student motivation. It is possible that later studies on learning dynamics should use more in-depth discovery research instead of focusing more on the theoretical basis, in order to add factors affecting learning motivation.
Factors Affecting Students’ Motivation
255
A Appendix Search Results Running Part Program SPSS (Tables 4, 5, 6, 7, 8, 9, 10, 11, 12, 13 and 14):
A.1 Features Points Sample Survey Damage
Table 4. Division announced sample according to sex count Number of objects Part hundred About properties Male 92 FEMALE 104 Total 196
46.9 53.1 100.0
Table 5. Classification published form under five student members
Five student members
Total
First 2 3 4
Number of objects 6 52 22 116 196
Part hundred 3.1 26.5 11.2 59.2 100.0
Part hundred area of consolidation 3.1 29.6 40.8 100.0
Table 6. Classification published form under group specializing in industry Number of objects Part hundred Industry groups Beijing international school Group sector management therapy Resources main-banking customers Accounting accounting-check payment Other Total
8 Thirty first 104 36 17 196
4.1 15.8 53.1 18.4 8.7 100.0
256
N. B. P. Duy et al.
A.2 Sub-area’s Prime EFA
Table 7. Summary of EFA analysis dependent variable No.
Variables related damage Human factors First First DL4 0.781 2 DL1 0.776 3 DL3 0.756 4 DL2 0.744 KMO 0.766 Bartlett’s (Sig.) 0.000 Method wrong deduction 58,433% Egienvalue 2,337
Table 8. Summary of EFA analysis results for independent variables No.
Variables related damage Human factors First 2 3 4 First GV3 0.801 2 GV5 0.773 3 GV4 0.741 4 GV7 0.695 5 GV2 0.675 6 GV1 0.629 7 PP1 0.599 8 GV6 0.597 9 PP6 0.853 Ten PP5 0.852 11 PP3 0.766 Twelfth PP4 0.682 13 PP2 0.346 0.669 14 PP7 0.665 15 SV2 0.782 16 SV6 0.760 17 SV4 0.742 18 SV3 0.645 (continued)
Factors Affecting Students’ Motivation
257
Table 8. (continued) No.
Variables related damage Human factors First 2 3 19 SV5 0.640 20 SV1 0.543 21 MT1 22 MT3 23 MT2 Cronbach alpha times 2 0.868 0.883 0.789 KMO 0.835 Bartlett’s (Sig.) 0.000 Method wrong deduction 57.409% Egienvalue 1,518
4
0.812 0.765 0.744 0.765
A.3 Sub-area Recovery Provisions
Table 9. Matrix correlation coefficient DL DL Pcor Sig GV Pcor Sig SV Pcor Sig MT Pcor Sig PP Pcor Sig
GV
1,000 0.237 0.001 0.237 1,000 0.001 0.479 0.214 0.000 0.001 0.232 0.420 0.001 0.000 0.398 0.494 0.000 0.000
SV 0.479 0.000 0.214 0.003 1,000
MT
0.232 0.001 0.420 0.000 0.174 0.015 0.174 1,000 0.015 0.279 0.385 0.000 0.000
PP 0.398 0.000 0.494 0.000 0.279 0.000 0.385 0.000 1,000
Table 10. Regression weight table after removing the GV variable Out B
SE
b standard chemical t
1,141 0.278 SV 0.416 0.066 0.395 MT 0.051 0.054 0.062 PP 0.190 0.048 0.264
4.102 6,300 0.952 3,941
Sig
Relatively important Multi plus online Cor Partial Part T VIF
0.000 0.000 0.479 0.414 0.378 0.917 1,091 0.342 0.232 0.069 0.057 0.847 1,181 0.000 0.398 0.274 0.237 0.806 1,241
258
N. B. P. Duy et al. Table 11. Regression analysis results (after removing variables GV and MT) tissue formation
R
R²
R adj ²
High rate benchmark estimates predict
first
0.552
0.305
0.298
0.53772
tissue formation
first
Out
Sum of squares
DF
Average the medium average
F
Sig
42,381
0.000
Islamic rules
24,508
2
12.254
part balance
55.804
193
0.289
total
80,312
195
B
SE
β standard chemical
1,261
0.250
SV
0.421
0.066
PP
0.207
0.046
t
Sig
relatively important
Multi plus online
Cor
Partial
Part
T
VIF
5,044
0.000
0.399
6.376
0.000
0.479
0.417
0.383
0.920
1,086
0.284
4,535
0.000
0.396
0.310
0.272
0.920
1,086
Table 12. Independent T-test results by student gender variable
Dynamic power school set
About properties
Number of opposite statue
Middle bottle
Degrees deviation Standard
Wrong number of standards
male
92
3.4647
0.70280
0.07327
Female
104
3.5288
0.58429
0.05729
Authors on the wrong by several
Do not assume the way wrong by several
Levene's Test the wrong by several
F Sig.
3,022 0.084
T-Test medium average by several
t
-0.698
-0,690
DF
194
177,630
Sig. (2tailed)
0.486
0.491
Factors Affecting Students’ Motivation
259
Table 13. ANOVA one-way results by student year variable Mini Statistics Levene
df1
df2
Sig.
0.517
3
192
0.671
Dynamic power school set DF
Between the groups
General average Phuong 0.532
3
average method on average 0.177
In the same group
79.78
192
0.416
total
80,312
195
F
Sig.
0.427
0.734
Table 14. One-way ANOVA results by student majors Mini Statistics Levene
df1
df2
Sig.
1,093
4
191
0.361
Dynamic power school set DF
Between the groups
General average Phuong 1,949
4
average method on average 0.487
In the same group
78,364
191
0.410
total
80,312
195
F
Sig.
1,187
0.318
A.4 Check for the Other Special This study uses a 5-point Likert scale to evaluate the level of consent ranked from small to large (with 1: Strongly disagree, 2: Disagree, 3: Neutral, 4: Agree, and 5: Totally agree). Scale of lecturer behavior: The scale of faculty behavior is based on the research of Gorham and Christophel (1992). In his study, the author gave 10 questions to represent 10 observed variables for faculty behavior. The author argues that the following 10 factors contribute to the explanation of the faculty behavior variable, which can increase or hinder student learning motivation. Specifically, lecturers must be the ones who meet the requirements of professional knowledge, pedagogical skills (voice, communication ability, …), who
260
N. B. P. Duy et al.
are responsible as promptly answering students’ questions. members, fairness in the examination and evaluation, avoiding all negative behaviors in learning. In addition, teachers also need to have humor in each lecture, or their level of interest in the interests of students, … 10 items are coded from GV1 to GV10 as follows: – – – – – – – – – –
GV1. Lecturers are competent and knowledgeable GV2. Lecturers have a sense of humor GV3. Be an effective, inspirational speaker in the classroom GV4. Speak clearly, explain in detail GV5. Interested in the benefits as well as the problems that students encounter GV6. Ready to help students outside working hours GV7. Be a responsible person (answer student questions, be fair in assessments) GV8. The voice is not boring GV9. Lecturers are people with good character (no negative in learning) GV10. Instructors are experienced people
The scale of student’s learning goals orientation The scale of learning goal orientation is based on research by Klein, Noe and Wang (2006) and in the original study by Vandewalle (1997). To specifically assess the learning goal-oriented factor (LGO - specifically presented in the profile of related studies), Klein et al. used 5 of the 6 questionnaires. In the study of Vandewalle (1997), the author also conducted adjustment of words to match the research in the field of education. An example of a question in Vandewalle’s original study (1997) “the opportunity to do challenging work is important to me”. In this study, we add a question item from the original study which is adjusted and coded from SV1 to SV6 as follows: – SV1. I often read industry-related documents to improve my skills – SV2. I am ready to be assigned assignments and challenging questions, which will help me learn a lot. – SV3. I often look for opportunities to develop new skills and knowledge – SV4. I like to face many challenges and difficulties in studying where I will learn new skills – SV5. My opinion is develop ability to learn important and I am willing to take risks to implement it. – SV6. I like to study in environments that require a high level of ability and talent. Scale for classroom learning environment The scale of the learning environment uses three question items in the study of Ullah et al. (2013). tablets. The author uses 3 question items to explain the learning environment factors. According to the author this scale is assessed by the size of the class, the healthy competition between students in the class and the active participation in their lectures. The questionnaire is coded from MT1 to MT3 as follows: – MT1. Suitable class sizes – MT2. Competition among students in class – MT3. The positive when participating in lectures of students in class
Factors Affecting Students’ Motivation
261
The scale of teaching methods To measure teaching methods, this study uses 7 questionnaires in the study of Ullah et al. (2013), Tootoonchi et al. (2002). The author has an opinion on the teaching method that a teacher-centered approach which is no longer appropriate, but now this method is still popular in most educational institutions. The new method now focuses on focusing on learners, teachers will still play a leading role, guiding the entire learning process for students. The new method also places more emphasis on classroom discussion, not merely a discussion between learners and learners but a direct discussion between lecturers and learners. With this method, students are more active in their learning. Lecturers will provide more materials such as their curriculum, lectures or reference materials to expand knowledge, so that students study at home more. In addition, the author offers a number of teaching methods such as using practical case studies in lectures; combining field visits in the subject and introducing scientific research papers related to the subject for student reference. The questions that represent the “teaching method” scale are coded from PP1 to PP7 as follows: – – – – – – –
PP1. PP2. PP3. PP4. PP5. PP6. PP7.
Frequent use of classroom discussion methods Modern teaching methods (learner centered) Regularly provide learning materials for students Use real-world case studies in lectures Incorporate field trips on the course Use documentaries relevant to the subject Use scientific papers related to the subject.
Learning motivation scale The observed variable is used to measure the dependent variable “Learning motivation” of students based on the scale of research of Cole et al. (2004). Cole et al. (2004) used 4 of 8 questionnaires in the study of Noe and Schmitt (1986) to measure learning motivation. For example, a question from the original research such as “I will try to learn if possible from the course” and “I will put considerable effort in the course”. Nguyen Dinh Tho (2013, p. 504) gave a learning motivation scale based on 4 questions in the study of Cole et al. (2004), in this study, the observed variables were recoded. from DL1 to DL4. – – – –
DL1. DL2. DL3. DL4.
I spend a lot of time studying in college Investing in this curriculum is my number one priority I study hard in this curriculum Overall, my motivation for studying at university is very well.
References 1. Amabile, T., Hill, K., Hennessey, B., Tighe, E.: The work preference inventory: assessing intrinsic and extrinsic motivational orientations. J. Pers. Soc. Psychol. 66(5), 950–967 (1994)
262
N. B. P. Duy et al.
2. Banjecvic, K., Nastasic, A.: Methodological approach: Students assessment of academic institution as basic for successful achievement of their satisfaction. Center for Quality: Faculty of Mechanical Engineering, University of Kragujevac (2010) 3. Boud, D.: Assessment and the promotion of academic values. Stud. High. Educ. 15(1), 101– 111 (1990) 4. Hoạt, Đ.V., Đức, H.T.: Lý luận dạy học đại học. Đại học Sư Phạm, Hồ Chí Minh (2013) 5. Deci, E.: Intrinsic motivation, extrinsic reinforcement and inequity. J. Pers. Soc. Psychol. 22 (1), 113–120 (1972) 6. D’Souza, K., Maheshwari, S.: Factors influencing student performance in the introductory management science course. Acad. Educ. Leadersh. J. 14(3), 99–120 (2010) 7. Elton, L.: Student motivation and achievement. Stud. High. Educ. 13(2), 215–221 (1988) 8. Fisher, S., Ford, J.: Differential effects of learner effort and goal orientation on two learner outcomes. Pers. Psychol. 51(2), 397–419 (1998) 9. Kinman, G., Kinman, R.: The role of motivation to learn in management education. J. Workplace Learn. 13(4), 132–144 (2001) 10. Klein, H., Noe, R., Wang, C.: Motivation to learn and course outcomes: the impact of delivery mode, learning goal orientation, and perceived barriers and enablers. Pers. Psychol. 59(3), 665–702 (2006) 11. Kroll, M.: Motivational orientations, views about the purpose of education, and intellectual styles. Psychol. Sch. 25, 338–343 (1988) 12. Merriam-Webster: Merriam-Webster’s collegiate dictionary, 10th edn. Houghton-Mifflin (1997) 13. Thọ, N.Đ.: Phương pháp nghiên cứu khoa học trong kinh doanh. NXB Lao động - xã hội, Hồ Chí Minh (2013) 14. Pinder, C.: Work Motivation and Organizational Behaviour, 2nd edn. Psychology Press, Upper Saddle River (2008) 15. Tucker, C., Zayco, R.: Teacher–child variables as predictors of academic engagement among low-income African American children. Psychol. Sch. 39(4), 477–488 (2002) 16. Ullah, M., Sagheer, A., Sattar, T., Khan, S.: Factors influencing students motivation to learn in Bahauddin Zakariya University, Multan (Pakistan). Int. J. Hum. Resour. Stud. 3(2), 90 (2013) 17. Valerio, K.: Intrinsic motivation in the classroom. J. Student Engagement: Educ. Matters 2 (1), 30–35 (2012) 18. Williams, K., Williams, C.: Five key ingredients for improving student motivation. Res. High. Educ. J. 12, 104–122 (2011) 19. Zu, P.: From motive to motivation: motivating Chinese elective students. Int. J. Arts Sci. 7 (6), 455–470 (2014)
Towards Traffic Saturation Detection Based on the Hough Transform Method Abdoulaye Sere(B) , Cheick Amed Diloma Gabriel Traore, Yaya Traore, and Oumarou Sie R´eseau des Enseignants chercheurs et Chercheurs en Informatique du Faso (RECIF), Ouagadougou, Burkina Faso [email protected], [email protected], [email protected], [email protected]
Abstract. The principal aim of this paper is to solve the problem of traffic saturation, in using only GPS on board of vehicles and servers in a station for the control, without deploying camera or more materials on roads. GPS takes a large region into account. The challenge is to reduce materials effectively deployed on board of vehicules and on roads. This paper deals with the application of the Hough transform method to the automatic detection of traffic saturation and monitoring in order to prevent drivers for traffic saturation. The Hough Transform method establishes a relation between an image space and a parameter space in 2D. In a map, Some main roads have been identified by their coordinates in the accumulator. The Hough Transform method is applied to the Cartesian coordinates of GPS coordinates for vehicles to determine the number of votes around the coordinates of main roads in the accumulator. The number of votes around main road coordinates shows saturation roads and drivers will receive a voice message to change directions. Forward, the method could be an alternative to elaborate a solution to fight against Covid-19 to detect the presence of the crowd in a street based on GPS.
Keywords: Hough transform
1
· Traffic saturation · Covid-19.
Introduction
Nowadays, road traffic management is a challenge for the cities in any country. Various technologies have been deployed on road or on board of vehicles to improve traffic, to facilitate driving, to avoid traffic saturation, to reduce accidents. Traffic management leads to deploy systems in vehicles and on roads, to create interactions between these two systems in order to facilitate traffic. Traffic signs are consisted of all the information on road to facilitate driving, particularly traffic lights. Traffic sign detection contribute to reduce accidents. Several works on traffic light detection have been done by scientists. According to statistical reports in japan [15], the most important accidents depend on the disregard of traffic c Springer Nature Switzerland AG 2021 K. Arai et al. (Eds.): FTC 2020, AISC 1289, pp. 263–270, 2021. https://doi.org/10.1007/978-3-030-63089-8_16
264
A. Sere et al.
light by drivers. Drivers receive automatically appropriate information on Traffic Light Detection to improve driving because possible traffic light saturation in the cities [15]. In this way, a system with camera operating on board of vehicles that detects Traffic Light, has been proposed by Hiroki Moizumi and others [15]. Moreover, traffic sign detection and interpretation becomes more and more important in the context of autonomous vehicles or intelligent vehicles [16]. For instance, Altaf Alam and others use a Vision-Based Traffic light detection for the navigation of autonomous vehicles [1]. In [7] Yun-Chung and others propose at intersections a vision-based traffic light detection system because traffic light saturation. Several works use the Hough transform to detect traffic sign [13]. For instance, [10], Garc´ıa-Garrido and others have studied how the Hough transform can be used for real-time traffic sign detection on roads. In the cities, the number of roads and the distance increase with the number of vehicles. Our study focuses on Traffic Saturation Detection in using the GPS and the Hough transform Method, to reduce the number of camera on roads. Through the definitions of various Hough Transform methods, straight line, circle, and arbitrary shapes recognition are possibles and that make the method plays a basic role in computer vision for object detection. Various concepts to modelize roads have been established. Scientists have worked on different definitions to draw digital straight line. Bresenham algorithms [6] are very Known to draw digital lines and digital circles. Reveill`es introduced the definition of analytical straight line [17,18]. The extensions of analytical straight line have been proposed by Andres et al. [2–4]. That leads to naive, standard, supercover hyperplane in dimension n. Particular analytical straight lines based on hexagonal or octogonal grids, have been studied by SERE et al. [21]. Many works have also concerned the detection of digital straight line. In 1962, Paul Hough [12] introduced the Hough transform method that transforms a point in an image space to a straight line in a parameter space. The method has been improved to take the detection of circles, ellipses and arbitrary shapes into account, proposed by Duda in 1972 [9] and Ballard [5]. In 1985, Henri Maˆıtre [14] proposed the unified definitions of the Hough transform method and [11], A.S. Hassanein and S. Mohammad also study a survey on the Hough Transform method. Moreover, Martine Dexet et al. in [8] also extended the initial Hough Transform definition in using the dual of a pixel and the dual of a voxel. The standard Hough transform is another method based on the equation p = x cos a + y sin a that transforms a point in an image space to a sine curve in a parameter space. Several works have been realized recently about the Hough Transform method to extend the standard Hough Transform, based on the dual of geometry shapes such as squares, circles, hexagons, rectangles, octogons, proposed by SERE et al. [19–21].
Towards Traffic Saturation Detection
265
This paper uses essentially two basic concepts such as analytical straight line, Hough Transform method and is focusing on the introduction of the Hough Transform method in the techniques of traffic saturation optimization. An automatic voice message is sent to drivers who are trying to go on critical roads. This paper is organized as follows: Sect. 2 describes the basic concepts related to analytical straight line and the Hough Transform method in preliminaries, followed by the Sect. 3 which presents the description of our method. Finally, Sect. 4 concerns illustrations.
2
Preliminaries
Analytical straight line [17,18], with parameters (a, b, μ) and thickness w is defined by the set of integer points (x, y) verifying : μ ≤ ax + by < μ + w , (a, b, μ, w) ∈ Z4 , pgcd(a, b) = 1 Analytical digital straight line is : – thin if w < max(|a|, |b|) – thick if w > (|a| + |b|) Let I ⊂ R2 be an image space. Let l be the number of columns in an image. Let h be the number of rows in an image. Suppose that the point (x, y) ∈ I. The Standard Hough Transform of (x, y) is defined by the set of points: {(θ, r) ∈ [0, π] × [− l2 + h2 , l2 + h2 ]/ r = x cos θ + y sin θ} The Standard Hough Transform has been extended by SERE et al. [20] to take the dual of a square (a pixel) into account. So, Let O ⊂ I be a square, Dual(O) = Dual(p), where p is a continuous point. p∈O
SERE et al. in [19] have established the dual of a rectangle to detect digital straight lines, which will be used in this paper.
3
Method Description
Our method consists of creating a list of roads on a map. Let R1 , R2 , ..., Rn−1 , Rn be n selected main roads in an area. We are interested to follow traffic saturation on these roads. Each road is represented by a straight line corresponding to an equation in the general form yi = a cos(xi )+ b sin (xi ) in an image space, where the couple (xi , yi ) is fixed, in considering the definition of the standard Hough Transform. Then, each straight line Ri is associated to a point (xi , yi ) which is the center of a cell in the accumulator. Let V1 , V2 , ..., Vm−1 , Vm be m vehicles, represented respectively by the coordinates V1 (a1 , b1 ), V2 (a2 , b2 ), ..., Vm−1 (am−1 , bm−1 ), Vm (am , bm ) in a image space, at a moment t. At each time, these coordinates will change, according to the new
266
A. Sere et al.
position of vehicles. The standard Hough Transform of a point Vi (ai , bi ) is a sine curve defined by y = ai cos(x) + bi sin(x). Here, the standard Hough transform based on the dual of a rectangle [19] is used: it is a surface and add 1 to the value of certain cells. The thresholds αi , βi are used to define the saturation function Si (u) for each (xi , yi ). u is the number of votes for the road Ri (xi , yi ). road Ri N −→ [0...1] Si : where γi is the maximal authorized number of vehicles u −→ γui for a road Ri . Thus, Si (u) defines the states of the road Ri (xi , yi ), “on saturation”, “almost on saturation”, “not on saturation”: – if Si (u) > βi the road Ri (xi , yi ) is “on saturation” and a voice message is sent to vehicles; – if αi < Si (u) ≤ βi the road Ri (xi , yi ) is “almost on saturation” and is opened for a few vehicle that receives a warning; – if Si (u) ≤ αi the road Ri (xi , yi ) is “not on saturation” and is opened to receive more vehicles. The value of Si (u) is also displayed on traffic sign to alert drivers about the state of each main road.
4
Numerical Simulation
We consider a digital image that contains at least one main road and secondary roads (see in Fig. 1). Canny filter is oriented for edge detection: edges in Fig. 2 are the results of Canny filter applied to the image in Fig. 1.
Fig. 1. A main road in a map
Fig. 2. Results with Canny filter
The dual of rectangles [19] allows to detect for instance main roads and secondary roads, as illustrated in Fig. 3, 4 and 5.
Towards Traffic Saturation Detection
267
A road can be represented by an analytical straight line. The dual of analytical straight lines leads to several couples that have the maximal vote in the accumulator, due to the thickness. For instance, the couples are: – for the main road in the Fig. 3: (392.0, 0.0)(392.0, 45.0)(392.0, 100.0)(392.0, 120.0)(392.0, 195.0)(392.0, 205.0)(392.0, 215.0)(392.0, 265.0)(392.0, 295.0) (392.0, 345.0)(392.0, 370.0)(392.0, 420.0)(392.0, 445.0) (413.0, 0.0)(413.0, 45.0)(413.0, 100.0)(413.0, 120.0)(413.0, 195.0)(413.0, 205.0)(413.0, 215.0) (413.0, 265.0)(413.0, 295.0)(413.0, 345.0)(413.0, 370.0)(413.0, 420.0)(413.0, 445.0) – for the secondary road in the Fig. 4: (686.0, 100.0); (0.0, 100.0); (406.0, 100.0); (133.0, 100.0); (392.0, 100.0); (805.0, 100.0); (497.0, 100.0); (756.0, 100.0); (413.0, 100.0); (812.0, 100.0); (693.0, 100.0) – for the secondary road in the Fig. 5: (686.0, 265.0); (0.0, 265.0); (406.0, 265.0); (133.0, 265.0); (392.0, 265.0); (805.0, 265.0); (497.0, 265.0); (756.0, 265.0); (413.0, 265.0); (812.0, 265.0); (693.0, 265.0)
Fig. 3. A main road
Fig. 4. A secondary road
Fig. 5. A secondary road
Fig. 6. The accumulator
The accumulator is represented by Fig. 6. It is the result of the Hough Transform method applied to the image in Fig. 2. The points, lit strongly have garnered the most votes.
268
A. Sere et al.
Generally for any map, the coordinates of the main roads are known in advance and they never change: Hough transform establishes only once theses coordinates. For instance, the couple (392.0, 0.0), with the scale changing corresponds really to the point (56, 0) in the accumulator for the main road. That leads to an equation 0 = x*cos (56)+y*sin (56) for a continuous straight line in the main road. Finaly, the point (413.0, 215.0) is also associated to the couple (59.0, 43.0) that leads to 43 = x*cos (59)+y*sin (59) for a continuous straight line in the same main road. Each vehicle has a GPS, transformed to a Cartesian coordinate (k, l). Let (392 , 0), (392 , 345), (413 , 420) be the Cartesian coordinates for three vehicles running on the main road. The standard Hough Transform of (k, √ l) is the set √ of points (see in Sect. 2)defined by: {(θ, r) ∈ [0, π] × [− l2 + h2 , l2 + h2 ]/ r = k cos θ + l sin θ}. For each point we have: – for the point (392, 0): p = 392 cos (a) + 0 sin (a). This sine curve passes through the point (392, 0) corresponding to the point (56, 0) in the accumulator. So, the number of votes around the point (56, 0) will be read and incremented by 1. – for the point(392, 345): p = 392 cos (a) + 345 sin (a). This sine curve passes through the point (392, 0) corresponding to the point (56, 0) in the accumulator. The number of votes around the point (56, 0) will be also read and incremented by 1. – for the point (413 , 420): the sine curve is defined by p = 413 cos (a) + 420 sin (a). Suppose that αi = 0.5 and βi = 0.8. Before incrementing the number of votes, the value is compared with αi and βi to know if a voice message must be sent to vehicles.
5
Short Discussion
This solution needs only to install the GPS on board of vehicles and to create servers for monitoring the main roads. It is not neccessary to place multiple camera on the main roads, in order to count vehicles. There exists no costs for maintaining camera. There are some remaining questions and future works should focus on these topics. For instance: – To establish a transformation between GPS coordinates vehicles and Cartesian coordinates in 2D or 3D. – The number of main roads is inferior to the number of vehicles. So, Only to have one server to control all the processing, we can use several servers to control scalability. In this case, what is the protocol, used between the server and the client that receives instantly a voice message? – The system on board of vehicles reads the coordinates of main roads and the value of αi and βi , their states through traffic signs and achieve all the processing itself and updates main road states on the server.
Towards Traffic Saturation Detection
6
269
Conclusion
The detection of traffic saturation is based on the Hough Transform Method that allows to get the coordinates of the main road, represented by an analytical straight line. We have considered a Cartesian coordinate (corresponding of a GPS coordinates) for vehicles. The Hough transform is applied to these coordinates to determine the number of votes in the accumulator. A function is used to define the level of saturation on the main road. Future works will concern the effective implementation in taking remaining questions into account. In perspectives, the method could be used to detect the crowd, in considering persons, walking in a street with their cellphones, to alert them in the case of Covid-19 prevention.
References 1. Altaf, A., Zainul, A.J.: A vision-based system for traffic light detection. In: Malik H., Srivastava S., Sood Y., Ahmad A. (eds) Applications of Artificial Intelligence Techniques in Engineering. Advances in Intelligent Systems and Computing, Springer, p. 698 (2019) 2. Andres, E.: Discrete circles, rings and spheres. Comput. Graph. 18(5), 695–706 (1994) 3. Andres, E.: Discrete linear objects in dimension n: the standard model. Graph. Models 65, 92–211 (2003) 4. Andres, E., Jacob, M.-A.: The discrete analytical hyperspheres. IEEE Trans. Vis. Comput. Graph. 3, 75–86 (1997) 5. Ballard, D.: Generalizing the hough transform to detect arbitrary shapes. Pattern Recogn. 13(2), 111–122 (1981) 6. Bresenham, J.E.: Algorithm for computer control of a digital plotter. IBM Syst. J. 4(1), 25–30 (1965) 7. Chung, Y.C., Wang, J.M., Chen, S.W.: A vision-based traffic light detection system at intersections. J. Taiwan Normal Univ. Math. Sci. Technol. 47(1), 67–86 (2002) 8. Dexet, M., Andres, E.: A generalized preimage for the digital analytical hyperplane recognition. Discrete Appl. Math. 157(3), 476–489 (2009) 9. Duda, R.O., Hart, P.E.: Use of the hough transform to detect lines and curves in pictures. Commun. ACM 15(1), 11–15 (1972) ´ ., Sotelo, M.A.: ´ Fast road sign detec10. Mart´ın-Gorostiz, E., Garc´ıa-Garrido, M.A tion using hough transform for assisted driving of road vehicles. In: Moreno D´ıaz, R., Pichler, F., Quesada Arencibia, A. (eds) Computer Aided Systems Theory – EUROCAST 2005. EUROCAST 2005. Lecture Notes in Computer Science,Springer, Berlin, Heidelber, p. 3643 (2005) 11. Hassanein, A.S., Sameer, M., Mohammad, S., Ragab, M.E.: A survey on hough transform, theory, techniques and applications. In: International Journal of Computer Science Issues, (2015) 12. Hough, P.-V.-C.: Method and means for recognizing complex patterns. United States Pattent 3069654, 47–64 (1962) 13. Huang, S., Gao, C., Meng, S., Li, Q., Chen, C., Zhang, C.: Circular road sign detection and recognition based on hough transform. In: 2012 5th International Congress on Image and Signal Processing, pp. 1214–1218 (2012) 14. Maitre, H.: A review on hough transform. Traitement du signal 2(4), 305–317 (1985)
270
A. Sere et al.
15. Moizumi, H., Sugaya, Y., Omachi, M., Omachi, S.: Traffic light detection considering color saturation using in-vehicle stereo camera. J. Inf. Process. 24(2), 349–357 (2016) 16. Ozcelik, Z., Tastimur, C., Karakose, M., Akin, E.: A vision based traffic light detection and recognition approach for intelligent vehicles. In: 2017 International Conference on Computer Science and Engineering (UBMK), pp. 424–429 (2017) 17. Reveilles, J.-P.: Structures des droites discr´etes. In: Journ´ees math´ematique et informatique. Marseille-Luminy (1989) 18. Reveilles, J.-P.: G´eom´etrie discr`ete, calcul en nombres entiers et algorithmique. Traitement des images, Universit´e Louis Pasteur (France) (1991) 19. Sere, A., Ouedraogo, F.T., Zerbo, B.: An improvement of the standard hough transform method based on geometric shapes. p. 1–8. Future of Information and Communication Conference (FICC), Singapore, 5-6 April (2018) 20. Sere, A., Sie, O., Andres, E.: Extended standard hough transform for analytical line recognition. Int. J. Adv. Comput. Sci. Appl. 4(3), 256–266 (2013) 21. Sere, A., Traore, Y., Ouedraogo, F.T.: Towards new analytical straight line definitions and detection with the hough transform method. Int. J. Eng. Trends Technol. (IJETT) 62(2), 66–73 (2018)
Performance Benchmarking of NewSQL Databases with Yahoo Cloud Serving Benchmark Irina Astrova1(&), Arne Koschel2, Nils Wellermann2, and Philip Klostermeyer2 1
2
Department of Software Science, School of IT, Tallinn University of Technology, Akadeemia Tee 21, 12618 Tallinn, Estonia [email protected] Faculty IV, Department of Computer Science, University of Applied Sciences and Arts Hannover, Ricklinger Stadtweg 120, 30459 Hannover, Germany [email protected]
Abstract. Selecting a NewSQL database product is an important process. Like any other successful database management system technology, the product selected today begins to define the legacy of the future. There are many different parameters that can be used to evaluate the NewSQL database alternatives, and there is no single most-correct process for conducting such an evaluation. Not only are there many possible pertinent evaluation criteria, but there also is typically a degree of uncertainty about the requirements and characteristics of the intended application environment. This paper focuses on three NewSQL databases (viz., VoltDB, MemSQL and NuoDB), with emphasis on evaluating their performance. The evaluation is done based on Yahoo Cloud Serving Benchmark. Keywords: Database management system (DBMS) SQL NoSQL NewSQL VoltDB MemSQL NuoDB Yahoo Cloud Serving Benchmark (YCSB)
1 Introduction In times of big data and increasing amounts of data overall, database management systems (DBMSs) have to adapt to those requirements. NoSQL databases are one approach to do so but they are not ACID-compliant. By contrast, SQL databases are ACID-compliant but they are poorly able to work with big data. As a result, NewSQL databases have appeared. Nevertheless, all the three bring overall up- and downsides with them [2], which are summarized in Table 1. While SQL databases can show their strengths in being ACID-compliant, in supporting SQL statements and being overall standardized, they certainly lack horizontal scaling and high availability when the volumes of data start to grow rapidly, especially in modern situations where the term big data appears much more often. Tackling the rapidly growing amounts of information we need to store, NoSQL databases offer the possibility to do exactly that by easing the process of scaling databases and clusters and © Springer Nature Switzerland AG 2021 K. Arai et al. (Eds.): FTC 2020, AISC 1289, pp. 271–281, 2021. https://doi.org/10.1007/978-3-030-63089-8_17
272
I. Astrova et al. Table 1. Comparison of SQL, NoSQL and NewSQL databases. Feature ACID-compliant SQL support Structured Horizontally scalable Highly available
SQL Yes Yes Yes No No
NoSQL No No No Yes Yes
NewSQL Yes Yes No Yes Yes
also being highly available during that, but at the same time trading in their ACIDcompliance (Atomicity, Consistency, Isolation, Durability) and their standardized SQL support, e.g., by using own proprietary domain specific languages or other universal language constructs like XML or JSON. Simply trying to provide everything at the same time and on the same level unfortunately brings the world of DBMSs to the CAP theorem which states that we can only achieve two of the three key points in this: Consistency, Availability and Partition tolerance. While traditional relational databases choose consistency over availability due to the fact that they are ACID-compliant, NoSQL databases choose avail-ability over consistency (also referred to as BASE-compliance, from Basically Available, Soft state, Eventually consistent). NewSQL is now trying to bridge the gap between these two by being ACID-complaint together with providing consistency and supporting standardized SQL and at the same time also providing high avail-ability by partly sacrificing partition tolerance. Most NewSQL databases achieve these features by heavily using the physical memory component of the database server and storing large parts of the database in it, which is giving high speed access, although this could be contrary to the durability aspect of the ACID-compliance, e.g., in case of a system failure by which the committed transaction could be lost and durability not be guaranteed anymore [3].
2 Candidates As candidates, we selected three NewSQL databases (viz., VoltDB, MemSQL and NuoDB) because of their popularity and availability. What was common about all those databases is their architecture and distribution. Furthermore, the databases scale horizontally rather than vertically. 2.1
VoltDB
VoltDB [4] is a commercial DBMS with the current version v9.2.2. It is originated from H-Store and was designed by Michael Stonebreaker, Sam Madden and Daniel Abadi. In addition to its commercial version, it also has a so-called free community edition. VoltDB is designed from the ground-up to avoid legacy database problems. It is an in-memory DBMS that is scalable and ACID-compliant. Furthermore, VoltDB is
Performance Benchmarking of NewSQL Databases with YCSB
273
designed to be a relational database to work with mainly OLTP (Online transaction processing) queries. It is also designed to be a distributed database with sharing and data replication. More specifically, VoltDB uses a shared-nothing architecture (see Fig. 1). This is a distributed computing architecture, where nodes do not share any memory or storage so each node can work independently [5]. Based on this structure, VoltDB combats the drawbacks of SQL databases. It is working in-memory to eliminate the delays and contentions associated with traditional disk-based systems. There are durability features like snapshots, command logging and database replication to be sure data never gets lost due to failures.
Fig. 1. Shared-nothing architecture [5].
Throughput is maximized by partitioning both the data and the transactions that access that data (see Fig. 2) so parallel processing of multiple transactions is possible. All the partitioning is done automatically by VoltDB. Scaling is easily done due to the shared-nothing architecture by simply adding more nodes or more partitions to each node.
Fig. 2. Partitioning tables [4].
Furthermore, each partition is running transactions for completion because it is single-threaded. Figure 3 shows the serialized processing of stored procedures. Here the stored procedures are single-partitioned, meaning that they operate on data within a
274
I. Astrova et al.
single partition, so multiple procedures can be executed in parallel. When a procedure does require data from more than one partition, one node acts as a coordinator and hands out the work to the other nodes, gathers all results and completes the task. With the serialized processing there is no need of locking or latching anymore because of transactional consistency. Distribution is used to easily scale for performance and volume. Also clustering provides durability and availability. With replication, the database is safe against node failures. Also replicated data can be accessed for more availability.
Fig. 3. Serialized processing [4].
In summary, VoltDB is a fully ACID-compliant NewSQL database. It provides good performance without sacrificing the reliability and integrity of a fully transactional DBMS. All can be done with the standard SQL syntax for schema definition and queries, which are used in VoltDB. 2.2
MemSQL
MemSQL [6] was developed by MemSQL Inc. It comes with both commercial and community editions. The current version of MemSQL is v7.0.9. MemSQL is a distributed, relational database that handles both transactions and real-time analytics at scale. Querying is done through the standard SQL drivers and syntax, thereby leveraging a broad ecosystem of drivers and applications. Like VoltDB, MemSQL uses a shared-nothing architecture but two-tiered, which consists of aggregator and leaf nodes (see Fig. 4). Aggregator nodes receive SQL queries and split them up across leaf nodes. The results then will be aggregated and responded to the request. Leaf nodes store the data and process the incoming queries from aggregators. Communication is done through SQL syntax over the network. Thus, here we have a distributed architecture, which is using hash partitioning to distribute data uniformly across leaf nodes. To achieve consistency, it uses multi-threading with multi-version concurrency control (MVCC) and lock-free data structures. This means that reads will never block writes and vice versa [7].
Performance Benchmarking of NewSQL Databases with YCSB
275
Fig. 4. Two-tiered shared-nothing architecture [7].
For fast concurrent operational workload, MemSQL uses an in-memory row-store. Analytical workloads are also feasible because of on-disk column-store. It uses snapshots and a write ahead log so no data is lost on system failures. Further-more, sharing is done automatically like in VoltDB. Thus, there is no overhead for manual sharing. It is also ACID-compliant, which has positive effects on reliability and availability. Scalability is achieved through distribution based on the shared-nothing architecture. MemSQL scales linearly by adding nodes to the cluster. Thus, here we have horizontal scaling. We can add those nodes online by using the rebalance partitions command after adding a node. The ratio of aggregator nodes and leaf nodes is important. A standard ratio of leaf nodes to aggregator nodes is 5:1. If there is a need to serve many clients, we need to add more aggregator nodes. In contrast, to meet larger capacity requirements, we need more leaf nodes. 2.3
NuoDB
NuoDB (formerly known as NimbusDB) [8] is a commercial database product from NuoDB, Inc. It is available as a free version (but with limited functionality), a community version or as a commercial enterprise version. The current version is v4.0.3.1. NuoDB is created for on-demand scale-out, while still retaining ACID-compliance in its transactions. It is designed as a distributed peer-to-peer application, which also keeps in mind the needs of attached distributed client applications [9]. NuoDB uses a patented two-tier architecture for providing its uniquely scale-out feature [10]. This allows NuoDB to split the synchronization of on-disk and in-memory data to different instances, which then can be scaled out freely. This sets NuoDB apart from traditional shared-disk or shared-nothing architectures and allows NuoDB to scale throughput, durability and storage independently from each other. Figure 5 shows the scalable architecture of NuoDB. The first layer handles Transaction Processing (TP) and takes care about the atomicity, consistency and isolation parts of the ACID-compliance while being entirely in-memory with a shared cache between nodes. In the community version, this layer can have up to three so called Transaction Engines (TEs), which are instance nodes to handle all incoming SQL queries of applications using a built-in SQL parser and on the other hand having an SQL optimizer to handle transactions with the second layer of the architecture. The caching of TEs enables this layer to quickly search in its in-memory cache for results from incoming SQL connections and to communicate via a peer-to-peer architecture to other TEs, in case the first TE had no results found for the incoming SQL query in its
276
I. Astrova et al.
own cache. Redundant data can also be held in the cache of different TEs, in case of a critical failure.
Fig. 5. Scalable architecture [10].
To ensure the consistency of data on this layer, NuoDB uses MVCC like MemSQL. To ensure consistency, a DBMS in general is either locking resources used in transactions until they are finished so other transactions can access them again or, in case of MVCC, every action with a resource in a transaction is using a frozen copy of the resource with that current status the resource had, when requested. Every transaction is producing new versions of the resource and TEs can hold multiple versions of a resource simultaneously. A new version is called pending, until its transaction commits successfully. The second layer is the Storage Management. This layer has so called Storage Managers (SMs), comparable to the TEs at the first layer. These are attached to the actual data storage and persist the data from TEs or answering queries of them, in case an incoming SQL query could not be resolved between the TEs at the first layer. NuoDB can use table partitioning and storage groups to partition data onto different SMs (see Fig. 6). By default, every SM is replicating the whole database. Activating partitioning and assigning storage groups to SMs will result in partitioning the total database into subsets persisted on specific SMs and mirrored on other SMs in the same storage group. The scalable architecture of NuoDB makes scaling the database fairly easy by simply adding TEs or SMs to the peer-to-peer network of the database, what also improves its availability while still providing ACID-compliance and SQL support.
3 Performance Benchmarking Performance is typically one of the top evaluation criteria for any DBMS. Because the NewSQL database products are still relatively new, many users expect that the vendors will continue over time to improve functionality, usability, reliability and support. They are willing to accept products that may not entirely meet their wish-list specifications in these areas. However, they typically do want to get the best performance possible.
Performance Benchmarking of NewSQL Databases with YCSB
277
Fig. 6. Partitioning of tables using transaction processing (TP) and storage management (SG) [9].
The user’s level of satisfaction with an application is largely determined by that application performance. On the other hand, the DBMS performance can be a significant factor in determining the application performance. Since the application performance is so important, many users would like to be able to predict the performance of a NewSQL database product on their applications prior to actually implementing the applications or even purchasing the product. Predicting the DBMS performance involves determining how well the NewSQL database will perform given the particular characteristics of the applications and their environment. A typical approach to predicting the DBMS performance is to use a benchmark, which is a representative of the application workloads without actually being a replica of that application. In a benchmark, we have a well-defined set of stress tests (transferrable between different databases), which produce comparable key performance measures. Benchmarking is often done using universally accepted tools that assist in creating a standardized and similar connection to different databases and running the same tests with the same workload on each of them [11]. We benchmarked NuoDB, VoltDB and MemSQL with a tool called Yahoo Cloud Serving Benchmark. 3.1
Yahoo Cloud Serving Benchmark
The goal of the Yahoo Cloud Serving Benchmark (YCSB) project was to develop a framework and common set of workloads for evaluating the performance of different “key-value” and “cloud” serving stores [12]. YCSB is an open-source benchmarking tool for evaluating key performance measures of different databases. YCSB uses a generic JDBC (Java Database Connector) to connect to the APIs (Application Programming Interface) of different databases, executes the equivalent code and evaluates the relative performance of those databases. YCSB includes a set of built-in workloads (A-F), which can be executed on a database target. Each of these workloads are testing different aspects of the underlying DBMS (e.g., only by reading, writing or modifying data) and are designed to simulate the real-life situations for that DBMS [13]. • A: Heavy updates, with a mix of 50/50 reads and writes; • B: Mostly reads, with a mix of 95/5 reads and writes; • C: Reads only;
278
I. Astrova et al.
• D: Inserts of records and reading them right after that; • E: Reads within short ranges; • F: Reads, updates and writing those changes back. 3.2
Yahoo Cloud Serving Benchmark
Our benchmark setup included an Ubuntu Server machine (v18.04.3 LTS), running with an i9-9940X Intel CPU, 64 GB 3000 MHz DDR4 RAM by Corsair and a 1 TB 970 Evo Plus by Samsung. For each NewSQL database, the same default YCSB settings were used (see Fig. 7). All the databases were run right out of the box after installing the software. NuoDB and MemSQL were using the JDBC connector (v21.0.0 and v8.0.18), whereas VoltDB was using the built-in existing connector shipped by YCSB. We repeated the execution for each workload on each NewSQL database ten times and took the average of the measured runtime and throughput to evaluate the database performance.
Fig. 7. Default settings of Yahoo Cloud Serving Benchmark [1].
3.3
Benchmarking Results
At first, we took a look at the overall runtime of each benchmark (see Fig. 8). The first thing we noticed was very high bars for NuoDB during loading and running the workloads. Initially, loading the workload data took for NuoDB about 14 times more than for MemSQL and about 24 times more than for VoltDB. But the most extreme values occurred during running the workload E. Running this workload took for NuoDB over 18 min in average, which was about 8 times more than for MemSQL and VoltDB. One important observation here was that NuoDB needed much more time to ingest new data with inserts or updates. Comparing MemSQL with VoltDB, we could observe that MemSQL needed a little bit more time for inserts than VoltDB at loading the workload data. However, MemSQL performed overall slightly faster in queries and updates. Only while running the workload C, NuoDB could show some of its real potential by closely beating VoltDB, positioning itself between VoltDB and MemSQL. In Fig. 9, we can see that our NuoDB installation had a serious bottleneck while doing inserts and updates, only averaging in about 2 200 operations per second for the workloads, which required those operations. This was the time when we decided to conduct a second run of tests for NuoDB exclusively, with settings that were different from the standard ones with which NuoDB had been installed. This allowed NuoDB to use only 2 GB of RAM with one TE and one SM. Being limited by the functionalities of the community version, we could enable only two more TEs, thus getting a total of
Performance Benchmarking of NewSQL Databases with YCSB
279
Fig. 8. Average runtime of each workload on each NewSQL database in milliseconds.
Fig. 9. Average throughput of each workload on each NewSQL database in operations per second.
three TEs and increasing the RAM usage up to 25 GB. We were surprised not to see any better throughout after running all the workloads again ten times and taking the average, which was very much like the benchmarking results that we observed in the first place.
280
I. Astrova et al.
4 Conclusion and Future Work This paper compared three NewSQL databases (viz., VoltDB, MemSQL and NuoDB) against three main criteria: architecture, scalability and performance. The performance comparison was supported by experimental data that were received by benchmarking the NewSQL databases with Yahoo Cloud Servicing Benchmark tool. The benchmarking tests were based on different workloads: some used more reads, whereas others used more writes. We saw that MemSQL showed off the best in most workloads closely followed by VoltDB. Only in loading data MemSQL was not as fast as VoltDB. NuoDB had some trouble in ingesting new data, so we had some mistrust in the benchmarking results regarding NuoDB. In the related work [14], NuoDB was the best. We recall that the benchmarking tests were based on default (standard) settings and were not adjusted for a specific use case. For future work, some different settings, especially for NuoDB, and some specific use cases could be considered as well. The results of the benchmarking tests also depend on how closely the selected benchmark (viz., YCSB) is intended to match the intended application workloads. Acknowledgment. Irina Astrova’s work was supported by the Estonian Ministry of Education and Re-search institutional research grant IUT33-13.
References 1. Hurwitz, J., Nugent, A., Halper, F., Kaufman, M.: Big Data for Dummies, 1st edn. Wiley, Hoboken (2013) 2. Padhy, R.P.: Google Spanner: A NewSQL Journey or beginning of the end of the NoSQL era. https://medium.com/rabiprasadpadhy. Accessed 06 June 2020 3. Coronel, C., Morris, S.: Database Systems: Design, Implementation and Management, 13th edn. Cengage Learning Inc, Boston (2018) 4. VoltDB, Inc.: VoltDB Documentation. https://docs.voltdb.com. Accessed 06 June 2020 5. Kacsuk, P., Podhorszki, N.: Dataflow parallel database systems and LOGFLOW. In: Proceedings of the 6th Euromicro Workshop on Parallel and Distributed Processing, pp. 382–388 (1998) 6. MemSQL, Inc.: MemSQL Documentation. https://docs.memsql.com/v7.0/introduction/ documentation-overview/. Accessed 06 June 2020 7. Oliveira, J., Bernardino, J.: NewSQL databases - MemSQL and VoltDB experimental evaluation. In: Proceedings of the 9th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management, pp. 276–281 (2017) 8. NuoDB, Inc.: NuoDB homepage. https://www.nuodb.com. Accessed 06 June 2020 9. NuoDB, Inc.: NuoDB whitepaper. http://go.nuodb.com/white-paper.html. Accessed 06 June 2020 10. NuoDB, Inc.: NuoDB for financial services. https://www.nuodb.com/nuodb-financialservices. Accessed 06 June 2020 11. Scalzo, B.: Database Benchmarking and Stress Testing, 1st edn. Apress Media LLC, New York (2018) 12. Verizon Media: Yahoo Cloud Serving Benchmark. https://research.yahoo.com/news/yahoocloud-serving-benchmark. Accessed 06 June 2020
Performance Benchmarking of NewSQL Databases with YCSB
281
13. Cooper, B.F., et al.: Core workloads of Yahoo Cloud Serving Benchmark. https://github. com/brianfrankcooper/YCSB/wiki/Core-Workloads/. Accessed 06 June 2020 14. Kaur, K., Sachdeva, M.: Performance evaluation of NewSQL databases. In: Proceedings of International Conference on Inventive Systems and Control (ICISC), pp. 1–5 (2017)
Internet of Art: Exploring Mobility, AR and Connectedness in Geocaching Through a Collaborative Art Experience Pirita Ihamäki1 and Katriina Heljakka2(&) 1
2
Prizztech Ltd., Siltapuistokatu 14, 281010 Pori, Finland [email protected] University of Turku, Siltapuistokatu 14, 28101 Pori, Finland [email protected]
Abstract. This paper views the Internet as a platform that may be used in cocreating art experiences and to support player collaboration. Our research presents an ethnographic study of preschoolers testing an Augmented Reality mobile application that includes connected artworks as part of a Geocaching letterbox. Driven by belief in the power of play to support exploring, learning and development, and the conviction that competencies with digital technologies will be necessary to ensure future literacy skills, play with technologies has become an integral part of educational provision for young children in developed nations. New cultural services, such as location-based, urban experiences should convince players’, educators’, parents’ and the city’s policymakers of their value for playful engagement. Therefore, this paper aims to present a suggestion for how connectedness and collaboration can be supported through the Internet of Art in the context of playing on an augmented geocaching trail enhanced with artworks. Keywords: Internet of Art Augmented Reality app Collaborative art Geocaching
Connectedness
1 Introduction: New Ways to Experience Art Latest mobile technologies have revolutionized the way people experience their environment. This research explored the opportunities of using the Sigrid-Secrets Mobile Augmented Reality (AR) application in order to enhance the user experience of art-related experiences in the context of geocaching [1]. Chevalier and Kiefer (2018) define AR (Augmented Reality) as real-time computationally mediated perception. Mediated because there is the potential for the “Augmented” in AR to be a transformation of the environment as opposed to an overlay, as we typically see in functional AR systems (for example, mapping apps, location-based artworks, etc.) [2]. New AR technology invites new forms of perception and situated experience, made possible through mediation that no longer nuances one reality (real or virtual) over another, instead approaching them as one environment, as one relational system. Despite the early stages of development, AR technology is becoming available to the masses in ubiquitous forms like mobile gaming technology, and these new platforms are © Springer Nature Switzerland AG 2021 K. Arai et al. (Eds.): FTC 2020, AISC 1289, pp. 282–299, 2021. https://doi.org/10.1007/978-3-030-63089-8_18
Internet of Art
283
providing new ways of creatively altering our perception of the environment in more detailed, nuanced, multisensory, timely and perceptually believable ways. This is happening above a rising base-level of pervasive technology as its data merge its data merge both physical and mental constructs like physical art works extended with digital dimensions by, for example, AR technology [3]. In this study, physical art works are used as a point of departure in the creation of interactive pieces that are fundamentally augmented reality arts within the geocaching game. In the context of expressive media, such as physical artworks used in connection to the location-based game of geocaching, a player needs to use technology to access the AR elements embedded the in artworks. In the Sigrid-Secrets game adventure, which simultaneously functions as an example of a connected and collaborative art experience, the players open their smartphones with an app designed for this purpose, and bring physical artworks alive by using augmented reality technology. Current AR technology invites new forms of perceptions and situated experience, made possible through mediation that no longer nuances one reality (real or virtual) over another, instead of approaching them as one environment, as one relational system [2]. Today, the interactive and immaterial nature of several digital art projects entails that curators need to establish a connection between physical and virtual spaces, emphasizing the participatory nature of the artworks and the activated role of the audience. Like artist Paul (2008), curators engaging with digital media frequently mediate between the artist and the institution, between the artwork and the audience, and between the artworks and critics, creating new collaborative models of production and presentation [4]. We see the co-creating of art as a practice that may take place in this new, experiential realm through the creation of gamified, interactive artworks, and using an augmented reality dimension created, used and shared within the geocaching game platform. Expressive media in combination with the physical artworks integrated with a multiplatform, i.e. location-based game with both physical and digital dimensions, such as geocaching, provides players the possibility to combine the use of mobile technology with their own artistic creations. In order for the users to extend their experiences of the physical artworks integrated into this location-based experience, they also need to download the Sigrid-Secrets AR mobile application. With the help of this application operated through mobile technology, they can interact with physical arts in novel ways and share their experiences of artworks with other users. This type of connectedness between the areas of arts and mobile gaming, physical and digital realms, and between users as viewers, co-creators and social players presents a new way of spectating and sharing of art experiences. Users of the Sigrid-Secrets app are simultaneously co-creators and collaborators of an art experience as they operate their smartphones to ‘enliven’ physical artworks by using augmented reality technology and share pictures with Sigrid-Secrets augmented reality effects to social media channels. This paper has two goals. First, it attempts to formulate a definition for the Internet of Art. Second, it presents a case example of a collaborative art experience, situated in the context of an urban geocaching trail in Finland—Sigrid-Secrets. In order to explore the experiences of players of this ‘artified’ game as an example of what we see as one dimension of the Internet of Art, we conducted a study with groups of 5-6-year old preschool children and collected research data from the Geocaching.com website
284
P. Ihamäki and K. Heljakka
dedicated to Sigrid-Secrets, on which players of different ages have left comments about their experiences of playing the ‘artified’ geocaching trail. In the paper, we focus on analyzing two sets of data: Experiences of preschool children (n = 20) with their 4 teachers’ who test-played the game, and a selection of comments made by individuals as geocachers’, and groups of geocachers, who have taken advantage of communication opportunities and resources enabled by physical artworks connected to the communications infrastructure of the geocaching platform.
2 From Connectedness to Connected Arts In this part of the paper we review Internet connectedness and connected arts-related research, as well as their relationship with gaming. There is still a large grey area between the phenomenon of the Internet of Art and gameplay. However, as suggested in the paper, there are possibilities to connect these realms through the use of existing gaming platforms and mobile technologies: One proposed way of connecting art to mobile games is through the location-based game of Geocaching, which offers a platform for players to discover urban, cultural phenomena like the arts through the experiencing of physical artworks, and augmenting them through their smartphones. Internet connectedness is a multidimensional conceptualization of the importance of networked technology in a person’s everyday life. Connectedness suggests a relationship between a person and the Internet not captured or described adequately by traditional use measures – particularly measures based on time, such as hours of use [5]. Internet connectedness implies dynamic and ecological relations between individuals and the Internet, which are embedded in a larger communication environment composed of multi-level relationships among individuals, institutions, organizations and various storytellers, such as artists, including all available communication through media forms [6]. Loges and Jung (2001) have proposed that Internet connectedness is composed of three dimensions: 1) history and context, 2) scope and intensity and 3) centrality in a person’s life. The history and context dimension refers to the time and integrates the Internet into the user’s everyday life, as well as the places in which one has access to the Internet [7]. Based on the theory of connectedness, as formulated by Loges and Jung, we will next go on to discuss the dimensions of Internet connectedness in the practice of geocaching. The context of connectedness [7] in geocaching leans simultaneously on the physical and the virtual: In geocaching, game players use their smartphones and a GPS system to find geocaches situated in physical environments. Players may use broad approaches in how the Internet is used, including tasks such as the sharing of photoplay (i.e. playful photography of toys. i.e. Travel Bugs, or the players themselves) during a geocaching trip on social media, or signing messages to other players by ‘logging’ (or writing comments of caches to other geocachers) on dedicated websites. The scope and intensity dimensions of connectedness [7] include the range of personal goals one seeks to attain through the Internet connection, the range of applications online one uses, such as browsing geocaches, chatting with other players, interacting with a local group of geocachers on Facebook, and the amount of time one
Internet of Art
285
spends on these activities. By adding considerations of goals and activities to the usual measure of the media relationship engendered by Internet use, the geocaching activity may additionally, be seen through the lens of gamified interaction. This implies that playing of geocaching is a goal-oriented activity that carries with its rules and goals for the interaction with the Geocaching platform and other players of the game. The centrality dimensions of Internet connectedness [7] refer to a persons’ subjective evaluations of the Internet’s impacts to playing the geocaching game in his or her personal life, and the extent to which a person would miss the Internet access (with geocachers’ applications, chat conversations, Facebook groups etc.) if it was no longer available. 2.1
From Digital Art to the Internet of Art
Rush (1999) has categorized Digital Art into five areas, which are Computer Art, Digitally Altered Photography, Art of the Worldwide Web, Interactive Digital Art and Virtual Reality. Digital art will evolve by how digital technology is developed [9]. Rachel Green (2004) has written about the Internet Art by exploring the evolution of web-based art as an independent genre [10]. Bolter and Grusin (2000) point out that on the mobile and computer screen everything is made visible to us through a window, which presents that the digital connection can be communicated with real terms of engagement [11]. This type of interaction with art has not been viewed as completely unproblematic before. For example, Anne-Marie Schleiner (1999) claims that interactive artworks such as the online exhibition Cracking the Maze have faced criticism too: To describe something as ‘an interactive artwork’ as ‘too game-like’ is a common pejorative. This statement illustrates the artworld’s early attitude to digital games and their spreading influence. Schleiner (1999) states, “Like “hacktivist” Electronic Disturbance Theater’s net.art attacks on government websites on behalf of the Zapatistas, game hacking and distribution of game hacks online are art strategies that offer the possibility for artists to participate in cultural intervention outside of a closed art world sphere” [12]. Nevertheless, during the recent years, digital and Internet-based art has seen a major interest arise from museums and galleries. For example, the Electronic Superhighway (2016-1966) at London’s Whitechapel Gallery in 2016, as well as Art in the Age of the Internet (1989), to Today at Boston Institute of Contemporary Art and I Was Raised on the Internet at the Museum of Contemporary Art Chicago, both in 2018 [13]. According to our vision, the Internet is a tool that can be used to produce art and to support artistic collaboration. When we talk about a combination of the Internet and emerging technologies such as near-field communications, real-time localization, and embedded sensors that let us transform everyday objects into smart objects (like in this case physical artworks) that can understand and react to their environment, we are presented with new opportunities of seeing the viewing and spectatorship of art turn to more participatory experiences through artworks, from which new types of interactivity arise. Such arts are an example of the building blocks of the Internet of Things (IoT) that enable novel computing applications [8]. In our research, we suggest that digital arts of today, when connected to networks, belong to the area of the Internet of Art, which has its roots in the development around
286
P. Ihamäki and K. Heljakka
the Internet of Things. The Internet of Things (IoT) is a concept referring to the connectivity of any device with the Internet. IoT could be considered as a giant network of connected people or things like in this case physical artworks, the connections are between things-things, people-things or people-people – [14]. Another recent term describing the phenomenon is Internet of Intelligence, which points to the collection and analysis of public information giving decision makers more options, more insight and more strategic power. This can also be referred to as “Open Source Intelligence”, which means the way in how Internet intelligence might be able to contribute to, for example data validation [15]. Early examples of art situated in the context of the Internet of Art include, for example, work by Sean Clark (2017), who has developed connected artworks called the Internet of Art Things (IoAT). He has continued the idea of the Internet of Things to connect arts and messages together. He talks about the “aesthetic of connectedness” and has worked on the Internet of Art Things infrastructure using open standards and modular technologies that will be around for a while [16]. Definition of the Internet of Art We define the Internet of Art to have four dimensions, as explained in the following: 1) Physical artworks, which extend to the digital art space (digital versions of art works, which reside on the Internet); 2) Physical artworks, which include a digital dimension, and which are connected other artworks, 3) Physical artworks, which include digital interactive art works (involving user to interact with the artwork and with other users), 4) Digital artworks, which are connected to other artworks through the Internet and belong to the ecosystem of the Internet of Art. We understand that the Internet of Art is always connected through the Internet and usually, may connect artworks, which can at the same time also be physical artworks made in 3D, as well as digital artworks (see Fig. 1). 2.2
Our Case Study: The Sigrid-Secrets Mobile Art Experience
In this case study, the example of the Internet of Art represents the third dimension, or physical artworks, which include a digital dimension and that involve users to interact with the artwork and other users to create collaborative creative experiences. Our case example, the Sigrid-Secrets geocaching trail, includes six small physical artworks that are hidden in the context of an urban park milieu in a coastal town of Finland. Every artwork includes a digital art ‘layer’, which consists of AR animations and participatory, fun learning exercises. The artworks are connected together through a narrative, which at the same time leads the users to find physical art works through playing of the geocaching game. In order to unlock the digital dimensions of the artworks, the players need to use the Sigrid-Secrets app: When the player finds one of the physical artworks, and scans it with the help of the application the augmented reality artworks ‘come alive’ on the screen of the mobile device. For example, the player can see through the application how the physical artworks change color in their digital version, and how
Internet of Art
287
Fig. 1. Internet of Art: four dimensions.
video content related to the physical artworks (such as animations) becomes available. Additionally, players have the possibility to share their experiences of the mobile art experience by writing messages in the mobile application’s guest book, and to share pictures with the main character of the story, Sigrid (a little doll) on their preferred social media channels1. The case study described in the paper at hand illustrates an investigation in both the Internet of Art, as well as the Internet of Play, and in this way, connects two areas of interest for the authors—visual art and location-based games (see Fig. 2).
3 Method Häkkinen et al. (2003) suggested a multi-method approach that is process-oriented and takes into account different contextual aspects [17]. Our case study uses this approach in order to provide a holistic and complimentary description of the Internet of Art’s possibilities for providing gamified user experiences through the geocaching game. Our study builds on the knowledge gained from two data sets: First, we use group interviews from players who tested the Sigrid-Secrets application, and in this way experienced the Internet of Art on the geocaching trail. Second, we build on the knowledge
1
The fictive behind Sigrid in the game is connected both to the cultural history of the city and facts about the city in its current form. In the 19th century, the daughter of a famous businessman died at the age of 11 and was built a mausoleum for by her father. The local museum carries a collection of her toys, which inspired us to create the semi-fictional, which inspired the mobile art experience created through geocaching. The Sigrid continues in a similar geocaching experience set up in a neighboring, coastal city. We have connected the geocaching trails of the two towns with their 12 physical artworks and narratives by using the geocaching platform for playing with the Internet of Art.
288
P. Ihamäki and K. Heljakka
Fig. 2. Areas of interest for this study: The Internet of Play and the Internet of Art.
gained from the test group interviews of geocachers’ opinions of their experiences on the geocaching trail by comparing them to testimonials of geocachers, who have reported their experiences on the Geocaching.com website. By analyzing the experiences of the test players of Sigrid-Secrets and comparing them with geocachers’ general attitudes towards the trail and potential art experiences, we aim to answer the following research questions: RQ1 (targeting the test group): What kind of promises related to art experiences are the Sigrid-Secrets app and geocaching fulfilling for the player, as reflected on the Sigrid-Secrets’ own geocache website (i.e. on websites and text of the Sigrid-Secrets app)? RQ2 (targeting the geocachers): What are the geocachers’ opinions and observations of art experiences related to the Sigrid-Secrets geocache? In order to analyze the viewpoints of the test-group (20 preschool children and 4 teachers) and geocachers who have experienced the Sigrid-Secrets geocache on their own, we have collected two kinds of research materials: First, group interviews with preschoolers gathered with the test-players after they have used the Sigrid-Secrets app and teachers’ observations about preschool-children experiencing the Internet of Art through the Sigrid-Secrets geocache. Second, through material collected from the Geocaching.com website focusing on geocachers’ comments of their experiences of Sigrid-Secrets. We selected these materials with an interest to understand the Internet of Art through the employment of the Sigrid-Secrets app and testimonials on the SigridSecrets website on Geocaching.com. Both of these are online-based services and offer players a possibility to engage with Internet of Art-related experiences.
Internet of Art
289
To analyze the research materials, we have used content analysis. The goal of this method is to provide knowledge and understanding of the phenomenon under study [18]. It provides researchers with the possibility to make a close reading of the data through the systematic classification process of coding and identifying themes and patterns. Researchers immerse themselves in the data to allow new insights to emerge [19]. The method is also described as inductive category development [20]. The motivation to use sets of research materials enriches the holistic view of geocachers’ perspectives on the Internet of Art: Their experience helps us to contextualize and elucidate geocachers’ motives in relation to the Internet of Art played through the geocaching game. The test group interviews inform us about player evaluations of the Sigrid-Secrets app and their experience for the Internet of Art. Further, these interviews enable the articulation of aspects of experiencing the Internet of Art through the geocaching game and help us to contextualize and elucidate individual attitudes and behavior, based on personal motives and perceptions in relation to use the Sigrid-Secrets mobile app on the geocaching trail. Next, we describe the rich material employed in our case study in more detail to illustrate the possibilities for the Internet of Art when connected to a geocaching trail. 3.1
Group-Testing the Mobile and Connected Art Experience
In our case study, we recruited 20 preschoolers and their teachers to test-play SigridSecrets. The two authors as researchers guided the test-playing of the geocaching trail by walking the trail together and visiting the six artworks placed on the trail. The researchers used their own smartphones to run the Sigrid-Secrets app in order to unleash the augmented reality features composed for each of the artworks. We were mainly interested in two areas of inquiry: a) the reception and reaction to the AR dimension of our mobile art experience, and b) the nature of connectedness, or, its context, intensity and centrality in our mobile art experience tied to geocaching. The test-players were separated in two groups who walked the geocaching trail with the authors as moderators. All test-players were familiar with the technology: They had used iPads in preschool and almost all of them have their own mobile phone. The test-playing of the trail was carried out in an explorative and non-task oriented way with an interest in testing the Sigrid-Secrets AR application with preschool children in its real context of a park situated in an urban milieu. In testing, we used the ‘think aloud protocol’ technique by asking the play-testers to tell what they were doing, what they expected to happen next, and whether something unexpected happened during the exploring of the geocaching trail. The two researchers moderated the tests together with the preschool teachers. The instructions given to test users were predefined and written on paper. Moderators were given detailed instructions concerning the interaction with test users: how to give instructions, and when and how to prompt in problem situations. All moderators participated in data gathering, but only the researchers participated in the qualitative analysis of the gathered research materials.
290
3.2
P. Ihamäki and K. Heljakka
Geocachers’ Experiences of the Sigrid-Secrets Geocache
In order to be able to use a comparative approach, we collected a secondary data set from the testimonials given by geocachers on the Sigrid-Secrets’ own website on the Geocaching.com website. At the time of research (March 2019), some 400 geocachers have left comments on the website about their experiences of Sigrid-Secrets. When analyzing this data, we have established categories for comments, and made a close reading of all comments, where the geocachers have commented the art experiences, or particular memorable experiences related to Sigrid-Secrets. These geocacher experiences are briefly described in the following: Geocachers play the geocaching game during special occasions when it fits their leisure time. For example, one geocacher said: “We have visited the cache with children when celebrating [… - a local event]. Thanks for the clear and comfortable cache”. Geocachers who have appreciated the experience, have given “stars”, and shared their special moments in a geocache with peer players, and by sending greetings to the creators of the geocache. Further, we have evaluated geocachers’ connectedness to the Internet of Art. There are different types of geocaches and the one in this case study is a cache type called letterboxing. Letterboxing is an outdoor hobby that combines elements of orienteering, art, and puzzle solving. Individual letterboxes contain a notebook (logbook) and a rubber stamp as in the Sigrid-Secrets cache as well. Some geocachers have their own stamp or marks, what they make in logbooks. The scope and intensity dimension of connectedness includes the range of personal goals like for this geocacher to attain through an Internet connection with Sigrid-Secrets geocache website and share their nickname pictures around the geocache place or the artworks with others who find the Sigrid-Secrets geocache. Geocachers attribute special meanings to some places, and they want to “leave their mark” in the logbook found in the cache at the end of the trail, like the one in Fig. 3.
Fig. 3. A Geocacher’s message at the Sigrid-Secrets geocache.
Internet of Art
291
4 Results: Player Experiences of the Sigrid-Secrets Mobile and Connected Art Experience 4.1
Play-Testers Reception and Reactions to the AR Dimension
The results show that overall, the preschool children as play-testers were engaged and connected with the geocaching game we introduced as the “Sigrid-Secrets Adventure Game”. In the game new technology creates opportunities for the artists and researchers to create an interactive public Internet of Art installation that merges physical material with digital content (as in here, Augmented Reality features), allowing social engagement and participation. In the Sigrid-Secrets Adventure Game Augmented Reality (AR) techniques were used to transform and augment the users’ visual and auditory perceptions of the location (special locations of a series of physical, but digitally connected artworks). With the aim of creating the illusion that the physical artworks actually expand into the mixed reality environment in the player’s surroundings, the players’ engaged with augmented reality animations through a mobile device. The augmented reality effect was perceived as quite realistic. The participants reported that the digital content of the artworks made them excited and even caused some children to feel tension. The findings of our study show that by far the most engaging aspects of gameplay were connected to the treasure hunt game mechanics of the art experience. For the most part, participants found the treasure hunting mechanic satisfying and enjoyed the process of physically moving around the environment, hunting for Internet of Artconnected physical artworks. The discovery of the urban environment to find artworks was considered a playful experience itself. However, moments of surprise were experienced when using the Sigrid-Secrets mobile application, which made the artworks “come to life”. Overall, participants used the multimodal features of the Sigrid-Secrets geocaching trail when using the interface of the Sigrid-Secrets app as a guide when searching for physical artworks. The participants reported to get multisensory (augmented reality) experiences through visual and sound effects. For example, in the first artwork the augmented reality dimension shows a series of letters of the alphabet one by one, and those letters form the second name of the character, Sigrid. This feature made the test-players collectively ‘speak out loud’, and saying those letters together. In this way, the participants solved the task together, and received a pleasurable experience that made them to connect with the Sigrid-Secrets adventure. Moreover, the results also show that the urban location of the geocaching trail contributed to the overall game experience because of the historical backstory (the narrative of Sigrid-Secrets) that fits the physical context and the location of the geocaching trail. 4.2
Play-Testers Evaluation of the Connectedness
By investigating preschool children’s perspectives on the Internet of Art through the “artified” geocaching trail it is possible to evaluate the context perspective of connectedness: During their test-playing of Sigrid-Secrets, the preschool children undertook an outdoor learning excursion, and by doing so, engaged with the history of their home town by visiting the geocaching trail and experiencing the AR features delivered by the Sigrid-Secrets Augmented Reality app.
292
P. Ihamäki and K. Heljakka
One of the participants described that “Sigrid-Secrets geocache trail was as an adventure and the most interesting thing about it was to search for the geocache under the ‘tree house’”. Another participant (teacher) described how one of the children is “walking to kindergarten through this park on an everyday basis and has seen one artwork before, but now she can take her mom on this adventure and show her the secrets”. What connects the test-players to their environment is the geocaching game, because geocaches are situated in the physical environment of places that people use every day. However, when you play geocaching, the real environment merges with the digital, and in our case, the augmented game-world. The test-players participation in the treasure hunt that geocaching represents, also enables experiences of connectedness to the world of art through the Internet. Preschool children’s perspectives on the Internet of Art through the geocaching trail can also be evaluated through perspectives of intensity: In our study, we were interested in the test-players’ aesthetic responses to the physical artworks and use of the SigridSecrets app to understand what kind of experiences the Internet of Art comprises regarding sensation, perception, emotions, and self-reflection. As aesthetic experiences are highly individual, we observed significant variations in their responses to the same artworks. The intensity perspective was examined with the test-players’ after the geocaching round: The participants were asked to draw their most memorable experience of the geocaching trail and to explain the drawings to the researchers. One of the drawings illustrated the location of the “Sigrid music box artwork” (Fig. 4 and 5).
Fig. 4. The “Sigrid music box artwork”.
Fig. 5. A preschooler’s drawing of the “music box artwork”.
Internet of Art
293
Fig. 6. Using the Sigrid-Secrets app makes the physical artworks come “alive”.
In this example, the Sigrid-Secrets app shows an animated video where the Sigrid doll plays an old music box. Participants considered this as the most exciting Internet of Art-related artwork. The augmented reality experience was, according to the playtesters, something that amazed the children and showed how the “artworks come alive” (Fig. 6). In the “dancing doll artwork” a preschooler remembered the doll making a ballet split (Fig. 7). Again, Fig. 8 illustrates the drawing of one participant, who described the “music box artwork” and the final geocache location as the most memorable experiences. This artwork includes an AR feature in which the doll can be seen dancing and making a split. Moreover, Fig. 9 illustrates the context and special location of the geocache “under the tree house”.
Fig. 7. A preschooler’s drawing of the “dancing doll” making a ballet split.
Fig. 8. A preschooler’s drawing of the “music box artwork” and the location of the final cache.
294
P. Ihamäki and K. Heljakka
By looking at preschool children’s perspectives on the Internet of Art through “artified” experiences on the geocaching trail it is possible to evaluate the centrality perspective of Internet connectedness, which refers to a person’s subjective evaluation of the impact on his or her personal life, and the extent to which a player would miss the Internet of Art if it was no longer available. The centrality of the connectedness of Sigrid-Secrets is perceived as relevant: The meaningfulness of the “artified” game experience for its test-players’ is, according to our study, that by playing the game, they reported to learn more things about their everyday environment. For example, there is an artwork where Sigrid looks at an old cotton factory depicted in the physical artwork. When looking at this artwork through the augmented reality app the picture comes alive, and there is a historical film, in which people are rowing a boat to traverse a local river. When watching this film, some of the participating preschoolers knew straight that nowadays, there is a bridge in this particular location. By making the drawings, the test-players started to remember their adventure the Sigrid-Secrets geocaching trail. What all participants found particularly interesting was to find the actual geocache at the end of the trail, “under a tree house” (see Fig. 8 for drawings illustrating the “tree house” and Fig. 9 for the actual geocache location under a “tree house”). Some of the children wanted to take their parents on this adventure, because they were so excited about the physical artworks to come alive through the Sigrid-Secrets app. They also realized that to use the Internet of Art dimension through the AR app is essential for the artworks to “come alive”—without playing the geocaching game in combination with the app, nobody knows the narrative behind the digitally connected artworks, and ultimately, what kind of Internet of Art experiences can be achieved through the app.
Fig. 9. Participants find the geocache under the “tree house”.
Internet of Art
4.3
295
Geocachers’ Evaluations of the Art Experiences on the Sigrid-Secrets Geocaching Trail
In order to be able to compare the results of the test-playing preschoolers with geocachers’ evaluations of playing the game on their own, we turned to the testimonials of players who had written their comments on Sigrid-Secrets website on the Geocaching.com website. For example, geocachers’ have seen the connectedness of the Internet of Art experience that Sigrid-Secrets is, with its “sister-trail” in a neighboring city. One geocacher has written: “Friday’s bicycle trip ended with this favorite geocache. Letterboxes [like this one…] are nice. The route followed the same logic as [the other city], so the journey goes fast. Thank you for letterboxing, for which I give favourite points [the highest score]”. Geocachers perspectives on the Internet of Art through the geocaching trail can be evaluated from the perspective of context as geocachers travel and make plans about what kind of experiences they want to find by playing the game. One geocacher describes the Sigrid-Secrets geocache in the following way: “Spending the weekend in [in the city] and I started to look for the geocache during the evening. Thanks for the story, arts, and geocache”. Another geocacher wrote: “The context of this geocache is the story [that can be accessed] through the artworks situated in central area of [this city] and [the other city] areas”. Furthermore, geocachers’ perspectives on the Internet of Art through the geocaching trail can be evaluated through intensity perspectives: Geocachers’ aesthetic responses to physical art comprise types of experiences, from sensation, perception, and from emotion and self-reflection. Like for the play-testing preschoolers, the aesthetic experiences related to Sigrid-Secrets were consider as highly individual. For example, one geocacher described the experience in the following way: “I found out about Sigrid-Secrets virtually some time ago when laying on the couch at home. Today I finally stopped over this geocache to admire the arts. I could find the surprisingly peaceful, hidden place [for the geocache] I could log, which is in the middle of the city center in a lively place [in the city]. Thank you for this hybrid letterboxing geocache”. Geocachers see the intensity perspective of the Internet of Art on the geocaching trail in relation that they have their own goals in geocaching, for example, to find different types of geocaches. One geocacher describes: “The goal of our summer trip is to find as many letterboxing geocaches during each day. Now it’s five days back, so I had to drive from [one city] to [another city] because there are no others in between. Nice hiding place for this geocache”. Geocachers’ perspectives on the Internet of Art through the geocaching trail can be evaluated from the centrality perspective as Internet connectedness refers to a person’s subjective evaluation of impact of the experience on his or her personal life, and the extent to which a person would miss the connection to the Internet of Art if it was no longer available. For the geocaching community, the mobile aspect of the locationbased play experience seems to be the most central, and the narrative of the art experience of secondary interest. For example, one geocacher describes the centrality aspect of the experience in the following way: “Vacationing with a caravan, together with GeoMuggle [non-playing] father, junior Geomuggle and geodog [playing family members]. In the morning, we took the dog out. We have read the story but walked
296
P. Ihamäki and K. Heljakka
straight to the final cache place. The weather started to be a bit too hot for our geodog. Thank you for the geocache”. Another geocacher describes the experience, as follows: “Spending the weekend in [the city] and we visited this geocache in the evening. Thanks for the story and geocache”. One geocacher describes “Thanks for introducing the city [X], it was a nice letterbox that we finally found.”
5 Discussion Experiences related to the Internet of Art present quite a new area of academic discussion. The study presented in this paper, demonstrates how urban and augmented arts experiences have been connected with the platform of the geocaching game. To our knowledge, the presentation of arts together with the game mechanics of the geocaching game represent a novel combination of cultural experiences. According to our understanding, artworks have not been connected with geocaches in this way before. The case is also unique because artworks are physically located in an urban park environment, as a part of a cityscape in Finland. As part of a geocaching trail, the connected artworks form an essential dimension of the play experience of SigridSecrets, but it is possible for other visitors of the park to find the artworks outside of playing the game. For groups of test-players taken on a guided geocaching tour, it is on the one hand possible to find the artworks because of their link to the story communicated on the Geocaching.com website. On the other hand, they can also use the Sigrid-Secrets app, which connects the artworks through the digitally mediated story of the game. The app enables the unleashing of the augmented reality features of artworks, and in future stages they can share their experiences of the artworks, for example, by narrating the story of Sigrid-Secrets further, and take augmented reality photographs with the Sigrid character, which can be shared with other users by using social media channels. The aspect of social sharing is key in understanding the Internet of Art as connected art, which users share their experience of making their own interpretations of the character Sigrid. This means that artworks connected to the Internet of Art are never “finished” because of the logic of player participation and sharing. Therefore, Internet of Art-related experiences like Sigrid’s story are ‘endless’ as they involve users to cocreate the narratives and online representations of artworks. In the view that theory and practice can each lead to developments of the other [21], a collaboration process that forces us to reposition our thinking can lead to new insights (creative and novel uses) for the Internet of Art in the technology space [22], produce positive outcomes of integrated cross-disciplined knowledge [23], and identify requirements to support the geocaching game environment [24, 25]. This is held in common belief by the participants involved (players as collaborators with different backgrounds) that access, build on knowledge and understanding of the capacities of the Internet of Art. Its associated constraints will allow creative exploitations of the Internet of Art in envisioned novel applications and approaches [26], like this case study has shown.
Internet of Art
297
6 Conclusion and Future Research According to the study at hand, the Internet of Art presents its players with connected and collectively creative experiences, in which the players have experienced physical artworks through a geocaching trail. The preschool children, who test-played Sigrid-Secrets with their teachers, reported to assign value to three things of this Internet of Art experience; 1) the excitement of the ‘treasure hunt’ mechanics of the location-based and urban geocaching game, 2) the Augmented Reality features of the digitally-connected, physical artworks, and 3) finding of the actual geocache in the hidden spot under the ‘tree house’. For the geocachers, again, according to testimonials written on Sigrid-Secret’s own website, the most valued aspects of the Internet of Art-related experience were reported to result first, from the excitement of engaging in letterbox-type geocaching through mobility and exploration of the park (the sequence of art works located in an urban context function as drivers to explore the city area with its sights), and second, from finding the surprisingly well-hidden cache at the end of the trail. The Sigrid-Secrets geocaching trail includes six artworks along the trail. These artworks are connected through a fictional story. They are also connected with a “sistertrail” located in a city some 50 km from the first trail. Some geocachers have been able to detect the connectedness between the two trails located in two cities, like one geocacher, who has written in the following way: “The trail follows the same logic as the other trail [in another city], that is why the trip advances quickly….”. Connecting different geocaching trails through a common story and theme also represents new possibilities to employ the Internet of Art. The geocaching game is based on a locationbased mechanic, which also means that locations are given special meanings within the game. In future stages, we are building a new geocaching trail with a coastal theme, where Sigrid will be adventuring in seaside areas2. The possibilities of the Internet of Art to connect to different locations adds on to the context perspective—places that have special meanings for geocachers. Special locations again, present opportunities for emotional engagement with places: geocachers connect meanings and experiences with particular places. By playing Sigrid-Secrets with the help of the app, geocachers are also presented with the unique possibility to extend their experiences of physical artworks virtually and collectively. The Internet of Art experienced through a geocaching trail enhances the geocaching trail in many ways: the fictional story of Sigrid-Secrets depicted in the physical artworks, extend to the digital realm in which the artworks “come alive” through videos and animations. People attach special meanings to places, and this is also achieved on the Sigrid-Secrets geocaching trail when people share their experiences and continue the art experiences by, for example, taking augmented reality photographs on the geocaching trail. This new AR technology enables players of the urban experience with wider possibilities to creative exploration of the Internet of Art: The players can collaborate by continuing the Sigrid-Secrets story with their creative input. The Sigrid-Secrets app also shows the expansions of the participants’
2
In Finnish the name for this coastal trail is Sigrid-Secrets Merellinen Pori.
298
P. Ihamäki and K. Heljakka
environment with AR enhancements, which become a shared social experience bridging public and personal playscapes. In sum, the case example of the Internet of Art (Sigrid-Secrets) presented in this paper includes the modes of connectedness (physical, digital, and social), modes of context (the geocaching trail), modes of intensity (levels of engagement with the geocaching trail), and modes of centrality (computational relationships between sensing and mediation, user participation, and the public environment (like in this case study, the geocaching trail in located in a public park). These modes may provide a helpful point of departure for future exploration of mobility, AR, and connectedness in design and development of cultural experiences built on combinations of locationbased gaming systems and augmented art experiences. Acknowledgments. We wish to express our gratitude to the preschool children and their teachers for participating in our study. This study was conducted in affiliation with Pori Laboratory of Play (PLoP).
References 1. Stephen, C., Plowman, L.: Digital play. In: Brooker, E., Blaise, M., Edwards, S. (eds.) Sage Handbook of Play and Learning in Early Childhood, pp. 330–341. SAGE Publication Ltd, London, UK (2014) 2. Chevalier, C., Kiefer, C.: What does augmented mean as a medium of expression for computational artists? Leonardo Music J. 51(4), 263–267 (2018) 3. Genevro, R., Hruska, J., Padilla, D.: Augmented reality: peeling layers of space out of thin air. In: the New York Architecture Diary (2011). http://architecturediary.org/newyork/ events/6961. Accessed 16 Jan 2016 4. Paul, C.: Curatorial Models for Digital Art, New Media in the White Cube and Beyond. Berkley: Univeristy of California Press (2008) 5. Jung, J-Y., Qiu, J.L., Kim. Y-C.: Internet connectedness and inequality: beyond the “Divide”. In: Communication Research 28, pp. 507–535 (2001). Internet Connectedness and Inequality. Accessed 16 Jan 2016 6. Ball-Rokeach, S.J., Readorn, K.: Monologue, dialogue, telelog: Comparing an emergent form of communication with traditional forms. In: (Eds.) Hawkins, R.P., Weimann, J.M., a Pingree, S. Advancing Communication Science: Merging Mass and Interpersonal Process, Newbury Park: Sage, pp. 135–161 (1988) 7. Loges, W.E., Jung, J.-Y.: Exploring the digital divide: internet connectedness and age. Commun. Res. 28(4), 509–537 (2001). https://doi.org/10.1177/009365001028004007 8. Kortuem, G., Kawsar, F., Fitton, D., Sundramoorthy, V.: Smart Objects as building blocks for the Internet of Things. IEEE Internet Comput. 14(1), 44–51 (2010). https://doi.org/10. 1109/mic.2009.143 9. Rush, M.: New Media in Late 20th Century Art (World of Art). London, New York, Thames and Hudson (1999) 10. Green, R.: Internet Art (World of Art). London, New York, Thames and Hudson (2004) 11. Bolter, J.D., Grusin, R.: Remediation, Understanding New Media. Cambridge, MASS: MIT Press (2000). Remediation, Understanding New Media. Accessed 16 Jan 2016 12. Schleiner, A-M.: “Parasitic Interventions: Game Patches and Hacker Art” (1999). http:// www.opensorcery.net/patch.html. Accessed 16 Jan 2016
Internet of Art
299
13. Driscoll, M.P.: Art on the Internet and the Digital Public Sphere, 1994–2003, A Dissertation Doctor of Philosophy in Art History, University of California, Los Angeles (2018), Published by ProQuest LLC. Art on the Internet and the Digital Public Sphere. Accessed 16 Jan 2016 14. Morgan, J.: A simple explanation of “The Internet of Things”, In: Forbes (2014). A simple explanation of “The Internet of Things”. Accessed 16 Jan 2016 15. Bright Start Investigations, Why You Need Internet Intelligence Today. https:// brightstarinvestigations.com/why-you-need-internet-intelligence-today/. Accessed 12 March 2020 16. Clark, S.: Internet of Art Things/IOAT/ IOT ART, In Interact Digital Art, Artworks, Commissions & Collaborations (2017). http://interactdigitalarts.uk/artthings. Accessed 16 Jan 2016 17. Häkkinen, P., Järvelä, S., Mäkitalo, K.: Sharing perspectives in virtual interaction: review of methods of analysis. In: (Eds.) Wasson, B., Ludvigsen, S., and Hoppe, U.: Designing for Change in Networked Learning Environments, Proceedings of the International Conference on Computer-Support for Collaborative Learning. pp. 395–404. Kluwer Academic Publishers: Dortrecht (2003) 18. Barbara, L., Downe-Wamboldt, B,L.: Content analysis: method, applications and issues. In: Health Care for Women International,vol. 13, pp. 313–321 (1992) 19. Kondraki, N.L., Wellman, N.S., Amundson, D.R., Sundramoorthy, V.: Content analysis: review of methods and their applications in nutrition education. J. Nutr. Educ. Behav. 34(4), 224–230 (2002) 20. Mayring, P.: Qualitative content analysis. In Qualitative social Research 1(2), (2000). http:// dx.doi.org/10.107169/fqs-1.2.1089. Accessed 16 Jan 2019 21. Edmonds, E., Candy, L.: Relating theory, practice and evaluation in practitioner research. In: Leonardo 43(5), The MIT Press, pp. 470–476 (2010) 22. Woolford, K., Blackwell, A.F., Norman, S.J., Chevalier, C.: Crafting a critical technical practice. In: Leonardo 43(2), The MIT Press, pp. 202–203 (2010). https://doi.org/10.1162/ leon.2010.42.2.202 23. Edmonds, E., Leggett, M.: How artists fit into research processes. In: Leonardo 43(2), The MIT Press, pp. 194–195 (2010) 24. Jones., S.: A systems basis for new media pedagogy. In: Leonardo 44(1), The MIT Press, pp. 88–89 (2011) 25. Candy, L., Edmonds, E.: Modeling co-creativity in art and technology, In: Proceedings of 4th conference on Creativity & Cognition, ACM, pp. 134–141 (2002) 26. Koh, R.K.C., Duh, H.B-L., Gu, J.: An integrated design flow in user interface and interaction for enhancing mobile AR gaming experiences. In: IEEE International Symposium on Mixed and Augmented Reality, Arts, Media & Humanities Proceedings, 13–16 October 2010, Seoul; Korea, pp. 47-52 (2010). An integrated design flow in user interface and interaction for enhancing mobile AR gaming experiences. Accessed 16 Jan 2016
Preservers of XR Technologies and Transhumanism as Dynamical, Ludic and Complex System Sudhanshu Kumar Semwal(&), Ron Jackson, Chris Liang, Jemy Nguyen, and Stephen Deetman Department of Computer Science, University of Colorado, Colorado Springs, USA {ssemwal,rjackso7,cliang,jnguyen2,sdeetman}@uccs.edu
Abstract. Augmented, Mixed and Virtual Reality technologies or XR technologies are starting to mature, and by incorporating deep-learning algorithms in these systems, technology is providing a new form of interactions again, that of autonomous drone(s), resulting in new form of support, also perhaps renewed interests on transhumanism where technology can assist and enhance quality of human-life. This could result in new forms of positive outcomes in elderly care, pandemic control based on localized situations, and possibly understanding human conditions across the globe. XR technologies are therefore getting tremendous possibilities and boost of infusion of technology options to go further in our quest of understanding who we are and what we are about. When technology improves, new forms of interactions emerge, leading the discussions around positive Transhumanism which is somewhat limited idea where technology would provide a positive impact on human-life solving problems and issues which were not possible. This paper looks at this idea as to what transhumanism, also called Human-Imitation, is and how it related to XR technologies. One such practical application is learning. Our work is about shifting the conversation from memorization to more experiential learning could be provided. Such systems have appeared in literature as ludic systems and embodied interactions. New XR technologies will make this more feasible. Keywords: Augmented Reality Virtual Reality Augmented environment Mixed Reality XR technology Complex systems Ludic systems
1 Introduction XR (Virtual Reality (VR), Augmented Reality (AR), Mixed Reality (MR)), and drones have the power to put computation all around us. In future, as more stealth drones would be the norm these technologies will also have less social weight allowing them to become more pervasive for benefit for all – hospitals, elderly, and many others would be directly benefitting with such a technology infusion all around us. The emergence of behaviors and patterns show that human abilities will increase. Mobile phones, for example, provide a relatively low encumbrance and low social weight device to us instead of carrying a whole notebook. New capabilities such as banking © Springer Nature Switzerland AG 2021 K. Arai et al. (Eds.): FTC 2020, AISC 1289, pp. 300–309, 2021. https://doi.org/10.1007/978-3-030-63089-8_19
Preservers of XR Technologies and Transhumanism
301
while on a hike or even paragliding are available today. These were not possible a few years ago due to bulkier designs of desktops of 1980 s. Self-driving cars, drones, private satellites, digital contents delivery, purchases, and many other forms of digital improvements are adding more capabilities in our arsenal of tools as we move towards more digital, more options. It also begs the question how the technology will be perceived around us – how far we can evolve beyond our current perceived physical and mental limitations through digital technology – leading to an idea of Computational Transhumanism [1]. Today, transhumanism is still prevalent in comparison to our past, particularly in the medical field and lifespan adjustments. Some examples of uses in the medical field include prosthetics and complex machine-based situations. Injuries that can occur within the body could possibly be mended by metal prosthetics replacing bones, and more. So trans-humanism, blending of technology around humans, does exist today.
2 Previous Work In some societies, transhumanism and cyborg research translates to the idea of uploading the mind and consciousness to a server, where it can be accessed by a robot or other living entity in the future. This also exist in rudimentary for today as internet is providing a way to permanently put our life in pictures and videos for our future generations to see, and frankly that can be done today. Transhumanism is such a gray area that no one definition may therefore apply, because it varies from person to person. The obvious technology that is being used to combat COVID-19, by Universities and School systems is the conferencing system, the concepts have been there in late eighties and nineties. Social distance is possible through socializing virtually. Other areas are haptic interfaces [2] providing tactile stimulations, which are example of things which once again has dominated thoughts during 1960 s. Gesture understanding [3] in game space, provides a way to communicate non-verbally, and can provide tele-presence. Some may feel that trans-humanism research is un-ethical as it crosses the boundary of what makes a human? One way to capture events around a person is to use cameras [4]. This research relates to transhumanism in a couple of ways, the first being that in the future if humans do get uploaded to a server, their motion-patterns can be downloaded as an avatar. In [5], question of how humans have evolved from primitive societies where the goal was survival, to modern day societies. Bay-Yam explores the idea of individualistic versus collective behavior, and how humans and human civilizations are a truly living organism that grows and evolves over time [5]. The interesting thing about humans and how they interact with one another, is that because human brains have evolved to the point of super complexity (compared to other animals), we are able to manipulate how a society evolves and grows depending on certain conditions, perhaps some global control and rules. Obviously, evolution is a key factor as well, yet as societies start to introduce non-survival elements to the forefront, important point here is that humans are starting to act more on what they are thinking, and not what instinct is telling them separating us from other specifics; although this may have to be studied further. New interactions are a large part of our study, and the
302
S. K. Semwal et al.
differences between random, coherent, and correlated behavior in a group of people and technology can be the focus of transhumanism research. In a coherent behavior, there is some semblance of order, even within what appears to be chaos. People waiting in crowded airport seems chaotic/random at the group level yet have well defined individual goal of perhaps catching a plane and thus have no random or chaotic intent. Finally, there is a correlated behavior [5]. This type of behavior is best described by a corporate structure, where tiers of people determine the chain of command.
3 Transhumanism Framed as Complex Systems Transhumanism is about pushing the boundaries of humanity, and with that, comes radical ideas such as uploading consciousness into a robot to allow a person to continue to live, or adding technology to the body to connect ourselves to the “grid” in a sense, yet many of us can shudder that the possibilities are only as limiting as the human mind and how far scientists are willing to go to manipulate the human body and mind. Here, we would like to mention that post-humans (life after we die, posthuman condition in [1]) is not our focus, neither is the discussion that robots will take over humans and our world [31]. We do not have any tools yet to understand scientifically posthumans conditions although possibility of para-normal communication are found in every culture and in story-telling. Similarly, tools are not available to enter worlds accessible to animal minds. Trans-humanism though is a reality and our work if focused on that. Also, when we frame the idea of transhumanism by Complex Systems research, we realize that there is no future for our human-form as trans-humanistic robotic-entity. Even if all our consciousness is loaded into a robotic mind that entity will simply be different, not us. Apart from (a) local interactions creating global phenomena, (b) sensitivity to initial conditions clearly allows us to claim that such a trans-human will not be same as the human it is replacing. So that leaves us with the possibility of emergence when we consider transhumanistic research along with what we can offer. This is also consistent with Melanie Mitchell’s ideas on complex systems. We think that case can be made that that (a) and (b) above are rooted in works of Darwin and in fact can make us understand the true nature of human as a natural living dynamic system. Thus, the goal of our work is to provide and expand human knowledge, and once again, HCI community that transhumanism, just like computer science or anything man-built is a tool providing us possibilities of new emergence, and advancing us to eliminate some of our sufferings, and increase the quality of our life. Another example of transhumanism is [7], this technique for augmented reality is extremely popular. It is used daily to provide different filters of the face. Stephen Hawkins is an example of what digital technology can deliver for a human. Combination of new drone based XR technology therefore can bring about new capabilities that are unimagined. To restraint the technology does not make sense as perhaps the best and the worst part of trans-humanism is that we might break away from our biological limitations. The question will arise if it would be safe to say that without some biological influences would we have a sense for survival? As a downside, transhumanism may give people a sense of false hope. The question in [6] is to ask does the character’s personality play a role. The paper discusses that
Preservers of XR Technologies and Transhumanism
303
affinity towards a virtual character is a complex interaction between the character’s appearance and personality. Personality within virtual characters do indeed create a positive reaction with users. In 3D graphics and robotics area the phenomena of uncanny valley can be reduced by giving a virtual character its own, slightly different from humans, personality.
4 Dynamical Systems and Sensitivity to Initial Condition Transhumanism exists today across the entire world. Yes, some parts of the world are more developed, while others are less developed. Yet at every demographics, place or country, the impact of technology in possibly improving the quality of the life can be argued. As an example, space technology providing communication with the outside world, and by providing remote sensing to help teach people to grow more food than was possible before. For those parts of the world, technologies of this sort, and many others, have stimulated a great advance in the human condition in a relatively short period of time. These technologies may not have been initially developed with the intention of improving the worldwide human condition, but that is the nature of new technologies. Some believe that technology has been, and will remain, a double-edge sword. They can explore what it will mean to be human, or even post-human, as we move into the future where technology would provide different options or sure moving forward, yet we contend that there are no similarities between the two human, transhuman, and post-human because they will be different due to all three being dynamical systems of different complexities themselves, still bound by sensitivity to initial conditions most of the time. The Zibrek, et al. paper describes a large scale (over 1000 participants) experiment in a virtual environment (VE) created with Unreal Engine 4.9 and containing a virtual human character (avatar) created with Autodesk 3ds Max 2015. The paper is an experimental study of user perception/acceptance of the avatars based on their appearance and personality/behavior. This confirms our thoughts around sensitivity to initial conditions as the research is trying to nudge us in that direction. The authors intended the experiment to be a virtual reality-based investigation of the uncanny valley effect which suggests that users have a negative response as the representation of characters becomes near photo-realistic. The avatar’s appearance was altered with five different render styles (Realistic, Toon CG, Toon Shaded, Creepy, and Zombie). The avatar’s animations reflected three different personality pairs: Agreeable/NonAgreeable, Extraverted/Introverted and Emotionally Stable/Neurotic. The experiment requires users to first experience VR waiting and training rooms to become familiar with the VE and to receive instructions on how to proceed through the scenario. Then, the user proceeds to the experiment room where they observe the avatar in one each randomly assigned rendering and personality. After exiting the room, the user answers questions (categories include Empathy, Realism, Affinity and Co-presence) designed to explore the effect of render style on the user’s perception of the avatar. The authors conclude that user affinity toward an avatar is based upon a complex relationship between the avatar’s realism and behavior. However, they generally conclude that, contrary to the uncanny valley effect, avatar realism is viewed as a positive avatar
304
S. K. Semwal et al.
characteristic in a VE. Another paper, Gonzalez-Franco, et al. paper describes a VE experiment with 20 participants created with a Fakespace Labs Wide5 HMD using the VR software environment, position sensors on the hands and head, and balance data using a video-games’ Wii Fit Balance Board under the feet. The main point of the experiment is to investigate user body ownership and agency of an avatar based on the avatar’s tracking and reflection of the user’s body movements. Agency is the user sense that they are causing the avatar’s motions. In the VE, the user faces a virtual mirror containing the avatar. The experiment has two modes which each participant experiences: synchronous and asynchronous. In the synchronous mode, the avatar will reflect the upper body motions of the user. In the asynchronous mode, the avatar motions will be independent from the user. There are so many examples which can be considered examples of work in HCI/VR/XR/MR areas of Computer Science moving us towards trans-humanism, not just robot-mimicry, but as extensions of our own capabilities using technologies: Mirror-reflections and sense of presence [8], Drone bases vision [10] and interactions such as following a person [17], or creating new Augmented Realty environments using drones [20, 21], fragrance rendering [11], thermal taste [12], techniques for exploring Virtual Environments [13], two comprehensive books on Virtual Reality [14], and Artificial Reality book by Myron Kruegar [15], work on Schizophrenia [18], and Complex Systems [31]. Possible uses of such technology to enhance both the human condition and organism in a very open and comprehensive manner will advance the research agenda of transhumanism. For the practitioners of Transhumanism to succeed, they must establish a better understanding and control of this technology/human intersection and its boundaries, a fascinating journey may still await all of us as technologies are connected and marching through in the hopes of using the technology to enhance human-life, which appears to the primary goal of transhumanism research.
5 Our Implementation of Avatars Facial Models In one of our implementations of Virtual Worlds being designed for both the Schizophrenia study and Autism in future, Unity and Oculus Rift™ provide an integrated capability to build applications that generates 3D Virtual Reality (VR) contents. A city environment, head sounds and social interactions of a person afflicted with schizophrenia are being modeled. One aspect of our applications is to create NPCs (non-player characters) to enrich the environment by including facial expressions and different body poses corresponding to the positive and negative verbal responses. The results of our avatar based on existing model of Megan in Unity3D and extending it with facial expressions. This perhaps is the state-of-the-art 3D XR technology example (Fig. 1). Once could contrast these facial features with our second avatar interpretation where focus in on creating faces in unity3D™ embedded some of Ekman’s research and other recent motion capture examples (See Fig. 2).
Preservers of XR Technologies and Transhumanism
305
Fig. 1. Avatar called Megan in Unity3D (Second author [19]).
Although some of the Transhumanism adherents are seeking ways to extend human capabilities. There is a desire to create a social environment that enhances human social interaction, or at least the parts we find most difficult, and replace it with technologically feasible interaction. These technical advances could make our lives, and our human pursuits, much more efficient and effective in many ways.
306
S. K. Semwal et al.
Fig. 2. Avatar with a variety of facial expressions generated, including hair [20].
6 Ludic Systems and Concept of Preservers of the Work Our work involving Line-Storm has been an attempt to provide a ludic system for use by the creative worker. A ludic system is one that is used for its own sake, and not for some other end [22]. A group of participants responded differently to the experimental condition than did the rest of the participants. There was also a group who responded differently to the control condition than did the rest of the participants and we have termed these Preservers [23–30]. We find such usage in Heidegger’s work where embodied interactions, such as that offered by XR technologies. In addition, we find a pathway, an idea that cause-and-effect which science generally craves, can still be studied under subtle science of complexity where cause-and-effect are themselves dynamic in space and time. The idea is to study the behavior of preservers of that system. Here preservers are the participants who use the system for many hours and are intuitively connected to the usage of that system, or thing or place etc. In the subgroup of preservers, cause and effect can be observed as a consistent pattern. Our work [16] on Ludic system develops this idea further. The idea of preservers is based on interests and learning in embedded and embodied environment by a set of participants who are preservers (fans) of that system. We see this in all walks of life: politics, branding, pedagogy, philosophy, and lastly technology, etc.
Preservers of XR Technologies and Transhumanism
307
7 Conclusions and Further Research Bostrom discusses values such as technological freedom, diversity, pragmatism, caring about sentience, and saving lives. Inherent in that discussion could also be ideas around absolute necessity for the world-wide pervasiveness of human freedom, the right of human self-determination and the rights of the individual over the state. This is so important because the same freedom appears to be compromised in the day and age we live in: our personal data is now public; and everything on internet has become permanently available. When did we give up on the concept of our data being our property? This discussion which is intimately related to transhumanism or Human-imitation, as well, is gaining momentum at that international arena, such as right of being forgotten [9]. The place where technology, specially XR technologies will make significant contributions is to shift the conversation of learning from memorization to embodied interactions. The deep-learning algorithms could allow us to memorize events, places and things and much more, so that we can become preservers of our learning styles. Facts, outcomes and past results could be available thanks to these deep learning machines, so that perception of things [23–31], ability to make decisions at an instant knowing the past, will create an opportunity to dive into experiences. Learning by experimenting and using perception enhancing through XR (AR, VR, MR, Ambience/Tangible interfaces, and drones) would provide new possibilities on multiple scales. The learning will return to the idea that learning systems are by themselves the way they are, yet when a participant engages the experience and knowledge of that participant will be enhanced, and that quality of that interaction will be better for some, the so-called preservers [23–30]. In other words, learning will once again become more experiential than what we have today, more interactive at the moment ideas; perhaps a trend towards Ludic system will increase as the pendulum of learning will swing from memorization to experiences for the participants/preservers, and of course designers of these systems. Acknowledgments. This paper is based on term reports and final exam questions for the VR and HCI (CS 6770) class offered during Spring 2020 at the University of Colorado Colorado Springs. Authors acknowledge the comments by reviewers. Special thanks to the second reviewer of our paper, who mentioned that Human-Imitation is synonymous with Trans-Humanism. We have borrowed that work into our paper as well. Thank you.
References 1. Bostrum, N.: Transhumanism values: https://www.nickbostrom.com/ethics/values.html 2. Dennis, M., Ramesh, R.: Second Skin Motion Capture paper with actuated feedback for motor learning. In: IEEE VR Conference, VR 2010, pp. 289–90,Walthom, MA.USA, March 2–24 (2010) 3. Michael, H., Paul, V., Joseph, J.L.: Breaking the status Quo: improving 3D gesture recognition with spatially convenient input devices. In: IEEE VR Conference, VR 2010, pp. 59–66, Walthom, MA.USA, March 2–24 (2010) 4. Lee, J., Chai, J., Reitsma, P.S., Hodgins, J.K., Pollard, N.S.: Interactive Control of avatars with human-motion data. In: SIGGRAPH 2002 Proceedings of the 29th Annual Conference on Computer Graphics and Interactive Techniques. pp. 491–500 (2002)
308
S. K. Semwal et al.
5. Bar-Yam, Y.: Complexity Rising: From Human Beings to Human Civilization, A Complexity Profile in Encyclopedia of Life Support Systems (EOLSS) under the suspicies of the UNESCO. EOLSS Publishers, Oxford, UK (2002) 6. Zibrek, K., Kokkinara, E., McDonnel, R.: The Effects of Realistic Appearance of Virtual Characters in Immersive Environments—Does the Character Personality Play a Role? IEEE Trans. Vis. Comput. Graph. 24(4), 1681–1690 (2018) 7. Keisuke Tateno in A nested Marker for Augmented Reality 8. Gonzalez-Franco, M., Perez-Marcos, D., Spanlang, B. and Slater, M.: The contribution of real-time mirror reflections of motor actions on virtual body ownership in an immersive virtual environment. In: 2010 IEEE virtual reality conference (VR). pp. 111-114. IEEE (2010) 9. Right to be forgotten: https://www.siia.net/blog/index/Post/71354/Europe-Does-Better-onthe-Right-To-Be-Forgotten 10. Erat, O., Isop, W.A., Kalofen, D., Schmalstieg, D.: Drone-augmented human vision— exocentric control for drones exploring hidden areas. IEEE Trans. Vis. Comput. Graph. 24(4), 1477–1485 (2018) 11. Hasegawa, K., Qui, L., Hiroyuki Shinoda, M.: Ultrasound fragrance rendering. IEEE Trans. Vis. Comput. Graph. 24(4), 1437–1446 (2018) 12. Kasun Karunayaka, N., Johari, S., Hariri, H., Camelia, K.S., Bielawski, A.D.C.: New thermal taste actuation technology for future multisensory virtual reality and internet. IEEE Trans. Vis. Comput. Graph. 24(4), 1496–1505 (2018) 13. Sitzmann, V., Serrano, A., Pavel, A., Agrawala, M., Gutierrez, D., Masia, B., Wetzstein, G.: Saliency in VR: how do people explore virtual environments. IEEE Trans. Vis. Comput. Graph. 24(4), 1437–1446 (2018) 14. Grigore, B., Philoppe, C.: Virtual Reality Technology. Chapters 2 and 3 (1994) 15. Myron, K.: Artificial Reality II. Chapters 1–3, ISBN-13: 978–0201522600 (1984) 16. Hans, C., Sudhanshu, K.S.: Line-Storm ludic system: an interactive augmented stylus and writing pad for creative soundscape. In: WSCG 2020 Conference, International Conference in Central Europe on Computer Graphics, Visualization and Computer Vision, May 19–21 (2020) 17. Eric, V., Sudhanshu, K.S.: Autonomous Drone following a Person in Indoor Physical Spaces, MS Thesis, University of Colorado Colorado Springs, Spring (2019) 18. Ron, J.: A VR environment simulating what a Schizophrenia person will hear. Term Project report, CS 6770: HCI &VR course, Faculty: Dr. SK Semwal, Spring 2020, pp. 1–56 (2020) 19. Chris, L.: Simulating Micro-facial Expressions, Term Project report, CS 6770: HCI &VR course, Faculty: Dr. SK Semwal, Spring 2020, pp. 1–56 (2020) 20. Guillem, B.T., Sudhanshu, K.S.: Autonomous Parking Spot Detection System for Mobile Phones using Drones and Deep Learning, manuscript under review. In: International Symposium of Mixed and Augmented Reality, Brazil, pp. 1–11 (2020) 21. Damia, F.E., Sudhanshu, K.S.: A new Augmented Reality Paradigm using Drones, manuscript under review. In: International Symposium of Mixed and Augmented Reality, Brazil, pp. 1-10 (2020) 22. Gaver, B.: Designing for Homo Ludens, Still. In (Re)Searching the Digital Bauhaus. pp. 163–178. Springer-Verlag London Ltd. (2009) 23. Heidegger, M.: Being and Time. (J. MacQuarrie, Trans.) New York, NY: Harper Perennial (1962) 24. Heidegger, M.: Memorial Address. In: M. Heidegger, Discourse on Thinking Anderson, J. M., Freund Trans, E. H., pp. 43–57. New York: Harper & Row (1966) 25. Heidegger, M.: On the essence of truth. In: Heidegger, M., Basic Writings (D. F. Krell, Trans., pp. 111–138). San Francisco: HarperSanFrancisco(1993) 26. Heidegger, M.: The origin of the work of art. In: Heidegger, M., Basic Writings Krell, D.F. T., pp. 139–212. San Francisco: HarperSanFrancisco (1993)
Preservers of XR Technologies and Transhumanism
309
27. Heidegger, M.: The question concerning technology. In: Heidegger, M., Basic Writings pp. 307–344. HarperCollins (1993) 28. Heidegger, M.: The question concerning technology. In: Heidegger, M., Basic Writings: Ten Key Essays, plus the Introduction to Being and Time pp. 307–344. HarperCollins (1993) 29. Heidegger, M.: Concealment and Forgetting. In: Heidegger, M., Parmenides, A.S., Rojcewicz, R.T., pp. 77–86. Indiana University Press (1998). https://books.google.com/ books?id=frwxZ3GWduYC 30. Rivlin, R., Gravelle, K.: Deciphering the Senses, The expanding world of human perception. Chapter 1, pp. 9 − 28. Simon & Schuster, Inc. New York, (1984) 31. Sudhanshu, K.S.: Complexity issues in virtual environments. In: 8th International Conference of Artificial Reality and Tele-Existence (ICAT 1998) as a Distinguished Invited presentation, pp. 27–32, December 21–23, Tokyo (1998)
Interview with a Robot: How to Equip the Elderly Companion Robots with Speech? Pierre-André Buvet1(&), Bertrand Fache2(&), and Abdelhadi Rouam3(&) 1
3
Sorbonne Paris Nord University, 99 Avenue J.B. Clément, 93430 Villetaneuse, France [email protected] 2 Teamnet. 10 Rue Mercœur, 75011 Paris, France [email protected] Ontomantics, 959 Rue de La Bergeresse, 45160 Olivet, France [email protected]
Abstract. In this paper, we present the UKKO dialogue system which is intended to be implemented in humanoid robots, namely robots designed to help the elderly living in semi-hospital environments. First, we explain the UKKO’s approach, in comparison with other man-machine dialogue systems. UKKO takes advantage of works carried out as part the linguistic intelligence. The system includes, inter alia, a natural language understanding module, a natural language generation module and a dialogue manager module. These modules use linguistic resources. Second, we discuss the language facts analysis on which the man-machine interaction model relies. This analysis takes into account the following principles. Firstly, any statement is interpreted as a propositional content. Secondly, the minimal unit of any dialogue is a pair of successive statements. Thirdly, any dialogue is structured thanks to rules which govern turn-taking and the way statement pairs are linked. Finally, we lay out the system main technological aspects. We present the UKKO modular architecture and describe its principal components. Keywords: Dialogue system Robotics Conversational analysis language understanding Natural language generation
Natural
1 Introduction The present paper is about an intelligent dialogue system, called UKKO, developed in order to be implemented in personal assistant robots. After describing the UKKO’s features in comparison with other dialogue systems, we present the language facts analysis that allowed to build a man-machine dialogue model. Thereafter, the main technological aspects are presented. Simulating the human speech is a key challenge for robotics as it is productive. When they are playing the role of life companions, humanoid robots communicate with human beings. From this theoretical perspective, we develop, in an experimental way, a French dialogue system which allows, on the one hand, to simulate the human linguistic capacity in terms of decoding and producing messages and, on the other hand, © Springer Nature Switzerland AG 2021 K. Arai et al. (Eds.): FTC 2020, AISC 1289, pp. 310–326, 2021. https://doi.org/10.1007/978-3-030-63089-8_20
Interview with a Robot
311
to have elaborated verbal interactions with users. In the future, this system will be implemented in humanoid robots designed to accompany the elderly in their daily life. To achieve this objective, we rely on the understanding of the human speech. Linguistic studies put forward distinctions such as language vs speech, prefabricated vs fabricated and competence vs performance. These oppositions allow to present language as a communication system based on a finite number of rules from which are produced infinite number of messages. In all languages, rules are identified, described and explained in different models which agree upon the existence of invariants which structure the human speech, although they are partially different in terms of their nature and operating mode. In our context, we use a model based on the works of Z.S HARRIS [1] M. GROSS [2], and recent works of R. Martin [3]. This model led to the creation of formalized language facts representations which are exploited by tools developed as part of the processing of non-structured information [4]. The work we present in this paper is about version 3 of our dialogue system. Unlike version 2, the version 3 includes a conversational component. The new version, as the previous one, associates an incoming message, formulated by the user, with an outgoing message, formulated by the machine. In this context, the system is able to respond orally to an information request. In the current version, oral messages follow each other in a conversational form. This includes, besides the messages related to an information request, messages related to an action request and messages specific to the genre of the discussion. Moreover, both the user and the machine are able to initiate the conversation. After explaining our methodological approach, we present the data model used to develop the dialogue system. Thereafter, we explain how the system works.
2 Methodological Approach Robots need five language competences in order to be able to speak: the lexical competence, the grammatical competence, the semantic competence, the pragmatic competence and the dialogical competence. These skills take action in order to exchange information in the form of written or oral messages. This kind of message requires the intervention of two participants at least: the speaker and the hearer. The first conveys a message by encoding information to send and the second receives and decodes a message in order to identify information1. The speaker and the hearer are two interlocutors whose roles are interchangeable since the speaker becomes a hearer and the hearer becomes a speaker. This happens particularly when the communication takes place in a dialogue, namely when the two interlocutors interact orally with each other. The interaction is characterized by a succession of messages which are alternately sent by the interlocutors. Four mechanisms are used during the information exchange: 1) interpretation of incoming messages; 2) formulation of outgoing messages; 3) linking incoming messages to outgoing messages; 4) the coordination of the conversational flow. Interpretation and formulation mechanisms exploit all the language competences
1
A brief presentation of the communication diagram [5].
312
P.-A. Buvet et al.
mentioned above while the connection and coordination mechanisms exploit at least one part. The five skills are interdependent in the sense that none can function apart from the other four. This interdependence shows the language complexity [6]. Language facts studies generally underestimate the interweaving of these competences as they tend to represent them in a separate way, from the point of view of their properties and operating mode. We particularly notice it in didactics of languages and in the analytic approaches of theoretical linguistics. These studies are also underestimated in the natural language processing works since they analyze language facts at different levels: the morphological level, the syntactical level, the semantic level and the pragmatic level [7]. The lexical competence is a linguistic knowledge which includes, on the one hand, the language vocabulary acquisition and, on the other hand, the vocabulary appropriate use in discourse situations. The grammatical competence is another linguistic knowledge which includes two parts: the acquisition of the morphosyntactic rules which govern the statements patterns and the mastery of these rules in order to create wellformed statements [8]. This shows the competences interdependence2. These competences are interrelated since it is necessary to know in which grammatical conditions a word can be used correctly. For instance, in French, the acceptance of pleuvoir (to rain) is related to the impersonal turn (il peut sur Nantes) (it is raining over Nantes) whereas another acceptance of this verb is related to a statement which involves a verb which is indirectly transitive, since the subject must be a nominal group (les coups pleuvent sur le pauvre Bill) (the blows rain on poor Bill). The other way round, the morphosyntactic rules mastery requires to know its lexical scope. For example, in French, the passive voice is not applied in the same way to all the verbs. Thus, the verb voler (to steal) can be used under the passive voice, unlike subtiliser. The semantic competence has many aspects as it concerns all the language dimensions: morphology, lexicon, syntax, phraseology, text, dialogue, …The interdependence between the lexical competence, the grammatical competence and the semantic competence is highlighted, inter alia, in the distinction between these lexical meanings (the meaning of the verb pleuvoir (raining) as a meteorological phenomenon, in il peut sur nantes (it is raining over Nantes) and the grammatical meaning (the meaning of the verb pleuvoir (raining)) as a frequentative aspect marker in les coups pleuvent sur le pauvre Bill (the blows rain on poor Bill) [10]. The pragmatic competence concerns the integration of what is extralinguistic in language production and the other way round. In the first case, the communication situation is taken into account in the message analysis (the sequence j’ai mal à la tête (I have a headache) is interpreted differently when it is used after a long sequence of work or in a drugstore) [11]. In the second case, the message has an impact on the communication situation, as in a formal declaration of marriage [12]. By definition, the pragmatic competence and the others competences are interdependent. The dialogical competence is a linguistic knowledge which allows a human being to interact with
2
This analysis is highlighted in the lexicon grammar theory (lexique grammaire) [9].
Interview with a Robot
313
another human being. It includes the other four competences as it is linked up with them, yet has its own characteristics, as the mastery of turn-taking [13]. A robot with speech skills is equipped with a man-machine interface that allows it, to a certain extent, to interact with a human being. In other words, the robot is able to take the place of a human being and consequently has the capacity to interact with a human following the principle of turn-taking. From this point of view, the sequence of messages between a robot and a human is more or less long, as a conversation between two persons. In order to be able to communicate both in oral and written forms, a robot must be equipped with a man-machine interface that masters the mechanisms 1, 2, and 3, used during the information exchange. This happens when the robot responds to information requests only [14]. A conversation between a robot and a human being relies on the mastery of mechanism 4. The latter is composed of rules which govern the succession of incoming messages and outgoing messages. The four language mechanisms underlying information exchange between a robot and a human are generally supported by a dialogue system made up of many modules3. The incoming messages are interpreted by a module which analyzes their content. A matching module associates the incoming messages representations with the outgoing messages representations. The outgoing messages are produced by a module. A dialogue manager module takes over the message flow coordination. There are different types of dialogue systems. Two parameters allow to draw a distinction between the dialogue systems types: the data processing and the systems architecture. Data processing is based on many approaches. Static methods classify the incoming messages through identifying key words and associating them with the pre-registered outgoing messages. The dialogue systems based on such methods are inefficient since their results depend entirely on beforehand prepared information. Some of these systems improve their performance through using partially built messages which need to be completed by key words [15]. Semi-dynamic methods are based on information retrieval principles. Outgoing messages are stored and indexed in advance and then associated with outgoing messages through similarity measures [16]. The dialogue systems based on these methods are more efficient than the previous ones because no prerequisites are needed during the analysis of the incoming messages. These systems are more adapted to question-answer systems than systems which simulate dialogues. Dynamic methods are divided into two types. Firstly, there are methods based exclusively on artificial intelligence as they use deep learning algorithms. It is about training a system in such a way that it produces automatically outgoing messages through the analysis of incoming messages. This is done thanks to a panel of incoming messages appropriately associated with outgoing messages. Such systems lead to good results, especially when they deal with very constrained domains. However, their results depend on the quality of the data used to train the system [17]. Secondly, there are methods based on linguistic intelligence [18]. These methods rely on the analysis of both language operating mode and language properties.
3
For now, we ignore the modules of speech-to-text and speech synthesis which are used at the input and the output of the system.
314
P.-A. Buvet et al.
They allow to develop IT tools which exploit formal descriptions. These descriptions are made up of the analysis results related to the processing of non-structured information. The two approaches are complimentary. Linguistic intelligence provides skills that aim at structuring calculations performed by algorithms. Artificial intelligence provides its computational power to linguistic intelligence-based devices. The supervised method used in the deep learning models illustrates this point of view. Two modular architectures can be used: circular architecture and tubular architecture. In the first architecture, the information processing chain passes successively through different modules. A module that serves as an input and a module that serves as an output. The activation of these modules is done in a linear way, according to the information processing. In the second architecture, the modules work in parallel. The information processing chain is managed by a principal module. This module manages the conversation flow. The other modules are activated according to the needs identified by the main module. This module is linked up with the other modules. The dialogue system presented here is based on linguistic intelligence and on a tubular architecture.
3 Modeling The information exchange design focuses on the elementary nature or the global nature of the information. In the first case, it is about sending and receiving a message. In the second case, the messages follow one another as in a dialogue. The first design is based on the notion of statement whereas the second uses the notion of conversation. It is appropriate to study the notion of statement first. Semantics plays an important role in the five mechanisms underlying information exchange (cf. Supra). From this point of view, the notion of propositional content is essential since it helps to explain how a statement is encoded and decoded. The latter has two dimensions: a semantic dimension and a morphosyntactic dimension. The first dimension emerges in a propositional content, the second concerns the form of a statement. The concept of predicate, borrowed from logic, is generally used in order to define a proposition. This concept has two distinct meanings depending on whether it is based on Aristotelian or non-Aristotelian logic. According to the first meaning, the predicate is as an attribute, a property. Also, the proposition is considered as the application of a predicate to a subject, as demonstrated by this binary representation: Proposition => Subject + Predicate [19]. According to the second meaning, the predicate is a function, in the algebraic sense of the term. As a result, the proposition takes the form of the following representation: Proposition => Predicate(Argument) [20]. Consequently, predicates correspond to functions whose variables are arguments. In other words, propositions correspond to the projection of arguments on predicates. Here, it is about the second meaning of the term predicate and the conception of the proposition it involves. That is, the proposition stems from a predicate-argument structure [21]. A propositional content is defined as the occurrence of a predicate-argument structure in which the predicate is an oriented relationship between entities assimilated to its
Interview with a Robot
315
arguments. For instance, the predicate-argument structure TRANSFERT_FINANCIER (ETRE_HUMAIN1/ORGANISME/CONTRIBUTION1, PRODUIT2, ETRE_HU MAIN3/ORGANISME3) (MINITORY_TRANSFER(HUMAN_BEING1/ORGAN ISM/CONTRIBUTION1,PRODUCT2,HUMAN_BEING3/ORGANISM3) is a representation of language-oriented relationship, symbolized by the predicate TRANSFERT_FINANCIER(FINANCIAL_TRANSFER), that involves three entities respectively symbolized by ETRE_HUMAIN (HUMAN_BEING) or ORGANISME (ORGANISM) concerning arguments 1 and 3, and PRODUIT (PRODUCT) or CONTRIBUTION (CONTRIBUTION) concerning argument 2. Predicate-argument structures represent a linguistic knowledge shared by a language community members [22]. Predicate-argument structures derive from a logico-semantic nature. To the predicate-argument structure TRANSFERT_FINANCIER(ETRE_HUMA IN1/ORGANISME1,PRODUIT/CONTRIBUTION2,ETRE_HUMAIN3/ORGANI SME3)(MINITORY_TRANSFER(HUMAN_BEING1/ORGANISM/CONTRIBU TION1,PRODUCT2,HUMAN_BEING3/ORGANISM3)), are related statements such as Monsieur Dupont, achètez une Peugeot à monsieur Durand!; La société Martin vend une part de ses actifs à la société Dubois; Monsieur Lefevre paie sa taxe immobilière à l’Etat (Mister Dupont, buy a Peugeot from Mister Durant!; The Martin company sells part of its assets to the Dubois company; Mister Lefevre pays his property tax to the State.). Despite the fact of being different in terms of their propositional content, these statements share the same predicate-argument structure. The communication situation shows these differences. It is determined by different parameters related to the speaker and to the enunciation conditions, including the time and the place of enunciation [23]. The predicate-argument structure functional representation is applied to propositional contents but the predicate falls into a particular type and arguments get instantiated: ACHAT (Monsieur Dupont1, Peugeot2, Monsieur Durant3); PURCHASE (Mister Dupont1, Peugeot2, Mister Durant3); VENTE (société Martin1, actif2, société Dubois3); SALE (company Martin1, asset2, company Dubois3); PAIEMENT (Monsieur Lefevre1, taxe immobilière2, Etat3). PAYMENT (Mister Lefevre1, council tax2, State3). The predicate TRANSFERT_FINANCIER (FINANCIAL_TRANSFER) includes semantic types ACHAT (PURCHASE), VENTE (SALE), PAIEMENT (PAYMENT) and its arguments ETRE_HUMAIN1 (HUMAN_BEING1),ETRE_HUMAIN3(HUMAN_ BEING3),ORGANISME1(ORGANISM1),ORGANISME3 (ORGANISM3),PRODUIT2(PRODUCT2) and CONTRIBUTION2 (CONTRIBUTION2) get instantiated by different denominations and representations. The same propositional content can produce statements different in terms of their form: Monsieur Dupont, achetez une Peugeot à monsieur Durand!; Monsieur Dupont, achète-il une Peugeot à Monsieur Durant?; Monsieur Dupont achèterait une Peugeot à monsieur Durand. Mister Dupont, buy a Peugeot from Mister Durand!; Mister. Dupont, is he buying a Peugeot from Mister Durant?; Mister Dupont would buy a Peugeot from Mister Durand.
316
P.-A. Buvet et al.
Respectively, the injunctive, the interrogative and hypothetical natures of these statements make them different from purely informative statements such as Monsieur Dupont achète une Peugeot à monsieur Durand (Mister Dupont buys a Peugeot from Monsieur Durand). This semantic shade shows the speaker communicative intention. The message decoding is a process whose purpose is to interpret the propositional content related to a statement through taking into account the communication situation as well as the speaker’s communicative intention. The decoding analysis in the present work associates statements with their propositional contents logical semantic representation. Though, it processes both its explicit and implicit dimensions. The message encoding is a process which has a propositional content and a communicative situation. Moreover, it includes the speaker’s communicative intention, as a starting point, and the statement form, as an arrival point. The encoding analysis shows the rules that allow to generate a well-formed statement from a propositional content. Four of the five language skills take action during the decoding and the encoding process: lexical competence, grammatical competence, semantic competence, pragmatic competence. During the decoding as well as the encoding process, the enunciation can concern verbal interactions [24], such as conversations. Therefore, it is not only about interpreting and formulating messages, but also about ensuring consistency in order to promote information exchange. The speaking turn is established between two speakers. It relies on a dialogic relationship which is highlighted when their utterances come across. The parameters used to analyze a dialogue are: 1) “the conversational objective”; 2) “the flow”; 3) “the topic”; 4) “background” [25]. Parameter 1 allows to know the dialogue’s purpose. The objective must be shared by the two interlocutors of the conversation. Otherwise, it will generate misunderstandings and incomprehension. The role of parameter 2 is to identify different steps of the dialogue, particularly the ones tied to social conventions such as how to start and end a conversation. Parameter 3 allows to specify the topic of the conversation. Parameter 4 corresponds to the context. The topic and the background do not belong to the same category. The first is related to an area of knowledge while the second is about the communication situation. Unlike the context, the topic must be shared by the interlocutors. It is shared when the extralinguistic information is used to interpret statements. It is not necessarily shared when it is about specifying what triggered the conversation. Its initiator, when speaking first, has potentially a personal objective, a dialogical intention, which will be revealed progressively through information exchange. As a result, the dialogical intention does not cover necessarily the conversational purpose4. The minimal unit of dialogue is a pair of utterances formulated by interlocutors who speak one at a time by alternating turns [26]. The connection between the two utterances is based on their complementarity, that is, the utterance content must be built on the previous one. In a dialogic unit, interlocutors are engaged in each utterance. Both the first utterance and the second utterance are successively encoded and decoded. The discussions flow progresses half as fast as the information exchanges flow which involves, from the speaker’s point of view, the message formulation (outgoing message) and, from the hearer’s point of view, the message interpretation (incoming
4
Socrates' maieutics is a good illustration of this aspect of the dialogue., cf. Plato’s Theaetetus.
Interview with a Robot
317
message). The interaction core is the continuity carried out between the outgoing message and the incoming message. Fig. 1 summarizes these different aspects of the dialogue5. The chain of utterances pairs is carried out in a cooperative way even in situations of confrontation and competition between participants. A conversation is based on the mutual involvement of the two interlocutors as “the interaction is a […] complex process of coordination of actions [as…] based on a co-presence relationship, participants of the interaction make the meaning of their actions and their understanding of what is going on mutually intelligible to each other” [27]. On the basis of this dialogical contract, the conversation gradually progresses since each utterance is a reply to the previous one. A reply is sometimes attributed to the speaker. It involves generally a break and two utterances [28], especially when it is related to the phatic function of the communication [29]. The speaker has to make sure that the information is sent to the hearer. Utterances more or less conventional take into account this aspect of the conversation, as Vous m’avez compris? (have you understood me?). Conversely, a hearer when becoming a speaker, is supposed sometimes to ask for the clarification of the previous utterance [30]. The understanding of the previous reply is the starting point of a reply. Its arrival point is the formulation of its statement. Therefore, at this level, the mechanism that links incoming messages to outgoing messages takes place. The outgoing message must be adapted to the incoming message in order to ensure the interactive nature of utterances which follow one another. For instance, an utterance about a choice to make must be completed by an utterance which demonstrates that a person makes a choice among the suggested alternatives, hesitates in his choice or refuses to make a choice. From this point of view, the semantic competence is necessary in order to build relationships between messages. The chain of utterances pairs and the pairs of utterances are the only elements available in the conversational analysis [26]. The study conducted on pairs of utterances has shown invariants, especially in situations related to the discussion start and end. The conversation is governed by the principles of coherence and thematic progression [31]. Finally, the conversation uses discursive strategies with respect to the communicative intention of the speaker who starts the conversation [32]. These parameters are processed at the level of the mechanism that takes over the conversational flow coordination. The different aspects of modeling are taken into account, with different degrees, in the dialogue system presented in the next section.
5
We notice that the representation of pairs does not cover all the possible situations. For example, the same speaker can produce two utterances in a row with a break in the middle. Also, a pair can be made up of an utterance and a break. On the other hand, the final utterance is not always an utterance of the speaker who initiated the dialogue.
318
P.-A. Buvet et al.
Mi
Mi+1
Initial pair
Mn
Mn+1
Intermediate pair
Mf-1
Mf
Final pair
Fig. 1. The utterance chain
4 Technological Aspects Version 3 of the dialogue system is made up of six modules: 1) a dialogue manager which plays a significant role in the system; 2) speech-to-text module; 3) speech synthesis module; 4) natural language understating module (NLU); 5) natural language generation module (NLG); 6) module that associates the NLU output with the NLG input. Fig. 2 shows the architecture of the dialogue system. The system is based on a tubular architecture. The dialogue manager is the main module. This is why it takes over the information processing chain by using other modules according to the tasks to be carried out and the function of each module. The information is processed through connections established between the main module and the other modules. The information is processed at two levels: the elementary level and the global level. At the first level, the statements pairs are analyzed as the minimal units of dialogue. At the global level, the statements pairs chain is analyzed. Therefore, in the module 1, the entire dialogue is sequenced in the form of statements pairs. Two situations are possible. This depends on whether the dialogue initiator is a human being or a robot. In the first situation, the main tasks carried out are the following: task a) the user’s statement is recognized in the module 1 in the form of an audio file and transferred to the module 2 which transforms it into a text file; task b) the text file is sent to the module 4 in order to be associated with a symbolic representation; task c) this representation is sent to the module 6 in order to be associated with another representation; task d) the new representation is sent to the module 5 which provides a new text file; task e) this text file is transmitted to the module 3 in order to be converted
Interview with a Robot
1. 2. SPEECH TO TEXT
3. SPEECH SYNTHESIS
D I A L O G U E M A N A G E M E N T
319
4. NLU
6. Link N L U N L G
5. NLG
Fig. 2. The architecture of the dialogue system
into an audio file, activated by the module 1. The five tasks allow to complete the initial utterance by a new utterance, formulated by the robot. The module 1 manages then the statements pairs chain on the basis of the same principles while at the same time caring about the dialogue coherence until the end. In the second situation, the tasks a), b) and c) are not used to process the initial statement. A symbolic representation is sent directly to the module 5 in such a way that tasks e) and d) provide an initial statement generated by the robot. Then, from the next utterance, formulated by the user, the statement pairs and their chain are managed in the same way by the module 1. The operating mode of the natural language understanding and the natural language generation modules relies on a huge amount of linguistic resources. The system of natural language understanding, incorporated in the dialogue system, exploits three types of linguistic resources: 1) electronic dictionaries; 2) local grammars; 3) ontology. Type 1 resources are lexical words associated with standardized linguistic metainformation [33]. For instance, the electronic dictionary related to the body parts lists all the lexical items6 included in this concept and associates each item with a normalized description that states its grammatical category as well as its semantic class. Type 2 resources are thorough morphosyntactic descriptions related to propositional contents [34]. As formal representations of contextual elements, local grammars correspond to the simplest grammars class according Chomsky’s hierarchy [35]. They are characterized by their combinatorial power. Local grammars rely on electronic dictionaries in order to process lexical data. They are implemented in the form of 6
Lexical items are simple words (for instance, hand) or composed words (for instance forearms), common words (head).
320
P.-A. Buvet et al.
automatons that explore texts in order to identify information. Type 3 resources allow to explicitly describe an extralinguistic knowledge area by indicating the related vocabulary [36]. Ontologies seek to complete electronic dictionaries and local grammars. They contribute to the implicit message processing in discourses7. The three types of resources allow to reproduce the lexical, the grammatical, the semantic and the pragmatic competences of a human being, from the information decoding point of view. The joint use of the three types of resources in this module corroborates the interdependence of these language skills. The function of the module of natural language generation is to associate a symbolic representation with an incoming message formulated by a human interlocutor. The message representation relies on the formalism that describes propositional contents in terms of predicate and argument, cf. Supra. It takes into account the explicit and the implicit aspects of messages. For instance, the incoming message J’ai soif (I am thirsty) is represented as follows: SENSATION_SOIF(INTERLOCUTEUR_HUMAIN)& INJONCTION(INTERLOCUTEUR_HUMAIN, INTERLOCUTEUR_ROBOT,DON (INTERLOCUTEUR_ROBOT, INTERLOCTEUR_ROBOT, BOISSON)). FEELING_THIRST(HUMAN_INTERLOCUTOR)& INJUNCTION(HUMAN_INTERLOCUTOR, ROBOT_INTERLOCUTOR,DONATION(ROBOT_INTERLOCUTOR,ROBOT_INTERLOCUTOR,DRINK)) The first part of the representation refers to the explicit message whereas the second part mentions the implicit content of the message. The linguistic resources exploited in the module of natural language generation are stored in the database. The statement’s different properties are formally described in the database tables. The table dedicated to predicates represents the mother table since they play a central role in the propositional contents representation (cf. Supra). It includes three types of columns. Firstly, there are columns directly related to the nature of predicates: form and grammatical category. Secondly, there are columns which refer to tables related to the other properties of statements. These daughter tables represent the characteristics of the predicates: distribution, construction, and actualization8. Thirdly, there are columns which specify the symbolic representations used at the beginning of the process of outgoing messages generation and the messages’ pragmatic particularities. These columns allow to identify the descriptions of the used predicates. Some of the daughter tables such as CONSTRUCTION and RECONSTRUCTION act as mother tables. A construction corresponds to the syntactic form of a propositional content. The construction depends on the predicate form which constitutes the propositional content. For instance, the verbal predicate manger (to eat), has X0 V (X1) as a standard construction that is a subject which is followed by a verb and an optional object complement: Tom mange un gâteau/Tom mange (Tom eats a cake/Tom eats). Other syntactic forms are possible: Tom, il mange un gâteau/Tom, il mange/Le gâteau, 7
8
Inferences are calculated thanks to rules which make use of ontologies according to the situation of communication. For instance, when the human locutor says J’ai soif (I am thirsty), the robot interlocutor must understand il faut me servir à boire (you need to get me a drink). The actualization here concerns the specification of verbal tenses and the noun determiners that must fit the communication situation.
Interview with a Robot
321
Tom le mange/C’est Tom qui mange un gâteau/C’est un gâteau que Tom mange/… (Tom, he eats a cake/Tom, he eats/The cake, Tom eats it/It is Tom who eats a cake/It is a cake which Tom eats/…). These standard construction variants are reconstructions. They are described in the eponymous table9. The table DISTRIBUTION highlights the predicate arguments domain, that is with which arguments it is combined, in order to constitute a propositional content. It is a daughter table and a mother table since it refers to the table ARGUMENT. The latter describes all the possible forms of an argument which is also semantically described in the table DISTRIBUTION. Finally, there are tables devoted to process other aspects of the statements formulation like the table MORPHOLOGY which takes over the variable words with respect to their context (verbs conjugation, nouns and adjectives bending). Other tables are used to process composed words by means of their conjugation or their bending. They also take over the ontological and pragmatic aspects of the enunciation (for instance, the family relationships that concern the user). Finally, there are tables dedicated to morphological and syntactic adjustments rules which allow, at a last resort, to revise wrong formulations, since there are rules that have a far-reaching impact. For instance, the wrong statement *La météo est au belle (*The weather is at beautiful) is due to a subject-adjective agreement rule. This error is corrected by the previous adjustment rule: est au belle (is at beautiful) ! est au beau (is beautiful) which generates the following correct statement: the weather is beautiful. The relationships between tables model the interdependence between the lexical, the grammatical, the semantic and the pragmatic competences. cf. supra. The role of the natural language generation module is to produce an outgoing text message from a symbolic representation. The formulation of a statement is carried out by an algorithm which relies on the language facts descriptions, stored in the database. The principles which have guided the development of the algorithm are the following [14]: 1) the predicative principle: it determines if the symbolic representation concerns a simple or a complex predication10; 2) the distributional principle: it specifies the semantic properties of the predicate and its arguments; 3) the pragmatic principle: the ontological and the pragmatic information are used in order to determine the number of possible answers; 4) the lexical principle: it specifies one of the lexical units related to the semantic categories of the predicates and the arguments; 5) syntactic principle: it is about selecting the construction or one of the reconstructions related to the predicative forms; 6) the instantiation principle: the linguistic forms that result from principle 4 are inserted in the positions saturated by predicates and arguments in the construction or the reconstruction; 7) the actualization principle: the predicative tense and determiners related to arguments are specified in order to conjugate verbs, bend nouns and adjectives, attribute a gender and a number and if necessary a person to determiners and pronouns. These principles are applied in an ordered way. They are completed by morphological adjustment rules and syntactic rules in order to generate well-formed 9
10
Two predicative forms which share the same construction do not always share the same reconstructions. Two columns, the one in the table of predicates, the other in the table of reconstructions, reflect these constraints. Unlike the simple predication, the complex predication includes a predicate-argument structure in its arguments domain.
322
P.-A. Buvet et al.
statements. The huge diversity of outgoing messages allows robots to simulate a human conversation. This huge diversity is due to the combinatorial power of the natural language generation system which is based on the lexico-syntactic variety of predicates and arguments. The role of the module that associates the output of the natural language understanding module with the input of the natural language generation module is to process the statements pairs coherence in such a way that the outgoing message is consistent with the incoming message. It manages the interfacing between the decoding of an information and the encoding of another information so that the second one is either an extension of the first one, as it completes it, or a break mark, as it respects the conventions in use. It handles the inferences and the paraphrasing processing as well. This module plays an essential role at the elementary level. The dialogue module operates at the global level. Its function is to manage the conversations between a robot and a human being from start to finish. By using conversational scenarios created in relation to a topic11, this module takes into account a human-initiated dialogue as well as a robot-initiated dialogue. The conversations length is variable. It is sometimes limited to three utterances (typically an information request sent by the user followed by a response and a concluding utterance generated by the machine). It usually consists of about ten utterances. Currently, the robot is able to respond to two types of requests: information requests (for example Que mange-t-on ce soir? (What do we eat tonight?)) and action requests (for example, Allume la lumière! (Turn on the light!)). These requests and the responses are integrated in scenarios that include Exclusively Conversational Items (ECI) formulated by the robot12. The ECIs fall into two categories: items that are used to manage a conversation, on the one hand, items that are used to ensure that the conversation runs smoothly, on the other hand. The first category includes conversational talks that help to start a conversation (for instance greetings such as bonjour (good morning) or comment allez-vous? (how are you?)) or to close it (for example, j’espère que ma réponse vous a satisfait ou je suis à votre disposition lorsque vous aurez besoin de moi (I hope you are satisfied with my answer or I am at your disposal when you need me)). The second category concerns utterances which allow to set up a conversation (For example, que puis-je faire pour vous aider? (What can I do to help you?)) or to pursue it (for example, souhaitez-vous parler d’autre chose? (do you want to talk about something else?)), utterances that correspond to polite formulas (for example, je vous remercie de l’avoir fait (thank you for doing it), utterances which check the correct transmission of the information (for example, est-ce que vous m’avez bien compris? (do you understand me correctly?) or suis-je assez clair? (am I clear enough?)) and utterances requiring clarifications. The latter can include a restatement of the previous utterance either under a direct speech form (Vous avez dit: « je veux être raccompagné dans ma chambre » ? (You said: “I want to be escorted back to my room”)), or an indirect speech form (Avez-vous demandé à être raccompagné dans
11
12
For example, the topic ROOM leads to scenarios related to the room cleaning, the movement to the room, the heating of the room, the room luminosity, etc. A human-machine dialogue is not strictly comparable with a human-human dialogue.
Interview with a Robot
323
votre chambre? (Have you asked to be escorted back to your room?)). In the second situation, a reformulation by means of a paraphrase is also possible (for example Voulez-vous que je vous emmène jusqu’à votre chambre? (Do you want me to take you to your room?)). The utterances related to a repetition request fall under the second category as well (for example Pourriez-vous répéter ce que vous avez dit, s’il vous plait, je n’ai pas bien compris? (Could you please repeat what you have said, I have not quite understood)). Scenarios implemented in the dialogue module are organized in accordance with topics. Each topic is handled by several scenarios. A scenario is made up of incoming messages and outgoing messages. The incoming messages are utterances formulated by a human interlocutor. The outgoing audio messages constitute as many utterances generated by the robot in reaction to those of the human interlocutor. The incoming messages and the outgoing messages processing is done by using the other modules of the dialogue system, cf. supra. As a result, utterances produced by a robot inside the same scenario have different aspects. This is due the lexico-syntactic variety inherent to statements generated by the natural language generation module. The combinatorial power of the dialogue system generates about 30000 different conversational forms that fall under the same topic13. The form of a scenario changes according to the initiator of the conversation that is whether it is a human being or a robot. There is an asymmetry between the utterances generated by the robot and those produced by the human being since the latter, in such a context, is not bound by the social conventions which characterize a human-human conversation. Typically, utterances related to the opening or the closing of a conversation are often omitted in the human-initiated conversation. Likewise, in general, the human interlocutor does not use polite formulas. Such utterances are necessarily produced by the robot since its conversation simulates a human being conversation, and, as a result, relies on the conversation flow conventions. However, it is possible that a human being talks to a humanoid robot as he would do to a human being as long as it is being humanized, since it is designed to be a life companion [37].
5 Conclusion The combinatorial power is defined as the linguistic capacity to produce a huge amount of varied statements in order to convey the same semantic content [40]. The lexical variety, the syntactic variety and the actualization variety are the basic properties of the combinatorial power. These varieties allow a speaker to formulate a broad range of statements from the same propositional content, in accordance with his communicative intent and the communication context. As part of the development of the UKKO
13
The formula to get this result is the following: ACF = [(AST)(max(AUS/2, AVU) -1)] with ACF standing for the average of conversational forms, AST for the average of scenarios in a topic, AUS for the average of utterances in a scenario, AVU for the average of variants in an utterance. In this case, AST = 8, AUS = 5, AVU = 6 and ACF = 32767. The fact that only what the robot says is calculated explains the division of ACF by 2 (1 utterance is produced by the robot, the other by its human counterpart).
324
P.-A. Buvet et al.
system, the combinatorial power along with its properties are modeled in such a way that a robot has the capacity to simulate a conversation with a human being.
6 Perspectives The linguistic resources are central in the dialogue system presented in this paper. Its functioning depends on the quality of the language facts descriptions, especially in the NLU and NLG modules. It is all the more true since, in these modules, the resources do not belong to the same nature. The functioning of the natural language understanding module is based on an efficient semantic analysis engine which exploits exhaustive descriptions of the French language, among other languages, in the form of local grammars and electronic dictionaries. The natural language generation module is based on an algorithm which exploits descriptions stored in a database. Mainly, these descriptions must be expanded. Languages have the particularity to produce an infinite number of utterances from a finite number of linguistic units [38]. The combinatorial power explains this discursive productivity [39]. This is partly due to the interdependence of the main linguistic skills, especially the semantic competence, the lexical competence and the grammatical competence. This interdependence is modeled in the database in such a way that the syntactico-semantic behavior of each lexical unit, acting as a predicate, is described. This description is used in the natural language understanding module (NLU). In order to improve its performance, the dialogue system must process rapidly new conversational topics. From this point of view, it is necessary to improve the predicates descriptions in the database. Also, we are developing, as part of the extended intelligence, a learning system which will rely on the existing descriptions to identify automatically new predicates and their syntactico-semantic behavior. In this way, the database will be enriched with new descriptions.
References 1. Harris, Z.S.: Notes du cours de syntaxe. Seuil (1976) 2. Gross, M.: Les bases empiriques de la notion de prédicat sémantique. Langages. 15, 7–52 (1981). https://doi.org/10.3406/lgge.1981.1875 3. Martin, R.: Linguistique de l’universel: réflexions sur les universaux du langage, les concepts universels, la notion de langue universelle. Académie des inscriptions (2016) 4. Silberztein, M.: Formalizing natural languages: the Nooj approach. ISTE Ltd ; John Wiley & Sons, Inc, London, UK : Hoboken, NJ, USA (2016) 5. Schéma de Jakobson, https://fr.wikipedia.org/wiki/Sch%C3%A9ma_de_Jakobson 6. Mufwene, S.: L’émergence de la complexité langagière du point de vue de l’évolution du langage. In: La Clé des Langues. ENS de LYON/DGESCO, Lyon (2010) 7. Pierrel, J.-M. (ed.): Ingénierie des langues. Hermès Science publications, Paris (2000) 8. Parisse, C.: La morphosyntaxe : Qu’est ce qu’est ? - Application au cas de la langue française ? Rééducation Orthophonique. (2009) 9. Lexique-grammaire, https://fr.wikipedia.org/wiki/Lexique-grammaire
Interview with a Robot
325
10. Blanco, X.: Valeurs grammaticales et structures prédicat-argument. Langages. 176, 50 (2009). https://doi.org/10.3917/lang.176.0050 11. Bracops, M.: Introduction à la pragmatique : les théories fondatrices : actes de langage, pragmatique cognitive, pragmatique intégrée. De Boeck Supérieur, Louvain-la-Neuve (2010). https://doi.org/10.3917/dbu.braco.2010.01 12. Austin, J.L.: Quand dire, c’est faire. Éditions du Seuil, Paris (1970) 13. Kerbrat-Orecchioni, C.: Les interactions verbales. Armand Colin, Paris (1990) 14. Buvet, P.-A., Fache, B., Rouam, A.: How does a robot speak? about the man-machineverbal interaction. In: Kuc, T.-Y., Manzoor, S., Tiddi, I., Masoumeh, M., Bastianelli, E., Gyrard, A. (eds.) The 3rd International Workshop on the Applications of Knowledge Representation and Semantic Technologies in Robotics. CEUR, Macau (2019) 15. Gouritin, T.: L’arnaque chatbots durera-t-elle encore longtemps?, https://www.frenchweb.fr/ larnaque-chatbots-durera-t-elle-encore-longtemps/305697. (2018) 16. Bateman, J., Hovy, H.: Computers and text generation. In: Butler, C. (ed.) Computers and written texts. B. Blackwell, Oxford, UK ; Cambridge, USA (1991) 17. Cambrai, T.: L’intelligence artificielle. Albin Michel (2017) 18. Buvet, P.-A.: Linguistique et intelligence. In: Linguistique et…. Peter Lang (in press) 19. Ducrot, O., Todorov, T.: Dictionnaire encyclopédique des sciences du langage. Seuil, Paris (1989) 20. Blanché, R.: La logique et son histoire, d’Aristote à Russell. Armand Colin (1970) 21. Blanco, X., Buvet, P.-A.: Présentation. In: Les représentations des structures prédicatarguments. pp. 3–6. Larousse, Paris (2009) 22. Mejri, S.: Le prédicat et les trois fonctions primaires. In: Souza Silva Costa, D. and Bençal, D.R. (eds.) Nos caminhos do léxico. Editora UFMS, Campo Grande do Sul (2016) 23. Charaudeau, P.: Langage et discours: éléments de sémiolinguistique; (théorie et pratique). Hachette, Paris (1983) 24. Sacks, H.: Lectures on conversation. Blackwell (1992) 25. Caelen, J.: Stratégies de dialogue. In: Modèles formels de l’interaction (2003) 26. Harvey, S.: Perspectives de recherche, problèmes d’épistémologie en sciences sociales. (1984) 27. de Fornel, M., Léon, J.: L’analyse de conversation, de l’ethnomethodologie à la linguistique interactionnelle. Histoire Épistémologie Langage 22, 131–155 (2000). https://doi.org/10. 3406/hel.2000.2770 28. Sacks, H., Schegloff, E.A., Jefferson, G.: A simplest systematics for the organization of turntaking for conversation. Language 50, 696–735 (1974). https://doi.org/10.2307/412243. harris 29. Jakobson, R.: Essais de linguistique générale: les fondations du langage. (1963) 30. Duncan, S.: Some signals and rules for taking speaking turns in conversations. J. Pers. Soc. Psychol. 23, 283–292 (1972). https://doi.org/10.1037/h0033031 31. Charaudeau, P.: Grammaire du sens et de l’expression. Hachette (1992) 32. Vanderveken, D.: Les actes de discours. Mardaga (1988) 33. Buvet, P.-A., Grezka, A.: Les dictionnaires électroniques du modèle des classes d’objets. Langages. 176, 63 (2009). https://doi.org/10.3917/lang.176.0063 34. Buvet, P.-A.: Comment parle un robot? A propos des interactions verbales homme-machine. 23, 30–58 (2019) 35. Chomsky, N., Miller, G.A.: L’analyse formelle des langues naturelles. Mouton, Paris (1968) 36. Oberle, D., Eberhart, A., Staab, S., Volz, R.: Developing and managing software components in an ontology-based application server. In: Jacobsen, H.-A. (ed.) Middleware 2004. LNCS, vol. 3231, pp. 459–477. Springer, Heidelberg (2004). https://doi.org/10.1007/ 978-3-540-30229-2_24
326
P.-A. Buvet et al.
37. Tisseron, S.: Le jour où mon robot m’aimera: vers l’empathie artificielle. (2015) 38. Mejri, S.: De l’inarticulé dans le langage. In: Les cahiers du dictionnaire. pp. 25–58 (2019) 39. Danon-Boileau, L., Diatkine, R.: Le sujet de l’enonciation: psychanalyse et linguistique. Ophrys, Paris (2007) 40. Buvet, P.-A.: La puissance combinatoire : le sens entre lexique et grammaire, (in press)
Composite Versions of Implicit Search Algorithms for Mobile Computing Vitaly O. Groppen(&) North-Caucasian Institute of Mining and Metallurgy (State Technological University), Vladikavkaz, Russia [email protected]
Abstract. We study the effectiveness of two versions of backtracking and two versions of B&B algorithms when solving a knapsack problem. The criterion for effectiveness is the running time to find a global optimal solution. One of each pair of investigated algorithms is a composite version of the other, it combines traditional branching strategy, B&B method of choosing the direction of movement by a search tree, and cutting off unpromising search directions used in dynamic programming. The experiments performed allowed us to identify areas where the use of composite algorithms in mobile devices is preferable. Keywords: B&B algorithm Backtracking algorithm Knapsack problem Dynamic programming Globally optimal solution Solution search time
1 Introduction Today, all computers can be divided into three classes: supercomputers, desktop computers, and mobile gadgets. The latter, in comparison with the first two, have low power consumption, a relatively small RAM, low processor speed and wide distribution. The last parameter plays a key role in the field of mobile devices application - it is constantly expanding. The class of the most complex problems that mankind has faced over the past five centuries includes the search of globally optimal solutions in problems, reducible to mathematical models with discrete variables. For such problems today there are no efficient solution searching algorithms: as a rule searching of globally optimal solutions of extreme problems with discrete variables is based on the methods developed in the second half of the last century simultaneously with the proliferation of electronic computers, which have added three new classes of algorithms: proposed by A.H. Land and A.G. Doig in 1960 branch and bound (B&B) algorithms [1], backtracking algorithms [2] - the term “backtrack” was coined by D. H. Lehmer in the 1950s [3], and branch and cut procedures, one of them dynamic programming developed in 1954 by R. Bellman [4]. Since the application of these procedures requires a significant expenditure of computer resources - processor time and RAM, their use on mobile computing devices significantly reduces the dimension of the tasks being solved. The purpose of this study is to expand the capabilities of mobile devices through a new family of algorithms created by a combination of the above procedures [5, 6]. © Springer Nature Switzerland AG 2021 K. Arai et al. (Eds.): FTC 2020, AISC 1289, pp. 327–339, 2021. https://doi.org/10.1007/978-3-030-63089-8_21
328
V. O. Groppen
2 Designations and Assumptions Below, we study the effectiveness of several modifications of the previously listed implicit enumeration methods [5, 6] as applied to the knapsack problem with a vector Z of Boolean variables [7]: 8 n P > > R ¼ Ci zi ! max; > < P i ð1Þ bi zi a; > > i > : 8i: zi ¼ 1; 0; where Z ¼ fz1 ; z2 ; . . .; zn g is a vector of Boolean variables, whereas 8i ; Ci ; bi , and a are non-negative constants. The search for the problem (1) solution by each modification of these algorithms is illustrated below by constructing the search tree G(X, U), where X is the set of vertices, U is the set of arcs. Further, the upper bound D(xk) for the value of the goal function R, provided that the values of variables in the basis are corresponding to the vertex xk of the search tree, is determined below by the expression:
0
Dðxk Þ ¼ DðZ Þ ¼
S1 þ S2 ; if S3 a; 1; otherwise;
ð2Þ
where: the values of S1, S2 and S3 are defined as follows: S1 ¼ S2 ¼
X i2I1 ðxkÞ
Ci zi
X
S3 ¼
j2IðzÞnIi ðxkÞ
X i2I1 ðxk Þ
Cj
bi z i
ð3Þ ð4Þ ð5Þ
I1(xk) denotes the set of indices of Boolean variables in the basis, the values of which are determined by the position xk 2 X on the search tree; I(Z) is the set of indices of all the Boolean variables of the problem solved. The lower bound d(xk) of the value F of problem (1) corresponding to the vertex xk of the search tree is determined similarly: dð x k Þ ¼
S1 ; if S3 a; 1; if S3 [ a:
ð6Þ
Another characteristic of the vertex xk of the search tree G(X, U), is the resource l(xk):
Composite Versions of Implicit Search Algorithms
lðxk Þ ¼ a
X
bi zi :
329
ð7Þ
i2I1 ðxk Þ
Thus below, each vertex xk 2 X of the search tree G(X, U), corresponding to the vector of problem (1) variables with h of them - in the basis, is also associated with vector: Vðxk Þ ¼ fDðxk Þ; dðxk Þ; lðxk Þg. Using this vector, we can formulate the rule for cutting off subsets of “unpromising” vectors of variables of problem (1). So, if there are two such vertices xk and xp in the search tree, for which the following conditions are true: D ð xk Þ \ D xp ;
ð8Þ
8 < I ðxk Þ ¼ I xp ; lðxk Þ l xp ; : dð x k Þ d x p ;
ð9Þ
then the subset of the vectors of variables corresponding to the vertex xk can be excluded from further consideration. A search for a solution to problem (1) with n variables, in which a search tree G(X, U) is generated such, that the number of vertices jXj is close to its’ lower bound 2n, is defined below as a search in a friendly environment. Vice versa, if the jXj value is close to its’ upper bound 2(2n − 1), then such a search is defined below as a search in a hostile environment. Further, we assume that: a) the time to(n) of a single bound computing for the problem with n variables in the first approximation is directly proportional to the value n: to ðnÞ ¼ k n;
ð10Þ
where “k” is a coefficient; b) the time spent for the bounds comparison and for choosing the direction of movement by the search tree can be neglected. All numerical examples below illustrating the work of the analyzed algorithms correspond to the following knapsack problem: R = 4z1 +3z2 + 5z3 + 9z4 max; ð11Þ 3z1 + 8z2 + 5z3 + 3z4 ≤ 10; zi = 1, 0; i = 1, 2, …,4. Further we seek solutions on a set of search trees, applying the following definitions: 1. “Hanging” vertex at any iteration is considered to be a vertex of the constructed search tree, which is free of outcoming arcs. 2. The root vertex of the search tree is considered as “hanging” vertex at the first iteration.
330
V. O. Groppen
3. The “gray” vertices of the i-th tier at the search tree correspond to zi = 1, the “white” vertices correspond to the values zi = 0. 4. “h” is the number of variables simultaneously loaded into the basis.
3 Implicit Search Methods The following are descriptions of two groups of implicit enumeration methods that guarantee a globally optimal solution to problem (1): backtracking and B&B methods. Each method is represented by its “classic” algorithm and by a composite modification of this algorithm. The search for the problem (1) solution by any modification of these algorithms is illustrated below by constructing and movement by the search tree G(X, U). 3.1
Backtracking Algorithms
Below are presented descriptions of two backtracking algorithms: the “classical” one that implements backtracking as applied to the problem (1), and composite version of this procedure. Algorithm 1 Step 1. R = - ∞. Step 2. i = 1. Step 3. zi = 1. Step 4. According to (2)-(5) the upper bound of the record Δ is calculated. Step 5. If Δ> R, then go to step 6, otherwise go to step 8. Step 6. If i = n, then go to step 7, otherwise go to step 12. Step 7. R set the value Δ, rewrite the vector z in vector q. Step 8. If zi= 1, then go to step 9, otherwise go to step 11. Step 9. zi = 0, go to step 4. Step 10. If i = 1, then go to step 13, otherwise, go to step 8. Step 11. i = i - 1, go to step 10. Step 12. i = i + 1, go to step 3. Step 13. The end of the algorithm. Print R and optimal vector q.
Obviously, if, during the search for a problem (1) solution by Algorithm 1, N1 bounds D were calculated, then, taking into account (10), the running time is equal to T1 ¼ k n N1 :
ð12Þ
Keeping in mind that the value of N1 is in the range: 2n N1 2ð2n 1Þ;
ð13Þ
Composite Versions of Implicit Search Algorithms
331
we can define similar boundaries for the running time T1: 2n2 T1 = k \2n þ 1 :
ð14Þ
Example 1: The search for system (11) solution by this algorithm is illustrated by the tree G1 (X1, U1), presented below in Fig. 1. The numbers near the vertices are equal to the corresponding bounds determined according to (8). The maximum value of R = 14, the optimal vector of variables Z = {0, 0, 1, 1}, the corresponding vertex has a bold black outline. Obviously, in this case N1 ¼ jX1 j ¼ 18 (the root vertex “s” is ignored), whence, based on (12), it follows: T1 = 72 k.
S
21
17
Z1 -∞ 18
18
17
13
-∞
12
14
14
9
Z2 Z3
Z4 -∞
9
13
4
14
9
Fig. 1. The search tree G1(X1, U1) constructed by algorithm 1 for the problem (11) solution.
Presented below Algorithm 2, being a composite one, includes local enumeration, the above Algorithm 1, a strategy for choosing the direction of descent along the search tree inherent in B&B methods, and the technology of cutting off hopeless search directions on the search tree used in dynamic programming [1–3]. An aggregated description of this algorithm is given below, provided that the value of the parameter h is known a priori (1 h n).
332
V. O. Groppen
Example 2: At Fig. 2 below is shown the search tree G2(X2, U2) constructed by Algorithm 2 when solving problem (11) under the condition that h = 3, and for each k-th vector constructed in step 2 (k = 1, …, 4), there is another vector Vðxk Þ ¼ fDðxk Þ; dðxk Þ; lðxk Þg. The rectangles contain all combinations of values for the first three variables of the problem (11). Crossed out is vector (0, 1, 0), satisfying (6). The running time T2 of problem (11) solution, taking into account (12), is 36 k, the time gain in comparison with using Algorithm 1 is equal to η1.2= T1/T2 = 2.0. 3.2
B&B Algorithms
Below are presented descriptions of two branch and bound type algorithms: the “classical” one (Algorithm 3) that implements branching and bounds calculation as applied to the problem (1), and composite version of this procedure (Algorithm 4).
Composite Versions of Implicit Search Algorithms -∞
1 1 1
-∞
-∞
18, 9,2
-∞
14, 5,5
1 1 0
1 0 1
0 1 1
0 0 1
9
14
12, 3, 2
0 1 0
5
13,4,7
9, 0, 10
1 0 0
0 0 0
333
Z1 Z2 Z3 Z4
Fig. 2. The search tree G2(X2, U2) constructed by the algorithm 2 for the problem (11) solution.
Algorithm 3 Step 1. On the set of hanging vertices X1 ⊆ X of the built search tree G (X, U) is selected vertex xj with the best upper bound. If this is done on the first iteration, then this vertex a priori corresponds to be the root vertex of this tree. Step 2. If the selected vertex meets equality I1 = I, then go to step 5, otherwise - to the next step. Step 3. The branching is made from the vertex xj, which was selected at the first step of the latest iteration. A new set of hanging vertices of the tree again is denoted as X1. Step 4. For each new vertex xk ϵ X1 belonging to the "bush" which was built at the previous step is computed bound Δ(xk). Go to the first step. Step 5. The algorithm is complete. The vector of variables corresponding to the selected during the first step of the latest iteration vertex is optimal.
Example 3: Problem (11) solution by Algorithm 3 leads to the construction of the search tree G1 (X1, U1), shown below in Fig. 3. Obviously, in this case N3 = jX3 j =16 (the root vertex “s” is ignored), whence, based on (12), it follows: T3 = 64 k. During solving (1) by the Composite implementation of branch and bound method, each vertex xk of the search tree is associated with described above vector V(xk) = fDðxk Þ; dðxk Þ; lðxk Þg. Below is presented a complete description of this procedure.
334
V. O. Groppen
S
21
17
Z1 -∞ 18
18 13
17 -∞
12
14
14
9
Z2 Z3
Z4 -∞
9
14
9
Fig. 3. The search tree G3(X3, U3) constructed by algorithm 3 for the problem (11) solution.
Algorithm 4 T
Step 1. On the set of not erased hanging vertices X 4 ⊆ X 4 of the constructed search tree G4 (X4, U4) is selected vertex xj with the "best" first component of vector V(xj). If this is done on the first iteration, then this vertex a priori considered to be the root vertex of this tree. Step 2. If for the selected vertex true is the equality I1 = I, then go to step 8, otherwise - to the next step. Step 3. Created is a “bush” with the root vertex coinciding with selected at the first step of the latest iteration vertex xj. A new set of terminal vertices of the tree T
we again denote as X 4 ⊆ X .4 Step 4. We calculate vector V(xj) for each hanging vertex xj of the built in the previous step "bush”. To determine the first component of this vector we use procedure applied for bounds calculation in the branch and bound type methods, as the second component of V(xj) vector is used δ(xj), the other components are calculated according to (3). T
Step 5. If in the set of vertices X 4 there is a vertex xk, for which are valid conditions (8) and (9) then vertex xk is crossed out. Step 6. If there are two vertices x ∈ Х 4T and j x ∈ Х 4T , forq which the following conditions are true: а) I1(xj) = I1(xq); b) value δ(xj) is “better than” value Δ(xq), then vertex xq is crossed out. Step 7. Go to the step 1. Step 8. The algorithm is compete. The vector of variables corresponding to the
Composite Versions of Implicit Search Algorithms
335
Example 4: Below in Fig. 4 is presented the search tree G4 ðX4 ; U4 Þ, built by the Algorithm 4 applied to the problem (11). The number of vertices of the constructed tree G4 ðX4 ; U4 Þ; jX4 j ¼ 14, crossed vertex is corresponding to the application of step 5 of Algorithm 4.
21,4,7
S
17,0,10
Z1 -∞
18,4.7
18,9,2
13,4,7
17,3,2 14,5,5
14,0,10 9,0,10
Z2 Z3
Z4 -∞
9
14
9
Fig. 4. The search tree G4(X4, U4) constructed by algorithm 4 for the problem (11) solution.
The running time T4 of problem (11) solution, taking into account (12), is 56 k, the time gain in comparison with using Algorithm 1 is equal to η3,4= T3/T4 = 1.14. It should be noted that the cut-off procedure, implemented in the sixth step of Algorithm 4, is absent in the branch and bound methods, as well as in dynamic programming. It was made possible only with the combination of these both approaches.
4 A Priori Analysis of the Effectiveness of the Proposed Approaches It is easy to show that, taking into account (7), the minimum T1min and maximum T1max search times for solving the problem (1) by Algorithm 1 depend on value n as follows: T1min ¼ 2n2 k;
ð16Þ
T1max ¼ 2nkð2n 1Þ
ð17Þ
Expression predicting the upper bound on the time for finding problem (1) globally optimal solution by the Algorithm 2 for a given value of h is presented below: T2max ¼ 2h n þ 2n þ 1 2h þ 1 ðn hÞ k:
ð18Þ
336
V. O. Groppen
This permits us, keeping in memory, that the value of “h” is integer, to determine the optimal value of “h” in Algorithm 2 for the “n” range {2–25} h ¼ 1 þ entireðn=9Þ
ð19Þ
The latter allows us to evaluate the gain in time η12 obtained when applying composite version of the backtracking algorithm in a “hostile” environment: g12 ¼ T1max =T2max
ð20Þ
The dependence obtained on the basis of (20) is presented below in Fig. 5
10∙η12 100 80 60 40 20 0 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 n Fig. 5. The dependence of the gain in time η12 on the number of variables “n” of the knapsack problem created according to (17)–(20).
A similar analysis comparing the effectiveness of Algorithms 1 and 2 in a friendly environment favors the former. Since all calculations above were made under the assumption that the time for the bounds comparing can be neglected, the results can be extended to branch and bound methods for cases of small n. The statement and results of experimental verification of these provisions are given below.
5 Statement of Experiments Two series of experiments were held. In the first series of experiments, which investigated the effectiveness of the composite version of backtracking algorithm, the following mobile devices were used: Microsoft Lumia 950 smartphone, with 3 GB RAM, 192 GB external memory, Windows 10 Mobile operating system, Qualcomm Snapdragon 808 MSM8992 processor with a clock speed of 1.8 GHz., and Samsung N150 laptop with the following parameters: processor Intel Atom N450 1.66 GHz, chipset: Intel NM10. GPU: Intel GMA 3150 (up to 256 MB) RAM: 1 GB DDR2, operating system Windows XP.
Composite Versions of Implicit Search Algorithms
337
The order of the experiments was determined by the following procedure: Algorithm 5 Step 1. n = 2. Step 2. By the use of Monte Carlo method generated are all the integer coefficients and constants of problem (1) for a given number of variables n in the range 0 - 100. Step 3. The problem obtained in the previous step is sequentially solved by software implementations of algorithms 1 and 2. For each i-th algorithm fixed are the number of vertices │Xi│ of the constructed search tree Gi (Xi, Ui) and the running time Ti (i=1.2). Step 4.n = n + 1. Step 5. If n 1/2. Then as the population size grows, the probability that the outcome of a majority vote is “correct” converges to one. While voting consensus protocols have their limitations, they have been successfully applied not only in decision making but also in a wide range of engineering and economical applications , and lead to the emerging science of sociophysics [3]. We continue the works of [11] and [2] and propose several adaptions, Sect. 8, of the fast probabilistic consensus protocol (FPC) that decreases the failure rate of at least one order of magnitude, e.g., see Fig. 5. The main contribution is the adaption of the protocol to a setting allowing defense against Sybil attacks. In FPC nodes need to be able to query a sufficiently large proportion of the network directly, which requires that nodes have global identities (node IDs) with which they can be addressed. In a decentralized and permissionless setting a malicious actor may gain a disproportionately large influence on the voting by creating a large number of pseudonymous identities. In the blockchain environment, mechanisms such as proof-of-work and (delegated) proof-of-stake can act as a Sybil mitigation mechanism in the sense that the voting power is proportional to the work invested or the value staked [14]. For the IOTA protocol [12] introduces mana as a Sybil defense, where mana is delegated to nodes and proportional to the active amount of IOTA in the network. While in the remainder of the paper we will always refer to mana, the protocol can be implemented using any good or resources that can be verified via resource testing or recurring costs and fee, e.g., [10]. In Sect. 3 we propose a weighted voting consensus protocol that is fair in the sense that the voting power is proportional to the nodes’ reputation. In general, values in (crypto-)currency systems are not distributed equally; [8] investigates the heterogeneous distribution of the wealth across Bitcoin addresses and finds that it follows certain power laws. Power laws satisfy a universality phenomenon; they appear in numerous different fields of applications and have, in particular, also been utilised to model wealth in economic models [7]. In this paper we consider a Zipf law to model the proportional wealth of nodes in the IOTA network: the nth largest value y(n) satisfies y(n) = Cn−s ,
(1)
N where C −1 = n=1 n−s , N is the number of nodes, and s is the Zipf parameter. Figure 1 shows the distribution of IOTA for the top 100 richest addresses1 together with a fitted Zipf distribution. Since (1) only depends on two parameters, s and N , this provides a convenient model to investigate the performance of 1
https://thetangle.org.
362
S. M¨ uller et al.
FPC in a wide range of network situations. For instance, networks where nodes are equal may be modelled by choosing s = 0, while more centralized networks can be considered for s > 1. We refer to Sect. 4 for more details on the Zipf law.
Fig. 1. Distribution of relative IOTA value on the top 100 addresses with a fitted Zipf distribution with s = 0.9.
Outline The rest of the paper organizes as follows. After giving an introduction to the original version of FPC in Sect. 2, we summarize results on the fairness of this protocol in Sect. 3. In Sect. 4 we propose modelling of the weight distribution using a Zipf law, we highlight the skewness of this distribution in Sect. 5, and in Sect. 6 we discuss how the properties of the Zipf law influence the message complexity of the protocol. After defining the threat model in Sect. 7 we propose several improvements of the Vanilla FPC in Sect. 8. In Sect. 9, we outline a protection mechanism against the most severe attack strategies. The quorum size is an important parameter of FPC that dominates its performance; we give in Sect. 10 a heuristic to choose a quorum size for a given security level. Section 11 presents simulation results that show the performance of the protocol in Byzantine infrastructure for different degrees of centralization of the weights. We conclude in Sect. 12 with a discussion.
Fast Probabilistic Consensus with Weighted Votes
2
363
Vanilla FPC
We present here only the key elements of the proposed protocol and refer the interested reader to [11] and [2] for more details. In order to define FPC we have to introduce some notation. We assume the network to have N nodes indexed by 1, 2, . . . , N and that every node is able to query any other nodes.2 Every node i has an opinion or state. We note si (t) for the opinion of the node i at time t. Opinions take values in {0, 1}. Every node i has an initial opinion si (0). At each (discrete) time step each node chooses k random nodes Ci = Ci (t), queries their opinions and calculates 1 sj (t), ηi (t + 1) = ki (t) j∈Ci
where ki (t) ≤ k is the number of replies received by node i at time t and sj (t) = 0 if the reply from j is not received in due time. Note that the neighbors Ci of a node i are chosen using sampling with replacement and hence repetitions are possible. As in [2] we consider a basic version of the FPC introduced in [11] in choosing some parameters by default. Specifically, we remove the cooling phase of FPC and the randomness of the initial threshold τ . Let Ut , t = 1, 2, . . . be i.i.d. random variables with law Unif([β, 1 − β]) for some parameter β ∈ [0, 1/2]. The update rules for the opinion of a node i is then given by 1, if ηi (1) ≥ τ, si (1) = 0, otherwise, and for t ≥ 1:
⎧ ⎨ 1, if ηi (t + 1) > Ut , si (t + 1) = 0, if ηi (t + 1) < Ut , ⎩ si (t), otherwise.
Note that if β = 0.5, FPC reduces to a standard majority consensus. The above sequence of random variables Ut are the same for all nodes; we refer to [2] for a more detailed discussion on the use of decentralized random number generators. We introduce a local termination rule to reduce the communication complexity of the protocols. Every node keeps a counter variable cnt that is incremented by 1 if there is no change in its opinion and that is set to 0 if there is a change of opinion. Once the counter reaches a certain threshold l, i.e., cnt ≥ l, the node considers the current state as final. The node will therefore no longer send any queries but will still answer incoming queries. In the absence of autonomous termination the algorithm is halted after maxIt iterations. 2
This assumption is only made for sake of a better presentation; a node does not need to know every other node in the network. While the theoretical results in [11] are proven under this assumption, simulation studies [2] indicate that it is sufficient if every node knows about half of the other nodes. Moreover, it seems to be a reasonable assumption that large mana nodes are known to every participant in the network.
364
3
S. M¨ uller et al.
Fairness
Introducing mana as a weighting factor may naturally have an influence on the mana distribution and may lead to degenerated cases. In order to avoid this phenomenon we want to ensure that no node can increase its importance in splitting up into several nodes, nor can achieve better performance in pooling together with other nodes. We consider a network of N nodes whose mana is described by {m1 , .., mN } N with i=1 mi = 1. In the sampling of the queries a node j is chosen now with probability f (mj ) . pj = N i=1 f (mi ) Each opinion is weighted by gj = g(mj ), resulting in the value ηi (t + 1) =
1 j∈Ci
gj
gj sj (t).
j∈Ci
The other parts of the protocol remain unchanged. We denote by yi the number of times a node i is chosen. As the sampling is described by a multinomial distribution we can calculate the expected value of a query as N Eη(t + 1) = si (t)vi , i=1
where vi =
y∈NN :
yi =k
N k! yi gi y p j N y1 ! · · · yN ! n=1 yn gn j=1 j
is called the voting power of node i. The voting power measures the influence of the node i. We would like the voting power to be proportional to the mana. Definition 1. A voting scheme (f, g) is fair if the voting power is not sensitive to splitting/merging of mana, i.e., if a node i splits into nodes i1 and i2 with a mana splitting ratio x ∈ (0, 1), then vi (mi ) = vi1 (xmi ) + vi2 ((1 − x)mi )
(2)
In the case where g ≡ 1, i.e., the η is an unweighted mean, the existence of a voting scheme that is fair for all possible choices of k and mana distributions is shown in [9]: Lemma 1. For g ≡ 1 the voting scheme (f, g) is fair if and only if f is the identity function f = id. For this reason we fix from now on g ≡ 1 and f = id.
Fast Probabilistic Consensus with Weighted Votes
4
365
Zipf ’s Law and Mana Distribution
One of the most intriguing phenomenon in probability theory is that of universality; many seemingly unrelated probability distributions, which may involve large numbers of unknown parameters, can end up converging to a universal law that only depends on few parameters. Probably the most famous example of this universality phenomenon is the central limit theorem. Analogous universality phenomena also show up in empirical distributions, i.e., distributions of statistics from a large population of real-world objects. Examples include Benford’s law, Zipf’s law, and the Pareto distribution3 ; we refer to [15] for more details. These laws govern the asymptotic distribution of many statistics which 1. 2. 3. 4.
take values as positive numbers; range over many different orders of magnitude; arise from a complicated combination of largely independent factors; and have not been artificially rounded, truncated, or otherwise constrained in size.
Out of the three above laws, the Zipf law is the appropriate variant for modelling the mana distribution. The Zipf law is defined as follows: The nth largest value of the statistic X should obey an approximate power law, i.e., it should be approximately Cn−s for the first few n = 1, 2, 3, . . . and some parameters C, s > 0. The Zipf law is used in various applications. For instance, Zipf’s law and the closely related Pareto distribution can be used to mathematically test various models of real-world systems (e.g., formation of astronomical objects, accumulation of wealth and population growth of countries). An important point is that Zipf’s law does in general not apply on the entire range of X, but only on the upper tail region when X is significantly higher than the median; in other words, it is a law for the (upper) outliers of X. The Zipf law tends to break down if one of the hypotheses 1)–4) is dropped. For instance, if the statistic X concentrates around its mean and does not range over many orders of magnitude, then the normal distribution tends to be a much better model. If instead the samples of the statistics are highly correlated with each other, then other laws can arise, as for example, the Tracy-Widom law. Zipf’s law is most easily observed by plotting the data on a log-log graph, with the axes being log(rank order) and log(value). The data conforms to a Zipf law to the extent that the plot is linear and the value of s can be found using linear regression. For instance, Fig. 1 shows the distribution of IOTA for the top 100 richest addresses. Due to universality phenomemon, the plausibility of hypotheses 1)–4) above and Fig. 1 we assume a Zipf law for the mana distribution. In Sect. 12 we give more details on the validity of the model. 3
Interesting to note here that these three distributions are highly compatible with each other.
366
5
S. M¨ uller et al.
Skewness of Mana Distribution
For s > 0 the majority of the nodes would have a mana value less than the average and hence, in the case of an increasing function f , these nodes would be queried less than in a homogeneous distribution. As a consequence the initial opinion of small mana nodes may become negligible. We define the γ-effective number of nodes Nγ-eff as the number of nodes whose proportional mana is more than or equal to γ/N : Nγ-eff =
N
1{mi ≥ γ/N }
i=1
where 1 is the standard indicator function. Figure 2 shows the relative proportion of effective nodes nγ-eff = Nγ-eff /N with s. We show the figure for N = 1000, although the distribution hardly changes when changing N . Note that for γ = 1 and s → 0 a large proportion of the nodes would have less than a proportion 1/N of the mana and hence nγ-eff approaches, as s → 0, to a value strictly less than 1. Note that for values of s 1 the effective number of nodes can be very small. This is also reflected in the distribution of IOTA. The top 100 addresses shown in Fig. 1 own 60% of the total funds, albeit there are more than 100.000 addresses in total (see Footnote 1).
Fig. 2. Proportion of effective number of nodes.
Fast Probabilistic Consensus with Weighted Votes
6
367
Message Complexity
Let us start with the following back-of-the-envelope calculation. Denote by h(N ) the mana rank of a given node. At every round this node is queried on average h(N )−s N · N −s n=1 n
(3)
times. Now, if s < 1 this becomes asymptotically Θ(N s h(N )−s ), if s = 1 we obtain Θ( logNN h(N )−1 ), and if s > 1 this is Θ(N h(N )−s ). In particular, the highest mana node, i.e., h(N ) = 1, is queried Θ(N s ), Θ( logNN ), or Θ(N ) times, and might eventually be overrun by queries. Nodes whose rank is Θ(N ) have to answer only Θ(1) queries. This is in contrast to the case s = 0 where every node has the same mana and every node is queried in average a constant number of times. The high mana nodes are therefore incentivized to gossip their opinions and not to answer each query separately. Since not all nodes can gossip their opinions (in this case every node would have to send Ω(N ) messages) we have to find a threshold when nodes gossip their opinions or not. If we assume that high mana nodes have higher throughput than lower mana nodes a reasonable threshold is log(N ), i.e., only the Θ(log(N )) highest mana nodes do gossip their opinions, leading to Θ(log N ) messages for each node in the gossip layer. In this case the expected number of queries the highest mana node, that is not allowed to gossip its opinions, receives is Θ(( logNN )s ) if s < 1, Θ( (logNN )2 ) if s = 1, and Θ( (logNN )s ) if s > 1. In this case, nodes of rank between Θ(log N ) and Θ(N ) are the critical nodes with respect to message complexity. Another natural possibility would be to choose the threshold such that every node has to send the same amount of messages. In other words, the maximal number of queries a node has to answer should equal the number of messages that are gossiped. For s < 1 this leads to the following equation N s h(N )−s = h(N )
(4) s
s
and hence we obtain that a threshold of order N s+1 leads to Θ(N s+1 ) messages for every node to send. For s > 1 one obtains similarly a threshold of 1 1 N 1+s leading to Θ(N 1+s ) messages. In the worst √ case, i.e., s = 1, the message complexity for each node in the network is O( N ). We want to close this section with the remark that as mentioned in Sect. 4, Zipf’s law does mostly not apply on the entire range of the observations, but only on the upper tail regions of the observations. Adjustments of the above threshold and more precise message complexity calculations have to be performed in consideration of the real-world situation of the mana distribution. Moreover, the optimal choice of this threshold has also to depend on the structure of the network, and the performances of the different nodes.
368
7
S. M¨ uller et al.
Threat Model
We consider the “worst-case” scenario where adversarial nodes can exchange information freely between themselves and can agree on a common strategy. In fact, we assume that all Byzantine nodes are controlled by a single adversary. We assume that such an adversary holds a proportion q of the mana and thus has a voting power vq = q. In order to make results more comparable we assume that the adversary distributes the mana equally between its nodes such that each node holds 1/N of the total mana. Figure 3 shows an exemplary distribution of mana between all nodes. Nodes are indexed such that the malicious nodes have the highest indexes, while honest nodes are indexed by their mana rank.
Fig. 3. Mana distribution with s = 1, N = 100 and q = 0.2.
We assume an “omniscient adversary”, who is aware of all opinions and queries of the honest nodes. However, we assume that the adversary has no influence nor prior knowledge on the random threshold. The adversary can take several approaches in influencing the opinions in the network. In a cautious strategy the adversary sends the same opinion to all enquiring nodes, while in a berserk strategy, different opinions can be sent to different nodes; we refer to [2,11] for more details. While the latter is more powerful it may also be easily detectable, e.g., see [12]. The adversary may also behave semi-cautious by not responding to individual nodes.
Fast Probabilistic Consensus with Weighted Votes
7.1
369
Communication Model
We have to make assumptions on the communication model of the FPC. We assume the communication between two nodes to satisfy authentication, i.e., senders and receivers are who they claim to be, and data integrity, i.e., data is not changed from source to destination. Nodes can also send a message on a gossip layer; these messages are then available to all participating nodes. All messages are signed by a private key of the sending node. As we consider omniscient adversaries we do not assume confidentiality. For the communication of the opinions between nodes we assume a synchronous model. However, we want to stress that similar performances are obtained in a probabilistic synchronous model, in which for every ε > 0 and δ > 0.5, a majority proportion δ of the messages is delivered within a bounded (and known) time, that depends on ε and δ, with probability of at least 1 − ε. Due to its random nature, FPC still shows good performances in situations where not all queries are answered in due time. Moreover, the gossiping feature of high mananodes allows to detect whether high mana nodes are eclipsed or are encountering communication problems. 7.2
Failures
In the case of heterogeneous mana distributions there are different possibilities to generalize the standard failures of consensus protocols: namely integration failure, agreement failure and termination failure. In this paper we consider only agreement failure since in the IOTA use case this failure turns out to be the most severe. In the strictest sense an agreement failure occurs if not all nodes decide on the same opinion. We will consider the α-agreement failure; such a failure occurs if at least a proportion of α nodes differ in their final decision. 7.3
Adversary Strategies
While [11] studies robustness of FPC against all kinds of adversary strategies, [2] proposes several concrete strategies in order to perform numerical simulations. In particular, [2] introduced the cautious inverse voting strategy (IVS) and the berserk maximal variance strategy (MVS). It was shown that, as analytically predicted in [11], the efficacy of the attacks is reduced when a random threshold is applied. The studies also show that the berserk attack is more severe, however in the presence of the random threshold the difference to IVS is not significant. Moreover, in Sect. 9 we propose efficient ways to detect berserk behavior. The simpler dynamic of the IVS may also allow to approach the protocol more easily from an analytical viewpoint. For these reasons, we consider in this paper only a cautious strategy that is an adaption of the IVS to the setting of mana. ManaIVS. We consider the cautious strategy where the adversary transmits at time t + 1 the opinion of the mana-weighted minority of the honest nodes of
370
S. M¨ uller et al.
step t. More formally, the adversary chooses arg min
N
j∈{0,1} i=1
mi 1{si (t) = j}
(5)
as its opinion at time t + 1. We call this strategy the mana weighted inverse vote strategy (manaIVS).
8
Improvements of FPC
We suggest several improvements of the Vanilla FPC described in [11]. Fixed Threshold for Last Rounds. In the original version of FPC nodes query at random including itself and finalize after having the same opinion for l consecutive rounds [11]. We analyzed various situations when the Vanilla FPC encountered failures. One key finding was that the randomness of the threshold has sometimes a negative side effect. In fact, due to its random nature it will from time to time show abnormal behavior.4 In order to counteract this effect we can fix the threshold to a given value, e.g., τ = 0.5, for the last l2 rounds. The initial l–l2 rounds enable the original task of FPC to create an honest super majority even in the presence of an adversary. Once a super majority is formed a simple majority rule is sufficient for the network to finalize on the same opinion, while the likelihood of nodes switching due to unusual behavior of the threshold is decreased significantly. Bias Towards Own Opinion. In Sect. 3 we showed that with the introduction of mana as a Sybil protection we can adopt the FPC protocol in a fair manner by querying nodes with probability proportional to their mana. However, this can lead to agreement failures if a mana high node over-queries the adversary in round l. Part of the network would then finalize the opinion, while the manaweighted majority of nodes could still switch their opinion. In an extreme situation it is possible that a node that holds the majority of the funds adjusts its opinion according to a minority of the funds, which is undesirable. In order to prevent this we propose the following adaption. Each node biases the received mean opinion η to its current own opinion. More specifically, a node j can calculate its η-value of the current round i by ηi (t + 1) = mj si (t) + (1 − mj )ηi∗ (t + 1), where mj is j’s proportion of mana and ηi∗ (t + 1) is the mean opinion from querying nodes without self-query. 4
This is a common phenomenon for stochastic processes in random media; e.g., see [6].
Fast Probabilistic Consensus with Weighted Votes
371
Fixed Number of Effective Queries. As discussed in Sect. 3 in order to facilitate a fair quorum (thereby preventing game-ability) we select for a given vote a node at random with a probability proportional to the mana. If a node is selected m times it is given m votes (of which all would have the same opinion). However this can lead to a quorum with a population of nodes kdiff < k, in particular in scenarios where N is low or s is large. Furthermore, if there is a fixed bandwidth reserved to ensure the correct functioning of the voting layer, individual nodes could regularly under-utilize this bandwidth since the communication overhead is proportional to kdiff . We can alleviate this deficit by increasing k dynamically to keep kdiff constant, and thereby improve the protocol by increasing the effective quorum size k automatically. Through this approach the protocol can adopt dynamically to a network with fewer nodes or different mana distributions.
9
Berserk Detection
Since berserk strategies are the most severe attacks, e.g., [2,11], the security of the protocol can be improved if berserk nodes can be identified and removed from the network. We, therefore, propose in this section a mechanism that allows to detect berserk behavior. This mechanism is based on a “justification of opinion” where nodes exchange information about the opinions received in the previous rounds. As the set of queried nodes changes from round to round this information does not necessarily allow a direct direction of a berserk behavior but berserk behavior is detectable indirectly with a certain probability. Upon discovering malicious behavior, nodes can gossip the proofs of this behavior, such that all other honest nodes can ignore the berserk node afterwards. 9.1
The Berserk Detection Protocol
We allow that a node can ask a queried node for a list of opinions received during the previous round of FPC voting. We call such a list a vote list and write v-list. A node may request for it in several ways. For example, the full response message to the request of a v-list and the opinions could be comprised of the opinion in the current round and the received opinions from the previous round. We do not require nodes to apply this procedure for every member of the quorum or every round. For instance, each node could request the list with a certain probability or if it has the necessary bandwidth capacity available. Furthermore, we can set an upper bound on this probability on the protocol level so that spamming of requests for v-lists can be detected. We denote this probability that an arbitrary query request includes a request for a v-list by pB . A more formal understanding of the approach is the following: assume that in the last round a node y received k votes, submitted by nodes z1 , ..., zk . If a node x asks y for a v-list, then y sends votes submitted by z1 , ..., zk along with the identities of z1 , ..., zk but without their signatures. This reduces the message size. Node x compares the opinions in the v-list submitted by y with
372
S. M¨ uller et al.
other received v-lists. If x detects a node that did send different opinions it will ask the corresponding nodes for the associated signatures in order to construct a proof of the malicious behaviour. Having collected the proof the honest node gossips the evidence to the network and the adversary node will be dropped by all honest nodes after they have verified the proof. Note that a single evidence for berserk behaviour is sufficient and that further evidence does not yield any additional benefit. 9.2
Expected Number of Rounds Before Detection
To test how reliable this detection method is and what the communication overhead would be, we carry out the following back-of-the-envelope calculations for s = 0 and s > 0. We are interested in the probability of detecting a berserk adversary since the inverse of this probability equals the estimated number of rounds that are required to detect malicious behaviour of a given node. Let us start with s = 0 and consider the following scenario. Among N nodes there is a single berserk node B. In the previous round, the adversarial node is (in expectation) queried k times. To see this note that in the case of s = 0, nodes are queried with uniform probability and every node has to receive on average the same number of queries. Furthermore, the berserk node sends f replies with opinion 0 to the group of nodes G0 and (k − f ) replies with opinion 1 to the group of nodes G1 . The probability that a node x receives v-lists that allow for the detection of the berserk node is in this case bounded below by P (x receives v-list from G0 and G1 )
k 2 f k−f N −k N − 2k + 3 · · ··· = γ0 . ≥2 p 2 BN N −1 N −2 N −k+1 The probability that some node detects the berserk behaviour satisfies P (some node detects malicious node) ≥ 1 − (1 − γ0 )N −1 . For example, in a system with N = 1000, k = 20, pB = 0.1 and f = k−f = 10 the detection probability is bounded below by 0.23. Assuming that the full FPC voting (i.e., a voting cycle) for a conflict takes about 15 rounds, berserk nodes can be detected within one FPC voting cycle with high probability. Precise calculations are more difficult to obtain for s > 0 and we give rough bounds instead. Let us assume that B holds the mana proportion mB . In the case of mana, i.e., s > 0, it is not the number of nodes, that are querying the berserk node, that is essential, but their mana. The probability that any given honest node queries the berserk node is at least mB , which implies that the average sum of mana of honest nodes that query the berserk node is at least mQ = mB (1 − mB ). We assume that we can split up these nodes into two groups G0 and G1 of equal mana weight, i.e., mG1 = mG2 . The berserk node answers 0 to the nodes in G0 and 1 to the ones in G1 . Then the probability that
Fast Probabilistic Consensus with Weighted Votes
373
an honest node x queries and requests a v-list from a node from the group Gi (i = 0, 1) is at least pB mQ /2. Moreover, P (x receives v-list from G0 and G1 ) p m 2 B Q = γ1 . ≥2 2 Similarly to above, P (some node detects malicious node) ≥ 1 − (1 − γ1 )N −1 . For instance, if N = 1000, pB = 0.1 and mB = 0.2 the detection probability is greater than 0.12. Note that the above bound holds already for k = 2. Hence, higher values of k will lead to detection probabilities close to 1.
10
Heuristic for Choosing the Quorum Size
An important parameter that dominates the performance is the quorum size k. It may be chosen as large as the network capacity allows, in a dynamic fashion or as small as security allows to be sustainable. Previous results, e.g. [11] and [5], show that an increase of k decreases the failures rates exponentially. Let us give here some heuristic probabilistic bounds on what kind of values of k may be reasonable. Here we consider only the Vanilla FPC but note that the same behaviour occurs for the changed protocol. The case s = 0 can be treated analytically as follows. One disadvantage of the majority voting is that even if there is already a predominant opinion present in the network, e.g., opinion 1 if p > τ , that a node picks by bad chance too many nodes of the minority opinion. Let p be the average opinion in the network and τ the threshold with which a node decides whether to choose the opinion 1 or 0 for the next round. More specifically if more than τ k nodes respond with 1 the node selects 1, or 0 otherwise. The number of received 1 opinions follows a Binomial distribution B(k, p). Hence, the probability for a node to receive opinions that result in an η-value leading to the opinion 0 is given by k
pm (1 − p)k−m , m m=0 τ k
P0,k (τ ) = P (Y ≤ τ k ) =
where Y ∼ B(k, p). As we are interested in the exponential decay of the latter probability as k → ∞ we use a standard large deviation estimate, e.g., [6], to obtain for τ < p: (6) P0,k (τ ) ≈ e−kI(τ ) , with rate function
τ 1−τ I(τ ) = τ log + (1 − τ ) log . p 1−p
(7)
374
S. M¨ uller et al.
Fig. 4. Probability for a node to choose the opinion 0 for τ = 0.5 in the mana setting.
This shows an exponential decay of P0,k (τ ) in k and that the rate of decay depends on the “distance” between p and τ . An exact calculation in the mana setting of these probabilities is more difficult to obtain. We consider the situation where the top mana holders have opinion 1 and the remaining nodes have opinion 0 such that a proportion p of the mana has opinion 1. Figure 4 shows estimates, obtained by Monte-Carlo simulations, of the probability that the highest mana node will switch to opinion 0.
11
Simulation Results
We perform simulation studies with the parameters given in Table 1 and study the 1%-agreement failure. In order to make the study of the protocol numerically feasible we choose the system parameters such that a high agreement failure is allowed to occur. However as we will show the parameters can be adopted such that a significantly lower failure rate can be achieved. The source code of the simulations is made open source and available online.5 The initial opinion is assigned as follows. The highest mana nodes that hold together more than p0 of the mana are assigned opinion 1 and the remaining opinion 0. More formally, let J := min{j :
j
mi > p0 },
i=1
then si (0) = 1 for all i ≤ J and si (0) = 0 for j > J. 5
https://github.com/IOTAledger/fpc-sim.
Fast Probabilistic Consensus with Weighted Votes
375
Table 1. Default Simulation Parameters
N
Parameter
Value
Number of nodes
1000
p0
Initial average opinion
0.66
τ
Threshold in first round
0.66
β
Lower random threshold bound 0.3
k
Quorum size
l
Final consecutive round
10
maxIt Max termination round
50
20
q
Proportion of adversarial mana 0.25
α
Minimum proportion of mana for agreement failure
0.01
We investigate a network with a relatively small quorum size, k = 20 and a homogeneous mana distribution (s = 0). The adversary is assumed to hold a large proportion of the mana with q = 0.25. Figure 5 shows the agreement failure rate with N . We observe that the improvements from Sect. 8 increase the protocol significantly for the lower range of N . For a large value of N the improvements are still of the order of one magnitude. Figure 6 shows the agreement failure rate with the adversaries’ mana proportion q. First, we can see that for the vanilla version the protocol performance remains approximately the same for small values of s, however for s = 2 we can observe a deterioration in performance. This effect may be explained by the skewness of the Zipf law, leading to a more centralized situation where high mana nodes opinion are susceptible to sampling effects described in Sect. 8. We can also observe that the improvements enable the protocol to withstand a higher amount q of adversarial mana and that for most values of q the improvement is at least one order of magnitude. As we increase s we can observe an agreement failure that is several orders of magnitudes smaller than without the improvements. Figure 7 shows the failure rate with the quorum size k. As discussed in Sect. 10 the probability for a node to select the minority opinion in a given round decreases exponentially with k and this trend is also well reflected in the agreement failure rate, apart for small values of k. We show that the improvement of the failure rate becomes increasingly pronounced as the quorum size is raised. In Vanilla FPC the improvement decreases in the query size. Interesting to note that for small query sizes (k ≤ 60), the centralized situation, s > 1, is more stable against attacks, but for larger k the centralized situations become more vulnerable than the less centralized ones. The improved FPC clearly performs better and the improvement of the agreement rate is more important as s increases.
376
S. M¨ uller et al.
Fig. 5. Agreement failure rates with N , for s = 0. The improvements from Sect. 8 are applied individually.
Fig. 6. Agreement failure rates with q for three different mana distributions.
Finally, for s = 2 no failures are found in 106 simulations for the improved algorithm, i.e., the failure rate is less than 10−6 . This is in agreement with the performance increase observed in Fig. 6.
Fast Probabilistic Consensus with Weighted Votes
377
Fig. 7. Agreement failure rates with k.
We want to highlight that the experimental study above is only the first step towards a precise understanding of the protocol. There are not only many numerous parameters of the protocol itself, different ways to distribute the initial opinions, other types of failures to consider, but also many possible attack strategies that were not studied in this paper. We refer to [2] for a more complete simulation study on the Vanilla FPC and like to promote research in the direction of [2] for the FPC with weighted votes.
12
Discussions
A main assumption in the paper is that every node has a complete list of all other nodes. This assumption was made for the sake of simplicity. We want to stress out that in [2] it was shown, for s = 0, that in general it is sufficient that every node knows about 50% of the other nodes. These results transfer to the setting s > 0 in the sense that a node should know about nodes that hold at least 50% of the mana. In many applications it is reasonable that all large mana nodes are publicly known and that this assumption is verified. Another simplification that we applied in the presentation of our results is that we assumed that the mana of every node is known and that every node has the same perception of mana. However, such a consensus on mana is not necessary. Generally, it is sufficient if different perceptions of mana are sufficiently close. The influence of such differences on the consensus protocol clearly depends on the choice of parameter s and may be controlled by adjusting the protocol parameters. However, a detailed study of the above effects is beyond the scope of the paper and should be pursued in future work. For the implementation of FPC in the Coordicide version of IOTA, [12], it is important to note that the protocol, due to its random nature, is likely to
378
S. M¨ uller et al.
perform well even in situations where the Zipf law is partially or even completely violated. The fairness results in Sect. 3 concern the Vanilla FPC. Similar calculations for the adapted versions are more difficult to obtain and beyond the scope of this paper. In particular, the sampling is no longer a sampling with replacement, but the sampling is repeated until k different nodes are sampled; we refer to [13] for a first treatment of the difference of these two sampling methods. The introduced bias towards its own opinion likely increases the voting power with respect to its own opinion but does not influence the voting power towards other nodes. Due to this fact and that linear weights are the most natural choice, we propose this voting scheme also for the adapted version. Acknowledgment. We are grateful to all members of the coordicide team for countless valuable discussions and comments on earlier versions of the manuscript.
References 1. Barborak, M., Dahbura, A., Malek, M.: The consensus problem in fault-tolerant computing. ACM Comput. Surv. 25(2), 171–220 (1993) 2. Capossele, A., Mueller, S., Penzkofer, A.: Robustness and efficiency of leaderless probabilistic consensus protocols within Byzantine infrastructures (2019) 3. Castellano, C., Fortunato, S., Loreto, V.: Statistical physics of social dynamics. Rev. Mod. Phys. 81(2), 591 (2009) 4. Condorcet, J.A.N.: Essai sur l’application de l’analyse ` a la probabilit´e des d´ecisions rendues ` a la pluralit´e des voix. De l’Imprimerie Royal (1785) 5. Cruise, J., Ganesh, A.: Probabilistic consensus via polling and majority rules. Queueing Syst. 78(2), 99–120 (2014) 6. den Hollander, F.: Large Deviations. Fields Institute Monographs. 14. American Mathematical Society, Providence, RI (2000) 7. Jones, C.I.: Pareto and Piketty: The macroeconomics of top income and wealth inequality. J. Econ. Perspect. 29(1), 29–46 (2015) 8. Kondor, D., P´ osfai, M., Csabai, I., Vattay, G.: Do the rich get richer? an empirical analysis of the bitcoin transaction network. PloS one 9, e86197 (2014) 9. M¨ uller, S., Penzkofer, A., Camargo, D., Saa, O.: On fairness in voting consensus protocols (2020) 10. Levine, B.N., Shields, C., Margolin, N.B.: A survey of solutions to the sybil attack (2005) 11. Popov, S., Buchanan, W.J.: FPC-BI: Fast Probabilistic Consensus within Byzantine Infrastructures (2019). https://arxiv.org/abs/1905.10895 12. Popov, S., Moog, H., Camargo, D., Capossele, A., Dimitrov, V., Gal, A., Greve, A., Kusmierz, B., Mueller, S., Penzkofer, A., Saa, O., William, S., Wolfgang W., Vidal, A.: The coordicide, Luigi Vigneri (2020) 13. Raj, D., Khamis, S.H.: Some remarks on sampling with replacement. Ann. Math. Statist. 29(2), 550–557 (1958) 14. Sarwar, S., Marco-Gisbert, H.: Assessing blockchain consensus and security mechanisms against the 51% attack. Appl. Sci. 9, 1788 (2019) 15. Tao, T.: Benford’s law, Zipf’s law, and the Pareto distribution. https://terrytao. wordpress.com/2009/07/03/benfords-law-zipfs-law-and-the-pareto-distribution/
A Process Mining Approach to the Analysis of the Structure of Time Series Julio J. Vald´es1(B) , Yaimara C´espedes-Gonz´alez2 , Kenneth Tapping3 , and Guillermo Molero-Castillo4 1
National Research Council Canada, Digital Technologies Research Centre, M50, 1200 Montreal Road, Ottawa K1A0R6, Canada [email protected] 2 Faculty of Accounting and Administration, Universidad Veracruzana, Veracruz, Mexico [email protected] 3 Herzberg Institute for Astrophysics, National Research Council Canada, Victoria, Canada [email protected] 4 Engineering Faculty, Universidad Nacional Aut´ onoma de M´exico, Mexico City, Mexico [email protected]
Abstract. This paper presents a discussion of the potential of Process Mining for the analysis of general processes involving the time variation of magnitudes given by real-valued variables. These scenarios are common in a broad variety of domains, like natural and life sciences, engineering, and many others beyond business processes, where, in general, complex systems are observed and monitored using sensor data, producing time-series information. Two approaches are presented to construct event logs for such types of problems and one of them is applied to a real-world case (monitoring the F10.7 cm electromagnetic flux produced by the Sun). The results obtained with the Fuzzy Miner and the MultiObjective Evolutionary Tree Miner algorithms successfully exposed the differences in the internal structure of the F10.7 cm series between Solar cycles. For this application, Process Mining proved to be a valuable tool for analyzing the rhythm of solar activity and how it is changing. The approach introduces here is general and could be used in the analysis of data from a broad variety of time-dependent information from many domains. Keywords: Process Mining · Machine learning · Graph an trace clustering · Fuzzy models · Evolutionary multi-objective models
1
Introduction
Process mining (PM) is a set of techniques originally developed within the process management domain, mainly oriented to model, analyze, optimize, and automate business processes [1,2]. It is an already well-established research discipline c Springer Nature Switzerland AG 2021 K. Arai et al. (Eds.): FTC 2020, AISC 1289, pp. 379–392, 2021. https://doi.org/10.1007/978-3-030-63089-8_25
380
J. J. Vald´es et al.
that combines machine learning and data mining with process modeling and analysis [3,4]. PM methods work with data consisting of event logs. Typically, an event log consists of a set of instances of a process, where each instance consists of ordered events. The event log has properties such as activity and time as well as additional ones like resource or cost [2]. Table 1 shows an example of an event log. Table 1. Example of an event log. Case ID Event ID dd-MM-yyyy:HH.mm Activity
Resource Costs
1
35654423 30-12-2010:11.02
register request
Pete
1
35654424 31-12-2010:10.06
examine thoroughly Sue
400
1
35654425 05-01-2011:15.12
check ticket
Mike
100
1
35654426 06-01-2011:11.18
decide
Sara
200
···
···
···
···
···
···
50
In particular, there are certain activities that are performed at certain times, which are performed by certain subjects (resources). Besides these key elements, an event log may provide additional information about the process (e.g. cost). Thus, one of the most interesting possibilities offered by process mining is to discover the process models from event logs, which guarantees that the discovered model describes the actual behavior recorded [2,3]. Models are suitable representations because they allow communicating complex knowledge in a more intuitive way [5–7]. Process mining has become a very useful tool for the analysis of systems of events and has been used very successfully in domains like business, management, health, and social studies, among others. However, despite its great potential, applications in non-business domains, like natural and life sciences, engineering, and many others are comparatively fewer [7,8]. PM algorithms cover a wide variety of problems and approaches [1,9–12]. The purpose of this paper is to start filling this gap by introducing process mining to the analysis of time series from processes in any domain producing continuous magnitudes (e.g. sensor data). The objective is to uncover patterns of behavior within the dynamics of a general system and to characterize the changes associated with these patterns using process mining. The paper is organized as follows: Section 2 presents process mining, discusses its applications to the study of natural processes, and presents approaches for constructing process logs from time series of real-valued magnitudes. Section 3 describes two process mining techniques used in the application example (Fuzzy Miner and the Evolutionary Multi-Objective Tree Miner). Section 4 presents the application example (monitoring the Sun’s F10.7 cm flux) and the results obtained with the process mining modeling, and Sect. 5 summarizes the paper.
Process Mining for Time Series
2
381
Process Mining
The aim of process mining is to discover, monitor, and improve processes by extracting knowledge from event logs. As described above, these are collections of cases containing sequences of certain events [6,13]. Typical scenarios described by sequences of events are database systems, transaction logs in a trading system, message logs, among others [4,7,14]. Process mining provides a set of techniques that allows the analysis of event logs in three main directions [2]: i) discovery, ii) conformance and iii) enhancement. Discovery techniques produce a process model. Conformance techniques compare a process model with the event log of the same process. It verifies if the process model is adjusted to the reality recorded in event logs. Enhancement techniques extend or improve an existing process model by using event log information. The minimum requirements for process mining are that any event can be related to both an activity and a case and that events within a case are ordered. At the same time, events can be characterized by various attributes (timestamp, resource or performer, activity name, and other data). Different techniques use these attributes for specific analyzes [2,3,8]. The learned model can cover different approaches: i) control-flow, which describes the order of execution of the activities within the process; ii) organizational, which discovers the actors that participate in the process and how they are related; iii) case, which focuses on the values of the data elements to characterize the process; and iv) the time approach that allows time analysis. 2.1
Extraction of Process Logs from Time Series
A time series is a sequence of values of a certain magnitude that is recorded at specific times. If T is an index set (e.g. time), a time series is given by Y = {Yt : t ∈ T }. In the case of real-valued magnitudes, Yt ∈ R. Thus, from a process mining perspective, where the series is seen as a process, it is necessary to represent it as an event log and therefore, to identify the elements (cases, activities, resources, and time). This could be done in several ways. Cases will be defined here as segments of the series of a certain length with or without overlapping (they will be unique). In this sense, since activities and resources are discrete entities, in time series of continuous magnitudes where Xt ∈ R, a discretization process is required. Change is an essential aspect in a time series and it could be seen as the activity that the series as a process performs at a given time. Since time series values experience these changes, they could be seen as the entities that perform the changes, that is, as the resources of the process. In a simple approach, resources could be interpreted as the intensity levels of the discretized series. If Cp = {C1 , · · · Cn }, n ∈ N+ , Ci ∈ R, for all i ∈ [1, n] with Ci < Ci+1 , i ∈ [1, n) is a sequence of cut points, it induces a partition of the range of Y into p = n − 1 categories. They could be considered as the state
382
J. J. Vald´es et al.
of the series at time t, Lk = {Yt ∈ (Ck , Ck+1 ]} and collectively, as the resources of the process. In this simple approach, the activity is the change experienced by a time series value and it is defined by characterizing qualitatively and quantitatively the difference (Yt+1 − Yt ). Three classes of change are considered: I ncreasing, Decreasing, and C onstant (no change). If min, max are functions returning the minimum and maximum of the time series respectively, R = |max(Yt )−min(Yt )| is the range of the time series and α ∈ (0, 1), β ∈ (0, 1) are constants, then the activity performed by a resource at time t can be defined as is Act(Lk ) = C, iff |Yt+1 − Yt | ≤ αR. Otherwise, if (Yt+1 > Yt ), Act(Lk ) = Id , where d = round(|Yt+1 − Yt |/(βR)) (round(x) returns the closest integer to x). Finally, if (Yt+1 < Yt ), Act(Lk ) = Dd . The scenarios are shown in Fig. 1.
Fig. 1. Simple approach for defining resources and activities in a time series.
This simple approach is the one used in the paper for illustrating the analysis of a continuous-time series (Sect. 4). Note that other approaches for defining an event log from a time series are possible. For instance, resources and activities could be defined in terms of category levels with respect to the mean of the series Tt by creating the cutpoints Cp = {C1 , · · · Cn } covering intervals given by a certain fraction γ of the standard deviation σ(Y ) of Y . In the same way, constants α ∈ (0, 1), β ∈ (0, 1) for defining activity levels, could act upon σ(Y ) instead of on the range.
3 3.1
Process Mining Techniques Fuzzy Models
Fuzzy miner is a process discovery algorithm capable of handling unstructured processes and numerous activities, providing a simplified process visualization
Process Mining for Time Series
383
[5,15]. The algorithm uses correlation and significance metrics to simplify the process model and to build a graph where [5]: i) the most significant behavior is conserved, ii) the less significant and most correlated behavior is grouped, and iii) the less significant and less correlated behavior is not considered in the model. Measurements of significance and correlation are modifiable in order to get the desired result. In this sense have been developed a set of metrics [15]: i) unary significance, ii) binary significance, and iii) binary correlation. The algorithm initially creates an early process model where the importance of model nodes (i.e. event class) is determined by the unary significance and the edges are depicted by the binary significance and correlation. Later three transformations are applied to the model to simplify it [5,15]: conflict resolution, edge filtering, and aggregation and abstraction. Conflict Resolution. In this first transformations, the conflict relation is identified, classified, and resolved. There is a conflict relation when two nodes in the model are connected in both directions. The conflict relation can be classified in one of the three situations: length-two-loop, exception, or concurrency. For resolving the conflict its relative significance is determined (for a more detailed see [5]). When the relative importance of a conflict relation (A → B) is known, i.e. rel (A → B) and rel (B → A), is possible to resolve the conflict relation as follows: If rel(A, B) or rel(B, A) exceed a threshold value, then it is inferred that A and B build a length-two-loop and both relations remain in the model. When at least one of these two values is below the threshold value, then offset is determined of s(A, B) = |rel(A, B) − rel(B, A)|, whether offset value exceeds a specified ratio threshold then it is assumed that the less significant relation is an exception and it is removed it from the process model. Otherwise, when offset value is inferior to the specified ratio threshold, it is concluding that A and B are concurrent and it is removed from the model. Edge Filtering. In this transformation, each edge is evaluated by its utility util(A, B), a weighted sum of its significance, and correlation (for a more detailed see [5]). Each incoming and outgoing edges is filtered. The edge cutoff parameter co ∈ [0, 1] allows configuring which edges are preserved. For each node, the utility value is normalized to [0, 1] where is assigned 1 to the strongest edge. All edges whose normalization exceeds utility value are added to the model. Node Aggregation and Abstraction. In this last transformation, the main idea is to preserve highly correlated clusters of less-significant nodes and take away lesssignificant isolated nodes. Removing nodes is configurable on the node cutoff parameter. Nodes whose unary significance is below parameter can either be aggregated or abstracted. 3.2
Evolutionary Multi-objective Pareto Process Trees
None of the standard techniques for learning process models guarantee the production of syntactically correct models. Moreover, they do not provide insights
384
J. J. Vald´es et al.
into the trade-offs between the different quality measures. The Evolutionary Tree Miner algorithm (ETMd) [16,17] is capable of learning sound process models that balances four established quality measures: i) simplicity, ii) replay fitness), iii) precision, and iv) generalization. Simplicity is about reducing the size of a process model by removing nodes that do not improve or compromise behavior in order to give preference to simpler, rather than complex models (Occam’s Razor principle). The replay fitness measure quantifies the fraction of the event log supported by the process model, the precision measure quantifies how much of the behavior described by the process model is not observed in the event log and generalization evaluates how the process model explains the behavior of the system, and not only the particular event log describing the observed behavior [18]. ETMd is an implementation of a genetic programming evolutionary algorithm, which evolves trees, each one representing a process model. It works by generating an initial population with candidate solutions that are evaluated (using the aforementioned four quality measures), and processed with evolutionary operators (selection, crossover, and mutation), in cycles that produce successive generations (with/without elitism), until a termination criterion is met (number of generations surpassed, lack of improvement of the best solution, performance measures exceeding a given threshold, among others). Common selection mechanisms are roulette-wheel and tournament selection. In the ETMd algorithm, provisos are taken to prevent bloat phenomenon, like prioritizing smaller over larger solutions, common in genetic programming scenarios. The evaluation of candidate solutions is not based on weighted averages of the individual objective functions, which suffers from several disadvantages. Instead, a four-dimensional Pareto front is maintained and updated at every generation, ensuring a true multi-objective optimization process, that gradually eliminates the dominated solutions in favor of the non-dominated ones. At the end of the evolution, the user examines the resulting pairwise Pareto fronts related to the model quality measures and makes his choice. An important element differentiating ELM to other approaches to process mining is that even though models are evaluated using the event log data, they are the result of a generative process involving many candidate solutions, having multiple objectives, where the best solutions balance these objectives. 3.3
Social Networks
Process mining generally focuses on discovering and analyzing the process model [3]. However, when the event log contains information about the resource it is possible to construct and analyze social networks. When the event log contains time information, it is possible to infer causal relations between activities and also social relations between resources [1,9]. Different metrics allow the identification of relations between resources within the process: i) handover of work, ii) subcontracting, iii) working together, iv) similar task, and v) reassignment.
Process Mining for Time Series
385
From them, handover of work and subcontracting, are based on causality. Their objective is to identify how workflows between resources within a process, and they were the ones used in this paper. There is handover of work, within a process instance from resource i to resource j when there are two successive activities where the first is performed by i and the second by j. When a resource j performed an activity between two activities performed by resource i, it is said that the work was subcontracted from i to j. In both metrics, it is possible to consider direct and indirect successions using a causality fall factor, that specifies the activity number in-between an activity complete by i and other complete by j [1,9].
4
Application Example: The 10.7 cm Solar Radio Flux Series
The Sun structure and behavior are largely controlled by magnetic fields. The level of magnetic activity follows an 11 (really a 22) year cycle. This rhythm pervades the Sun and modulates physical processes taking place at different locations throughout the Sun. The result is variations in the Sun’s energy output and other emissions, for example, the ultraviolet emissions that heat the Earth’s atmosphere and change the ionosphere. Monitoring the Sun is extremely important because of the impact that it has on Earth, ultimately affecting human activities (both on Earth and in space). Geomagnetic storms caused by solar flares and coronal mass ejections are responsible for distorting communications, satellites, the power grid, and many other distortions with economic impact measured in millions. One of the most useful solar activity indices is 10.7 cm solar radio flux (F10.7), which has been measured by the National Research Council of Canada since 1947 [19]. This index consists of measurements of the total solar radio emission 10.7 cm wavelength (a frequency of 2800 MHz). It comprises contributions from the solar disc plus emission from all the activity centers on it. At least three emission mechanisms are involved. The most important are thermal free-free emission from plasma concentrations trapped in the chromosphere and lower corona by magnetic fields, and thermal gyroresonance, where those magnetic fields are strong enough for the electron gyrofrequency (fg (MHz) = 2.8 B (Gauss)) to be higher than about a third of the observing frequency. This requirement is often met in the strong magnetic fields overlying sunspots. The third contribution is gyrosynchrotron (non-thermal) emission, driven by electrons accelerated by flares or other reconnection processes [20]. The main use of this index is to reflect the changes in the general level of magnetic activity-evolutionary changes in the active structures, which have characteristic time-scales ranging from hours to weeks. The non-thermal emissions may vary dramatically over seconds to hours [21]. Three flux values are distributed for each measurement: the Observed Flux, which is the value as measured, the Adjusted Flux, which is the value corrected for the annual variations in the Earth-Sun distance, and the URSI Series-D Flux, which is 0.9 times the Adjusted Flux.
386
J. J. Vald´es et al.
The flux data were obtained from the Canadian Space Weather Centre [22]. For a more detailed discussion of the F10.7 solar radio flux activity index, see [19]. The F10.7 solar radio flux data used in this paper are daily local noon values of the Adjusted Flux, smoothed using 27-point adjacent-averaging. This corresponds to the size of the synoptic map (≈ 27 days), covering a single solar rotation. The daily averaged and smoothed F10.7 flux for the time period from 2006 to October 2018 is shown in Fig. 2, which includes the discretization levels used for creating the event logs (Sect. 2.1), as well as the time periods corresponding to the different solar cycles (denoted as Cxx , where xx indicates the given year). 450 400
F10.7 adj flux
350 300 250 200 150 100 50 0 1952 01/01
1958 01/01
1964 01/01
1970 01/01
1976 01/01
1982 01/01
1988 01/01
1994 01/01
2000 01/01
2006 01/01
2012 01/01
2018 01/01
time
Fig. 2. F10.7 solar radio flux series (1947–2018). Top labels indicate solar cycles [19– 24]. Horizontal lines indicate intensity levels categorized into classes (7). Solar cycles covered by the F10.7 series are labeled at the top.
The flux is expressed in solar flux units (1 sfu = 10−22 W m−2 Hz−1 ). In the figure, we can see a declining phase of the solar cycle 23 and cycle 24, which is nearing its minimum. The flux encompasses full solar cycle 23 and 24 more than solar cycle length. 4.1
Fuzzy Models
According to these results, fuzzy models were computed for the first and last two solar cycles contained in the F10.7 record (Cycles 19, 23, and 24). They are shown in Fig. 3 (Top row), together with the models obtained with the Evolutionary MO techniques and the F10.7 series for comparison. Each fuzzy model is a graph that provides a simplified process visualization and describes the precedence relations among event classes. The yellow squares represent significant activities and each node is labeled with the event class name and the significance value. The edges that link nodes express their significance with thickness and darkness proportional to the strength of the connection. The fuzzy models clearly reveal differences in the structure of the subprocesses associated with the cycles. Solar Cycle 19 consists of only six classes of events while cycles 23 and 24 consist of 15 and 10 classes of events, respectively, related in a much more complex manner. Taking into account the number
Process Mining for Time Series
Fuzzy Models C19
C23
C24
Evolutionary Multi-Objective Pareto Models C19 C23
C24
387
450 400
F10.7 adj flux
350 300 250 200 150 100 50 0 1952 01/01
1958 01/01
1964 01/01
1970 01/01
1976 01/01
1982 01/01
1988 01/01 time
1994 01/01
2000 01/01
2006 01/01
2012 01/01
2018 01/01
Fig. 3. F10.7 solar radio flux series (1947–2018). Fuzzy and Evolutionary Multi-Objective Pareto Models corresponding to Solar cycles 19, 23 and 24 (each row contains networks of the same type). Left hand side: Cycle 19. Right hand side: Cycles 23 and 24.
388
J. J. Vald´es et al.
of classes of events, the model for Cycle 19 seems much simpler and more balanced, i.e. with less abrupt jumps. The model of Cycle 23 involves the largest number of event types, related to a large number of high spikes characteristic of this cycle. Finally, Cycle 24 involves 10 classes of events, of which those indicating little changes appear with the highest frequency ({D1 , I1 , D2 , I2 and D3 }). This situation explains why the cycle is observed flatter than the two previous cycles. However, there is an abrupt peak (I6 ) with high frequency, indicating sudden, higher intensity variations, easily identified in the F10.7 behavior during Cycle 24. 4.2
Evolutionary Multi-objective Models
The ETMd algorithm was applied to the activities sub-logs of the F10.7 for Cycles {19, 23, 24} with the following parameters: population size = 20, elite count = 5, nbr. of generations = 1000, cross-over rate = 0.25, random tree creation rate = 0.25, random node addition = 1, random node removal rate = 1, random node mutation = 1 and useless node removal = 1. No solutions were filtered from the Pareto front based on quality measures preset thresholds. Upon termination, for each case, a trade-off solution was chosen as the one on the Pareto front, closest to the overall optimum given by the vector determined by the best values of the individual quality measures (Sect. 3.2). The resulting process trees are shown in Fig. 3 (left to right for Cycles {19, 23, 24}). As with the fuzzy models, there are immediate differences in the underlying dynamics of the sub-processes for the starting and the ending solar cycles along the F10.7 flux record, exposed by the Pareto trade-off models. However, when all quality measures are considered simultaneously, Cycle 19 exhibits a longer sequence of elements (7), with a larger number of constant and d = 1 order increasing/decreasing changes and loops, compared to Cycles 23, 24. On the other hand, Cycle 23 is the one with the deepest tree and with jumps which are either small or more towards the extreme (d = {1, 3, 7}). Cycle 24 has a simpler activity change schema and slightly shorter sequences. It is structurally more similar to 23 than to 19, a relation that coincides with the one exhibited by their fuzzy model counterparts. These findings provide more insight into the changes in Sun’s behavior during the last cycles [21]. 4.3
Social Networks
The social networks constructed from the F10.7 process log for solar cycles {19, 23, 24} are shown in Fig. 4, corresponding (top to bottom) to the handover of work, similar tasks and the subcontracting models respectively. Recall that resources are the radiation intensity levels at which the different types of changes (activities) take place. The handover of work analysis indicates how activities are passed from one resource to another, and it depends on two parameters: The first one indicates whether to consider multiple transfers within one instance. The second parameter
Process Mining for Time Series
389
Fig. 4. F10.7 solar radio flux series (1947–2018). Social Networks models corresponding to Solar cycles 19, 23 and 24 (each row contains networks of the same type). Left hand side: Cycle 19. Right hand side: Cycles 23 and 24.
390
J. J. Vald´es et al.
indicates whether to consider direct succession. In the representation, node sizes indicate the frequency with which resources have executed activities per process instances. The network for Cycle 19 involves six resources, with R3 being has one with the greatest participation. It receives work from R2 , R4 and R6 and it gives work to R2 and R4 . The opposite happens with the R1 which has less participation with only two edges, one for giving work to R2 and other for receiving work from R2 . On the other hand, for Cycle 23, seven resources are involved. The central nodes are R2 and R3 with 9 edges each. R2 receives work from {R1 , R3 , R4 , R5 , R6 } and it gives work to {R1 , R3 , R5 , R6 }. R3 receives work from {R1 , R3 , R4 , R5 , R6 } and it gives work to {R1 , R3 , R5 , R6 }. The opposite happens with R5 and R7 , which have less number of edges. R5 is related to R1 and R3 ; while R7 is related to R2 and R4 . Altogether, the dynamics are very different from those exhibited by Cycle 19. The network of Cycle 24, involves five resources. Node R2 is the one with the highest number of relations. It gives work to four other nodes and also receives work from them. R3 is the node with fewer edges, as it only relates to R2 and R4 , from which it gives and receives work. As was seen with other techniques, the structure and behavior are more similar to Cycle 23 than to Cycle 19. Subcontracting Social Network. This type of network provides insight about resources performing an activity in between two other activities performed by other resources. It depends on two parameters: the first one establishes whether to consider multiple subcontracting relations and the second one allows the consideration of indirect subcontracting relationships. Nodes sizes differ because they are proportional to the amount of contracting and subcontracting relationships. These networks are also shown in Fig. 4 (Subcontracting Models). Cycle 19 has six nodes in total. Nodes {R2 , R3 , R4 , R5 } have the same behavior (two incomings and two outgoing edges), whereas nodes R1 and R6 have only two edges, one incoming and one outgoing. Specifically, R1 subcontracts and is subcontracted only by the R2 node. R6 node subcontracts and is subcontracted only by the R5 node. In contrast, Cycle 23 consists of seven nodes and with a very different structure, of which R2 is the node with the most incoming and outgoing connections. The two incoming edges indicate that R2 has been subcontracted twice, and the four outgoing edges indicate that R2 has subcontracted four other nodes {R1 , R3 , R5 , R6 }. Interestingly, nodes {R5 , R6 } only perform subcontracted work, whereas node R7 is only subcontracted by R3 . Finally, Cycle 24 consists of five nodes all of which have the same behavior, four incoming edges, and four outgoing edges, which indicates that all are subcontracted and are subcontracted equally. Although different, its structure is more similar to Cycle 23 than to Cycle 19.
5
Conclusions
Process mining was discussed in the context of the analysis of continuous, realvalued time-varying magnitudes like time series and the monitoring with sensor
Process Mining for Time Series
391
data. They are important in a broad variety of domains, like natural and life sciences, engineering, and many others. Approaches were presented that describe the variations of continuous magnitudes as event logs where the intensity levels of the time series are interpreted as the resources involved and the type and magnitude of their variation are mapped to activities. This representation allows the application of process mining techniques to problems like monitoring with sensor data and other types of time-varying phenomena. In particular, the Fuzzy Miner (FM) and the Multi-Objective Evolutionary Tree Miner (ETMd) process mining algorithms were applied to a time series of the F10.7 flux index of Solar activity. These techniques successfully constructed models for the process that exposed the differences in the internal structure of the time series between Solar cycles and provided a better understanding of the changing dynamics of the physical system (the Sun). In this application, Process Mining proved to be a valuable tool for analyzing the rhythm of solar activity and how it is changing. The results obtained are promising and further studies should extend the range of application domains, the dimensionality of the time series, the data types of the variables describing the time-dependent processes, as well as comparison with other data mining procedures. Acknowledgment. The F10.7 data are provided by the National Research Council of Canada and Natural Resources Canada. Y. C´espedes-Gonz´ alez acknowledges the support of the University of Veracruz/Faculty of Accounting and Administration/Ph.D. program in Administrative Sciences.
References 1. van der Aalst, W.M., Reijers, H.A., Song, M.: Discovering social networks from event logs. Comput. Support. Coop. Work (CSCW) 14(6), 549–593 (2005). https:// doi.org/10.1007/s10606-005-9005-9 2. van der Aalst, W.M.: Process Mining Data Science in Action, 467 p. Springer, Heidelberg (2016) 3. van der Aalst, W.M., Adriansyah, A., Alves de Medeiros, A.K., Arcieri, F., Baier, T., Blickle, T., et al.: Process mining manifesto. In: Business Process Management Workshops, vol. 99, pp. 169–194 (2012). https://doi.org/10.1007/978-3-642-281082 19 4. Geyer, J., Nakladal, J., Baldauf, F., Veit, F.: Process mining and robotic process automation: a perfect match. In: Proceedings of 16th International Conference on Business Process Management, pp. 124–131 (2018) 5. G¨ unther, C.W., van der Aalst, W.M.: Fuzzy mining–adaptive process simplification based on multi-perspective metrics. Lecture Notes in Computer Science, vol. 4714, pp. 328–343 (2007). https://doi.org/10.1007/978-3-540-75183-0 24 6. R’bigui, H., Cho, C.: The state-of-the-art of business process mining challenges. Int. J. Bus. Process Integr. Manag. 8(4), 285–303 (2017). https://doi.org/10.1504/ IJBPIM.2017.088819 7. Kouzari, E., Stamelos, I.: Process mining applied on library information systems: a case study. ScienceDirect 40(3–4), 245–254 (2018)
392
J. J. Vald´es et al.
8. Pika, A., Wynn, M., Budiono, S., ter Hofstede, A., van der Aalst, W.M., Reijers, H.: Towards privacy-preserving process mining in healthcare. Lecture Notes in Business Information Processing, vol. 362, pp. 483–495 (2019). https://doi.org/10. 1007/978-3-030-37453-2 39 9. van der Aalst, W.M., Song, M.: Mining social networks: uncovering interaction patterns in business processes. In: Proceedings of International Conference on Business Process Management, pp. 244–260. Springer, Heidelberg (2004). https://doi.org/ 10.1007/978-3-540-25970-1 16 10. van Dongen, S.: A cluster algorithm for graphs. National Research Institute for Mathematics and Computer Science in the Netherlands, Technical report INSR0010 (2000) 11. Hompes, B.F., Buijs, J.C., van der Aalst, W.M., Dixit, P.M., Buurman, J.: Detecting change in processes using comparative trace clustering. In: Proceedings of 5th International Symposium on Data-Driven Process Discovery and Analysis, (SIMPDA), pp. 95–108 (2015) 12. Hompes, B.F., Buijs, J.C., van der Aalst, W.M., Dixit, P.M., Buurman, J.: Discovering deviating cases and process variants using trace clustering. In: Proceedings of 27th Benelux Conference on Artificial Intelligence (BNAIC), Belgium (2015) 13. C´espedes-Gonz´ alez, Y., Vald´es, J.J., Molero-Castillo, G., Arieta-Melgarejo, P.: Design of an analysis guide for user-centered process mining projects. In: Advances in Information and Communication, vol. 69, pp. 667–682 (2019). https://doi.org/ 10.1007/978-3-030-12388-8 47 14. Molero-Castillo, G., Jasso-Villazul, J., Torres-Vargas, A., Vel´ azquez-Mena, A.: Towards the processes discovery in the medical treatment of mexican-origin women diagnosed with breast cancer. In: Advances in Information and Communication, vol. 69, pp. 826–838 (2019). https://doi.org/10.1007/978-3-030-12388-8 56 15. G¨ unther, C.W.: Process mining in flexible environments. Ph.D. thesis, School for Operations Management and Logistics. Eindhoven University of Technology (2009) 16. Buijs, J.C.: Flexible evolutionary algorithms for mining structured process models. Ph.D. thesis, Technical University of Eindhoven (2014) 17. Buijs, J.C., van Dongen, B.F., van der Aalst, W.M.: A genetic algorithm for discovering process trees. In: Proceedings of IEEE Congress on Evolutionary Computation, pp. 1–8 (2012). https://doi.org/10.1109/CEC.2012.6256458 18. Buijs, J.C., van Dongen, B.F., van der Aalst, W.M.: Quality dimensions in process discovery: the importance of fitness, precision, generalization and simplicity. Int. J. Coop. Inf. Syst. 23(1), 1–39 (2014). https://doi.org/10.1142/S0218843014400012 19. Tapping, K.F.: The 10.7 cm solar radio flux (F10.7). Space Weather 11, 394–406 (2013). https://doi.org/10.1002/swe.20064 20. Tapping, K.F., DeTracey, B.: The origin of the 10.7 cm solar flux. Solar Phys. 127, 321–332 (1990) 21. Tapping, K.F., Vald´es J.: Did the sun change its behaviour during the decline of cycle 23 and into cycle 24? Solar Phys. 272, 337–347 (2011). https://doi.org/10. 1007/s11207-011-9827-1 22. Natural Resources Canada (NRCan): Space Weather Canada (2019). https:// spaceweather.gc.ca/index-en.php
OptiShard: An Optimized and Secured Hierarchical Blockchain Architecture Shyam Kantesariya and Dhrubajyoti Goswami(&) Department of Computer Science and Software Engineering, Concordia University, Montreal, Canada {s_kante,goswami}@encs.concordia.ca
Abstract. Blockchain has become an emerging decentralized computing technology for transaction-based systems due to its peer-to-peer consensus protocol over an open network consisting of untrusted parties. Bitcoin and other major alt-coins based on monolithic blockchain architecture exhibit significant performance overhead which in turn make them highly non-scalable. Imposing hierarchy in blockchain can improve performance, however it adds on additional security and fault-tolerance measures necessary for correctness of transaction validation. This paper presents a hierarchical blockchain architecture named OptiShard, which addresses the issues of performance, fault-tolerance, and security in the presence of faulty and malicious nodes. The hierarchy comes as a result of dividing the network nodes into multiple disjoint shards. Majority of transactions are distributed among these shards in non-overlapped fashion (i.e., one-to-one mapping of transactions to shards). The model of OptiShard provides a theoretical measure to determine optimal shard size based on two parameters: performance and correctness of transaction validation in the presence of malicious or faulty nodes. The theoretical measure provides guaranteed majority of good shards by allowing to choose the right shard size and forms the basis of network and workload sharding protocols discussed in this paper. OptiShard also provides a mechanism for identifying faulty shards through the overlapping of a small fraction of transactions across all the shards so that all faulty transactions can be discarded. Experimental results exhibit the impact of sharding the network on performance and conform to the theoretical results. Keywords: Sharding blockchain Scalability Secured blockchain Fault tolerance
Decentralized consensus
1 Introduction 1.1
Background
Blockchain is an immutable data structure to represent an ordered chain of blocks which are timestamped and connected in such a way that each block has reference to its previous block [1]. Nodes participating in the network maintain this chain of blocks, which is called as ledger. Thus, blockchain network consists of a distributed ledger agreed at all the time by consensus among majority of the nodes [1]. If admission to the
© Springer Nature Switzerland AG 2021 K. Arai et al. (Eds.): FTC 2020, AISC 1289, pp. 393–411, 2021. https://doi.org/10.1007/978-3-030-63089-8_26
394
S. Kantesariya and D. Goswami
network is governed by an authority then we categorize it as private, otherwise as public blockchain [2]. The unit of blockchain is a block which records the Merkle root [3] of committed transactions, hash value of previous block, timestamp, and proof of its eligibility to be the next block in the chain. The entire blockchain network goes through consensus to add a new block to the chain. As every block stores the hash value of its previous block, any effort by malicious nodes to alter any committed transaction will lead to changes in all subsequent blocks, which makes malicious attack nearly impossible [4]. Bitcoin is one of the prominent applications of the blockchain architecture. Bitcoin follows the Nakamoto consensus protocol [4] and supports a monolithic architecture. As per Nakamoto protocol, every node competes to generate a new block by consuming some CPU cycles required to solve a mathematical problem, which involves creating a valid hash value for the new block; this is also called the Proof-of-Work (PoW) [4]. Once a node solves the PoW, it gets the privilege to create the next block in the chain. PoW uses a randomized approach to select the node to generate the new block. The transactions in a block are considered committed only when at least 51% of nodes accept the block as the next block. Though the Nakamoto consensus protocol has proved its practical correctness and applicability in the Bitcoin network, it has certain major drawbacks to be addressed. Firstly, it consumes a considerable amount of energy. Secondly, the monolithic architecture itself is not scalable due to the considerable costs involved in computation, communication, and consensus. Lastly, if a node or group of nodes can manage 51% of computing power then collectively they have the ability to alter a committed block, which is known as 51% attack [5]. Bitcoin has other major security threats, for example Sybil attack [6], which are beyond the scope of this paper. The current bitcoin network processes on an average about 7 transactions per second compared to mainstream transaction processing platforms like VISA, which has the ability to process up to about 56,000 transactions per second [7]. Hence it is inevitable to reengineer the blockchain architecture and its consensus protocol so that it becomes scalable to compete with the market leaders and at the same time runs at economic operational cost [8]; this is a necessary step for blockchain to be adopted for enterprise use cases and remain competitive with other market leaders. In the recent past, various approaches have been explored to address the challenges of scalability and energy inefficiency of the Bitcoin system. The Proof-of-Stake (PoS) protocol was adopted by Peercoin [9] to address energy inefficiency of PoW. Instead of burning CPU cycles as in PoW, miners invest some currency on stake which derives their chances of creating a new block. Besides PoS, there are other consensus protocols proposed which include Delegated PoS [10], Proof-of-Burn (PoB) [11] and Proof-of-Personhood [12]. Practical Byzantine Fault Tolerant (PBFT) Protocol [13] has been proposed to replace the Nakamoto consensus protocol to increase throughput and reduce transaction commit latency. However, the system still does not scale well if all the transactions need to be validated by all the participants. Thus, sharding of the network nodes and distributing the transactions among network shards so that every shard validates only a subset of transactions becomes a feasible alternative. Various approaches have been proposed to scale blockchain by sharding the network [14–19]. These are further elaborated in the next section on related works.
OptiShard: An Optimized and Secured Hierarchical Blockchain Architecture
395
Though sharding improves scalability, it poses challenges to the security aspect of the blockchain network in the presence of malicious nodes, because it is likely that a network shard may produce corrupted result due to majority of malicious nodes within the shard (i.e., a malicious shard). Moreover, depending on the number of malicious nodes and the shard size, it is also possible that majority of shards are malicious which will produce a corrupted end result. The existing approaches on sharding either assume that all the shards are non-malicious or otherwise take a probabilistic approach in determining the possibility of a malicious shard. 1.2
Contributions
This research addresses some of the previous issues on sharding, not addressed by other related works, more specifically on determining an optimal shard size while considering both performance and reliability in the presence of malicious and faulty nodes. A novel sharding scheme for a hierarchical blockchain architecture, named OptiShard, is proposed. The sharding scheme determines an optimal shard size based on two parameters: performance and correctness of transaction validation. The consensus protocol within each shard is a centralized variant of PBFT. Requirements of PBFT are used to calculate an allowable shard size that guarantees that majority of the shards are good, based on a predefined majority. An optimal shard size can be determined by consolidating another parameter, i.e., performance. Majority of transactions are distributed among shards in non-overlapped fashion, i.e., one-to-one mapping of a transaction to a shard. However, a small fraction of transactions is mapped in overlapped fashion, i.e., one-to-all mapping of a transaction to shards; this is to identify malicious/faulty shards by comparing validation results of overlapped transactions across shards. OptiShard is a hybrid of decentralized and peer-to-peer. Peer-to-peer links across the network shards enable efficient exchange of information without broadcasting. Though the current protocols use PoS instead of PoW to handle energy inefficiency, they can be modified to incorporate any other technique, if needed, without affecting the sharding scheme. These are elaborated in Sect. 3.
2 Related Works RSCoin [14] proposes a hierarchical architecture in the presence of a trusted central authority in a private environment. Network nodes are divided into shards and each transaction is assigned to a specific shard based on a hash function. It follows a twophase commit protocol. In phase one, a transaction is validated at shard level by majority of the nodes within the shard and committed locally. Subsequently, all shards send their transactions to a central authority for the second phase of approval. A transaction is considered committed in a final block if approved by the central authority. RSCoin assumes that each network shard is non-malicious, i.e., composed of majority non-malicious nodes, and hence there is no mechanism to detect a malicious shard. Chainspace [15] improves RSCoin by introducing a more general distributed ledger for smart contract processing. It uses PBFT protocol for consensus within a shard. It
396
S. Kantesariya and D. Goswami
detects malicious shards by conducting a major audit and expensive replaying of all the transactions processed by a shard. ELASTICO [16] proposes a PoW based network sharding approach for a public blockchain. It randomly distributes the network nodes into multiple shards, each processing a disjoint set of transactions. Individual shards run PBFT protocol for consensus within a shard. One of the shards collects the validation results from all other shards; commits into a final block; and then broadcasts the block header to the network. At the beginning of each epoch, shards are reformed. By sharding the network nodes and distributing transactions, it achieves parallelism in transaction validation. Each shard can be malicious with a certain probability and hence the final result can be corrupt with a certain probability, however small. OmniLedger [17] shards the network using distributed randomness generation protocol RandHound [20], combined with VRF-based leader election algorithm [21]. In addition to sharding the network nodes, it also shards the ledger, which requires each node to store only a portion of the ledger. Due to sharding of the ledger, a two-phase commit protocol is used to handle transactions across multiple shards. Moreover, it proposes a dual layer architecture to quickly process transactions carrying micro payments. RapidChain [18] randomly distribute nodes among shards and uses offline PoW to increase the throughput. It follows hypergeometric distribution of nodes to calculate the failure probability of the epoch. Monoxide [19] deterministically distributes the nodes among 2k shards, for a predefined value of k. It proposes an economic incentive model combined with PoW, that motivates non-malicious nodes to control the overall network. In comparison to the previous, our approach is based on a semi-private network, where some of the network nodes (i.e., belonging to a governing authority) are trusty, while the others can be malicious. Also, any node can be faulty (byzantine failure). It defines a sharding scheme which is not based on a probabilistic model. As long as there is an upper bound, f (say, f \ n3), on the total number of malicious and faulty nodes in the network, it provides the allowable shard sizes for which a majority of shards are guaranteed to be good irrespective of how the nodes are distributed among the shards. Combined with a performance model, an optimal shard size is determined. Each shard processes its own set of transactions. However, a small fraction of transactions is overlapped across all the shards to enable efficient identification of any malicious/faulty shards based on majority consensus from good shards. Table 1 summarizes some of the comparisons and are further elaborated in the following.
3 Proposed Architecture A hierarchical blockchain architecture, named OptiShard, is proposed. The architecture is a hybrid of decentralization and peer-to-peer. Network nodes are sharded into disjoint partitions called committees and workloads (transactions) are distributed among these committees (Fig. 1). Each committee is headed by committee leaders and committee leaders are headed by network leaders. For achieving majority consensus, there is an odd number of network and committee leaders.
OptiShard: An Optimized and Secured Hierarchical Blockchain Architecture
397
Table 1. Comparison with other hierarchical approaches.
Fig. 1. Network hierarchy and consensus protocols.
There are two levels of consensus: intra-committee consensus among all committee members of a committee and inter-committee consensus among all network leaders. The PBFT protocol and its variant are used at both levels of consensus. Committee leaders do not execute any consensus among themselves and this consensus is handled by the network leaders. Committee members (i.e., worker nodes) can also communicate with their selected peers in other committees using peer-to-peer links; this is required during the transaction commit phase.
398
S. Kantesariya and D. Goswami
3.1
Terminologies and Assumptions
OptiShard is based on a semi-private blockchain network consisting of N physical nodes. Each node maintains the complete blockchain ledger. The following discussion uses financial transactions for illustration purposes; however, the proposed approach is applicable to any other blockchain applications that involve transaction processing. By assumption, network leaders are deployed by a trusted party in the semi-private domain (e.g., a governing authority in a financial institution) and hence they are nonmalicious; however, the network leaders can be faulty (byzantine failure). The rest of the network nodes, including committee leaders, are any nodes from the untrusted (e.g. public) network and they can be malicious or faulty. Identity of network leaders are known to the entire network and any node can establish a point-to-point communication with a network leader. Every node participating in the mining process reserves some currency as a bet to the network leaders, which is called as reserved stake for the current epoch. A higher stake usually implies a higher probability that a node is non-malicious. PoS is used for leader selection of a committee, which also ensures a higher probability that a committee leader is non-malicious. We assume an asynchronous environment where each node is represented as a state machine. A node can be either malicious or faulty. The fault model is byzantine. In addition to producing incorrect results, a malicious node may also involve in other undesired activities, for example: not participating in the protocols, or altering the contents or signatures in messages received from other nodes. For our discussion, the end results of transaction validation by a malicious node and a faulty node are similar; hence we will treat and call them equally as faulty nodes, unless otherwise stated explicitly. A good node is a non-faulty node. As per PBFT, the maximum number of allowable faulty nodes in a committee of Cn nodes is fc , where fc Cn31. By assumption of the protocol, at the most another Cn31 good nodes may be unreachable. So, the protocol requires a Cn31 þ 1 majority for reaching consensus within a committee. A committee is good if fc is within the allowable maximum; otherwise it is a faulty committee. A good committee reaches consensus within a definite time. Network leaders also need consensus among themselves, e.g., during inter-committee consensus, and so it is assumed that less than 13 of them can be faulty, as required by PBFT. The sharding scheme requires another input parameter: the value of “majority of good shards” for reaching a valid inter-committee consensus. The sharding scheme guarantees that the number of good shards is at least this predefined majority. In our following discussion, a 51% majority is considered. Committee members can communicate only with their assigned peers across other committees and not among members within the same committee. A change of state event, say X ! Y, is called an epoch. The network nodes need to go through a consensus for committing each epoch. Assuming a network of N worker nodes and only 1 committee in the extreme scenario, PBFT requires that at most N1 3 nodes can be faulty so that a valid consensus can be reached. Hence the requirement for
OptiShard: An Optimized and Secured Hierarchical Blockchain Architecture
399
validly committing an epoch is that there is an upper bound f ¼ N1 on the total 3 number of faulty nodes in the network. Each epoch is timestamped with an increasing unique identifier generated and agreed by the network leaders. The same identifier is used by all the nodes to uniquely identify communication messages specific to an epoch. Hence every message contains an epoch id and a good recipient node discards the message if the epoch id is smaller than the latest epoch the node has committed. In evaluating transaction T, each node applies a deterministic function b. b(T) returns 0 or 1 depending on if the transaction is invalid or valid respectively. b(T) on a malicious node or on a non-crash byzantine node produces inconsistent results, i.e., for a valid transaction it can return 0 or 1, and same is for an invalid transaction. b(T) on a node that has stopped due to crash (fail-stop failure) produces no output and hence the end result is the same as an unreachable node. Transactions remain anonymous to nodes. It is assumed that from the instance a malicious or faulty node produces incorrect result during an epoch, it will produce incorrect results for all the remaining transactions till the end of the epoch. Any good node can become faulty during an epoch processing by producing a succession of incorrect results till the end of the epoch; however, the total number of faulty nodes remains within the upper bound f set during the start of the epoch. In the presence of a key issuing authority, each physical node would be assigned a unique (secure key, public key) pair as an identity required for message authentication. The originator of a message signs the message with its signature, generated by its secure key, message content, and the epoch id. At a receiver node, authentication of a message M, /ðMÞ, returns success only if the signature of the originator can be authenticated by its public key. In the case of a consensus message, the message contains the result of the consensus and the signatures of all the consenting nodes. No adversary can alter the content of a message and regenerate the signatures unless it has access to the secure keys of the signing nodes and the originator. A good recipient node accepts the message M only if /ðMÞ returns success, otherwise it reports to the network leaders that the sender could be faulty. 3.2
Network Sharding
The blockchain network of N nodes, excluding the network leaders, is partitioned into C disjoint shards called committees, each of size Cn ¼ NC . Based on our previous discussion, PBFT requires Cn31 þ 1 majority for reaching consensus within a committee and hence for validity of consensus it requires that the number of faulty nodes within the committee, fc Cn31 . However, for reaching inter-committee consensus at the network leaders’ level (Subsect. 3.8), a 51% majority is required and hence at least 51% of the committees must be good in order to reach a valid consensus. Consider the presence of f faulty nodes in the network such that f N1 3 . It has been established in one of our previous works [22] that the number of good committees remains a majority provided:
400
S. Kantesariya and D. Goswami
f\
Cn 1 Cþ1 þ1 3 2
ð1Þ
Based on (1), the goal is to choose a C such that number of good committees remains a majority. Another factor in choosing the proper value of C is performance, which is discussed in Subsect. 3.8 in the following. By combining (1) with the results of Subsect. 3.8, an optimal value of C can be determined. As the network begins state transition, network leaders consent on a suitable value of C as based on the previous discussion. Nodes are sorted in non-increasing order of their stake values and the topmost M nodes are nominated to become committee leaders, where M is a multiple of C such that each committee is headed by an odd number, greater than 1, of committee leaders. The remaining nodes are sharded into C disjoint committees. Nodes are assigned to committees in a round robin fashion. Committee members and leaders of each committee are sorted based on their stake values, and nodes across the committees at the same rank in the sorted order are designated as peers of each other. Thus, each member node is associated with at least C 1 peers, at least one peer from each other committee. This is to enable peer-to-peer inter-committee communication prior to committing the final block during the commit phase. Network leaders follow PBFT among themselves for reaching consensus on the committee details discussed in the previous paragraph and the epoch id. Once consensus is reached, each network leader sends the following information to each committee leader: committee member details for its assigned committee, epoch id, and peer-to-peer links of each committee member. Subsequently, the epoch id and peer details are forwarded to member nodes by respective committee leaders. The previous discussion is summarized in the following steps: Network Sharding Protocol 1. Each node registers its stake value to network leaders. 2. Each network leader performs the following steps: 2:1 Determine C (based on Eqs. (1) and (3)) 2:2 Sort all nodes in non-increasing order of stake values 2:3. Nominate a multiple of C top ranked nodes as committee leaders 2:4. Shard remaining nodes into C committees 2:5. Assign an odd number of committee leaders to a committee 2:6. For each node, set at least C 1 peers across the committees 2:7. Generate epoch identity 2:8. Follow PBFT with other network leaders to reach consensus on the results of steps 2.1 to 2.7 2:9. Multicast to each committee leader about: committee member details, epoch id, and peer information for each committee member 3. Each committee leader multicasts epoch id and peer details to its committee members.
OptiShard: An Optimized and Secured Hierarchical Blockchain Architecture
3.3
401
Workload Sharding
Every transaction gets a timestamp when it is generated. Users can submit their transactions to any node in the network. Every node forwards its transactions to network leaders, who are responsible to decide on the transactions to be processed in each epoch. A transaction, when submitted to network leaders, remains in their pending transaction pool until processed. Network leaders generate a unique transaction id for each transaction by appending its timestamp value to the user id and then linearize all pending transactions based on non-decreasing order of transaction ids. Epoch workload (W) is decided by considering all pending transactions up to a predefined limit. Epoch workload is divided into C transaction shards in such a way that a predefined small fraction of transactions is common across all the shards. All the remaining transactions are distributed among the C transaction shards in a round robin fashion so that one transaction is mapped to exactly one shard. Common transactions across all the shards are called as overlapped transactions; these are used to identify any faulty shards (Subsect. 3.5). All transactions of the same user are batched together in one shard to avoid any double spend attack [4]. Network leaders assign each transaction shard to exactly one committee and multicast the shard details to the respective committee leaders. Each committee leader subsequently broadcasts its transaction shard details to the rest of the committee members. Figure 2 shows a workload of nine transactions generated by four different users, distributed among three transaction shards. Transactions T1 and T2 are overlapped transactions, therefore common across all the transaction shards. Remaining transactions are distributed in round robin fashion such that all the transactions of the same user are assigned to a single shard.
Fig. 2. Workload distribution for C ¼ 3, transactions T1 and T2 are overlapped transactions.
By requirement, committee leaders and committee members have no way of identifying overlapped versus non-overlapped transactions; this information is solely maintained by the network leaders. Following is the summary:
402
S. Kantesariya and D. Goswami
Workload Sharding Protocol 4. Each node sends its pending transactions to network leaders 5. Network leaders perform the following steps through consensus: 5:1. Assign a transaction id to each transaction 5:2. Linearize all pending transactions 5:3. Decide on a predefined number of overlapped transactions common across all transaction shards 5:4. Distribute the remaining transactions among C shards in a round robin fashion so that each transaction goes to exactly one shard 5:5. Assign each transaction shard to exactly one committee 5:6. Multicast transaction shard details to respective committee leaders 6. Each committee leader broadcasts the message to committee members. 3.4
Intra-committee Consensus
In the original PBFT protocol, message complexity of peer-to-peer messaging is OðCn2 Þ in a committee of size Cn, which is quite expensive from performance and scalability perspectives. Hence, for performance reasons, a centralized variant of PBFT protocol is used for intra-committee consensus. Each committee member and committee leader apply the deterministic function b (discussed in Subsect. 3.1) on each of its assigned transactions, sorts its list of valid transactions in non-decreasing order of transaction ids, calculates the Merkle root based on the sorted list. A committee member sends only the Merkle root to the committee leaders. Note that valid transactions are sorted in non-decreasing order of transaction ids to generate a consistent value of the Merkle root across all the good committee members. A committee leader accepts the Merkle root sent by a committee member only if it matches with the Merkle root computed by itself. When a committee leader accepts Merkle root value in a majority from at least Cn31 þ 1 committee members, it marks consensus reached; sends transaction ids of valid transactions and the agreed Merkle root to network leaders. Otherwise, the committee leader considers consensus not reached and reports to network leaders. It is important to note that all the consenting nodes must have generated the same validation result, because otherwise Merkle root value would differ even if there is a mismatch of a single transaction in a validation result. Recall that a good committee can contain up to a maximum of Cn31 faulty nodes. Hence a majority consensus in a good committee assures that it is the correct consensus. Moreover, a committee leader is chosen based on its higher stake value. Hence it is highly likely that a committee leader is not malicious based on the trust model followed by PoS. Therefore, committee leaders also validate the transactions and compare its Merkle root with the committee members. The message complexity of the centralized consensus protocol is OðCnÞ in a committee size Cn. Additionally, all committees can work in parallel to achieve intracommittee consensus. Following is the summary:
OptiShard: An Optimized and Secured Hierarchical Blockchain Architecture
403
Intra-committee Consensus Protocol 7. Each committee member and each committee leader perform the following steps: 7:1. Sort all transactions in non-decreasing order of transaction ids to linearize the transactions 7:2. Apply b to validate each transaction (Subsect. 3.1) 7:3. Calculate Merkle root on the sorted list of valid transactions. 8. Each committee member sends its Merkle root to each committee leader 9. Each committee leader performs the following steps: 9:1. If at least Cn31 þ 1 identical Merkle roots are received from committee members that match with its own Merkle root then 9:1:1. Mark consensus successful and send the following information to each network leader in a consensus message: transaction ids of valid transactions, agreed Merkle root 9:2. If consensus not reached or timed-out then 9:2:1. Consider intra-committee consensus unsuccessful and report to each network leader in a message containing the received and its own Merkle roots. 3.5
Inter-committee Consensus
Each network leader receives consensus messages from all committee leaders. A network leader accepts a consensus message M only if /ðMÞ returns success (Subsect. 3.1). Once the message is authenticated, the network leader also validates the consensus result by recalculating the Merkle root against the transaction ids. These steps are required to rule out any faulty committee leaders. A committee leader is marked as faulty if its message cannot be successfully authenticated or validated. Finally, a network leader accepts the consensus result from a committee only if there is at least one successfully authenticated and validated consensus message from a committee leader of the committee. If all committee leaders of a committee are marked faulty, then the entire committee is also marked as faulty. Subsequently, the network leader checks the validation results of all the overlapped transactions. Recall that overlapped transactions are common to all the committees. The overlapped transactions are used to identify any faulty committees as follows: considering that at least 51% of the committees are good (Eq. (1)), a 51% majority consensus among committees is required on the result of each overlapped transaction. If a committee does not agree with the majority consensus on an overlapped transaction then the committee is marked as faulty. Each network leader prepares the list of the Merkle roots from all the good committees. Subsequently, the network leaders execute PBFT protocol among themselves for reaching consensus on the results of the Merkle roots and the good committees. If a consensus is not reached, then the results from the entire epoch are discarded; otherwise the agreed list of Merkle roots of all the good committees is broadcast to all good committee leaders. This information is subsequently forwarded to committee members by the respective committee leaders.
404
S. Kantesariya and D. Goswami
All transaction results from a faulty committee are discarded and these discarded transactions are processed in the subsequent epochs (Fig. 3). In the presence of a faulty committee, the network needs to be resharded at the start of the next epoch based on the network sharding protocol. The previous discussion is summarized in the following:
Fig. 3. (a) Intra-committee consensus result for C = 3. (b) Comparison of overlapped transactions during inter-committee consensus. (c) Discard transaction results of committee no 3.
Inter-committee Consensus Protocol 10. Each network leader performs the following steps: 10:1. Receive intra-committee consensus message from committee leaders 10:2. Authenticate each message and validate the consensus result by recalculating the Merkle root against the transaction ids 10:3. Mark a committee leader as faulty if (i) the consensus message from the committee leader cannot be authenticated or (ii) the Merkle root in the message cannot be validated or (iii) no message received (timed-out) 10:4. If there is a valid intra-committee consensus then 10:4:1. Accept the consensus result from the committee leader 10:4:2. Else, mark a committee as faulty if all its committee leaders are marked as faulty or a consensus is not reached for the committee 10:5. Compare validation result of each overlapped transaction among all committees and consider minimum 51% majority as the valid consensus on the result of an overlapped transaction. 10:6. If a majority consensus cannot be reached on an overlapped transaction then 10:6:1. Mark the epoch as Aborted 10:6:2. Else 10:6:2:1. Mark a committee as faulty if its result of an overlapped transaction does not match with the majority consensus in step 10.5. 10:6:2:2. Prepare the list of the Merkle roots from all the good committees. 10:7. Follow the PBFT protocol among all network leaders on the results of step 10.6 (i.e., step 10.6.1 or 10.6.2) 10:8. If a consensus is not reached or a consensus is reached on aborting the epoch then 10:8:1. Abort the current epoch and discard the results of all transactions. These transactions will be processed in subsequent epochs. 10:8:2. Else 10:8:2:1. Discard all transaction results assigned to faulty committees. These transactions will be processed in subsequent epochs.
OptiShard: An Optimized and Secured Hierarchical Blockchain Architecture
405
10:8:2:2. Broadcast the agreed list of (Merkle root, committee id) to committee leaders of all committees. 11. Each committee leader forwards the message from step 10.8.2.2 above to the committee members 3.6
Commit Epoch
As transactions are sharded among committees, each committee member possesses only its own list of transactions. For committing the final block(s) into its ledger, it needs all transactions from all the committees that are validated during the epoch. However, a message containing all the valid transactions can be quite large and hence, for performance reasons, network leaders send only the Merkle roots from the good committees to the committee members (step 11 in inter-committee consensus protocol). Each committee member subsequently receives the list of all valid transactions from all its peers (a peer is defined in Subsect. 3.2) via peer-to-peer exchange and validates the Merkle roots from each peer from a committee against the Merkle root received from the network leaders. If the Merkle root cannot be validated or the list of valid transactions is not received from a peer then a committee member can request this information from the committee leaders or directly from network leaders in the event of unresponsive or suspected faulty committee leaders. Network leaders need not wait for all good committees to acknowledge before committing the epoch; the epoch can be committed by the network leaders on a timedout event. Hence, committing of the epoch by the committee members can proceed lazily for performance reasons. This is safe because the network leaders already have the valid ledger. Moreover, the parallelism involved in peer-to-peer exchange can relieve the inefficiency of one-to-all broadcast of all valid transactions by a network leader to all committee members. Any node that does not have the up-to-date ledger can request and receive the required data from network leaders. The steps of the protocol are summarized in the following: Lazy Epoch-Commit Protocol 12. Each committee member performs the following steps: 12:1. Receive the list of Merkle roots from at least one committee leader (refer to step 11 in inter-committee consensus protocol) 12:1. Peer-to-peer exchange its own list of valid transactions with all its peers. Recall that there is at least one peer from each other committee (Subsect. 3.2) 12:2. Repeat steps 12.2.1 to 12.2.3 until lists of valid transactions are received from all the peers from good committees 12:2:1. Calculate the Merkle root of transactions received from a peer and compare with the one approved by network leaders (received in step 12.1). If Merkle root matches, then 12:2:1:1. Consider those transactions as candidates to commit. Go back to step 12.2.
406
S. Kantesariya and D. Goswami
12:2:2. If Merkle root does not match or list of valid transactions not received from a peer then ask committee leader(s) to share valid transactions for the specific committee(s) 12:2:3. If none of the committee leaders responded within a predefined timeout or if the received Merkle root(s) could not be validated, then report to network leaders and collect details of all valid transactions for specific committee(s) from network leaders. 12:3. After receiving transactions from all good committees, sort all valid transactions, including its own, in increasing order of transaction ids 12:4. Prepare block(s), each consisting of a predefined maximum number of transactions, chosen from the sorted sequence 12:5. Commit transactions by appending block(s) to the blockchain 12:6 Acknowledge to committee leader(s) 13. Each Committee leader acknowledges to network leaders after receiving commit acknowledgement from at least Cn31 þ 1 committee members 14. Network leaders mark the epoch as committed upon receiving at least one valid acknowledgement message from each good committee or on a predefined timed out, whichever comes first. Figure 4 illustrates communication diagram of an epoch processing. Details regarding messages m1–m8 are explained as following:
Fig. 4. Communication diagram of an epoch processing.
m1: Committee details, Epoch identity, Peer-to-Peer links, Transactions m2: Epoch identity, Peer-to-Peer links, Transactions
OptiShard: An Optimized and Secured Hierarchical Blockchain Architecture
407
m3: Merkle root of approved transactions m4: Merkle root approved by Intra-committee consensus m5: A list of Merkle roots approved by Inter-committee consensus m6: Approved transactions m7–8: Epoch commit acknowledgement Figure 5 summarizes the sequence of the previous protocols using a flow chart.
Fig. 5. Flow diagram of an epoch processing.
3.7
Correctness Proof
Lemma: A faulty committee that produces an incorrect consensus result is identified either during intra- or inter-committee consensus and all its transactions are discarded. Proof: A committee can be faulty due to the following reasons: Case 1: All its committee leaders are marked as faulty. Network leaders mark a committee leader as faulty if it either sends invalid intracommittee consensus result or is unresponsive due to crash or network failure (line 10.3 of protocol). Note that network leaders validate the consensus result by checking the Merkle root and signatures of the consenting nodes (line 10.2 of protocol) and hence can determine for certain if the consensus result has been altered. If all committee
408
S. Kantesariya and D. Goswami
leaders of a committee are marked as faulty then the committee is marked as faulty (line 10.4.2 of protocol) and its transactions are discarded (line 10.8.2.1). Case 2: At least Cn31 þ 1 committee members are faulty. Case 2.1. At least Cn31 þ 1 of these faulty committee members produce incorrect transaction results from the start of the epoch. By assumption (Subsect. 3.1), once a faulty member produces an incorrect result, it will produce incorrect results for all remaining transactions for the rest of the epoch. As a result, there are three possibilities: (1) an intra-committee consensus is not be reached. In that case, network leaders are informed (line 9.2.1 of protocol) and eventually all its transactions are discarded (lines 10.4.2 and 10.8.1 of protocol). (2) Intra-committee consensus is reached with correct consensus result. In that case, the consensus result is equivalent of a good committee and its transaction results are accepted. (3) An incorrect intra-committee consensus is reached with majority of faulty nodes as consenting members. In that case, the committee will be identified as a faulty committee during inter-committee consensus due to its non-agreement of the overlapped transactions with the majority of good committees (lines 10.5–10.6 of the protocol). Case 2.2. At most Cn31 of the faulty members produce incorrect results from the start of the epoch. This case can be treated the same way as case 2.3 below. Case 2.3. The committee was a good committee to start with. However, after processing at least one transaction, it turns faulty due to some of the good nodes turning faulty. In the worst case scenario, if at least Cn31 þ 1 good nodes turn faulty exactly after processing the same number of transactions, which include all the overlapped transactions, then it is possible that an incorrect consensus is reached which can go undetected through the overlapped transactions because the consensus produced correct results for all the overlapped transactions. In any other scenario, one of the three possibilities of case 2.1 arises and can be handled in a similar way to case 2.1. This concludes the proof. ∎ 3.8
Theoretical Performance Analysis
Performance of an epoch processing is dominated by the following: workload sharding (Subsect. 3.3), transaction validation, intra-committee consensus (Subsect. 3.4), and inter-committee peer-to-peer exchange prior to committing an epoch (Subsect. 3.6). Workload W is divided among C committees and committees work in parallel in validating transactions; so, transaction validation time is O W C , which is the average time to process transactions in a committee. Intra-committee communication and the centralized consensus protocol time is liner in committee size, which is OðCn Þ. Time complexity for peer-to-peer exchange among nodes is OðC2 Þ, which dominates over the OðCÞ time required for communication among network and committee leaders, and time required for inter-committee consensus.
OptiShard: An Optimized and Secured Hierarchical Blockchain Architecture
409
Based on the previous, the estimated theoretical runtime can be written as: FC ¼ tm Cn þ tg C2 þ tt
W C
ð2Þ
where tm, tg, and tt are constants. We can rewrite Eq. (2) as a function of the total number of committees, C: FC ¼ tm
N W þ tg C 2 þ tt C C
ð3Þ
The equation in (3) gives an inverted bell curve and the C value for the optimal performance is given at the point where the gradient dFdðccÞ ¼ 0. Combining the results of Eqs. (3) and (1), an optimal sharding scheme can be designed by choosing a suitable value of C from two different perspectives: performance and reliability. In comparison, in the traditional monolithic blockchain architecture, the most dominating time is due to peer-to-peer validation among the nodes which amount to OðN 2 Þ: When compared to (3), hierarchy shows a clear performance advantage over monolithic because the number of committees C is significantly smaller than N:
4 Experimental Evaluations We used up to 800 Amazon Web Services (AWS) ec2 micro instances with 1 GB RAM and 1 CPU core, across different regions to execute our test scenarios. We programmed our code in Scala, considering a scalable JVM based language that supports heterogeneous platforms. For intra-committee consensus, a variant of PBFT was used as explained earlier. We used individual MySQL instances on every node for structured storage. Instead of creating random dummy transactions on runtime, we kept a set of transactions pre-populated for valid comparison across different runs. Every node follows peer-to-peer communication to exchange information with its peers across different committees. Every message is signed by the secured key of the sender and validated by its public key at the receiving end. We used Java BouncyCastle implementation of SHA256 with ECDSA encryption algorithm for the simplicity. Each node is assigned a secure-public key pair. Participating nodes exchange their public keys during initial setup. The following experiment is to validate the theoretical result in Eq. (3). Figure 6 shows the runtime with increasing number of committees from 3 to 99, keeping number of transactions constant at 98.6 K. As demonstrated by the experiment, the given configuration offers optimal performance with 30 committees. The shape of the runtime graph is in accordance with the theoretical analysis given by Eq. (3) and exhibits the impact of sharding the network.
410
S. Kantesariya and D. Goswami
Fig. 6. Runtime versus number of committees.
5 Conclusions OptiShard is a hierarchical blockchain architecture which addresses the issues of performance, security, and fault tolerance in the traditional blockchain architecture. The hierarchy arises due to division of the network into multiple disjoint shards called committees. Though hierarchy can improve performance and presumably scalability, it can compromise on correctness of transaction validation arising due to possible presence of both malicious and faulty nodes. The model of OptiShard focuses on this balancing act. The theoretical model of OptiShard guarantees majority of good committees. Most of workload (transactions) is distributed among the committees so that there is a one-to-one mapping from a transaction to a committee; the rest are overlapped (i.e., one-to-all mapping) with the objective of identifying faulty committees. Experimental results conform to the theoretical analysis and demonstrate the impact of sharding in OptiShard. The correctness of the discussed protocols, addressed in the Lemma in Subsect. 3.7, is based on certain assumptions in the current fault model. These assumptions could be removed by having a more restricted model where either the committee leader(s) are more trustworthy or network leaders do additional validations. However, the restrictions would either make the model more centralized or add extra overhead in performance. So there is a balancing act and these issues are being addressed in our ongoing research.
References 1. Xu, X., et al: A taxonomy of blockchain-based systems for architecture design. In: IEEE ICSA, Gothenburg, Sweden (2017) 2. Difference between Public and Private Blockchain. https://www.ibm.com/blogs/blockchain/ 2017/05/the-difference-between-public-and-private-blockchain/. Accessed 10 June 2020 3. Merkle, R.C.: A digital signature based on a conventional encryption function. In: CRYPTO 1987, vol. 293, pp. 369–378. Springer, Heidelberg (1987) 4. Nakamoto, S.: Bitcoin: A peer-to-peer electronic cash system. Bitcoin.org (2009) 5. 51% attack. https://bitcoin.org/en/glossary/51-percent-attack. Accessed 10 June 2020 6. Douceur, J.: The sybil attack. In: The First International Workshop on Peer-to-Peer Systems, vol. 2429, pp. 251–260 (2002)
OptiShard: An Optimized and Secured Hierarchical Blockchain Architecture
411
7. Croman, K., et al: On scaling decentralized blockchains (a position paper). In: Financial Cryptography and Data Security, vol. 9604, pp. 106–125, (2016) 8. Yli-Huumo, J., Ko, D., Choi, S., Park, S., Smolander K.: Where is current research on blockchain technology?—A systematic review. PLOS ONE (2016) 9. King, S., Nadal, S.: PPCoin: Peer-to-Peer Crypto-Currency with Proof-of-Stake. Selfpublished paper (2012) 10. Delegated Proof of Stake. https://en.bitcoinwiki.org/wiki/DPoS. Accessed 10 June 2020/ 11. Proof of burn. https://en.bitcoin.it/wiki/Proof_of_burn. Accessed 10 June 2020 12. Borge, M., Kokoris-Kogias, E., Jovanovic, P.: Proof-of-personhood: redemocratizing permissionless cryptocurrencies. In: 2017 IEEE European Symposium on Security and Privacy Workshops (EuroS&PW), Paris, France (2017) 13. Castro, M., Liskov, B.: Practical byzantine fault tolerance. In: The Third Symposium on Operating Systems Design and Implementation, New Orleans, USA (1999) 14. Danezis, G., Meiklejohn, S.: Centrally banked cryptocurrencies. In: NDSS, San Diego, CA, USA (2016) 15. Al-Bassam, M., Sonnino, A., Bano, S., Hrycyszyn, D., Danezis, G.: ChainSpace: a sharded smart contracts platform. arXiv:1708.03778 (2017) 16. Luu, L., Narayanan, V., Zheng, C., Baweja, K., Gilbert, S., Saxena, P.: A secure sharding protocol for open blockchains. In: CCS’16, Vienna, Austria (2016) 17. Eleftherios, K., Philipp, J., Linus, G., Nicolas, G., Ewa, S., Bryan, F.: OmniLedger: a secure, scale-out, decentralized ledger via sharding. In: 2018 IEEE Symposium on Security and Privacy, San Fransisco, USA (2018) 18. Zamani, M., Movahedi, M., Raykova, M.: RapidChain: scaling blockchain via full sharding. In: The 2018 ACM SIGSAC Conference on Computer and Communications Security, New York, NY, USA (2018) 19. Wang, J., Wang, H.: Monoxide: scale out blockchains with asynchronous consensus zones. In: 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 2019), Boston, MA, USA (2019) 20. Syta, E., et al.: Scalable bias-resistant distributed randomness. In: 2017 IEEE Symposium on Security and Privacy, San Jose, CA, USA (2017) 21. Micali, S., Rabin, M., Vadhan, S.: Verifiable random functions. In: 40th Annual Symposium on Foundations of Computer Science, New York City, NY, USA (1999) 22. Kantesariya, S., Goswami, D.: Determining optimal shard size in a hierarchical blockchain architecture. In: IEEE International Conference on Blockchain and Cryptocurrency (2020)
Qute: Query by Text Search for Time Series Data Shima Imani(B) , Sara Alaee, and Eamonn Keogh Department of Computer Science and Engineering, UC Riverside, Riverside, CA, USA {siman003,salae001}@ucr.edu, [email protected] Abstract. Query-based similarity search is a useful exploratory tool that has been used in many areas such as music, economics, and biology to find common patterns and behaviors. Existing query-based search systems allow users to search large time series collections, but these systems are not very robust and they often fail to find similar patterns. In this work, we present Qute (Query by Text) a natural language search framework for finding similar patterns in time series. We show that Qute is expressive while having very small space and time overhead. Qute is a text-based search which leverages information retrieval features such as relevance feedback. Furthermore, Qute subsumes motif and discord/anomaly discovery. We demonstrate the utility of Qute with case studies on both animal behavior and human behavior data. Keywords: Time series · Similarity search · Time series text search · Pattern matching
1 Introduction Searching is an complex task in many cases and applications, both on the Web and in domain-specific collections [1]. Because of the ubiquity and importance of time series, there now exists tools to search the data set to find similar behaviors or events in different data sets. A data analyst may ask: Is there any instance of this behavior or event within my dataset? [2–4]. She may use a query-based search system to answer this question. Query-by-Example (QbE) and Query-by-Sketching (QbS) are popular query-based search methods [5, 6]. Query-by-Example extracts an example from one time series and searches another time series for patterns which are similar to the example. In Query-by-sketching a user draws an example and searches the time series for sections which closely match their sketch. These systems are often not intuitive, expecting the user to learn complex syntax and interfaces to create high-quality queries. Most query-by-content systems have a limited expressiveness. There exist simple queries that the existing query-by-content is unable to answer, regardless of the similarity measure used [7]. Often simple patterns can be expressed in informal English and cannot be easily specified in the existing query-by-content. In this paper, we design of a query-by-text framework for time series data. Our framework Qute (pronounced as ‘ku:t’ like a word cute) allows users to formulate their search queries using natural language. We will show that Qute can express any search c Springer Nature Switzerland AG 2021 K. Arai et al. (Eds.): FTC 2020, AISC 1289, pp. 412–427, 2021. https://doi.org/10.1007/978-3-030-63089-8_27
Qute: Query by Text Search for Time Series Data
413
that can be performed by Query-by-Example and Query-by-Sketching methods. This enables Qute to be more expressive than previous search algorithms for time series data. Consider a data scientist examining a set of time series data [8] which tracks the inside and outside temperature, humidity, and sunlight data for the building. She might ask the following questions: (Q1) What is the lowest temperature inside the building over 24 h? (Q2) Have we ever seen a significant increase in humidity? (Q3) What is the typical behavior of sunlight in one day? The first question is easy to answer using existing systems , by finding the minimum value for temperature. However, existing systems cannot easily answer the second or third question. Qute can respond to each of these questions using the following queries: (A1) INSIDE-TEMPERATURE lowest (A2) HUMIDITY rising (A3) SUNLIGHT typical Figure 1 shows the result for the first two questions. The gray line shows the time series and the highlighted region shows the answer to the query. The blue highlight shows the lowest temperature in the time series over 24 h and the highlighted red region shows a significant increase in humidity.
Fig. 1. (Top) The time series of inside temperature. The highlighted blue region shows the lowest temperature inside the building over 24 h. (Bottom) The time series of humidity data. The highlighted red region shows a significant increase in humidity.
Qute responds to the third question by running SUNLIGHT typical. The typical day in this time series corresponds to the sunny day as shown in Fig. 2 highlighted in green.
Fig. 2. Time series corresponding to the sunlight of building. The highlighted green region shows the typical day of the time series.
414
S. Imani et al.
In this work, we introduce Qute a framework for searching time series using natural language. We organize the rest of the paper as follows: in the next Section, we review related work. In Sect. 3, we introduce the necessary notation and definitions. Section 4 introduces the vocabulary for searching time series data. We explain our framework Qute, in Sect. 5. In Sect. 6, we perform an empirical evaluation of diverse data. Section 7, shows Qute subsumes Query-by-Example and Query-by-Sketching systems. In Sect. 8 we compare a shape-based classifier to a Qute based classifier. Section 9 draws conclusions and suggests directions for future work.
2 Related Work There are different systems that allow a user to search the time series data such as Query-by-Sketching (QbS) [6, 9] and Query-by-Example (QbE) [5, 10]. It is natural to understand natural language query mechanism might be useful. Existing QbS and QbE methods can handle simple queries. QbS is only used for simple queries, because drawing a complex query is hard and “You can’t always sketch what you want” [7]. Also, the shape that we are sketching should be similar to the query that exists in the time series. In QbE the user finds a pattern in the time series and asks for similar patterns to be retrieved from different time series. Searching the exact match is so brittle that most of the time fails to find the desire result. We will show examples of such a failure in Sect. 8. Using natural language enables users to search the time series in a higher-level structure. Consider a user that is looking for the behavior “eating a grape” [11], which . If there exist a similar shape such as in another has a shape like time series then the high-level description in our system still remains the same. We can describe it in Qute as Constant followed-by rising followed-by falling followed by constant But using QbS or QbE to search the query, fails to find the desire result because of the existence of the spike in the search query and the lack of the spike in the similar patterns. Suppose a user wants to find a “noisy” subsequent. Drawing a “noisy” subsequent using QbS systems might be challenging. Moreover, Query-by-Sketching and Queryby-Example systems cannot retrieve patterns based on a combination of local and global features. For example, searching a pattern that is low (relative to the global data), and has falling trend (a local feature).
3 Definitions and Notation We begin by describing the necessary definitions and notation. The data type of interest is time series: Definition 1 (Time series): A time series T of length n is a sequence of real-valued numbers ti : T = t1 , t2 , . . . , tn . Time series are often multidimensional:
Qute: Query by Text Search for Time Series Data
415
Definition 2 (MTS): A multidimensional time series M T S consists of k time series T of length n, where k ≥ 2. MTS = T1 , T2 , T3 , . . . , Tk , where T1 = t11 , t12 , t13 , . . . , t1n T2 = t21 , t22 , t23 , . . . , t2n ... Tk = tk1 , tk2 , tk3 , . . . , tkn tji and tki are two points co-occurring and j = k. Often, each dimension of the M T S has a mnemonic name, for example a Body Area Network might be: BAN : ECG, T emperature, Glucose, O2Saturation Where appropriate, we will use such mnemonic names in our examples. We are primarily interested in the behavior of local regions. A local region of time series is called a subsequence: Definition 3 (Subsequence): A subsequence Ti,m of a time series T is a continuous ordered subset of the values from T of length m starting from position i. Ti,m = ti , ti+1 , . . . , ti+m−1 , where 1 ≤ i ≤ n − m + 1. Definition 4 (Trivial matches): For a time series T and length n that contains a subsequence Tp,m , if Tp,m have a high score on any scoring function, then its temporal neighbor also score high on the same scoring function. These high scoring subsequences are called trivial matches. We use the concept of an exclusion zone to eliminate these trivial matches when finding the top-k matches to a query [12].
4 Vocabulary Our approach is to describe a collection of features (in NPL speak “adjectives”) that can be used to create a mapping of a subsequence to a single value. The basic assumption of our proposed method is that for any given subsequence, it is possible to measure to what degree it possesses the various semantic features that a data analyst would use to describe it, such as “noisy” or “periodic” or “rising”. This “feature to word” task is not a completely solved problem, however we can exploit the latest progress in this area [13, 14]. Without being exhaustive, we list and briefly explain a sample of words which consist of adjectives and conjunctions from our vocabulary. We have created a website that contains all the formal definitions and code [15]. For any given adjective and time series, we create a meta-time series which is a mapping between a time series and the corresponding adjective. These meta-time series are normalized to be between zero and one. For each feature in our vocabulary that comes in natural pair such as rising and falling
416
S. Imani et al.
(X and Y ) we define one of them X. We can compute the other one Y , by calculating 1 − X. Our vocabulary contains local and global features. The local features are computed by looking only at the relevant subsequence while for global features we use the whole time series to compute them. Our local features are: rising, falling, concave, convex, lindear, non-linear, constant, smooth, noisy, complex, simple, spiky, dropout, periodic, aperiodic, symmetric, asymmetric, step, no-step, high-amplitude, low-amplitude, high-volume and low-volume Our global features are: high, low, typical and unusual The website contains all the formal definitions and code for each feature [15]. Finally, a simple list of words would feel very unnatural and disjointed. We can fix this with some simple conjunctions: {followed-by, and} The followed-by operator allows a sequence of feature words. This operator allows user to be able to query a sequence of words such as rising followed-by dropout. The and operator allows joins a set of feature words. For example the user might searching for a spiky subsequence that contains rising behavior can query: spiky and rising Our vocabulary is currently handcrafted, however we can learn thesauruses automatically using web or user interaction [16]. The reader may have ideas for additional words which could be added to our framework. However, as we will show, this vocabulary is expressive enough to allow us to find targeted patterns in diverse domains.
5 Qute We are now in a position to define the problem at hand. Problem Definition: Given a time series subsequence, predict the words a human would use to describe it. We assume that the users are limited to the vocabulary and (informal) grammar explained in the earlier section. To search a time series, Qute consists of the following steps: – Create a mapping For each adjective in our vocabulary we define a mathematics function that takes a time series and a subsequence length as an input and outputs a “meta-time series” that is a representative of that word for each subsequence. We call this meta-time
Qute: Query by Text Search for Time Series Data
417
Fig. 3. (Top) A 40 s of seal time series. (Bottom) A meta-time series that represents noisy feature. Note that noisy feature is between zero and one and the high value corresponds to the noise in the raw time series and zero corresponds to the smooth region of time series.
series an index. Appendix 9.1 contains a worked example of defining a mapping between a word and a time series. Fig. 3 shows an example of mapping between a time of seal behavior and a metatime series corresponding to noisy word in red. As shown in Fig. 3 the time series of seal data set is less noisy and then the noisiness increases and then it decreases. Each feature in our vocabulary represents an intuitive property of a time series. For example, our vocabulary contains adjective such as rising, complex and linear. – Dictionary-set One problem that we may encounter is the lack of agreement on the correct wording for the words in our vocabulary which depends on each domain. For example, in the oil and gas community flat regions often described as low-amplitude. To solve this problem we provide a dictionary of key-value pair synonyms for each word in our vocabulary called dictionary-set. For example our dictionary-set contains the following pair: low-amplitude:flat, level, constant,... for our temporal operator the key-value pair synonyms are: followed-by: then, next, succeeded-by – Compute all the indexes Given a multidimensional time series M T S and the subsequence length sub as inputs, we compute the indexes (meta-time series) corresponding to all the adjectives in our vocabulary and we call it feature-set. This can be pre-computed and stored on disk (Algorithm 1) – Query time At run time, a user enters a search query. We begin by validating the user input to find the search query in the time series. – Top-k-result We find the best k non overlapping matches to our query and show it to the user (Algorithm 2) We describe these steps below in greater detail. The inputs to the algorithm BuildMapping are a multidimensional time series M T S of size p × q and a user-defined subsequence length sub. The output of the algorithm is a p × (q − sub) array which is
418
S. Imani et al.
normalized between zero and one. A mapping between a time series and a word in our vocabulary is stored in an array. The algorithm Build-Mapping is shown in Algorithm 1. First, an index array Idx is initialized to an empty array. Then for each word in our feature-set we computer the Index-feature. The Index-feature is a mapping between the time series and each word. Next, we add each Index-feature to the Index array Idx. In line 6, the algorithm returns the Index array Idx. After building the Index we store it on the disk. Algorithm 1. Build-Mapping Input: Multidimensional Time Series M T S, Subsequence Length sub Output: Index Idx 1: Idx← − empty array 2: for feature in feature-set do 3: Index-feature ← − feature(MTS,sub) 4: Idx← − Index-feature 5: end for 6: return Idx
Now in a runtime, the user search a query. The user input can contain errors or misspelled words. Since the algorithm needs a valid query we need to parse the query and check the correctness of the query. For the ease of explanation, we initially ignore the followed-by term in the query. At query time, we break a sequence of characters into pieces called tokens. The query should contain the time series name and the corresponding features. In our system the user search a query by using the uppercase letters for time series name followed by the corresponding features in the lowercase letter. For example, for the query: ECG periodic dropout Volume high, we parse it into six tokens, with the uppercase letters ECG and Volume indicating the time series name and the features [periodic, dropout] and [high] being the corresponding features for each time series, respectively. If each token cannot be find in the feature-set, we use the dictionary-set to find the key indicating its synonym. If the token does not exist in the dictionary-set we ignore it but echo a warning to the user. For example if the user types beriodic instead of periodic we echo a warning like: “feature beriodic does not exist.”. Now we are ready to find the top-k-result. Now that the search query is ready, we compute the score for each subsequence. The scorej for each subsequence j is the sum of all the indexes within the query. Indexi j scorej = i∈Query
The algorithm Top-K-Results finds the best results to a query. Its inputs are a userdefined query Q and a number of matches k. Top-K-Results outputs k subsequences, corresponding to the best k matches to our query Q.
Qute: Query by Text Search for Time Series Data
419
Top-K-results is outlined in Algorithm 2. First, the algorithm initializes an array mapping-score to an empty array in line 1. In lines 2 to 4, we iterate over each word (feature) in the query Q and add the index for that specific feature to the mappingscore array. scoreSummation is the element wise summation of the mapping-score which stores the total mapping-score. After, sorting scoreSummation in the descending order, we find the k subsequences, top-k-result, within time series as we are taking care of the exclusion zone. Finally the top-k-result (i.e. k subsequences) are returned. Algorithm 2. Find the Top-K-Results Input: The user defined query Q, Number of Results k, build-Mapping index, Subsequence Length sub, Multidimensional Time Series M T S Output: Top-k subsequence top-k-result 1: mapping-score← − ∅, top-k-result ← −∅ 2: for q in Q do 3: mapping-score ← − index(q) 4: end for 5: scoreSummation ← − sum(mapping-score) element-wise summation 6: id← − sort(scoreSummation,‘desced’) 7: for i in range k do 8: top-k-result ← − M T S[idi : idi + sub] 9: end for 10: return top-k-result
Now we demonstrate how our system Qute, handles the followed-by case. If the query involves any followed-byvalue terms, index-feature is truncated by the user-defined amount and we pad zeros to the end of the index-feature. In a case that user does not provide a value, the shift value equals to the subsequence length sub. followed-byvalue Index = Index (value : end − value + 1) In order to maintain the index-feature size, we zero pad the computed Index-feature. We give this modified index as an input to Algorithm 2 to calculate the top-k-result best matches. 5.1 Time and Space Complexity The time complexity of Qute is O(n2 ). Computing the word mapping typical takes O(n2 ), which has the largest time complexity among other features [17]. The space complexity of Qute is O(kn) which k is the number of words. Right now, our featureset contains 20 words.
6 Experimental Evaluation we demonstrate our framework through a series of experiments. We have provided a website that contains all data, code, and raw spreadsheets for all the experiments [15].
420
S. Imani et al.
6.1
Bird Behavior
Poultry farms are a vital source of food to humans and the task of studying chicken behaviors is important towards improved production levels and chicken welfare [18]. It is believed that frequency and timing of behaviors such as pecking, preening and dustbathing can be good indicators of chicken health [19]. We consider a three-dimensional time series of a domestic chicken (Gallus domesticus) [19, 20]. This dataset reflects a wearable accelerometer placed on the back of the chicken. As Fig. 4 hints, this data source is complex and noisy, yet it is suggestive of some structure. In datasets of human behavior, the accelerometer is worn on the ankle or wrist, proving direct measurements of behaviors. In contrast, with the placement of the sensor on the chicken’s back, we have a more indirect measurement of the bird’s behavior.
Fig. 4. (Left) A small accelerometer (AX3, Axivity Ltd, UK) was placed on a chicken. The resulting data is noisy and complex; (right), but does hint at some structure.
We wish to search this dataset for pecking behaviors. One might imagine this behavior looks like this, a steady state, movement to the ground, and a return to a steady state. However, based on our experience with chickens, we might expect to see, as chickens typically draw back a little before striking the ground, like a boxer winding up a punch. In either case the feature symmetric looks like a good candidate, but it might produce false positives by itself, there may be other symmetric behaviors. Instead we expand our query to SURGE flat then convex and symmetric with our subsequence length set at 25 (1/25th of a second). Figure 5 shows the top three query results.
Fig. 5. (Left) The top three answers to our query in the chicken dataset, based on inspection of video, all are true positives; (right) 30 min of X-acceleration of chicken data time series.
Qute: Query by Text Search for Time Series Data
421
Another behavior in the dataset is preening. For preening behavior we generate the preening query after inspecting the result from [19]. The preening query is: SURGE flat then complex and low-amplitude then flat. The results are very promising. For the performance, we measure the accuracy which we define as following. Each separated labeled behavior is called a bag. A bag that our algorithm correctly identifies as having the behavior is called a True-bag. The sum of all the True-bag length divided by the total bags is called accuracy. j∈True-bag length of labeled behaviorj Accuracy = i∈bag length of labeled behaviori The accuracy of Qute, for the preening query is 99.7%. Figure 6 shows the preening query’s results. The random sampling accuracy is just 8%.
Fig. 6. (Top) A green binary vector represents the preening behavior (green). This ground truth was obtained by human annotators that watched contemporaneously recorded videos. (Bottom) The top-140 results of preening query for the time series shown in Fig. 5.
We also compare our results, by measuring the accuracy of searching preening behavior for random sampling and Query-by-Example. We use Euclidean distance as a similarity measure. The following table shows the results.
Methods Accuracy Random-Sampling 8% Query-by-Example 18.3% Qute 97.7% As the results show, Qute accuracy for the preening behavior is much higher than shape-based method which they use use ED distance measures which may not generalize well. 6.2 On the Expressiveness of Our Vocabulary The earlier case studies suggest that our initial vocabulary as defined in Sect. 3 is expressive enough for some cases. In this section, we will compare a Qute based nearest neighbor classifier to the shape-based results published in UCR Archive [21]. The datasets in the archive were explicitly collected to compare shape-based distance measures. The datasets in UCR archive collections consist of arrays of subsequence data. Many of these subsequences were sampled from larger time series with a shape-based extraction algorithm. Qute is designed for searching time series. For this set of experiments, we designed a nearest neighbor classifier which builds on top of Qute. To measure feature-based accuracy, we used K features to project the exemplars into K-dimensional space.
422
S. Imani et al.
We made class predictions for both Qute and shape-based methods using a one-nearest neighbor classifier. All the results are available at [15], and Table 1 shows some of our best results. Table 1 shows some examples where nearest neighbor classification using Qute outperformed shape-based classification. Table 1. Selected highlight of comparing feature-based and shape-based classification. UCR Archive Dataset Features Used Shape-based Accuracy Qute Accuracy InsectEPGSmallTrain exponential-decay 0.66 0.66 GunPointMaleVsFemale convex, normalized-complex 0.97 0.99 PigArtPressure unnormalized-complex, high-amplitude 0.13 0.51 SemgHandMovementCh2 high-amplitude, exponential-decay 0.36 0.71 HouseTwenty normalized-complex 0.66 0.76
These experiments suggest a useful way to add new features or optimize current ones. For example, a future researcher may argue that our definition of the exponential-decay feature is flawed. If their replacement definition could further improve the accuracy on the SemgHandMovementCh2 dataset, this would lend weight to their argument.
7 Subsuming Query-by-Example and Query-by-Sketching We also introduce a word feature shape. Adding shape allows our system Qute to include the existing QbS and QbE systems for time series [6, 9, 22]. Basically, we add a user query for QbE/QbS methods as a shape word to our system. This feature makes Qute to be more expressive than QbS and QbE.
8 Converting Query-by-Example to Natural Language Query As we mentioned in the previous section, Qute subsumes QbE and QbS, in this section we show another ability of Qute to converting the QbE query to the natural language query and then using the generated text to search the time series to find the similar patterns. Consider the following example. A real pattern extracted from [11] is embedded into a smoothed random walk. The real pattern is a collection of human trajectories by asking a robot to pick up food and feed it to a mannequin. The embedded pattern is “eating a carot”. Suppose the user finds “eating a carrot” pattern in the time series as shown in Fig. 7. Qute allows a user to select a desired query from the time series. Then a sorted list of the feature words displays to the user . For the example shown in Fig. 7 Qute generated the following query: simple, smooth, concave, symmetric
Qute: Query by Text Search for Time Series Data
423
Fig. 7. An example of a dataset with an “eating a carrot” pattern embedded in an unstructured data.
Now the user can search for the “eating a carrot” pattern in different time series using text generated by Qute. As shown in Fig. 8 each time series consists of “eating a carrot” pattern embedded in a random walk. In this example, Qute and QbE are able to find all instances of the “eating a carrot" pattern.
Fig. 8. Each time series consists of an “eating a carrot” pattern embedded in a random walk data. Qute and QbE are both successful in finding the “eating a carrot” behavior.
For the time series shown in Fig. 9, QbE fails to find any instance of the “eating a carrot” pattern while Qute is able to find all the instances. Figure 10 shows a comparison between the user selected query (blue) from Fig. 9 and the results of QbE (red) and Qute (green). QbE is unable to find the green query because the distance between the green and the user selected query is higher than the red query and the user selected query (blue). The instances of the “eating a carrot” pattern have similar structure in a higher-level which is aiming toward carrot and putting it in the mouth is the same in all behaviors. Using QbE to search the pattern means to include all the details in the pattern and compare it to the instances. For example, QbE
424
S. Imani et al.
Fig. 9. Qute is able to find the correct pattern however QbE finds random walk data instead of “eating a carrot” behaviors. (Color figure online)
compares all the points so it captures a query (red) that is completely random. However, Qute uses a higher-level context to search the smoothness of behavior or symmetry so is able to find all the instances of the “eating a carrot” pattern.
Fig. 10. The comparison between the original query with Qute and QbE. Qute finds the correct behavior however QbE finds random data. (Color figure online)
We also compared our results with the current state of the art qetch, a Query-bySketching system [6]. In qetch “users freely sketch patterns on a scale-less canvas to query time series data without specifying query length or amplitude” [6]. We sketch patterns similar to “eating a carrot” using qetch, as shown in Fig. 11. qetch annotates its results as good, fair, or poor based on the distance measure. Naturally, we would hope that the desired “eating a carrot” pattern is labeled as a good match. However, as Fig. 11 shows this is not the case. Moreover, good matches belongs to the random data. QbE and QbS methods such as qetch are shape-based methods. These methods cannot find “eating a carrot” behavior because, as noted in Sect. 2, shape-based methods use ED or DTW distance measures which may not generalize well.
Qute: Query by Text Search for Time Series Data
425
Fig. 11. (Top) Some example of sketching “eating a carrot” pattern. middle) The zoom-in area that shows the first match of our sketches with the time series. The best match is a random data. (Bottom) The fair and poor matches that outputs by qetch.eps
9 Discussion and Conclusions We have introduced Qute, a natural language search system for time series data. We have shown that our system allows the user to compose queries at a higher conceptual level. For example, instead of the indirection cause by a user thinking “I need to find a sudden drop in temperature”, and then having to navigate the idiosyncrasies of a particular system to specify this, she can simply type (or vocalize) TEMPERATURE decrease. This more natural mode of interaction does not come with any decrease in expressiveness. We have demonstrated that our system is more expressive than existing Query-by-Sketching and Query-by-Example systems [5, 6]. In future work, we intend to improve the mappings between the data and the words, by making the features perceptually uniform [23]. We also intend to perform additional user studies in different fields including economics, statistics and biology. This will enable us to expand the vocabulary and the dictionary-set of our system. 9.1 Appendix: Mapping Between a Word and Time Series Suppose the user wants to find spiky sections in a time series. Can we concretely quantify this concept? Even for a simple feature such as spiky, “the best method to identify spikes in time series is not known” [24]. Our first idea is to capture spikiness by looking for a high residual L1 norm between an input time series and a smoothed version of the input. This definition has many flaws. In some cases, it will give a high score for a subsequences that might consider noisy and not spiky. Alternatively, we can replace the L1 norm with the maximum residual value.
426
S. Imani et al.
For each feature that we implement, we alternate it by many iterations and test it on different data sets to produce acceptable results. As [24] points out, even for something like spike, there may be different definitions that are best for epidemiologists vs. economists etc. The formal definitions and code for each feature are relegated to [15].
References 1. Koolen, M., Kamps, J., Bogers, T., Belkin, N., Kelly, D., Yilmaz, E.: Report on the second workshop on supporting complex search tasks. In: ACM SIGIR Forum, vol. 51, pp. 58–66. ACM (2017) 2. Schäfer, P., Leser, U.: Fast and accurate time series classification with weasel. In: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, pp. 637–646. ACM (2017) 3. Roitman, H., Yogev, S., Tsimerman, Y., Peres, Y.: Towards discovery-oriented patient similarity search. In: ACM SIGIR Workshop on Health Search & Discovery, p. 15 4. Sarker, H., Tyburski, M., Rahman, Md.M., Hovsepian, K., Sharmin, M., Epstein, D.H., Preston, K.L., Furr-Holden, C.D., Milam, A. Nahum-Shani, I., et al.: Finding significant stress episodes in a discontinuous time series of rapidly varying mobile sensor data. In: Proceedings of the 2016 CHI conference on human factors in computing systems, pp. 4489– 4501. ACM (2016) 5. Keogh, E.J., Pazzani, M.J.: Relevance feedback retrieval of time series data. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 183–190. ACM (1999) 6. Mannino, M., Abouzied, A.: Expressive time series querying with hand-drawn scale-free sketches. In: Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, p. 388. ACM (2018) 7. Lee, D.J.-L., Lee, J., Siddiqui, T., Kim, J., Karahalios, K., Parameswaran, A.: You can’t always sketch what you want: understanding sensemaking in visual query systems. IEEE Trans. Vis. Comput. Graph. 26, 1267–1277 (2019) 8. Zamora-Martinez, F., Romeu, P., Botella-Rocamora, P., Pardo, J.: On-line learning of indoor temperature forecasting models towards energy efficiency. Energy Build. 83, 162–172 (2014) 9. Correll, M., Gleicher, M.: The semantics of sketch: flexibility in visual query systems for time series data. In: 2016 IEEE Conference on Visual Analytics Science and Technology (VAST), pp. 131–140. IEEE (2016) 10. Imani, S., Keogh, E.: Matrix profile XIX: time series semantic motifs: a new primitive for finding higher-level structure in time series. In: 2019 IEEE International Conference on Data Mining (ICDM), pp. 329–338. IEEE (2019) 11. Bhattacharjee, T., Lee, G., Song, H., Srinivasa, S.S.: Towards robotic feeding: role of haptics in fork-based food manipulation. IEEE Robot. Autom. Lett. 4(2), 1485–1492 (2019) 12. Chiu, B., Keogh, E., Lonardi, S.: Probabilistic discovery of time series motifs. In: Proceedings of the ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 493–498. ACM (2003) 13. Imani, S. Alaee, S., Keogh, E.: Putting the human in the time series analytics loop. In: Companion Proceedings of The 2019 World Wide Web Conference, pp. 635–644. ACM (2019) 14. Imani, S., Keogh, E.: Natura: towards conversational analytics for comparing and contrasting time series. In: Companion Proceedings of the Web Conference 2020, pp. 46–47 (2020) 15. Author: Project website (2019). https://sites.google.com/site/nlptimeseries/ 16. Shekarpour, S., Marx, E., Auer, S., Sheth, A.: RQUERY: rewriting natural language queries on knowledge graphs to alleviate the vocabulary mismatch problem. In: Thirty-First AAAI Conference on Artificial Intelligence (2017)
Qute: Query by Text Search for Time Series Data
427
17. Giachanou, A., Crestani, F.: Tracking sentiment by time series analysis. In: Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, pp. 1037–1040. ACM (2016) 18. Tixier-Boichard, M., Bed’hom, B., Rognon, X.: Chicken domestication: from archeology to genomics. C.R. Biol. 334(3), 197–204 (2011) 19. Abdoli, A., Murillo, A.C., Yeh, C.-C.M., Gerry, A.C., Keogh, E.J.: Time series classification to improve poultry welfare. In: 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA), pp. 635–642. IEEE (2018) 20. Alaee, S., Abdoli, A., Shelton, C., Murillo, A.C., Gerry, A.C., Keogh, E.: Features or shape? Tackling the false dichotomy of time series classification. In: Proceedings of the 2020 SIAM International Conference on Data Mining, pp. 442–450. SIAM (2020) 21. Dau, H.A., Bagnall, A. Kamgar, K., Yeh, C.-C.M., Zhu, Y., Gharghabi, S., Ratanamahatana, C.A., Keogh, E.: The UCR time series archive. arXiv preprint arXiv:1810.07758 (2018) 22. Hochheiser, H., Shneiderman, B.: A dynamic query interface for finding patterns in time series data. In: CHI 2002 Extended Abstracts on Human Factors in Computing Systems, pp. 522–523. ACM (2002) 23. Safdar, M., Cui, G., Kim, Y.J., Luo, M.R.: Perceptually uniform color space for image signals including high dynamic range and wide gamut. Opt. Express 25(13), 15131–15151 (2017) 24. Goin, D.E., Ahern, J.: Identification of spikes in time series. arXiv preprint arXiv:1801.08061 (2018)
Establishing a Formal Benchmarking Process for Sentiment Analysis for the Bangla Language AKM Shahariar Azad Rabby1(&) , Aminul Islam1, and Fuad Rahman2 1
Apurba Technologies, Dhaka, Bangladesh {rabby,aminul}@apurbatech.com 2 Apurba Technologies, Sunnyvale, CA, USA [email protected]
Abstract. Tracking sentiments is a critical task in many natural language processing applications. A lot of work has been done on many leading languages in the world, such as English. However, in many languages such as Bangla, sentiment analysis is still in early development. Most of the research on this topic suffers from three key issues: (a) the lack of standardized publicly available datasets, (b) the subjectivity of the reported results, which generally manifests as a lack of agreement on core sentiment categorizations, and finally, (c) the lack of an established framework where these efforts can be compared to a formal benchmark. Thus, this seems to be an opportune moment to establish a benchmark for sentiment analysis in Bangla. With that goal in mind, this paper presents benchmark results of ten different sentiment analysis solutions on three publicly available Bangla sentiment analysis corpora. As part of the benchmarking process, we have optimized these algorithms for the task at hand. Finally, we establish and present sixteen different evaluation matrices for benchmarking these algorithms. We hope that this paper will jumpstart an open and transparent benchmarking process, one that we plan to update every two years, to help validating newer and novel algorithms that will be reported in this area in future. Keywords: Sentiment analysis Annotation Benchmarking
NLP Bangla sentiment corpus
1 Introduction The explosion of information technology, especially the use of social media, has resulted in a vast amount of content that is thrown at human beings at any given moment. A lot of this content is tied to social, political, and economic interests, publishers of all of which have a vested interest in tracking whether the audience likes the content or not. For instance, data-driven trend analysis is an essential part of modern politics and advertising. Less dramatic, but equally critical applications of sentiment analysis are customer reviews on online shopping sites or opinion mining on newspapers to gauge public sentiment on national security issues, just to name a few. © Springer Nature Switzerland AG 2021 K. Arai et al. (Eds.): FTC 2020, AISC 1289, pp. 428–448, 2021. https://doi.org/10.1007/978-3-030-63089-8_28
Establishing a Formal Benchmarking Process
429
Bangla is spoken as the first language by almost 200 million people worldwide, 160 million of whom hold Bangladeshi citizenship. But Natural Language Processing (NLP) development of the Bangla language is in very early stages, and there is not yet enough labeled data to work with for the language. Because of the scarcity of labeled data and standardized corpora, little work has been reported in this space. Recently, a sentiment analysis corpus of about 10,000 sentences was made public by Apurba Technologies [1]. We searched and located two additional, albeit smaller, opensourced datasets in this space [2]. We built ten different sentiment analysis algorithms using Machine Learning (ML), statistical modeling, and other methods. This paper benchmarks these 10 algorithms on the above-mentioned 3 annotated corpora. The paper is arranged as follows. We begin by reviewing the existing state of the art of sentiment analysis in Bangla—which as stated already is not very rich—but the principal issue that becomes crystal clear is that whatever efforts have been reported on this topic, it is absolutely impossible to compare them since they use different datasets and almost always the datasets reported are not available to other researchers. As a natural segue from this topic, we then present how we combined all the possible sources of sentiment corpora available publicly and built a large dataset. We then move to designing 14 different matrices that form the benchmarking framework. We then describe 10 different sentiment analysis algorithms that have been reported in the literature. Although this list is not exhaustive in any sense, it does cover the majority of the work ever reported in this space. We not only implemented these algorithms, we also fine-tuned the parameters for optimizing each of these solutions. Finally, these 10 algorithms were benchmarked by the 14 different matrices identified earlier. The paper ends with a discussion on the reported work.
2 Brief Background There are three classification levels in sentiment analysis: document-level, sentencelevel, and aspect-level. In the document level, overall sentiment is assessed based on the complete text. The sentence-level analysis aims to classify sentiment expressed in each sentence. The first step is to identify whether the sentence is subjective or objective. If the sentence is subjective, sentence-level analysis will determine whether the sentence expresses positive or negative opinions [3]. In aspect-based sentiment analysis, sentiments are assessed on aspects or points of view of a topic, especially with multi-clausal sentences. For the rest of this paper, we will exclusively focus on sentence-level sentiment analysis. Machine learning techniques for sentiment analysis are getting better, especially for vector representation models, where some of these models can extract semantics that helps to understand the intent of the messages [4]. Many machine learning and deep learning techniques have been reported for identifying and classifying sentiment polarity in a document or sentence. Existing research demonstrates that Long ShortTerm Memory networks (LSTMs) are capable of learning the context and inherent meaning of a word and provide more accurate results for sentiments [5]. Classification algorithms such as Random Forest, Decision Tree Classifier, and the k-nearest neighbors (KNN) algorithm, are suitable for classification based on feature sets. Naive Bayes works based on Bayes’ theorem of a probability distribution. Convolutional
1
Method
word2vec and Sentiment extraction of words
Support Vector Machine (SVM) and Maximum Entropy (MaxEnt).
Support Vector Machine, Logistic Regression, etc.
LSTM, using two types of loss functions – binary cross-entropy and categorical cross-entropy
Word embedding methods Word2vec SkipGram and Continuous Bag of Words with an addition Word to Index model for SA in Bangla language
Fuzzy rules to represent semantic rules that are simple but greatly influence the actual polarity of the sentences
Author
Md. Al- Amin, Md. Saiful Islam, Shapan Das Uzzal
Shaika Chowdhury, Wasifa Chowdhury
Mohammad Samman Hoss-ain, Israt Jahan Jui, Afia Zahin Suzana
Asif Hassan, Mohammad Rashedul Amin, Abul Kalam Al Azad, Nabeel Mohammed
Sakhawat Hosain Sumit, Md. Zakir Hossan, Tareq Al Muntasir and Tanvir Sourov
Md. Asimuzzaman, Pinku Deb Nath, Farah Hossain, Asif Hossain, Rashedur M. Rahman Bangla tweets using Twitter APIs.
Bangla Web Crawl Bangla Sentiment Dataset
Selfcollected
NA
1,899,094 Sentences 23,506,262 Words, 394,297 que Words
10,000 Bangla text samples
15,325 headlines
1,300 tweets
Bangla Tweets
Selfcollected news headline data.
15,000 Comments
Size
Selfcollected comments data
Dataset
Not publicly available
Not publicly available
Not publicly available
Not publicly available
Not publicly available
Not publicly available
Availability
(MSE) 0.0529
83.79%
Lr:75.91% SVM: 79.56% Tree:76.64 % 78%
SVM 88% MaxEnt 88%
75.5%
Acc
2017
2018
2016
2017
2014
2017
Year
430 AKM Shahariar Azad Rabby et al.
Neural Networks (CNNs), a commonly used tool in deep learning, works well for sentiment analysis as its standard architecture can map the sentences of variable length into sentences of fixed size scattered vectors [6].1 Table 1. Bangla sentiment analysis - previous work
Recently lots of pre-trained language models like BERT [30], ELMo [31], XLNet have been reported to achieve promising results on several NLP tasks including sentiment analysis. However, these models are mainly targeted to the English language, not Bangla.
Dataset from Hasaan, Asif, et al.
Selfcollected
Selfcollected
Generated from Amazon's Watches English dataset.
Long Short-term Memory (LSTM) Neural Networks for analyzing negative sentences in Bangla.
Random Forest Classifier to classify sentiments.
The model is generated by a neural network variance called Convolutional Neural Network
Mutual Information (MI) for the feature selection process and also used Multinomial Naive Bayes (MNB) for the classification
Deep learning based modelsto classify a Bangla sentence with a three-class
Nusrath Tabassum; Muhammad Ibrahim Khan
Md. Habibul Alam ; Md-Mizanur Rahoman ; Md. Abul Kalam Azad
Animesh Kumar Paul; Pintu Chandra Shill
Nafis Irtiza Tripto ; Mohammed Eunus Ali Selfcollected YouTube comment
Selfcollected
Naïve Bayes Classification Algorithm and Topical approach to extract the emotion.
Rashedul Amin Tuhin, Bechitra Kumar Paul, Faria Nawrine, Mahbuba Akt A it K Abdul Hasib Uddin; Sumit Kumar Dam; Abu Shamim Mohammad ArifChakrabarty
Dataset
Method
Author
15689 YouTube comment
68356 translated reviews
850 Bangla comments from different sources
Not publicly available
Not publicly available
Not publicly available
Not publicly available
Not publicly available
9337 post
1050 Bangla texts
Not publicly available
Availability
7,500 Bangla sentences
Size
65.97% three, 54.24% five labels
88.54%
99.87%
2018
2016
2017
2019
2019
84.4%
87%
2019
Year
Above 90%
Acc
Sentiment Analysis of Bangla Microblogs Using Adaptive Neuro Fuzzy System [12]
Exploring Word Embedding for Bangla Sentiment Analysis [11]
Sentiment Analysis on Bangla and Romanized Bangla Text (BRBT) using Deep Recurrent models. [10]
Sentiment Analysis for Bengali Newspaper Headlines [9]
Performing Sentiment Analysis in Bangla Microblog Posts [8]
Sentiment Analysis of Bengali Comments with Word2Vec and Sentiment Information of Words [7]
Paper Title
Establishing a Formal Benchmarking Process 431
Dataset Collected from various social sites
Collected from different source Collected from YouTube
Collected from Facebook using Facebook graph api Collected from Facebook Group
Selfcollected
Method
Used Tf.Idf to come out a better solution and give more accurate result by extracting different feature
One vector containing more than one words using N-gram
A backtracking algorithm used, where the heart of this approach is a sentiment lexicon
Represent Bangla sentence based on characters and extract information from the characters using an RNN
Naïve Bayes and Dictionary Based Approach used to Lexicon Based Sentiment Analysis
Multinomial Na ı̈ ve Bayes used for sentiment analysis.
Author
Muhammad Mahmudun Nabi, Md. Altaf, Sabir Ismail
SM Abu Taher; Kazi Afsana Akhter ; K.M. Azharul Hasan
Tapasy Rabeya ; Narayan Ranjan Chakraborty ; Sanjida Ferdous ; Manoranjan Dash ; Ahmed Al Marouf
Mohammad Salman Haydar ; Mustakim Al Helal ; Syed Akhter Hossain
Sanjida Akter; Muhammad Tareq Aziz
Omar Sharif; Mohammed Moshiul Hoque; Eftekhar Hossain
1000 restaurant reviews
9000 words
45,000
201 Comments
9,500 comments
1500 short Bangla comment
Size
Not publicly available
Not publicly available
Not publicly available
Not publicly available
Not publicly available
Not publicly available
Availability
80.48%
73%
80%
70%
89.271%
83%
Acc
2019
2016
2018
2019
2018
2016
Year
Detecting Multilabel Sentiment and Emotions from Bangla YouTube Comments [18]
Sentiment mining from Bangla data using mutual information [17]
Sentiment analysis for Bangla sentences using convolutional neural network [16]
Design an Empirical Framework for Sentiment Analysis from Bangla Text using Machine Learning [15]
Extracting Severe Negative Sentence Pattern from Bangla Data via Long Short-term Memory Neural Network [14]
An Automated System of Sentiment Analysis from Bangla Text using Supervised Learning Techniques [13]
Paper Title
432 AKM Shahariar Azad Rabby et al.
433
Sentiment Analysis of Bengali Texts on Online Restaurant Reviews Using Multinomial Naïve Bayes [24]
Sentiment analysis on the Facebook group using lexicon-based approach [23]
Sentiment Extraction from Bangla Text: A Character Level Supervised Recurrent Neural Network Approach [ 22]
Sentiment Analysis of Bangla Song Review- A Lexicon Based Backtracking Approach [21]
N-Gram Based Sentiment Mining for Bangla Text Using Support Vector Machine [20]
Detecting Sentiment from Bangla Text using Machine Learning Technique and Feature Analysis [19]
Paper Title
Establishing a Formal Benchmarking Process
Table 1 shows the state of the art of Bangla sentiment analysis research. One observation that is painfully plain in this table is that all of the authors of these papers spent valuable time in building and annotating their own datasets. What is even more alarming is that none of these datasets were then made publicly available. This has made it impossible to compare the validity and relative strengths or weaknesses for any of these solutions, making the task of establishing a benchmark framework impossible.
3 Dataset In this research, we used three different datasets. The first dataset is our own, that we previously published [1], representing the largest open-access sentiment analysis dataset for Bangla, with 9,630 samples. The second is the ABSA Sports dataset [2], with 2,979 samples. The third and final dataset [2] is the ABSA Restaurant dataset, with 2,059 samples. All datasets have three sentiment categorizations: positive, negative, and neutral. For simplicity, we excluded all of the neutral data from our datasets. After eliminating the neutral samples, the Apurba, ABSA Sports, and ABSA Restaurant datasets have 7,293, 2,718, and 1,808 positive and negative samples, respectively. The proposed benchmarking system has four stages: data collection, data preprocessing, training, and evaluation. 3.1
Dataset Collection
The Apurba Dataset was collected from a popular online news portal “Prothom Alo” ( ), tagged manually and checked twice for validation. Also, the dataset is opensource for all types of non-commercial usage, intended for educational and research use. The other two datasets can easily be obtained from GitHub. We also merged these three datasets and made a mixed dataset.
434
3.2
AKM Shahariar Azad Rabby et al.
Data Pre-processing
Data cannot be used as-is in most machine learning algorithms—it needs to be processed before anything else can be done. In this research, we took the text and annotated sentiment values. We excluded the neutral samples and represent the positive class with 0 and the negative level with 1. We removed all unnecessary characters, including punctuation, URL, extra white space, emoticons, symbols, pictographs, transport and maps symbol, iOS flags, digits, and 123 other characters, and so forth. After all these steps, the preprocessed dataset looks as shown in Fig. 1.
Fig. 1. Processed dataset sample
Tokenization is a task of separating the given sentence sequence each word, which are then known as tokens. Tokenizers accomplish this task by locating word boundaries. The ending point of a word and the beginning of the next word are our word boundaries. We tokenize each sentence based on white space. The next step is removing stop-words, which are commonly used words (such as “a” or “and”) which our algorithm ignores. Figure 2 shows a typical example of these steps.
Fig. 2. Pre-processing steps
We then prepare a “term frequency-inverse document frequency” vectorization, commonly known as tf-idf, that creates a sparse matrix. The sparse matrix contains a vector representation of our data. The tf-idf output is used as a weighting factor to measure how important a word is in a document in a collection of given corpus. Then we split our data into two portions, 80% is for training purposes and 20% for test the model performance. Figure 3 shows the flowchart of these pre-processing steps.
Establishing a Formal Benchmarking Process
435
Fig. 3. Flowchart of the pre-processing steps
4 Benchmarking Indices Sensitivity analysis is a model that determines how target variables are affected based on changes in other variables known as input variables. This model, also referred to as what-if or simulation analysis, is a way to predict the outcome of a decision given a certain range of variables. By creating a given set of variables, an analyst can determine how changes in one variable affect the outcome. We have used a set of universally standardized indices for validating the algorithms including Confusion Matrix (CM), True Positive Rate (TPR), True Negative Rate (TNR), False Negative Rate (FNR), False Positive Rate (FPR), Positive Predictive Value (PPV), Negative Predictive Value (NPV), False Discovery Rate (FDR), False Omission Rate (FOR), Accuracy (ACC), F1 Score, R2 Score, Receiver Operating Characteristic (ROC), and Area Under the Curve (AUC) [24–28].
5 Sentiment Analysis Algorithms We used ten different algorithms, which are: Multinomial Naive Bayes, Bernoulli Naive Bayes, Logistic Regression, Decision Tree Classifier, K-Nearest Neighbors Classifier (KNN), Support Vector Machine (SVM), Ada-Boost Classifier, Extreme Gradient Boosting (XGBoost) and long short-term memory (LSTM). LSTM achieves the best performance among them. We used K-fold cross-validation and Grid Search to find the best parameters for all of our algorithms. 5.1
Multinomial Naive Bayes
Multinomial Naive Bayes estimates the conditional probability of a particular word given a class as the relative frequency of term t in samples belonging to class c. Multinomial Naive Bayes simply assumes a multinomial distribution for all the pairs, which seems to be a reasonable assumption in some cases, especially for word counts in documents.
436
5.2
AKM Shahariar Azad Rabby et al.
Bernoulli Naive Bayes
The Bernoulli Naive Bayes classifier assumes that all our features are binary—that they take only two values. This is similar to the Multinomial Naive Bayes, but the predictors are Boolean variables. The parameters that we use to predict the class variable take up only values, yes or no, for example, if a word occurs in the text or not. 5.3
Logistic Regression
Logistic Regression is the primary form of statistical method to find a binary dependent variable. In this technique, models try to find the probability of each class. Logistic Regression is a ML classification algorithm that used to predict the probability of a categorical dependent variable. In logistic regression, the dependent variable is a binary variable that contains data coded as either 1 (yes, success, etc.) or 0 (no, failure, etc.). In other words, the logistic regression model predicts P (Y = 1) as a function of X. 5.4
Random Forest
A forest usually consists of lots of trees; in a random forest, a large number of individual decision trees operated like ensemble. Every decision tree gives their vote to a particular class, and the class that gets the most votes is selected for model prediction. 5.5
Decision Tree Classifier
A decision tree is the purest form of the classification algorithm. A decision tree contains nodes, edges, and leaf nodes for classifications. Decision trees consist of: (a) nodes to test for the value of a particular attribute, (b) edges/branches to correspond to the outcome of a test and connect to the next node or leaf, and (c) leaf nodes which are terminal nodes that predict the outcome (such as class labels or class distribution). 5.6
KNN Classifier
In the field of AI, the k-nearest neighbors’ algorithm is a non-parametric technique used for classifications. It is easy to implement, but the major problem is that it becomes slow as the amount of data increases. 5.7
SVM Classifier
A Support Vector Machine (SVM) is a discriminative classifier formally defined by a separating hyperplane. In other words, given labeled training data (supervised learning), the algorithm builds an optimal hyperplane that separates new examples into constituent classes. In two-dimensional space, this hyperplane is a line dividing a plane into two parts wherein each class lies on either side.
Establishing a Formal Benchmarking Process
5.8
437
Ada-Boost Classifier
The general idea behind boosting methods is to train predictors sequentially, each trying to correct its predecessor. The basic concept behind Ada-boost is to set the weights of classifiers and to train the data samples in each iteration such that it ensures accurate predictions, even for unusual observations. 5.9
XGBoost
XGBoost is a decision-tree-based ensemble ML algorithm that uses a gradient boosting framework. XGBoost Gradients are fantastic models because they can increase accuracy over a traditional statistical or conditional model and can apply themselves quite well to the two primary types of targets. 5.10
LSTM
Long Short-Term Memory (LSTM) networks are a modified version of recurrent neural networks that enables the memory storage of past data. RNN’s vanishing gradient problem is solved here. LSTM is ideal for classifying, analyzing, and forecasting time series owing to uncertain time lags.
6 Performance 6.1
Multinomial Naive Bayes
We found that if the alpha value set to 0.9, Multinomial Naive Bayes gets a maximum of 76.65% accuracy. Table 2 shows the performance of Multinomial Naive Bayes. And Table 3 shows the sensitivity analysis for this algorithm.
Table 2. Multinomial Naive Bayes performance Dataset Apurba
CM [342, 264] [195, 658] ABSA sports [[38, 72] [55, 379]] ABSA restaurant [225, 37] [52, 48] All dataset [566, 466] [271, 1061]
ACC ROC AUC 68.54% 73.05% 76.65% 67.93% 75.41% 72.64% 68.82% 73.05%
438
AKM Shahariar Azad Rabby et al. Table 3. Sensitivity analysis of multinomial Naive Bayes Dataset Apurba ABSA sports ABSA restaurant All dataset
6.2
TPR 77.14 87.33 48.0 79.65
TNR 56.44 34.55 85.88 54.84
FNR 22.86 12.67 52.0 20.35
FPR 43.56 65.45 14.12 45.16
PPV 71.37 84.04 56.47 69.48
NPV 63.69 40.86 81.23 67.62
FDR 28.63 15.96 43.53 30.52
FOR 36.31 59.14 18.77 32.38
F1 74.14 85.65 51.89 74.22
Bernoulli Naive Bayes
For all datasets, we found the alpha value of 0.8 got the best performance. Table 4 shows the performance, and Table 5 shows the sensitivity analysis for Bernoulli Naive Bayes.
Table 4. Bernoulli Naive Bayes performance Dataset Apurba
CM [342, 264] [195, 658] ABSA sports [23, 87] [20, 414] ABSA restaurant [225, 37] [52, 48] All dataset [566, 466] [271,1061]
ACC ROC AUC 69.16% 73.27% 80.33% 70.50% 71.82% 73.64% 67.98% 73.54%
Table 5. Sensitivity analysis of Bernoulli Naive Bayes Dataset Apurba ABSA sports ABSA restaurant All dataset
6.3
TPR 78.19 92.86 25.0 80.56
TNR 56.44 23.64 89.69 51.74
FNR 21.81 7.14 75.0 19.44
FPR 43.56 76.36 10.31 48.26
PPV 71.64 82.75 48.08 68.3
NPV 64.77 45.61 75.81 67.34
FDR 28.36 17.25 51.92 31.7
FOR 35.23 54.39 24.19 32.66
F1 74.78 87.51 32.89 73.92
Logistic Regression
Table 6 shows the performance, and Table 7 shows the sensitivity analysis for Logistic Regression.
Establishing a Formal Benchmarking Process
439
Table 6. Logistic Regression performance Dataset Apurba
CM [338, 268] [203, 650] ABSA sports [23, 87] [20, 414] ABSA restaurant [237, 25] [66, 34] All dataset [566, 466] [276, 1056]
ACC ROC AUC 67.72% 72.51% 80.33% 70.50% 74.86% 75.39% 68.61% 74.30%
Table 7. Sensitivity analysis of logistic regression Dataset Apurba ABSA sports ABSA restaurant All dataset
6.4
TPR 76.2 95.39 34.0 79.28
TNR 55.78 20.91 90.46 54.84
FNR 23.8 4.61 66.0 20.72
FPR 44.22 79.09 9.54 45.16
PPV 70.81 82.63 57.63 69.38
NPV 62.48 53.49 78.22 67.22
FDR 29.19 17.37 42.37 30.62
FOR 37.52 46.51 21.78 32.78
F1 73.4 88.56 42.77 74.0
Random Forest
Table 8 shows the performance, and Table 9 shows the sensitivity analysis for the Random Forest model. Table 8. Random Forest performance Dataset Apurba
CM [340, 266] [309, 544] ABSA sports [47, 63] [41, 393] ABSA restaurant [240, 22] [75, 25] All dataset [629, 403] [387, 945]
ACC ROC AUC F1 Precision Recall 60.59% 65.56% 65.42% 67.16% 63.77% 80.88% 73.30
88.31% 86.18%
90.55%
73.20% 70.00%
34.01% 53.19%
25%
66.58% 71.36%
70.52% 70.10%
70.94%
Table 9. Sensitivity Analysis of Random Forest Dataset Apurba ABSA sports ABSA restaurant All dataset
TPR 64.71 88.71 28.0 68.77
TNR 59.08 43.64 91.98 62.02
FNR 35.29 11.29 72.0 31.23
FPR 40.92 56.36 8.02 37.98
PPV 69.0 86.13 57.14 70.03
NPV 54.32 49.48 77.0 60.61
FDR 31.0 13.87 42.86 29.97
FOR 45.68 50.52 23.0 39.39
F1 66.79 87.4 37.58 69.39
440
6.5
AKM Shahariar Azad Rabby et al.
Decision Tree Classifier
Table 10 shows the performance, and Table 11 shows the sensitivity analysis of the Decision Tree Classifier.
Table 10. Decision Tree performance Dataset Apurba
CM [316, 290] [341, 512] ABSA sports [49, 61] [73, 361] ABSA restaurant [216, 46] [55, 45] All dataset [601, 431] [492, 840]
ACC ROC AUC F1 Precision Recall 56.75% 57.11% 61.87% 63.84% 60.02% 75.37% 65.88%
84.34% 85.55%
83.18%
72.10% 65.13%
47.12% 49.45%
45%
60.96% 60.99%
64.54% 66.09%
63.06%
Table 11. Sensitivity analysis of decision tree Dataset Apurba ABSA sports ABSA restaurant All dataset
6.6
TPR 58.85 83.18 41.0 63.21
TNR 55.61 47.27 82.06 60.95
FNR 41.15 16.82 59.0 36.79
FPR 44.39 52.73 17.94 39.05
PPV 65.11 86.16 46.59 67.63
NPV 48.98 41.6 78.47 56.21
FDR 34.89 13.84 53.41 32.37
FOR 51.02 58.4 21.53 43.79
F1 61.82 84.64 43.62 65.35
K-NN Classifier
Table 12 shows the performance, and Table 13 shows the sensitivity analysis of KNN.
Table 12. K-NN Classifier performance Dataset Apurba
CM [293, 313] [308, 545] ABSA sports [25, 85] [29, 405] ABSA restaurant [236, 26] [77, 23] All dataset [500, 532] [368, 964]
ACC ROC AUC 57.44% 57.42% 79.04% 66.31% 71.55% 63.69% 61.92% 63.10%
Establishing a Formal Benchmarking Process
441
Table 13. Sensitivity analysis of KNN Dataset Apurba ABSA sports ABSA restaurant All dataset
6.7
TPR 63.89 93.32 23.0 72.37
TNR 48.35 22.73 90.08 48.45
FNR 36.11 6.68 77.0 27.63
FPR 51.65 77.27 9.92 51.55
PPV 63.52 82.65 46.94 64.44
NPV 48.75 46.3 75.4 57.6
FDR 36.48 17.35 53.06 35.56
FOR 51.25 53.7 24.6 42.4
F1 63.71 87.66 30.87 68.18
SVM Classifier
Table 14 shows the performance, and Table 15 shows the sensitivity analysis of the SVM.
Table 14. SVM performance Dataset Apurba
CM [293, 313] [308, 545] ABSA sports [25, 85] [29, 405] ABSA restaurant [236, 26] [77, 23] All dataset [500, 532] [368, 964]
ACC ROC AUC 66.83% 72.24% 70.77% 69.37% 69.89% 72.87% 67.94% 73.95%
Table 15. Sensitivity analysis of SVM Dataset Apurba ABSA sports ABSA restaurant All dataset
6.8
TPR 69.75 75.81 62.0 70.35
TNR 62.71 50.91 72.9 64.83
FNR 30.25 24.19 38.0 29.65
FPR 37.29 49.09 27.1 35.17
PPV 72.47 85.9 46.62 72.08
NPV 59.56 34.78 83.41 62.88
FDR 27.53 14.1 53.38 27.92
FOR 40.44 65.22 16.59 37.12
F1 71.09 80.54 53.22 71.2
Ada-Boost Classifier
We got the best accuracy for Ada-Boost if the number of the estimator set to 50. Table 16 shows the performance, and Table 17 shows the sensitivity analysis of the Ada-Boost Classifier.
442
AKM Shahariar Azad Rabby et al. Table 16. ADA Boost performance Dataset Apurba
CM [293, 313] [308, 545] ABSA sports [25, 85] [29, 405] ABSA restaurant [236, 26] [77, 23] All dataset [500, 532] [368, 964]
ACC ROC AUC 64.22% 65.92% 79.42% 66.74% 73.20% 69.38% 65.44% 70.44%
Table 17. Sensitivity analysis of ADA Boost Dataset Apurba ABSA sports ABSA restaurant All Dataset
6.9
TPR 82.77 96.77 18.0 82.88
TNR 38.12 11.82 93.89 42.93
FNR 17.23 3.23 82.0 17.12
FPR 61.88 88.18 6.11 57.07
PPV 65.31 81.24 52.94 65.21
NPV 61.11 48.15 75.0 66.02
FDR 34.69 18.76 47.06 34.79
FOR 38.89 51.85 25.0 33.98
F1 73.01 88.33 26.87 72.99
XGBoost
Table 18 shows the performance, and Table 19 shows the sensitivity analysis of XGBoost. Table 18. XGBoost performance Dataset Apurba
CM [291, 315] [140, 713] ABSA sports [15, 95] [16, 418] ABSA restaurant [244, 18] [67, 33] All dataset [490, 542] [185, 1147]
ACC ROC AUC 68.81% 6580 79.60% 54.97% 76.52% 63.06% 69.25% 66.80%
Table 19. Sensitivity Analysis of XGBoost Dataset Apurba ABSA sports ABSA restaurant All dataset
TPR 83.59 96.31 33.0 86.11
TNR 48.02 13.64 93.13 47.48
FNR 16.41 3.69 67.0 13.89
FPR 51.98 86.36 6.87 52.52
PPV 69.36 81.48 64.71 67.91
NPV 67.52 48.39 78.46 72.59
FDR 30.64 18.52 35.29 32.09
FOR 32.48 51.61 21.54 27.41
F1 75.81 88.28 43.71 75.94
Establishing a Formal Benchmarking Process
6.10
443
LSTM
In word2vec [31], vector representations help to get a closer relationship among the words. Deep learning models such as LSTMs can remember important information across long stretches of sequences [32]. For semantic understanding or ‘meaning’ that based on context, it is important to get the actual sentiment of a sentence [4]. Hence LSTM model with word2vec has been implemented to get the results over the newly published corpora. Here are the implementation details: • • • • • • • • • • • • •
Word Embedding using vord2vec Window size: 2 Minimum word count frequency is 4 (ignored lower than 4) The dimensionality of the word vectors: 100 Embedding layer dropout: 50 LSTM layer dropout: 20 Recurrent dropout: 20 The dimensionality of the output space 100 Activation function: Sigmoid Optimizer: Adam Loss function: Binary cross-entropy Number of Epoch: 10 Batch Size: 100
Table 20 shows the performance, and Table 21 shows the sensitivity analysis of the datasets. For the ABSA dataset, it doesn’t work well for the lack of enough data in both classes. So, the model was biased for those two ABSA datasets. Figure 4 is showing the proposed LSTM model.
Table 20. LSTM performance Dataset Apurba
CM [361, 245] [175, 678] ABSA sports [0, 110] [0, 434] ABSA restaurant [262, 0] [100, 0] All dataset [579, 453] [181, 1151]
ACC ROC AUC 69.52% 69.53% 79.77% 50% 72.38% 50% 73.18% 71.26%
444
AKM Shahariar Azad Rabby et al.
Fig. 4. Proposed LSTM architecture
Table 21. Sensitivity Analysis of LSTM Dataset Apurba ABSA sports ABSA restaurant All dataset
TPR 79.25 100 0 82.81
TNR 60.56 0 100 62.5
FNR 20.75 0 100 17.19
FPR 39.44 100 0 37.5
PPV 73.88 79.78 – 74.03
NPV 67.46 – 72.38 73.80
FDR 26.12 20.22 – 25.97
FOR 32.54 – 27.62 26.20
F1 76.47 – – 78.17
7 Discussion In this section, we will benchmark the ten algorithms. Table 22 shows the comparison of all the algorithms on all the datasets. The algorithms are sorted based on their performance on the merged dataset. According to this evaluation, LSTM performs the best, followed by XGBoost and Multinomial Naive Bayes and so forth.
Table 22. Benchmark comparison - 1 Algorithm
LSTM XGBoost Multinomial Naive Bayes Logistic Regression Bernoulli Naive Bayes SVM Random Forest ADA Boost K-NN Classifier Decision Tree Classifier
Acc Apurba Acc Sports
Acc Restaurant
69.52% 68.81% 68.54% 67.72% 69.16% 66.83% 60.59% 64.22% 57.44% 56.75%
72.38% 76.52% 75.42% 74.86% 71.82% 69.89% 73.20% 73.20% 71.55% 72.10%
79.77% 79.60% 76.65% 80.33% 80.33% 70.77% 80.88% 79.42% 79.04% 75.37%
Acc All Data
73.18% 69.25% 68.82% 68.61% 67.98% 67.94% 66.58% 65.44% 61.92% 60.96%
Establishing a Formal Benchmarking Process
445
Note that although LSTM performs best on the combined dataset, it was beaten by Random Forest on the Sports and by XGBoost on the Restaurant datasets, respectively, as noted by the highlighted cells in Table 22. Another point to note is that Bernoulli Naive Bayes is twice in the second-best position: on the Apurba and the Sports datasets, as indicated by the gray cells in Table 22. To rank these algorithms based on how consistent they are, we start by assigning 1, 2, … 10 positions for each dataset, and then adding up their ranks on each dataset. The algorithm with the smallest sum can be ranked as most consistent, assuming the degree of difficulty of each dataset is the same, which, admittedly, we cannot know for sure. But it still gives us a ‘sense’ of how they perform over a range of different problem domains. Table 23 shows this revised ranking. This indicates that LSTM and XGBoost are tied in the first place, followed by another tie between Multinomial Naive Bayes and Logistic Regression. Decision Tree Classifier is again at the bottom of this table. Table 23. Benchmark comparison - 2 Algorithm
Accuracy Apurba 1 3 4
LSTM XGBoost Multinomial Naive Bayes Logistic 5 Regression Bernoulli Naive 2 Bayes SVM 6 Random Forest 8 ADA Boost 7 K-NN Classifier 9 10 Decision Tree Classifier
Accuracy sports 3 4 7
Accuracy restaurant 5 1 2
2
3
2 9 1 5 6 8
Accuracy all data 1 2 3
Sum of rankings 10 10 16
Overall ranking 1st 1st 2nd
4
14
2nd
7
5
16
3rd
9 4 4 8 6
6 7 8 9 10
30 20 24 32 34
6th 4th 5th 7th 8th
Since LSTM seems to be leading the ranking on both tables, we should take a closer look at this algorithm. LSTM is a deep learning algorithm. Therefore, it has a different way of learning from data. The other six models are classification algorithms using various types of features. As described earlier, LSTM learns the context or semantic meaning from word2vec, but the rest of the models work on the frequency of a given word from encoded vector representation. As the dataset contains only about 12,000 records, this is not enough for getting consistent and accurate output, especially for LSTM, as it is learning the context or semantic lexicon. It needs more data to perform better. We have tested the LSTM model by parameter tuning, input shuffling, and changing the input size. We found that it sometimes provides very different outputs for small changes in the value of the parameters.
446
AKM Shahariar Azad Rabby et al.
8 Conclusion and Future Work This paper presents a detailed benchmarking of ten sentiment-analysis algorithms on three publicly available Bangla datasets. One of the core issues that we face in Bangla natural language processing research is the unavailability of standard datasets. In other languages, such as English or Chinese, this is not a concern. The absence of a standard, publicly available dataset means that every researcher has to first collect and label the data before any training can take place. And since each new algorithm is evaluated on a different dataset, it is also virtually impossible to compare the different approaches in terms of their accuracy and quality. We hope that this paper will alleviate those problems to some degree. Since we have fine-tuned the algorithms for these particular datasets, researchers in the future can improve on these algorithms by comparing their performance against these benchmarked datasets, which will aid in the overall improvement in the development of NLP tools for Bangla. One of the essential factors in sentiment analysis that has not been addressed in this paper is multi-aspect sentence evaluation. In a sentence, there might be multiple clauses, and different clauses may have different sentiments. For example, examine the following quote: “Sakib’s batting was good, but he did not bowl well.” Here, we need to take the sentiment based the aspects of batting and bowling. The same goes for customer reviews: a product may be bad or good from different perspectives. So, a future task would be to extend these benchmarking models for aspect-based sentiment analysis. For sentiment analysis, there are some smarter and more complicated models, such as CNNLSTM, where the dimensional approach can provide more fine-grained sentiment analysis [14]. We decided not to include those models since we wanted to start the benchmarking with the fundamental, commonly used, algorithms, especially within the nascent Bangla NLP domain. In the next iteration of this research, we plan to include some of these more advanced models. Finally, the size of the datasets used in this benchmarking is still minimal. We hope that other researchers will come forward and fill this gap by publicly offering larger labeled datasets for Bangla sentiment analysis.
References 1. Rahman, F., Khan, H., Hossain, Z., Begum, M., Mahanaz, S., Islam, A., Islam, A.: An annotated Bangla sentiment analysis corpus. In: 2019 International Conference on Bangla Speech and Language Processing (ICBSLP) (2020) 2. Rahman, M., Kumar Dey, E.: Datasets for aspect-based sentiment analysis in Bangla and its baseline evaluation. Data 3(2), 15 (2018) 3. Medhat, W., Hassan, A., Korashy, H.: Sentiment analysis algorithms and applications: A survey (2014) 4. LeCun, Y., Bengio, Y.: Convolutional networks for images, speech, and time series. In: The Handbook of Brain Theory and Neural Networks, vol. 3361, no. 10 (1995) 5. Le, M., Postma, M., Urbani, J., Vossen, P.: A deep dive into word sense disambiguation with LSTM. In: Proceedings of the 27th International Conference on Computational Linguistics, Santa Fe, New Mexico, USA, pp. 354–356. Association for Computational Linguistics, August 2018 6. Sentiment analysis using deep learning techniques: A review. Int. J. Adv. Comput. Sci. Appl
Establishing a Formal Benchmarking Process
447
7. Al-Amin, M., Islam, M.S., Uzzal, S.D.: Sentiment analysis of Bengali comments with word2vec and sentiment information of words. In: 2017 International Conference on Electrical, Computer and Communication Engineering (ECCE), pp. 186–190. IEEE, February 2017 8. Chowdhury, S., Chowdhury, W.: Performing sentiment analysis in Bangla microblog posts. In: 2014 International Conference on Informatics, Electronics & Vision (ICIEV), pp. 1–6. IEEE, May 2014 9. Hossain, M.S., Jui, I.J., Suzana, A.Z.: Sentiment analysis for Bengali newspaper headlines. Doctoral dissertation, BRAC University (2017) 10. Hassan, A., Amin, M.R., Mohammed, N., Azad, A.K.A.: Sentiment analysis on Bangla and Romanized Bangla text (BRBT) using deep recurrent models. arXiv:1610.00369 (2016) 11. Sumit, S.H., Hossan, M.Z., Al Muntasir, T., Sourov, T.: Exploring word embedding for bangla sentiment analysis. In: 2018 International Conference on Bangla Speech and Language Processing (ICBSLP), pp. 1–5. IEEE, September 2018 12. Asimuzzaman, M., Nath, P.D., Hossain, F., Hossain, A., Rahman, R.M.: Sentiment analysis of Bangla microblogs using adaptive neuro fuzzy system. In: 2017 13th International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery, pp. 1631– 1638 (2017) 13. Tuhin, R.A., Paul, B.K., Nawrine, F., Akter, M., Das, A.K.: An automated system of sentiment analysis from Bangla text using supervised learning techniques. In: 2019 IEEE 4th International Conference on Computer and Communication Systems (ICCCS), pp. 360–364. IEEE (2019) 14. Uddin, A.H., Dam, S.K., Arif, A.S.M.: Extracting severe negative sentence pattern from bangla data via long short-term memory neural network. In: 2019 4th International Conference on Electrical Information and Communication Technology (EICT), pp. 1–6. IEEE, December 2019 15. Tabassum, N., Khan, M.I.: Design an empirical framework for sentiment analysis from Bangla text using machine learning. In: 2019 International Conference on Electrical, Computer and Communication Engineering (ECCE), pp. 1–5. IEEE, February 2019 16. Alam, M.H., Rahoman, M.M., Azad, M.A.K.: Sentiment analysis for Bangla sentences using convolutional neural network. In: 2017 20th International Conference of Computer and Information Technology (ICCIT), pp. 1–6. IEEE, December 2017 17. Paul, A.K., Shill, P.C.: Sentiment mining from Bangla data using mutual information. In: 2016 2nd International Conference on Electrical, Computer & Telecommunication Engineering (ICECTE), pp. 1–4. IEEE, December 2016 18. Tripto, N.I., Ali, M.E.: Detecting multilabel sentiment and emotions from Bangla YouTube comments. In: 2018 International Conference on Bangla Speech and Language Processing (ICBSLP), pp. 1–6. IEEE, September 2018 19. Taher, S.A., Akhter, K.A., Hasan, K.A.: N-gram based sentiment mining for Bangla text using support vector machine. In: 2018 International Conference on Bangla Speech and Language Processing (ICBSLP), pp. 1–5. IEEE, September 2018 20. Rabeya, T., Chakraborty, N.R., Ferdous, S., Dash, M., Al Marouf, A.: Sentiment analysis of Bangla song review-a lexicon based backtracking approach. In: 2019 IEEE International Conference on Electrical, Computer and Communication Technologies (ICECCT), pp. 1–7. IEEE, February 2019 21. Haydar, M.S., Al Helal, M., Hossain, S.A.: Sentiment extraction from Bangla text: a character level supervised recurrent neural network approach. In: 2018 International Conference on Computer, Communication, Chemical, Material and Electronic Engineering (IC4ME2), pp. 1–4. IEEE, February 2018
448
AKM Shahariar Azad Rabby et al.
22. Akter, S., Aziz, M.T.: Sentiment analysis on Facebook group using lexicon-based approach. In: 2016 3rd International Conference on Electrical Engineering and Information Communication Technology (ICEEICT), pp. 1–4. IEEE, September 2016 23. Sharif, O., Hoque, M.M., Hossain, E.: Sentiment analysis of Bengali texts on online restaurant reviews using multinomial Naïve Bayes. In: 2019 1st International Conference on Advances in Science, Engineering and Robotics Technology (ICASERT), pp. 1–6. IEEE, May 2019 24. Fawcett, Tom: An introduction to ROC analysis (PDF). Pattern Recogn. Lett. 27(8), 861– 874 (2006). https://doi.org/10.1016/j.patrec.2005.10.010 25. Powers, D.M.W.: Evaluation: from precision, recall and f-measure to ROC, informedness, markedness & correlation (PDF). J. Mach. Learn. Technol. 2(1), 37–63 (2011) 26. Ting, K.M.: Encyclopedia of Machine Learning. Springer (2011). ISBN 978-0-387-30164-8 27. Brooks, H., Brown, B., Ebert, B., Ferro, C., Jolliffe, I., Koh, T.-Y., Roebber, P., Stephenson, D.: WWRP/WGNE Joint Working Group on Forecast Verification Research. Collaboration for Australian Weather and Climate Research. World Meteorological Organisation (2015). Accessed 17 July 2019 28. Chicco, D., Jurman, G.: The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genom. 21(6) (2020). https://doi.org/10.1186/s12864-019-6413-7. PMC 6941312. PMID 31898477 29. Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. CoRR, vol. abs/1810.04805 (2018) 30. Peters, M.E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., Zettlemoyer, L.: Deep contextualized word representations. In: Proceedings of NAACL (2018) 31. Mikolov, T., Chen, K., Corrado, G.S., Dean, J.: Efficient estimation of word representations in vector space. CoRR, vol. abs/1301.3781 (2013) 32. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Detection of Malicious HTTP Requests Using Header and URL Features Ashley Laughter, Safwan Omari(&), Piotr Szczurek, and Jason Perry Lewis University, Romeoville, IL 60446, USA {ashleyrlaughter,omarisa,szczurpi,perryjn}@lewisu.edu
Abstract. Cyber attackers leverage the openness of internet traffic to send specially crafted HyperText Transfer Protocol (HTTP) requests and launch sophisticated attacks for a myriad of purposes including disruption of service, illegal financial gain, and alteration or destruction of confidential medical or personal data. Detection of malicious HTTP requests is therefore essential to counter and prevent web attacks. In this work, we collected web traffic data and used HTTP request header features with supervised machine learning techniques to predict whether a message is likely to be malicious or benign. Our analysis was based on two real world datasets: one collected over a period of 42 days from a low interaction honeypot deployed on a Comcast business class network, and the other collected from a university web server for a similar duration. In our analysis, we observed that: (1) benign and malicious requests differ with respect to their header usage, (2) three specific HTTP headers (i.e., accept-encoding, accept-language, and content-type) can be used to efficiently classify a request as benign or malicious with 93.6% accuracy, (3) HTTP request line lengths of benign and malicious requests differ, (4) HTTP request line length can be used to efficiently classify a request as benign or malicious with 96.9% accuracy. This implies we can use a relatively simple predictive model with a fast classification time to efficiently and accurately filter out malicious web traffic. Keywords: HTTP request URL Header Web security Malicious traffic Classification Cyber threat Machine learning IoT Web page Honeypot
1 Introduction In the Ninth Annual Cost of Cybercrime Study 2019, Accenture, together with the Ponemon Institute, report the annualized cost of cybercrime to businesses [5]. Specifically, the data shows that the cost of web-based attacks has increased by 13% from 2017 to 2018, at an average cost of $2.3 million; and are second only to malware attacks at $2.6 million, as the costliest attacks during this period. The average cost of cybercrime in 2018 increased by 12% from 2017. In general, new developing business models introduce technology vulnerabilities faster than they can be secured. Ensuring the security of web applications is difficult, mainly due to variability and complexity of web systems, non-standardized, custom usage of various scripting languages, and conservative local security patching policies [1, 2]. Other relevant factors contributing to the current security state are a very large user base (i.e., World Wide Web), © Springer Nature Switzerland AG 2021 K. Arai et al. (Eds.): FTC 2020, AISC 1289, pp. 449–468, 2021. https://doi.org/10.1007/978-3-030-63089-8_29
450
A. Laughter et al.
the need for multiple website inter-compatibility and functionality, and the premium placed on ‘user experience’ as compared to security concerns [2]. These aspects, when combined, can generate numerous vulnerabilities (i.e., loopholes, bugs, etc.) in web applications that can be exploited by malicious users [3]. As of March 2020, the Open Web Application Security Project (OWASP) lists an injection attack as the top web application security vulnerability [4]. This type of attack allows a malicious user to inject code into a program or query for the purpose of executing remote commands that can read or modify a database. Injection attacks are often associated with web applications that integrate with databases. As an example, SQL injection attacks use SQL query syntax to inject unauthorized commands that can read or modify the contents of a database. Many large-scale databases can be costly for a business to design and implement, and involve sensitive or proprietary confidential data; therefore, injection attacks carry a high risk and must be prioritized accordingly. For this study, we specifically target low profile non-commercial websites, such as blogs, university department websites, small home business networks, and IoT device webpages, as these consist of mostly static pages and will therefore exhibit different HTTP request patterns as compared with commercial websites. Low-profile webpages are also less secure in general due to more limited resources and therefore present attractive targets for malicious users and hackers. Such websites are more easily compromised and recruited as part of large-scale botnets posted for sale (or rent) on the dark web. A botnet is a large network of individual computers infected with malicious software and controlled as a group, often by a single attacker (botmaster), without owner knowledge of these events. By compromising many smaller, low profile targets, the attacker can collectively launch large-scale attacks against larger commercial targets. As listed in 2017, the top five high profile botnets are: (1) ‘Star Wars’ Twitter botnet; comprised 350,000 bot accounts, (2) Hajime Malware botnet; with 300,000 compromised devices, (3) WireX Android botnet; estimated tens of thousands of devices, (4) Reaper IoT botnet; potentially millions of compromised networks, and (5) Satori IoT botnet; with 280,000 compromised IP addresses [6]. The presence of IoT device botnets within the top five botnet list for 2017 underscores the need for improved security measures applied to IoT devices and their corresponding websites. Current statistics on botnets show that 39.3% of botnet attacks observed in 2018 were new attacks as compared with those encountered in 2017 [7], with web injection being one of the most common attacks. Even though the global economic impact of botnets on business organizations is difficult to assess due to the paucity of financial data available, the estimated annual cost of botnet attacks is $270 K with an individual attack cost of approximately $1 k per business entity per year [8]. Based on these alarming statistics, we recognize the severity of the problem, especially as directed towards low profile, mostly static, non-commercial websites where traditional intrusion detection and prevention systems (IDPS) techniques may not work well. We argue that the inherent characteristics of HTTP requests directed to such websites are different than for their more complex counterparts. In our work, we therefore propose a method to detect malicious HTTP requests and raise an alarm quickly, prior to processing of the request by the HTTP server or web application. If a back-end script initiates processing of the HTTP payload, an exploit is probably already underway, and detection and recovery is too late by then. We therefore utilize
Detection of Malicious HTTP Requests
451
features present within the first few lines of the HTTP packet (i.e., request line and headers) that do not require significant processing of the HTTP request beyond detecting the existence of certain HTTP header fields and counting of characters. An example of this data is shown in Fig. 1.
Fig. 1. Parsed URL and raw request data
Note that the ‘request_URL’ column contains only the URL and the query string, while the ‘request_raw’ column contains the method, URL, query string, protocol version, associated header, and request body for each HTTP request. In our approach, we computed lengths of the URL and raw request strings and used them as features to train our models. We also used individual header values parsed from the raw request data and converted these to a binary representation. For our solution, we implemented a model positioned outside of the HTTP server that connects to the network and performs string processing of the HTTP request start line and headers. Note that our approach is similar to conventional network intrusion detection systems (IDS) in that we seek to implement a binary classification scheme on incoming HTTP traffic such that malicious HTTP requests are identified quickly prior to payload processing and the packet dropped or safely re-routed for further analysis. For encrypted HTTPS traffic, commercial solutions are available that provide a decryption solution capable of handling any session or protocol [29]. Intercepted and decrypted HTTPS content is simply routed to our solution for further processing. The work presented in this paper is based on two real-world datasets. The first dataset was collected via deployment of a low interaction, web-based honeypot (i.e., Glastopf) on a Comcast business class network for a period of 42 days from 11/23/15 to 1/4/16 and represents our ground truth positive class, i.e., malicious HTTP request data [9]. Note that Glastopf emulates a vulnerable web server by pretending to host multiple web pages and applications with thousands of vulnerabilities (i.e., SQL injection, Cross site scripting, etc.) and allows attackers to upload malicious payloads [9, 10]. The second dataset was collected from a university dedicated web server over a period of 42 days from 11/8/19 to 12/20/19 and therefore represents our ground truth negative class, i.e., benign HTTP request data. We recognize that our benign set may contain malicious traffic; however, the ratio of malicious traffic in real-world applications is very low. We compare our benign set to an enriched set of malicious traffic to further reduce the effect of noise.
452
A. Laughter et al.
For this study, we sought to apply supervised machine learning techniques to our datasets to address several research questions. First, do benign and malicious requests differ with respect to header usage, and can headers be used as features to classify a request as benign or malicious with high accuracy? Second, which headers demonstrate the best predictive power? Third does the URL and raw request length differ between malicious and benign requests and can these features classify a request as benign or malicious with high accuracy? Based on the outcome of these questions, we propose to design and implement a separate warning system that connects to a network, processes HTTP requests as they arrive, and raises an alarm if a suspicious request is received. The system could modify the HTTP server to drop a suspicious packet without passing the payload to the backend script (i.e., application). We envision this as a lightweight ‘early-warning’ system appropriate for low profile non-commercial websites, such as blogs, university department websites, small home business networks, and IoT device webpages. These websites consist of mostly static pages, experience low to moderate levels of network traffic as compared with commercial websites and should therefore exhibit HTTP request patterns similar to our benign data set. Low-profile webpages are also less secure in general due to more limited resources and therefore present attractive targets for malicious users and hackers. We propose that smaller networks such as these could benefit from our system. The rest of this paper is organized as follows. Section 2 discusses related work. Section 3 outlines and discusses the data sets and data processing. Section 4 presents the modeling systems used and discusses the experimental results. Section 5 provides concluding remarks and details for future work.
2 Related Work Niu et al. [14] used a template generation algorithm to create feature templates based on HTTP headers and apply the XGBoost algorithm to differentiate between benign and malicious traffic. The authors reported a detection accuracy of 98.72% with a false positive rate of less than 1%. This work is similar to our work in that HTTP request header data is used as a feature to classify network traffic and we also evaluate XGBoost as a classifier for our data. This work differs from ours in that we extract feature vectors from the raw HTTP request data while Niu et al. execute wellcharacterized malware samples to generate their malicious data set. Zhang et al. [15] used a convolutional neural network (CNN) to detect web attacks by analyzing the HTTP request packets. After ten (10) epochs of training, the CNN was run on test data; a detection rate (recall) of 93.35% along with an accuracy of 96.49% is reported. This work differs from ours in that much less raw data processing is required during our feature extraction. We computed the length of the URL and raw request and use these numerical values as feature vectors for our classifiers. Zhang et al. conduct more extensive processing by converting the URL into words and special characters and then apply ‘word2vec’.
Detection of Malicious HTTP Requests
453
Yu et al. [17] modeled HTTP traffic as a natural language sequence. The authors developed and used a deep neural network approach for HTTP traffic anomaly detection (‘DeepHTTP’) and reported high precision (>95%), recall (>96%) and F1 (>95%) scores using different sample ratios for their model. In contrast, we used HTTP header data combined with the URL string length as features for our classifier but did not attempt to represent HTTP traffic as a natural language sequence. We achieved comparable precision, recall and F1 measures using only the URL length and three to four headers. Zhang et al. [16] demonstrated a grammar-guided (regex-based) useragent string method to classify HTTP flows. The authors showed that their method can identify non-standard user-agent strings in HTTP flows with high precision and recall metrics. This work is dissimilar to ours in that we did not utilize the user-agent field string data as a feature. In contrast, we used the presence or absence of header fields, including the user-agent field, and convert these fields to a simpler, more efficient binary representation. Goseva-Popstojanova et al. [18] characterized and classified malicious web traffic captured using high-interaction honeypots. The authors used Support Vector Machines, J48 decision trees, and partial decision trees to perform the classification; the decision tree algorithms outperformed the SVM classifier with high accuracy, recall and precision. The authors demonstrated that a set of four to six features could be used with comparable metrics (>97% recall). This work is similar to our work in that we attempt to classify web traffic captured from a low-interaction honeypot, and this data represents our positive class. In contrast, we use a larger set of classifiers, and an initial data set containing 84 different features (headers); however, we also perform feature selection to reduce this set to the most predictive features. Li et al. [19] propose a system (MalHunter) to detect security threats using statistical features of HTTP requests generated by malware. The authors report 98.32% precision and 98.70% recall using the XGBoost ensemble classifier. This work is similar to ours in that URL and header features are extracted from the HTTP request traffic and used to train classifiers. The implementation differs in that we do not use the header sequence as a feature in our work. We achieve similar precision, recall and F1 scores with our implementation but with less feature extraction and processing. Zarras et al. [22] propose a framework (BOTHOUND) for network-level detection of malicious traffic that uses the HTTP protocol as the main communication channel. The authors report a classification accuracy of 99.97% and a false positive rate of 0.04% for their implementation. This work is similar to ours in that we also extract header data to build feature vectors to use for traffic classification; however, our approach is simpler in that we use the presence or absence of a header to construct a sparse matrix of feature vectors that are applied to our downstream classifiers. Yong et al. [20] propose a Hidden Markov Model (HMM) based detection system (OwlEye) to evaluate the malicious “score” of a web request. The authors report a recall of 99.9% for SQL injection attacks which shows that their system can detect abnormal HTTP requests associated with injection attacks. This work is similar to ours in that the authors use data extracted from the HTTP requests to construct feature vectors applied to a downstream classifier. Like Yong et al., we seek to apply our system to low profile IoT devices and websites. Our work differs in that we classify malicious traffic generally, and do not attempt to target a specific type.
454
A. Laughter et al.
Ogawa et al. [21] extract feature vectors from the HTTP request interval, request/response body size and header bag of words. Features are clustered using kmeans ++ algorithm and an appearance ratio calculated. The authors achieve a 96% average recall with an optimal cluster size of 500. This work is dissimilar to ours in that we use feature vectors consisting of headers and URL length and apply these vectors to our classifiers without clustering. Our work also differs in that we use the presence or absence of a header as a feature instead of a BOW model. Kheir [23] provides an analysis of malware generated anomalies within the useragent HTTP header field. Kheir observes that the implementation increases the detection rate for malicious traffic that specifically alters the user-agent field. This work differs from ours in that we convert all header fields present within the HTTP request (including the user-agent field) to a binary representation and therefore discard the information present within the field. Our approach does not require an additional detection method as we attempt to detect all malicious requests in general with one tool. McGahagan et al. [24] provide an extensive evaluation of HTTP header features and their relation to malicious website detection. The authors report content-length, content-encoding gzip, transfer-encoding chunked, content-type text/html, and varyaccept as the top five most important header features. This work differs from ours in that the authors use HTTP header features to detect malicious websites; we use HTTP headers as feature vectors to detect malicious HTTP requests. Rovetta et al. [25] use supervised and unsupervised learning methods to differentiate between web-bot and human sessions in a Web commerce store. A multilayer perceptron and Support Vector Machine (SVM) were used to classify the test sessions. K-means and GPCM (Graded Possibilistic c-Means) were used to perform unsupervised classification of session data. The authors report SVM and k-means (with k = 761) achieved the highest accuracies, i.e., 99.16% and 99.18% respectively, as compared with the other classifiers. This work differs from ours in that the authors classify web sessions (browsing activities on an e-commerce website) as human or bot using information extracted from access logs, whereas we classify individual HTTP requests as malicious or benign using header and URL lengths as features. Seyyar et al. [26] propose a rule-based method to detect web application vulnerability scans. Using the rule-based model applied to the collective dataset, the authors report a 99.38% accuracy, 100% precision, 75% recall and 85.71% F1 score. In contrast, we use multiple classifiers including more complex ensemble methods, and a neural network to classify web traffic, though we do not focus on vulnerability scans specifically. Husak et al. [27] analyze HTTP traffic using network flow monitoring and parsing of HTTP requests to extract multiple fields. The authors detect 16 undetected bruteforce password attacks and 19 HTTP vulnerability scans per day as directed at a lowprofile university campus network. Like Husak et al., we also parse HTTP requests for usable data; however, our end product is different, and our approach is simpler since we seek to classify all malicious traffic, regardless of type. Zolotukhin et al. [28] analyze HTTP logs to detect network intrusions. Relevant features are extracted using n-gram models and the data applied to clustering and anomaly detection algorithms. The authors report anomalous web resources within HTTP requests are detected with 99.2% accuracy using SVDD. K-means applied to
Detection of Malicious HTTP Requests
455
feature vectors achieves 100% accuracy for injection attacks, while DBSCAN achieves 97.5% accuracy for user-agent field anomalies. This work differs from ours in that the authors extract features using n-gram models and then apply this data to clustering and anomaly detection algorithms to classify different types of network attacks. Our approach is more general since we extract basic features from HTTP requests and use these to classify each request (i.e., benign or malicious). We also obtain similar classification metrics using our approach with less data pre-processing. A summary comparison of several previous works is shown in Table 1. Table 1. Model performance comparison - classification accuracy Study
Model(s) Tested
Features
Niu et al. [14] Zhang et al. [15] Goseva-Popstojanova et al. [18] Zarras et al. [22] Yong et al. [20] McGahagan et al. [24]
XGBoost CNN SVM, J48 DT, PART DT
HTTP headers HTTP request URL HTTP requests; 43 session attributes HTTP requests header chains HTTP requests key value pairs HTTP headers
Rovetta et al. [25] Seyyar et al. [26] Zolotukhin et al. [28] Current study
BOTHOUND HMM KNN, LR, RF, AB, GB, ET, BC, NN MLP, SVM, K-Means, GPCM Rule-based methods
Accuracy 98.7% 96.4% 99.5% 99.9% 99.9% 91.0%
HTTP requests, various fields 99.1% Vulnerability scans, HTTP 99.3% URI HTTP requests, n-gram models 100.0%
SVDD, DBSCAN, SOM, Kmeans SGD, RF, KNN, MLP HTTP headers (n = 84) DT, SVM, XGBoost, RF, KNN, HTTP headers (n = 3) MLP RF HTTP URL/raw request (n = 2)
97.1% 93.6% 96.9%
3 Data Set Processing Our malicious dataset was collected via deployment of a low interaction, web-based honeypot (i.e., Glastopf) on a Comcast business class network for a period of 42 days from 11/23/15 to 1/4/16 [9]. Note that honeypots are legitimate computer systems intended to simulate targets for cyberattacks and can be used to detect attacks or deflect attacks away from valuable assets. The data set acquired during this deployment session comprises a total of 78,592 different HTTP attack requests from 2,189 unique IP addresses originating from 104 different countries and therefore represents our ground truth positive class, i.e., malicious HTTP request data. Note that Glastopf emulates a vulnerable web server by hosting multiple web pages and applications with thousands of vulnerabilities (i.e., SQL injection, Cross site scripting, etc.) and allows attackers to upload malicious payloads [9, 10]. Through utilization of the honeypot technology, we were able to obtain an enriched source of malicious HTTP traffic and therefore avoid extraction of low-level malicious traffic from a larger benign stream. Most attack
456
A. Laughter et al.
origins were concentrated in a few distinct geographical areas, namely, the northeast and central U.S., Europe, Brazil, and southeast China. A smaller, but significant number of attacks originated from the U.S. west coast, northern tip of Africa (Algeria and Morocco), and Indonesia. SQL injection and phpMyAdmin attacks collectively represented 63.2% of all attacks captured. SQL injection attacks attempt to execute SQL statements in the database server and may result in destruction, unauthorized disclosure, or alteration of sensitive database data. The phpMyAdmin attack can result in deletion of a database and originates from multiple vulnerabilities in the phpMyAdmin database management tool. Note that OWASP lists injection attacks as the number one web application security vulnerability [4]. Based on the attack type distribution, we conclude that our data capture is representative of the types of attacks (i.e., SQL injection, command injections, local file inclusion, etc.) that are prominent in today’s landscape. The benign dataset was collected from a dedicated university web server over a period of 42 days from 11/8/19 to 12/20/19. The university’s Computer Science website contains information and links related to various aspects of the department, i.e., program information and curriculum, admission and aid, athletics, student life, etc. [11]. The site includes links to videos, virtual tours, and application materials for prospective undergraduate/graduate students. The dataset comprises a total of 39,265 different HTTP requests from 24,404 unique IP addresses sent to the Computer Science website, and therefore represents our ground truth negative class, i.e., benign HTTP request data. We recognize that our benign set may contain malicious traffic; however, the ratio of malicious traffic in real-world applications is very low. We compare our benign set to an enriched set of malicious traffic to further reduce the effect of noise for this study. We therefore predict that this aspect should not significantly affect our overall analysis or results given our experimental design. Note also that the web administrator was not aware of any web attacks during the collection period and the Computer Science webpage is situated behind a firewall that performs filtering of requests. Both datasets, initially in the form of raw Apache log files, were processed using Python [12] in combination with the Spyder IDE [13]. A novel mySQL database was created by parsing and uploading relevant data from each log file to the database. The individual raw HTTP requests were parsed to extract header key/value pairs and store these as tables within the database. Header data for each labeled class was extracted and exported as standalone .csv master files for further processing. In addition to header data, the length of the HTTP request line and the length of raw request for each instance were imported directly from the database and .csv master files created for further processing. Due to selective enrichment provided by the honeypot technology, our data set contained significantly more malicious instances than benign ones (79,879 vs 39,265 instances). Therefore, we randomly undersampled the malicious instances in order to match the number of instances in our benign set. The net result is a class balanced dataset containing a total of 78,530 instances representing both classes evenly. The dataset was then split into 70/30 ratio training and testing datasets, respectively. This operation allows us to compare our classifier predictions to the test set to determine how well the models generalize to new unseen data.
Detection of Malicious HTTP Requests
3.1
457
HTTP Request Data Processing
Two features are extracted from the HTTP request: (1) request line length, and (2) raw request lengths. The request line of the HTTP request consists of the HTTP method, host/DNS name of the server, resource path, and query string if present. The HTTP raw request consists of all fields of the request line in addition to the request body (payload). Note that due to privacy concerns, we do not use source IP addresses as a feature for classification. Current IP-based black-listing approaches are already widely used in industry grade firewalls and intrusion detection systems (IDS). Our goal was to complement such techniques using a behavior based classification approach and evade IP spoofing. To derive features for building our models, the HTTP method/protocol information was removed from each instance in the request_url column. The string length of each instance was then computed. The same process was repeated for the request_raw column with the exception that we did not remove the method/protocol information. As a result, the former length feature accounts for the length of the request line including the query string, and the latter accounts for the length of the request link, query string, all HTTP headers and the request body. We then performed a log transformation of the string length features to reduce positive skew and produce a distribution that is closer to normal. The URL and raw request length distributions are shown in the following Fig. 2 and 3 (Note: Lengths are log-transformed).
Fig. 2. URL length class distribution
Fig. 3. Raw Request length class distribution
Both distributions demonstrate class overlap; however, the degree of overlap is higher for the raw request feature as shown in Fig. 3. The relationship between the two features can be shown in the following Fig. 4.
458
A. Laughter et al.
Fig. 4. URL_counts vs raw_counts scatterplot (log scaled values; b = benign, m = malicious)
Figure 4 shows that longer URLs combined with mid-range and longer raw requests are predictive of malicious HTTP requests. Once the URL exceeds a logscaled length of 4.5, the requests transition almost entirely to the malicious class regardless of raw request length. Figure 4 shows some class overlap in the mid range lengths. The distributions of the URL request and raw request data features including potential outlier data are shown in the following Fig. 5.
Fig. 5. Boxplots of features (log scaled values; b = benign, m = malicious)
As shown in Fig. 5, the distributions are different between the URL malicious and benign classes. A similar observation is observed for the raw malicious and benign classes; however, the relationship is reversed. 3.2
Header Data Processing
To generate header-based features, header key/value pairs were extracted from each raw HTTP request. The final output data set was an 85-column matrix containing 1’s and 0’s denoting the presence or absence of a each of the 84 headers for each request instance, and one column for the malicious (or benign) label.
Detection of Malicious HTTP Requests
459
4 Predictive Modeling Methodology and Results The primary goal of our study was the evaluation of various modeling techniques for the classification of HTTP requests. A secondary goal of this work was to determine whether a small subset of headers along with URL length can be used to classify an HTTP request as benign or malicious with high accuracy. The following subsections detail our approach using supervised machine learning methods to achieve these goals. 4.1
Machine Learning Methods
In order to generate a predictive model, we have experimented with various supervised machine learning approaches, including: stochastic gradient descent (SGD), logistic regression (LR), decision trees (DT), support vector machines (SVM), extreme gradient boosting (XGBoost), AdaBoost, random forests (RF), k-nearest neighbors (KNN), and a multilayer neural network (MNN) [32]. SGD is an optimization algorithm that can be used for classification and regression. The algorithm iteratively selects a random instance from the training data and computes a cost gradient that it seeks to minimize [30]. During operation, SGD updates the parameter weights in an incremental fashion based on each randomly selected instance in the training data [31]. LR uses a sigmoid function to predict the probability that an instance belongs to a specific class [31]. Output from the sigmoid function is a number between 0 and 1 (i.e., a probability) which is then used to classify an instance. DTs are hierarchical models used for supervised learning, where the local region is identified in a sequence of recursive splits in a smaller number of steps. It is built up of decision nodes where each node implements a test function with an outcome. The process begins at the root and is repeated until a leaf node is reached; the value within the leaf is the output [31]. SVM is a discriminative classifier that creates a separating hyperplane between two or more classes in the feature space. The hyperplane is then used as the basis for separation of new class instances. SVM seeks to find an optimal hyperplane that maximizes the margin between classes [31]. XGBoost is an optimized ensemble method that uses gradient boosted decision trees to perform either classification or regression tasks [33]. Boosted ensemble methods build a collection of sequential decision trees in order to reduce the bias of the combined estimator. Each individual decision tree can be considered a weak learner; however, if multiple weak learners are combined, the net result is often a very strong ensemble learner. AdaBoost is an ensemble method similar to XGBoost in that both utilize weak learners to create a more powerful collective learner; however, AdaBoost focuses on the instances that were misclassified and therefore attributes a larger weight to these values [30]. Random Forest (RF) is an ensemble learning method used for classification and regression. The algorithm functions by constructing a predetermined number of decision trees during the training phase and outputs the class that represents either the mode of the classes (classification) or mean prediction (regression) of the individual trees. More specifically, RF is a collection of decision trees where each tree is slightly different from the others due to the random nature of the attributes chosen to construct an individual tree. In general, each tree can be a good predictor but may tend to overfit on part of the data. Building many decision trees based on randomly sampled training data and then averaging the results can avoid
460
A. Laughter et al.
overfitting yet result in a model with higher classification accuracy [31]. KNN is an instance-based, non-parametric machine learning method that uses a majority voting scheme to classify instances in the training data [31]. To evaluate an instance, the KNN algorithm will locate the k nearest neighbors in proximity to the instance, where k is a parameter of the method. KNN then assigns a class label to the instance using majority voting. A Multilayer Neural Network (MNN) is a feed-forward artificial neural network that contains several layers of artificial neurons or nodes [31]. The MNN will generally have one input layer, one or more hidden layers and one output layer. MNN can distinguish data that is not linearly separable and employs a variety of different optimizers and hyperparameter tuning options. For this study, we use a Multilayer Perceptron (MLP) as our MNN. 4.2
Model Implementation and Classification Results
For this study, we used Python’s scikit-learn package to perform classification using SGD, LR, DT, SVM, AdaBoost, RF, and KNN methods. All methods were executed using default parameters for the initial experiment. For the XGBoost method we utilized Python’s XGBoost library. Lastly, for the MLP method, we used Python’s keras API [34]. The MLP model used four dense (i.e. fully-connected) layers with 1000, 500, 250, and 2 nodes in each layer, respectively. The ReLU activation function was used for the first three layers and softmax with L2 regularization of 0.1 was used for the last layer. Adaptive moment estimation was used as the optimizer [35]. The model was trained for 15 epochs. All processing was performed on a PC with an Intel® i3-7100u CPU @ 2.40 GHz with 8 GB of RAM. Header Dataset Analysis Results obtained from each classifier using the full feature set (84 features/headers) are shown in Table 2. We also show classifier testing times in seconds. Best results are highlighted in gray.
Table 2. Classifier result - full feature set (header data)
SGD
Precision % 96.3
Recall % 96.2
F1-Measure % 96.2
Accuracy % 97.1
Time (s) 0.16
Random Forest
97.1
97.1
97.1
97.1
1.14
MLP
97.1
97.1
97.1
97.1
3.94
K-Nearest Neighbor
97.0
97.0
96.9
96.9
149.0
Logistic Regression
96.5
96.5
96.5
96.5
0.13
Classifier
XGBoost
96.5
96.5
96.5
96.4
0.76
Support Vector Machine
96.0
96.0
96.0
96.0
190.35
Decision Tree
95.2
95.1
95.1
95.1
0.09
AdaBoost
94.1
94.1
94.1
94.0
5.66
Detection of Malicious HTTP Requests
461
As shown in Table 2, RF, MLP, and SGD achieved the highest classification accuracy (97.1%). Of the three, SGD registered the third fastest with a testing time of 0.16 s. The DT classifier scored a lower comparative classification accuracy (95.1%) but did register the fastest testing time of 0.09 s. SVM and KNN registered the slowest testing times at 190.35 and 149.0 s respectively; however, both demonstrated very good classification accuracies of 96.0 and 96.9 percent, respectively. LR had the second best testing time of 0.13 s. All classifiers scored 94.0% for all metrics considered. These results indicate that the header state (i.e., absence or presence) in an HTTP request can be used to identify a malicious request with high accuracy. A second goal of our study was to determine whether a small subset (i.e., 3–5 features) of the original 84 features can be used to identify a malicious request with high accuracy. This goal aligns with our desire to create a lightweight intrusion detection tool that processes the minimum required information from the raw HTTP request yet provides the highest possible classification accuracy for malicious instances. We therefore applied feature selection based on feature importance metrics using three of the above classifiers: RF, XGBoost, and DT. RT feature importance is implemented using Scikit-Learn by averaging the decrease in node impurity at each split over all trees as they are created. Features that provide the greatest overall mean decrease in impurity are given higher feature importance. DT feature importance is implemented in a manner similar to Random Forest in Scikit-Learn and uses the default criterion of ‘Gini’ impurity measured at each split. Features that provide the greatest decrease in impurity at each split are given higher feature importance. XGBoost by default uses ‘gain’ to measure the feature importance of each attribute. The gain criterion determines the relative contribution of each feature to the model based on each tree. Features therefore resulting in higher average gain are therefore considered more important when making predictions. Feature selection results are presented in the following Table 3. The top five most important header features as scored by each classifier are shown. Table 3. Feature selection results Importance rank: Random forest XGBoost 1 accept-encoding accept-encoding 2 accept-language content-type 3 referer accept-language 4 content-type accept 5 from user-agent
Decision tree accept-encoding content-type accept-language accept x-wallarm-scanner-info
As shown in Table 3, the three highest rated header features common to all three classifiers are as follows: (1) accept-encoding, (2) accept-language and (3) contenttype. Note that Random Forest feature selection rated the referer header as more important than content-type, but only marginally so. To confirm these results, the permutation_importance function from the ScikitLearn inspection module was used to calculate feature importances using the trained Random Forest model created from the 84-feature training set. This testing confirmed
462
A. Laughter et al.
‘accept-encoding’ and ‘content-type’ headers represent the top two highest importance scores and align with previous results. The ‘accept-language’ header, originally ranked in the top three based on the previous analysis, ranked lower for the permutation importance testing, suggesting that the original score for this header may have been inflated by the models (i.e., XGBoost, RF, and DT). However, experimental model testing using different combinations of the top six headers while including ‘acceptencoding’ and ‘content-type’ as the two base headers, showed that the ‘accept-language’ header provides incrementally more classification accuracy. Therefore, based on this testing, we included the ‘accept-language’ header as the third feature for our models. Using these three features, we have retrained all the classification models. Results obtained from each classifier using the reduced feature set are shown in Table 4. Best results are highlighted in gray. Table 4. Classifier result - three feature set (header data)
Decision Tree
Precision % 93.6
Recall % 93.6
F1-Measure % 93.6
Accuracy % 93.6
Time (s) 0.26
Support Vector Machine
93.6
93.6
93.6
93.6
8.85
XGBoost
93.6
93.6
93.6
93.6
0.17
Classifier
Random Forest
93.6
93.6
93.6
93.6
0.40
K-Nearest Neighbor
93.6
93.6
93.6
93.6
12.84
MLP
93.6
93.6
93.6
93.6
4.72
SGD
91.9
91.7
91.7
91.7
0.12
Logistic Regression
91.9
91.7
91.7
91.7
0.23
AdaBoost
91.9
91.7
91.7
91.7
1.93
As shown in Table 4, all classifiers achieved comparable results using the threefeature set. Compared to the models with the full feature set, the best accuracy decreased from 97.1% to 93.6%. However, the overall testing times were reduced for most classifiers making the three-feature set more suitable for a real-time IDS application. These results indicate that the absence or presence of three specific headers in an HTTP request (i.e., accept-encoding, accept-language, and content-type) can be used to efficiently identify a malicious request with good accuracy. To determine if the accuracy results can be improved, we performed a grid search using the GridSearchCV as implemented in Scikit-Learn. We chose three of the top performing classifiers (i.e., Decision Tree, XGBoost, and Random Forest) and tuned multiple hyperparameters specific to each to improve upon our previous metrics using the three-feature data set. The GridSearchCV was performed using 10-fold cross validation to estimate model generalization performance during hyperparameter tuning. Optimization results did not differ significantly from the previous results and demonstrated similar overall metrics, indicating that the original models and parameters were optimal for our dataset.
Detection of Malicious HTTP Requests
463
HTTP Raw Request and URL Dataset Analysis For the second phase of our study, we built classification models using the processed HTTP raw request data set (i.e., a matrix with dimensions 78,530 2), with the same methods as used for the header dataset. The data set contained the lengths of the URLs and the complete raw request for each instance, and these two numerical values were used as features for our models. Results are shown in Table 5. Best results are highlighted in gray. Table 5. Classifier result - full feature set (URL/Raw HTTP request data)
Random Forest
Precision % 96.9
Recall % 96.9
F1-Measure % 97.0
Accuracy % 96.9
Time (s) 1.52
K-Nearest Neighbor
96.5
96.5
96.5
96.5
3.49
XGBoost
93.2
93.2
93.2
93.2
0.41
Classifier
AdaBoost
92.0
91.9
92.0
91.9
1.61
Decision Tree
88.9
88.8
88.7
88.7
0.11
Support Vector Machine
88.7
88.2
88.2
88.3
20.75
MLP
84.3
82.7
86.2
84.2
4.99
SGD
79.7
78.5
78.1
78.3
0.28
Logistic Regression
76.9
76.7
76.6
76.7
0.12
As shown in Table 5, the Random Forest (RF) model achieved the highest accuracy at 96.9%, although the Decision Tree (DT) model registered the fastest testing time of 0.11 s and scored a classification accuracy of 88.7%. To gain an intuition for how the RF model achieved the high classification accuracy, the decision boundaries that the model used to classify the HTTP requests are shown in the following Fig. 6.
Fig. 6. RF model (decision boundaries, log scaled values; 0 = benign, 1 = malicious)
464
A. Laughter et al.
As shown in Fig. 6, the decision boundaries for the RF model can be visualized as irregular square and rectangular shaped regions in the two-dimensional graph space. The green region represents benign instances, while the red region represents malicious regions. A few misclassified instances can be observed in the plot as well as some instances close to the decision boundaries; however, the majority (i.e., 96.9%) are correctly classified. Contrast the shapes of the decision boundaries produced by the RF model with those generated from the K-Nearest Neighbors (KNN) model as shown in the following Fig. 7:
Fig. 7. KNN model (decision boundaries, log scaled values; 0 = benign, 1 = malicious)
As mentioned earlier, KNN operates by identifying the k nearest neighbors and returning the majority vote of the labels for these instances. As compared with the Random Forest model, the decision boundaries created by KNN are more angular and region-specific, since the algorithm is averaging over multiple data instances based on local proximity. In contrast, the RF model produces a series of rectangular and stairstep regions in the feature space resulting in sharp linear boundaries. Similarly to the experiments done for header data, we performed a grid search to optimize the hyperparameters of the RF, KNN, and DT models. These were chosen since RF and KNN demonstrated the highest accuracies, and DT scored the fastest testing time. The specific hyperparameters tuned and the values used for each are shown in Table 6.
Detection of Malicious HTTP Requests
465
Table 6. Selected hyperparameters and associated values Classifier Hyperparameters Values K-Nearest Neighbors leaf_size {25, 30, 35, 40, 45} n_neighbors {3, 5, 7, 9} p {1, 2} Decision Tree max_depth {4, 6, 8, 10} criterion {‘Entropy’, ‘Gini’} Random Forest n_estimators {100, 150, 200} max_depth {4, 6, 8, 10}
Once trained, the best estimators were selected for each model and then predictions derived on the test set. Results are shown in Table 7. Table 7. Classifier results - optimized models
Random Forest
Precision % 96.9
Recall % 97.0
F1-Measure % 97.0
Accuracy % 96.9
Time (s) 1.00
K-Nearest Neighbor
96.6
96.7
96.6
96.6
2.16
Decision Tree
95.8
95.8
95.8
95.7
0.15
Classifier
Results from Table 7 show a significant improvement for the Decision Tree model. The prediction accuracy increased from 88.7% using default parameters to 95.7% after optimization, representing a 7.0% increase. This makes it close in accuracy to the RF and KNN models, while having significantly lower processing time. The optimized model parameters for the Decision Tree classifier are: (1) max_depth = 10, and (2) criterion = ‘Gini’. For all three models, the results indicate a low false positive rate and high recall. Given our desire to design a lightweight, fast classifier for incoming HTTP traffic with minimal processing overhead, the optimized Decision Tree model could represent the best overall model when all factors are considered.
5 Conclusion Low-profile, non-commercial websites are less secure due to more limited resources and represent attractive targets for malicious users and hackers. Such websites are compromised as part of large-scale botnets and sold on the dark web for illicit financial gain. Current statistical trends show that botnet attacks will increasingly target financial organizations and that attacks related to cryptocurrency are on the rise. We therefore propose a machine learning approach to detect malicious HTTP requests potentially present in routine network traffic directed towards non-commercial websites. Our approach examines the first few lines of an HTTP packet and performs string
466
A. Laughter et al.
processing of the request line (URL length/raw request) and header fields. After minimal processing, the resultant feature vectors are applied to downstream supervised models for classification as either benign or malicious (attacks). The datasets used in our study were collected from two sources, one being a low-interaction honeypot, and the other, network traffic logs collected from a university department web server. Our experimental results show that a combination of three HTTP request headers (i.e., accept-encoding, accept-language and content-type) are sufficient to identify malicious requests with an accuracy of 93.6% using an optimized Decision Tree (DT) model. Average precision (93.6%) and recall (93.6%) were high as well as F1score (93.4%), indicating a low false-positive rate and high detection rate. In addition to header data, the URL and raw request lengths extracted from the HTTP requests were converted to input feature vectors and applied to our downstream classifiers. Results from these experiments show that URL and raw request lengths can be used as predictive features to identify malicious requests with an accuracy of 96.9% using an optimized Random Forest (RF) model. Average precision (96.9%) and recall (97.0%) for the RF model were high, as well as F1-score (97.0%). Overall, our study results demonstrate that machine learning models can be successfully applied to network traffic analysis and can identify malicious traffic with high efficiency and accuracy. For future work, we plan to implement a lightweight IDS tool for non-commercial web sites with high performance and minimal HTTP packet processing situated upstream of the web application. We envision multiple potential implementation scenarios including a stand-alone system where all HTTP/HTTPS traffic is mirrored to a monitoring box. Required features (i.e., headers, URL length) are extracted and processed in real time as they arrive, and the request is then classified as benign or malicious. The system would route back into a firewall to block or drop malicious connections. A second potential system could involve an embedded module situated inside the HTTP server. Feature extraction and classifier functions are implemented as a processing step by the server. The HTTP request is classified and either admitted for full payload processing, or dropped by the server and routed to storage for later analysis, trending, and inclusion into a database reserved for malicious instances.
References 1. Calzavara, S., Conti, M., Focardi, R., Rabitti, A., Tolomei, G.: Machine learning for web vulnerability detection: the case of cross-site request forgery. In: IEEE Security & Privacy, January 2019 2. Calzavara, S., Focardi, R., Squarcina, M., Tempesta, M.: Surviving the web: a journey into web session security. ACM Comput. Surv. 5, 451–455 (2018) 3. Khalid, M., Farooq, H., Iqbal, M., Alam, M.T., Rasheed, K.: Predicting web vulnerabilities in web applications based on machine learning. In: Presented at Intelligent Technologies and Applications. Communications in Computer and Information Science, vol. 932. Springer, Singapore, March 2019 4. https://owasp.org/www-project-top-ten/. Accessed 15 Mar 2020 5. https://www.accenture.com/acnmedia/PDF-99/Accenture-Cost-Cyber-Crime-Infographic. pdf#zoom=50. Accessed 15 Mar 2020 6. https://www.pentasecurity.com/blog/top-5-botnets-2017/. Accessed 15 Mar 2020
Detection of Malicious HTTP Requests
467
7. https://securelist.com/bots-and-botnets-in-2018/90091/. Accessed 15 Mar 2020 8. Putman, C., Abhishta, A., Nieuwenhuis, B.: Business model of a botnet. In: Presented at Proceedings of the 26th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, PDP, April 2018 9. Omari, S., Mescioglu, I., Rajshree, S.: Experiences on the deployment of honeypots for collection and analysis of web attacks. In: Presented at MBAA International Conference, June 2016 10. https://www.honeynet.org/projects/old/glastopf/. Accessed 16 Mar 2020 11. https://www.lewisu.edu/academics/comsci/. Accessed 16 Mar 2020 12. https://www.python.org/. Accessed 16 Mar 2020 13. https://www.anaconda.com/. Accessed 16 Mar 2020 14. Niu, W., Li, T., Zhang, X., Hu, T., Jiang, T., Wu, H.: Using XGBoost to discover infected hosts based on HTTP traffic. In: Security/Communication Networks, pp. 1–11 (2019) 15. Zhang, M., Xu, B., Bai, S., Lu, S., Lin, Z.: A deep learning method to detect web attacks using a specially designed CNN. In: Presented at 24th International Conference on Neural Information Processing. Proceedings Part V, LNCS, vol. 10638, pp. 828–836, October 2017 16. Zhang, Y., Mekky, H., Zhang, Z., Torres, R., Lee, S., Tongaonkar, A., Mellia, M.: Detecting malicious activities with user-agent-based profiles. Int. J. Netw. Manage. 25(5) (2015) 17. Yu, Y., Yan, H., Guan, H., Zhou, H.: DeepHTTP: semantics-structure model with attention for anomalous HTTP traffic detection and pattern mining. In: Proceedings of ACSAC, New York, NY, USA (2018) 18. Goseva-Popstojanova, K., Anastasovski, G., Dimitrijevikj, A., Pantev, R., Miller, B.: Characterization and classification of malicious web traffic. Comput. Secur. 42, 92–115 (2014) 19. Li, K., Chen, R., Gu, L., Liu, C., Yin, J.: A method based on statistical characteristics for detection malware requests in network traffic. In: Presented at IEEE Third International Conference on Data Science in Cyberspace. pp. 527–532, June 2018 20. Yong, B., Xin, L., Qingchen, Y., Liang, H., Qingguo, Z.: Malicious web traffic detection for internet of things environments. Comput. Electr. Eng. 77, 260–272 (2019) 21. Ogawa, H., Yamaguchi, Y., Shimada, H., Takakura, H., Akiyama, M., Yagi, T.: Malware originated HTTP traffic detection utilizing cluster appearance ratio. In: Presented at International Conference on Information Networking (ICOIN), pp. 248–253, January 2017 22. Zarras, A., Papadogiannakis, A., Gawlik, R., Holz, T.: Automated generation of models for fast and precise detection of HTTP-based malware. In: Presented at Annual Conference on Privacy, Security and Trust, PST 2014, pp. 249–256, July 2014 23. Kheir, N.: Behavioral classification and detection of malware through HTTP user agent anomalies. J. Inf. Secur. Appl. 18, 2–13 (2013) 24. McGahagan, J., Bhansali, D., Gratian, M., Cukier, M.: A comprehensive evaluation of HTTP header features for detecting malicious websites. In: Presented at European Dependable Computing Conference, pp. 75–82, September 2019 25. Rovetta, S., Suchacka, G., Masulli, F.: Bot recognition in a web store: an approach based on unsupervised learning. J. Netw. Comput. Appl. 157 (2020) 26. Seyyar, M., Catak, F., Gul, E.: Detection of attack-targeted scans from the apache HTTP server access logs. Appl. Comput. Inf. 14, 28–36 (2017) 27. Husák, M., Velan, P., Vykopal, J.: Security monitoring of HTTP traffic using extended flows. In: Presented at 10th International Conference on Availability, Reliability and Security, pp. 258–265, August 2015 28. Zolotukhin, M., Hamalainen, T., Kokkonen, T., Siltanen, J.: Analysis of HTTP requests for anomaly detection of web attacks. In: Presented at International Conference on Dependable, Autonomic and Secure Computing, pp. 406–411, August 2014
468
A. Laughter et al.
29. https://www.nubeva.com/product. Accessed 22 Mar 2020 30. Geron, A.: Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems, 2nd edn. OReilly Media, Inc., Sebastopol (2019) 31. Raschka, S., Mirjalili, V.: Python Machine Learning. 3rd edn. Packt Publishing, Birmingham (2019) 32. https://scikit-learn.org/stable/supervised_learning.html#supervised-learning. Accessed 31 Mar 2020 33. https://xgboost.readthedocs.io/en/latest/. Accessed 31 Mar 2020 34. Keras Homepage. https://keras.io/. Accessed 31 Mar 2020 35. Kingma, D., Ba, J.: Adam: a method for stochastic optimization. In: Presented at International Conference on Learning Representations, January 2015
Comparison of Classifiers Models for Prediction of Intimate Partner Violence Ashly Guerrero, Juan Gutiérrez Cárdenas(&), Vilma Romero, and Víctor H. Ayma Universidad de Lima, Lima, Peru [email protected], {jmgutier,vromero,vayma}@ulima.edu.pe
Abstract. Intimate partner violence (IPV) is a problem that has been studied by different researchers to determine the factors that influence its occurrence, as well as to predict it. In Peru, 68.2% of women have been victims of violence, of which 31.7% were victims of physical aggression, 64.2% of psychological aggression, and 6.6% of sexual aggression. Therefore, in order to predict psychological, physical and sexual intimate partner violence in Peru, the database of denouncements registered in 2016 of the “Ministerio de la Mujer y Poblaciones Vulnerables” was used. This database is comprised of 70510 complaints and 236 variables concerning the characteristics of the victim and the aggressor. First of all, we used Chi-squared feature selection technique to find the most influential variables. Next, we applied the SMOTE and random under sampling techniques to balance the dataset. Then, we processed the balanced dataset using cross validation with 10 folds on Multinomial Logistic Regression, Random Forest, Naive Bayes and Support Vector Machines classifiers to predict the type of partner violence and compare their results. The results indicate that the Multinomial Logistic Regression and Support Vector Machine classifiers performed better on different scenarios with different feature subsets, whereas the Naïve Bayes classifier showed inferior. Finally, we observed that the classifiers improve their performance as the number of features increased. Keywords: Intimate partner violence Random forest Multinomial logistic regression Support Vector Machine Naïve Bayes SMOTE
1 Introduction Intimate partner violence (IPV) is a type of gender violence, that is not a recent issue, but that it has evolved over the years. In the 70s, this crime was a severe social problem, which caused the resurgence of the Women’s Movement [48]. This problem has dramatically affected society since it goes against the fundamental rights of the person, and it is a social and interpersonal crime, which affects the entire environment of society (political-economic and social) [36].
A. Guerrero and V. H. Ayma—Authors contributed equally to this manuscript © Springer Nature Switzerland AG 2021 K. Arai et al. (Eds.): FTC 2020, AISC 1289, pp. 469–488, 2021. https://doi.org/10.1007/978-3-030-63089-8_30
470
A. Guerrero et al.
This offense, of which 68.2% of women are victims in Peru, has been categorized according to the type of violence action committed [16]: a) Physical violence, which consists of physical aggression towards the victim, causing injuries, fissures, bruises, among others; 31.7% of women suffer from this type of aggression in Peru. b) Psychological violence, which consists of dominating and confining the victim against his or her will, as well as undermining and causing psychological damage (psychiatric alterations of the victim); 64.2% of women suffer from this type of aggression in Peru. c) Sexual violence, which is done against a person without their consent. 6.6% of women suffer from this type of aggression in Peru. Moreover, it should be highlighted that is type of criminal act has several risk factors, which are factors that increase the likelihood of violence occurring between the couple. One of the possible factors is economic problems such as poverty, and if we associated it to male identity, we could have another risk factor identified [20]. An additional issue is psychological problems such as jealousy, insecurity of people, low self-esteem, and stress [20]. Also, statistics indicate that 6.4% of the husbands or partners of the victims drink alcoholic beverages frequently or occasionally. In relationship with this information, 49.1% of women expressed that they were once assaulted when their partner was under the influence of alcohol or drugs [16]. The consequences of this type of violence in society are diverse; one of them is the impact on the income of women in Peru. A research [30] showed that, in Peru, women who suffer aggression generate an average of 80% of the monetary income earned by those living in a situation without violence. This problem also has an impact on the psychiatric level (physical and mental). Additionally, the research showed that victims of violence are more likely, with 7% more accuracy, to have complications in childbirth. Also, regarding mental health, victims may suffer anxiety or distress of great magnitude, causing them to be prevented from fulfilling their work or other obligations. Although the causes of this gender violence are known, the most critical factors that have the most significant influence on it are unknown. Furthermore, no experiments have been performed due to the lack of studies of a selection of relevant attributes or variables that will allow making these violence predictions. Some research work has been carried out to solve this problem. For example, Babu and Kar, 2010; Saile et al., 2013; Iverson et al. 2013 tried to find the most influential factors that are present in intimate partner violence using statistical models as logistic regression, latent class growth analysis, multinomial logistic regression, Poisson regression and linear regression. Also, Gosh, 2007; Berk, 2016; Hsieh et al., 2018; Wijenayake, Graham, and Christen, 2018 tried to predict this type of crime using heuristic models. Moreover, statistical methods like logistic regression have been used to make predictions [31]. Regarding our home-country, Perú, there is little evidence of research performed on this type of crime. For that reason, this research proposes the use of data wrangling and feature selection techniques to find those variables that are more relevant for determining the prediction of genre violence. The selected variables will serve as an input to a set of classifier models such as Multinomial Logistic Regression, Naive Bayes, Random
Comparison of Classifiers Models
471
Forest and Support Vector Machines; in order to make a comparison between them and obtain the optimal model considering the metrics of accuracy, precision, recall, and F1score. This research is divided into the following sections: In Sect. 2, we will analyze the current state of the art of researches made in the field of genre violence. In Sect. 3, we will make a succinct description of the techniques that we will use. The description of the steps employed in this research, followed by the experimentation processes, are depicted in Sect. 4. Having specified the methodology employed, the results are explained in Sect. 5. Finally, the discussion and conclusions are stated in the last two sections.
2 Related Work 2.1
Identification of Influential Features
In 2010. Genuer, Poggi and Tuleau-Malot [11] carried out a research project to select the most relevant features. First, the authors use Random Forest Importance; this was to observe the significance of the variables and eliminate those that have less importance. After that, the features were arranged in descending order and were entered into several Random Forests to obtain the optimal number of variables based on the OOB (out of the box error). In the same year, Babu and Kar [2] did an investigation with the objective of analyzing the factors related to the victimization and perpetration of domestic violence that occurred in eastern India. The authors modeled the binary variables (presence or absence) of domestic violence in a logistic regression based on each independent variable. Likewise, they executed several multivariate logistic regression models with gradual backward elimination, in which the process of variable elimination relied on a p-value of less than 0.10. On the other hand, Abramsky et al. [1], focused on the identification of risk factors and protective factors (factors that prevent the risk of violence) in the couple, and how these factors differ according to the environment. These authors used a bivariate logistic regression to obtain the association between the variables of the environment or characteristics of the victim and the crime committed. For feature selection, they used multivariate logistic regression to identify factors of sexual or physical violence. In 2012, Swartout, Cook, and White [45] used latent class growth analysis to adjust a set of longitudinal models and select the one that obtained a better fit. The model selected was the multinomial logistic regression in which were entered the variables of negative experiences in childhood, such as sexual, physical abuse, and if the study subject had witnessed acts of violence. The objective of this study was to obtain the magnitude in which latent variables affected the occurrence of partner violence. At the same time, Schafer et al. [41] investigated the relationship between the violence between couples and the development of HIV disease. Those authors used univariate analysis, where they applied for the categorical predictors and Sample T for continuous predictors. Likewise, they used multivariate logistic regression analysis for the variables that obtained a p-value less or equal than 0.30.
472
A. Guerrero et al.
In 2013, Saile, Neuner, Ertl, and Catani [40] used Spearman’s correlations with the risk factors of violence and the level of violence experienced by women to analyze the bivariate relationships that exist between them. After that, they executed an independent linear regression analysis to test the independent predictors of that correlation. The researchers used four linear regression models to investigate the independent associations of predictive variables with the different subtypes of abuse experience. Also, Iverson, Litwack, Pineles, Suvak, Vaughn, and Resick [18] used Poisson regression to obtain the risk factors for the revictimization of partner violence, focusing mainly on the factors that reduce the risk of violence in the future. In contrast, Izmirli, Sonmez and Sezik [19] used a Fisher’s tests for bivariate comparisons and logistic regression models with backward elimination to select the influential variables in domestic violence among married women of reproductive age in southwestern Turkey. Two years later, Clark and others [10] investigate to assess whether partner violence in late adolescence and early adulthood was related to cardiovascular problems. For this reason, the authors used restricted cubic spline functions to observe if the victimization and perpetration score, neighborhood poverty, educational achievement measures by the subjects studied, and their parents were linearly related to cardiovascular problems. Furthermore, they used a linear regression model to assess the relationship of violence with health problems. At the same time, Brignone and Gomez [7] sought to identify which groups of patients admitted to the emergency department due to partner violence were more likely to suffer homicide. The authors executed multiple linear regression to obtain the relationship between the variables used and the score obtained from the risk assessment. Moyano, Monge, and Sierra [32] investigated the relationship between sexual double standards and attitudes that are related to rape in order to identify the probability that a person could perform that social aggression. The authors used logistic regression to obtain that relationship and the variables that influence the perpetration of the crime. On the other hand, Laeheem and Boonprakarn [26], analyzed the predictive factors of domestic violence among married couples in the Pattatini province of Thai Muslim and created a predictive equation for domestic violence. The authors used Pearson’s correlation to obtain the relationships between family background in education, the experience of violence, the authoritarian relationship, and domestic violence. Also, the authors used a backward multiple regression equation that could predict domestic violence by calculating multiple correlation coefficients. In parallel, Leonardsson and San Sebastian [27] sought to find prevalence and predictors so that women who have been victims of violence during the marriage could seek help; this research was conducted in India. For analyzing the factors that provoke that a victim seeks help, the authors used a bivariate logistic regression model, including only one independent variable, to obtain odd raw ratios for each of the variables. Then they executed the multivariate logistic regression for the remaining variables. In contrast, Jung et al. [22] investigated the relationship between childhood exposure to abuse and other forms of domestic violence with the risk of participation in future partner violence. The authors used latent class analysis in order to obtain this relationship. On the other hand, Silva et al. [43] used artificial neural networks in order
Comparison of Classifiers Models
473
to find predictors of partner violence. Their research was conducted with data obtained from “National Survey Demography and Health” in Colombia 2017. 2.2
Crime Prediction
Ghosh [12] compared logistic regression, classification tree, and Random Forest models to predict the vulnerability of women living in India to domestic violence. First, the author performed a simple cross-tabulation to obtain the predictive variables associated with domestic violence. Then, the author executed the predictive models and compared them in a classification table based on false negatives and false positives using the accuracy metric. Moraes, et al. [31] used a multinomial logistic regression model to predict the physical violence of a couple, which was divided into three classes: The first was the absence of physical violence, the second was at least one episode of violence during or after pregnancy, and the third class was the occurrence of the crime in both stages. They performed a univariate analysis to identify a parsimonious set of descriptors. The variables that had a statistical significance at the 5% level were considered in the predictive model. However, in the prediction model, some variables lost significance due to one or more variables previously contemplated. On the other hand, Berk, Sorenson, and Barnes [4] investigated the prediction of domestic violence with the aim of finding a subset of criminals who could be released without conditions, which means that would adhere not to commit more domestic violence events. For doing this, they used a Random Forest model, which classified the data into three different classes. The first class contained the criminals who were not arrested for domestic violence, the second was made up of criminals arrested for domestic violence that did not include injuries, intent or physical threats, and the third one was offenders arrested for domestic violence who did physically hurt or made an attempt or threats to the victim. In 2018, Hsieh et al. [14] did a similar job focusing primarily on the prediction and management of domestic violence risk in Taiwan. First, they performed an exploratory analysis of the data to obtain the most important characteristics. They use the Random Forest model to build a repeated victimization prediction model to discard false allegations. Wijenayake, Graham, and Christen [50] investigated to predict whether a criminal would commit the crime of domestic violence again within 24 months from the date the court hearing was held. The authors used a decision tree model.
3 Background 3.1
Data Balancing via the Synthetic Minority Oversampling Technique
A database is considered imbalanced if the number of one of its classes has more instances than the other classes [47]. The class with most instances is known as the major class, and the one with the least number of instances is known as the minor class [28]. Class imbalances can significantly affect the performance of machine learning classifiers, causing unknown errors during training phase. To solve this problem, machine learning researchers and practitioners often use the Synthetic Minority Oversampling Technique (SMOTE), which oversamples the instances of the minor class through the introduction of new samples by linking all or any of the closest
474
A. Guerrero et al.
neighbors of the class. Depending on the number of instances to be sampled, the neighbors of the nearest neighbors are randomly selected [9]. In order to apply SMOTE, a sample E of the smaller class is first chosen, so the closest neighbors of the class are obtained. Then, an E’ sample of the closest neighbors found is randomly selected, and a new minority sample is constructed, see Eq. 1 [51]. Enew ¼ E þ randð0; 1ÞxðE 0 EÞ
ð1Þ
To obtain the nearest neighbors, each sample of the minority class calculates the distance between all other samples using the Euclidean distance. Then an N ratio is established according to the unbalanced sample ratio; each sample E of the imbalanced class selects N samples from its nearest neighbors which are denominated Ei (i = 1,2,3, …, N). Finally, E, sample of the minor class, and Ei (i = 1,2,3, …, N) repeatedly construct a new sample with Eq. 1 [51]. It should be noted that the selected samples are characteristic vectors [15]. 3.2
Feature Selection via the X2
It is a set of methods that allows identifying, among all the features for a given classification model [50]. These techniques are used to enhance the classification performance, often improving the understanding of the data [8]. Similarly, it reduces overfitting and aids in obtaining knowledge of the processes that generated the data [39]. The literature categorizes the Feature Selection into a filter, wrapper and embedded methods. The filter methods consist of the evaluation of the relevance of a variable based on the intrinsic relationship between the data. The wrapper methods, on the other hand, not only evaluate the relevance of the characteristics but also evaluate the results of the model for each of the relevant variables [17, 52]. The embedded methods consist in the integration of the selection of characteristics within the training stage to reduce the time taken to classify the different subsets of relevant features [8]. Among the filter methods, the Chi-squared (X 2 ) is a well-known method for the selection of the most important features. This technique consists in independence tests to estimate if a class is independent of a characteristic. The correlation between the features is calculated by using Eq. 2 [37]: X ¼ 2
Xr
Xc
i¼1
j¼1
nij uij uij
ð2Þ
Where c is the number of classes, r is the number of values and nij is the number of samples in the value i that has the factor of the class j. We can see how to obtain the value of uij in Eq. 3 [37]: uij ¼
nj ni n
ð3Þ
In this equation nj is the number of samples in the class j; ni is the quantity of samples with the value of the characteristic i and n is the number of samples.
Comparison of Classifiers Models
3.3
475
Classifier Models
Multinomial Logistic Regression. The multinomial regression model is used when the dependent variable is polyatomic, so that it has more than two values, and the independent variables or predictors are continuous or categorical. This model is measured on a nominal scale; and unlike the binary logistic regression, where the dependent variable can only take two values, the dependent variable of this model can have more than two categorically coded classes, and one of them is taken as the reference class [49]. This model is also known as SoftMax regression [53]. The SoftMax Regression algorithm applies binary logistic regression to multiple classes at once. Also, for each attribute through a stochastic gradient function a weight (h) is calculated and an activation function determines whether the attribute belongs to a class or not. The probability that a given sample x belongs to a class k is determined by Eq. 4 [53]: exp xTi hk pk ðxi jhk Þ ¼ PK T j¼1 exp xi hj
ð4Þ
Where hk is the vector of parameters associated with class k. Likewise, X ¼ ½X1 ; X2 ; X3 . . .; Xn represents the matrix of the samples and y is the parameter matrix. Therefore, given a set of training data, the SoftMax model will be used to learn the parameter matrix through maximum likelihood [53]. Random Forest. Random Forest is a model compounded of many decision trees, and it belongs to the techniques known as “ensemble”. It uses trained classifiers that get new instances when their predictors are combined [13, 42]. In other words, it is a combination of predictive trees or classifiers, independent random vectors, and an input vector [6]. In order to make the predictions, a set of CARTs (classification and regression trees) are used, which are created using a subset of training samples through replacement. In other words, a sample can be selected many times while others are not selected [6] in a technique called “bootstrapping” [24]. The samples are divided into two data sets, one contains two-thirds of the samples, and this data set is known as an “in-bag” sample; the other contains one-third of the data and is also known as out-of-the bag error. Those datasets are used in an internal cross-validation technique in order to obtain the out-of-bag error, which allows observing the performance of the random forest [3]. Support Vector Machine. This model uses a hyperplane to obtain a classifier with a maximum margin [23]. In the training phase, several hyperplanes can be obtained; also, the classifier must select one of them to represent the decision limit, based on how well it is desired that the classification of the test data be performed. These (decision limits) are associated with two hyperplanes. This relationship is formed by moving hyperplanes parallel to the decision limits until they touch a support vector. The margin between the transferred hyperplanes is known as the classifier margin. It should be noted that the smaller the margin, the higher the risk of overfitting and will have greater
476
A. Guerrero et al.
errors of generalization [46]. The advantage of using vector support machines (SVM) is that this model makes use of the minimization of structural risk. In other words, SVM reduces the classification error in unseen data, without making prior assumptions, based on the probabilistic distribution of the data. Also, this model has a good performance in scenarios with the curse of dimensionality [5]. Naive Bayes. Naïve Bayes is one of the simplest but effective classifiers [34]. The Naive nomenclature is used due to the assumption that the characteristics of a data set are mutually independent. However, in practice, most of the time, this assumption is not fulfilled, but these classifiers continue to function correctly [38]. This model is a linear classifier based on Bayes’ theorem. This theorem derives from two probabilities: the joint probability which states that having two random variables X and Y, its probability is represented by PðX ¼ x; Y ¼ yÞ. The second probability is the conditional probability, which is the probability that a random variable takes a value because the value of another random variable is known. This probability is represented by PðX ¼ xÞ [46]. The Bayes theorem is represented in Eq. 5: Pð X Þ ¼
PðY Þ PðY Þ Pð X Þ
ð5Þ
Where Pð X Þ is the posterior probability of Y, also known as conditional probability; PðY Þ is the previous probability. In addition, in Eq. (5), the conditional probability of the class is represented by PðY Þ, and the evidence is represented by Pð X Þ, which can be ignored, because its value is constant. It should be noted that the previous probability is obtained through the fraction of training records belonging to each class; and the conditional probability of the class is obtained by the assumption of independence of the Naive Bayes model. This assumption is represented in Eq. 6 [46]: PðXjY ¼ yÞ ¼
Yd i¼1
PðXi jY ¼ yÞ
ð6Þ
Where d is the number of attributes of X ¼ fX1 ; X2 ; X3 ; . . .; Xd g.
4 Methodology 4.1
Proposed Method
We present a general flow diagram of our methodology; as can be seen, our methodology consists in the selection of the most influential variables using the X 2 technique after we applied descriptive analysis techniques. Afterwards, these variables will be entered into our different classification models which are Multinomial Logistic Regression, Random Forest, Naive Bayes and Support Vector Machine models, in order to obtain the best classifier for the problem at hand, see Fig. 1. In algorithm 1 we show the steps that we will follow in our experimental procedure:
Comparison of Classifiers Models
477
Fig. 1. Methods used to predict IPV. Source: Own elaboration.
Algorithm 1: Feature selection and testing of the classifiers Step 1: Data Preprocessing: Use mode for missing values, delete variables composed by null values in the majority of cases. Step 2: Use Cramér´s V for removing those variables that contribute the same information to the target value. We end up with 32 variables. to sort them the variables according to their importance level. Step 3: Use Step 4: Perform one-hot encoding on the categorical variables of type nominal. Step 5: Divide our dataset in subsets of 6, 12, 18, 24 y 32 features considering the results from step 3 and 4. Step 6: Perform for each classifier a cross-validation with 10 folds: a. Balance the training sets using SMOTE and under-sampling. We gave a percentage to the minority class of around 10% b. Apply classification model. c. Validate classification model. Step 7: Obtain metrics from the models and compare them
4.2
Experimental Design
Dataset. The dataset used was obtained from the “Ministerio de la Mujer y Poblaciones Vulnerables” (Ministry of the Women and Vulnerable Populations, MIMP) of Perú. This dataset contains records of complaints made by persons who have been victims of psychological, physical, or sexual violence, throughout the year of 2016 around the whole country. The dataset encompasses family relationship between the aggressor and the victim, such as spouses, father/mother, brother or sister, father-inlaw, among others. The dataset is built upon 70 510 records with 236 variables features each, which gather information related to the aggressor and the victim such as the age and gender, area of residence of the couple, educational level, pregnancy of the victim, number of children that a couple had, ethnicity and mother native language, if the victim and the aggressor generates income, if they live together, the type of violence committed, if the woman received any treatment among others. In the same way it contains variables, not related to the act of violence, such as the time of arrival, the location, to the place where the report was made, acts carried out by the center where the complaint was made, among others.
478
A. Guerrero et al.
It is worthy of mentioning that more than 95% of the dataset is composed of nominal and ordinal categorical features. Additionally, hereafter, we use the terms variables and features in an interchangeably manner to encourage a better understanding. 4.3
Implementation
Data Preprocessing. In order to provide a consistent dataset for the model selection stage, we removed the non-violence related features, as well as those having more than 70% null values, such as the Emergency Center, complaint time, among others. Similarly, we identified those features which had empty values but were significant for our classification models. This process was done through the SPSS statistical program by exporting our original dataset into a CSV format. We also filtered the dataset by the type of relationship between the aggressor and the victim, such as husbands, boyfriends, partners, or sexual partners with or without children, to reflect a partner/couple violence; furthermore, we only consider cases in which the victim was a woman and the aggressor a man. Moreover, we used the mode to complete the features with less than 5% of missing values [44]; and filled up with zeros the features containing binary categorical variables that had more than 70% of null values. We also transformed the aggressor and victim age variables into five categories taking as reference the biopsychosocial sub-stages of human development [29]; therefore, five categories were obtained. In addition, for variables that presented null values and their values depend on another variable, we used the iterative imputer technique. Finally, we performed a descriptive statistical analysis for determining which variables provide useful information for our classifiers. Likewise, we found that some variables had high association between them, such as the general state of the aggressor (sober, under the influence of alcohol, under the influence of drugs or both), the age of the aggressor and the age of the victim. The measure of association employed in this procedure was the Cramer’s V coefficient, a useful measure when analyzing the relation between two nominal variables. This indicator gives a value from 0 (no association) to 1 (perfect association). For this research, the focus was on values greater or equal to 0.5, see Fig. 2.
Comparison of Classifiers Models
479
Fig. 2 (a). Crammer’s V coefficient results (b). Crammer’s V coefficient results (c). Crammer’s V coefficient results (d). Crammer’s V coefficient results
480
A. Guerrero et al.
Fig. 2 (continued)
Comparison of Classifiers Models
481
Feature Selection. The feature selection allowed us to select the most influential variables that would serve as input for the classification models. Thus, we used the X 2 , technique to obtain the degree of association among the features with the target variable. After we sorted the features according to their X 2 score (most relevant features first), we built five new datasets with the 6, 12, 18, 24 and 32 most relevant features. Table 1 shows the 32 variables that compose our final dataset, where the first six attributes belong to the subset of the six most relevant attributes, the first twelve attributes belong to the subset of twelve most relevant variables, and so on. It should be noted that the variables are sorted by order of importance. Table 1. List of the 32 features in the dataset sorted by relevance after the X 2 32 variablesRELATIONSHIP WITH AGGRESSOR VICTIM HAS PREVIOUSLY REPORTED THE AGGRESSION VICTIM AGE AGGRESSOR AGE FREQUENCY OF THE AGGRESSION AGGRESSOR IS JEALOUS PREGNANCY VICTIM HAS REMUNERATED WORK EDUCATIONAL LEVEL OF THE VICTIM GENERAL STATE OF THE AGGRESSOR VICTIM ETHNICITY EDUCATIONAL LEVEL OF THE AGGRESSOR VICTIM LIVES WITH AGGRESSOR VICTIM MOTHER TONGUE VICTIM DECIDES TO SEPARATE GENERAL STATE OF THE VICTIM VICTIMA IS IN TREATMENT AGGRESSOR HAS HISTORY OF VIOLENCE HOME AREA AGGRESSOR REFUSES RELATIONSHIP WITH VICTIM AGGRESSOR BURNED VICTIM AGGRESSOR HAS REMUNERATED WORK AGGRESSOR COMES FROM VICTIM VICTIM DEMAND AGGRESSOR AGGRESSOR POISON VICTIM AGGRESSOR IS FOREIGN VICTIM IS FOREIGN AGGRESSOR GOES TO THE VICTIM HOUSE VICTIM START A NEW RELATIONSHIP AGGRESSOR CRASH VICTIM AGGRESSOR FORCES VICTIM TO KNOCK OFF CONDITION OF THE REPORT
Furthermore, one hot encoding was used in the categorical variables, for example, in the variables of the ethnicity, the mother tongue, and the educational level.
482
A. Guerrero et al.
Model Selection. For the modelling part we decided to apply cross-validation with 10 folds for each of the different classifiers that we have chosen. For doing this first we performed a data balancing over the dataset using the SMOTE technique along with a random under-sampling to compensate for the minority and majority classes, respectively. In this way, the class distributions for the psychological, physical and sexual types of violence in the dataset passed from 52.81%, 45.46% and 1.73% to 47.35%, 41.77% and 10.88% proportions, respectively. It is important to mention that the distribution of the percentages considered among classes were not equal due to the fact that a balanced dataset does not mean that every class of the target need to have the same amount of observations in some sensitive contexts like the one studied in this research. Thus, the most important issues considered in the process were the compensation between the significant loss of information from the majority classes and the inclusion of a considerable percentage of artificial data in the minority class. Finally, before performing the comparison between the models, an exhaustive search of the best hyperparameters for SVM and the Random Forest was carried out. At first, grid search was used for both models for the sets of 6,12,18,24 and 32 variables, but in the case of random forest, we obtained very high values of some hyperparameters, for example, concerning the number of trees. For that reason, we chose to doublecheck our results by using a separate tuning for each hyperparameter and using the out of the box error (OOB) metric to determine its recommended values. 4.4
Validation of the Models
The sets of 6,12,18,24 and 32 variables, resulting from the X 2 feature selection technique were entered into our classifier models: Multinomial Logistic Regression, Random Forest, Naive Bayes and Support Vector Machine, in order to predict intimate partner violence. We used a cross-validation technique with ten folds and obtained the mean, standard deviation of the accuracy, precision, recall, and F1 metrics. This CV was used for all models, including random forest, even if this model uses an internal bootstrapping [21].
5 Results As we mentioned before in Sect. 4.3 we performed hyperparameter tuning for the SVM and Random Forest models by using Grid Search and OOB. The obtained hyperparameters of the SVM classifier for all the datasets were: C: 100, gamma: 0.1 and kernel: RBF, for random forest the hyperparameters are depicted in Table 2. Table 2. Hyperparameters for Random Forest # Feature 6 – 12
Hyperparameters ‘max_depth’: 40 ‘n_estimators’: 100 18 – 24 - 32 ‘max_depth’: 125 ‘n_estimators’: 300 ‘min_samples_leaf’: 1
Comparison of Classifiers Models Table 3. Results obtained from the different classifiers proposed 6 variables Feature Selection Technique
Classifier
Accuracy
Precision
Recall
F1
X2
Multinomial Logistic Regression Random Forest
61.59% +/−2.22% 61.15% +/−2.66% 58.42% +/−1.58% 60.92% +/−2.55%
61.62% +/−2.07% 61.30% +/−2.27% 58.86% +/−1.51% 61.13% +/−2.25%
61.59% +/−2.22% 61.15% +/−2.66% 58.42% +/−1.58% 60.92% +/−2.55%
61.34% +/−2.18% 60.46% +/−2.63% 58.43% +/−1.56% 60.20% +/−2.65%
12 variables Accuracy
Precision
Recall
F1
62.67% +/−2.47% 60.88% +/−8.68% 59.11% +/−1.63% 63.07% +/−2.50%
62.68% +/−2.30% 60.86% +/−8.65% 59.83% +/−1.49% 63.36.% +/−2.33%
62.67% +/−2.47% 60.88% +/−8.68% 59.11% +/−1.63% 63.07% +/−2.50%
62.33% +/−2.37% 60.76% +/−8.70% 59.18% +/−1.54% 62.34% +/−2.20%
18 variables Accuracy
Precision
Recall
F1
63.24% +/−2.53% 61.84% +/−10.00% 59.84% +/−1.77% 63.37% +/−2.51%
63.31% +/−2.31% 61.78% +/−10.00% 60.71% +/−1.64% 63.32.% +/−2.35%
63.24% +/−2.53% 61.84% +/−10.00% 59.84% +/−1.77% 63.37% +/−2.51%
62.81% +/−2.43% 61.68% +/−10.04% 59.90% +/−1.69% 62.68% +/−2.35%
24 variables Accuracy
Precision
Recall
F1
63.21% +/−2.66% 62.90% +/−10.26% 59.80% +/−1.73% 63.46% +/−2.49%
63.21% +/−2.48% 62.85% +/−10.26% 60.62% +/−1.66% 63.70% +/−2.31%
63.21% +/−2.66% 62.90% +/−10.26% 59.80% +/−1.73% 63.46% +/−2.49%
62.77% +/−2.53% 62.74% +/−10.31% 59.80% +/−1.64% 62.78% +/−2.33%
32 variables Accuracy
Precision
Recall
F1
63.22% +/−2.53% 63.36% +/−10.39% 59.79% +/−1.78% 63.33% +/−2.40%
63.30% +/−2.32% 63.28% +/−10.40% 60.60% +/−1.69% 63.57% +/−2.18%
63.22% +/−2.53% 63.36% +/−10.39% 59.79% +/−1.78% 63.33% +/−2.40%
62.78% +/−2.38% 63.17% +/−10.41% 59.80% +/−1.78% 62.68% +/−2.25%
Naïve Bayes SVM
Feature Selection Technique X2
Classifier Multinomial Logistic Regression Random Forest Naïve Bayes SVM
Feature Selection Technique X2
Classifier Multinomial Logistic Regression Random Forest Naïve Bayes SVM
Feature Selection Technique X2
Classifier Multinomial Logistic Regression Random Forest Naïve Bayes SVM
Feature Selection Technique X2
Classifier Multinomial Logistic Regression Random Forest Naïve Bayes SVM
483
484
A. Guerrero et al.
Table 3 illustrates the results obtained from the different classifiers with the sets of variables selected by using first the Crammer’s V and second the X 2 technique. In these tables, the columns contain the metrics analyzed, and the rows represent the models employed. We give two values in each cell, the first one is related to the average of the results obtained from the cross-validation procedure, and the second one is the standard deviation of the obtained values. In Table 3, we can see that for the set of 6 variables, the best model is the Multinomial Logistic Regression. In this model, we obtained an accuracy of 61.59% and an F1 score of 61.34%. Something to consider for all the models, and specifically for the case of violence towards women is not only the focus on the accuracy score, but also the F1 score. The reason for this is that that F1 score considers sensitively both false positives and false negatives in its computation. There could be some cases in which a woman could be more prone to be classified to suffer a type of violence due to the high quantity in that specific class, while this is not the situation and vice-versa. The second-best model is the SVM with an accuracy of 60.92% and F1-score of 60.20%. For the set of 12 variables, the best model is the SVM with an accuracy of 63.07% and an F1-score of 62.34%, and the second-best model is the Multinomial Logistic Regression. It can be pointed out that by choosing 18 features or more, the best classifier is SVM with an accuracy mean of 63.38% and an F1-score mean of 62.71% for all the other set of features. At this point, we can hypothesize that with 18 variables or features, the SVM reaches a stall point with no meaningful increase or decrease in the evaluated metrics; that is, there is not significant information added, see Table 3.
6 Discussion After doing a descriptive statistical analysis, it was observed that in most of the reported cases, the aggressor’s ages are like their victims, the same situation occurs for their educational level. Likewise, we found that older aggressors had a lower educational level. On the other hand, aggressors with younger age tend to have jealousy problems, poison their partner, and burn her. Moreover, victims between the ages of twenty-four and sixty tend to make more reports. Also, aggressors between the ages of twenty-four and sixty tend to commit more acts of IPV. Finally, older aggressors commit the act of violence more frequently. In the sets of 6 variables, multinomial logistic regression is the one that obtains the best performance. This may be the case because it could exist a linear separation between classes with the selected features. The second-best model considering the results was Random Forest. However, considering the set of 12, 18, 24, and 32 features, the standard deviation of the metrics of random forest has a value of around 10%, which causes it not to be an accurate model to use. The high standard deviation values may be due that the features have low importance in the building of this model. From the subset of 12 features, we can observe that SVM has a slight improvement in accuracy and F1-scores. This situation in which an SVM classifier outperformed or was closely compared to the accuracy results of a Random Forest was also observed in the research of Nitze [33], Phan and Kappas [35], and Kranjčić et al. [25]. Finally, when all the models are compared among all the set of variables used, it is observed
Comparison of Classifiers Models
485
that the metrics increase as the number of variables grows, obtaining the best results with the set of 18 variables and SVM as a classifier, and for parsimony, we decided to choose this model.
7 Conclusions We have conducted experimentation for finding the most accurate classifier and a relevant set of variables for the problem of genre violence towards women. The three types of violence that our models tried to predict or classify were physical, psychological, or sexual violence. From these classes, the physical and psychological classes were in the range of approximately 40% of the total, being the sexual violence the one that only showed 1% of the total data. For this reason, we had to balance the minority class by using SMOTE. Additionally, in our research, we found 32 relevant features from a dataset of 84 variables, and from our experimentations, we found that the most accurate model is the SVM with a subset of 18 variables. For the selection of the relevant features, we used the techniques of Crammer V and X 2 . Our results, in which an SVM outperforms or has similar classification results as the second-best classifier that we found, which was Random Forest, is supported by other researches in the current literature. The accuracy and F1-score obtained from the SVM where 63.37% with a standard deviation of +/−2.51% and 62.68% with a standard deviation of +/−2.35%. We believe that these metrics, could be improved by considering other recent techniques like kernels based on random forest or an analysis of the entropy of the dataset to be analyzed would be needed; this in case that we could be facing a problem of randomness in the values of the features or variables.
References 1. Abramsky, T., Watts, C.H., Garcia-Moreno, C., Devries, K., Kiss, L., Ellsberg, M., Heise, L: What factors are associated with recent intimate partner violence? Findings from the WHO multi-country study on women’s health and domestic violence. BMC Pub. Health 11(1), 109 (2011) 2. Babu, B.V., Kar, S.K.: Domestic violence in Eastern India: factors associated with victimization and perpetration. Pub. Health 124(3), 136–148 (2010) 3. Belgiu, M., Drăguţ, L.: Random Forest in remote sensing: a review of applications and future directions. ISPRS J. Photogr. Remote Sens. 114, 24–31 (2016) 4. Berk, R.A., Sorenson, S.B., Barnes, G.: Forecasting domestic violence: a machine learning approach to help inform arraignment decisions. J. Empir. Legal Stud. 13(1), 94–115 (2016) 5. Bengio, Y., Delalleau, O., Le Roux, N.: The curse of dimensionality for local kernel machines. Technical report, 1258 (2005) 6. Breiman, L.: Machine Learning, 45(1), 5–32 (2001) 7. Brignone, L., Gomez, A.M.: Double jeopardy: predictors of elevated lethality risk among intimate partner violence victims seen in emergency departments. Prevent. Med. 103, 20–25 (2017) 8. Chandrashekar, G., Sahin, F.: A survey on feature selection methods. Comput. Electr. Eng. 40(1), 16–28 (2014)
486
A. Guerrero et al.
9. Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002) 10. Clark, C.J., Alonso, A., Everson-Rose, S.A., Spencer, R.A., Brady, S.S., Resnick, M.D., Borowsky, I.W., Connett, J.E., Krueger, R.F., Nguyen-Feng, V.N., Feng, S.L., Feng, S.L.: Intimate partner violence in late adolescence and young adulthood and subsequent cardiovascular risk in adulthood. Preventive Med. 87, 132–137 (2016) 11. Genuer, R., Poggi, J.-M., Tuleau-Malot, C.: Variable selection using Random Forests. Pattern Recogn. Lett. 31(14), 2225–2236 (2010) 12. Ghosh, D.: Predicting vulnerability of Indian women to domestic violence incidents. Res. Pract. Soc. Sci. 3(1), 48–72 (2007) 13. Goel, E., Abhilasha, E.: Random Forest: a review. Int. J. Adv. Res. Comput. Sci. Softw. Eng. 7(1), 251–257 (2017) 14. Hsieh, T.C., Wang, Y.-H., Hsieh, Y.-S., Ke, J.-T., Liu, C.-K., Chen, S.-C.: Measuring the unmeasurable—a study of domestic violence risk prediction and management. J. Technol. Hum. Serv. 36(1), 56–68 (2018). https://doi.org/10.1080/15228835.2017.1417953 15. Hu, S., Liang, Y., Ma, L., He, Y.: MSMOTE: improving classification performance when training data is imbalanced. In: 2009 Second International Workshop on Computer Science and Engineering (2009). https://doi.org/10.1109/wcse.2009.756 16. Instituto Nacional de Estadística e informática: Perú: Indicadores de violencia familiar y sexual, 2000–20017 (2017) 17. Ismi, D.P., Panchoo, S., Murinto, M.: K-means clustering based filter feature selection on high dimensional data. Int. J. Adv. Intell. Inf. 2(1), 38–45 (2016) 18. Iverson, K., Litwack, S., Pineles, S., Suvak, M., Vaughn, R., Resick, P.: Predictors of intimate partner violence revictimization: the relative impact of distinct PTSD symptoms, dissociation, and coping strategies. J. Traumat. Stress 26(1), 102–110 (2013) 19. Izmirli, G., Sonmez, Y., Sezik, M.: Prediction of domestic violence against married women in southwestern Turkey. Int. J. Gynecol. Obstet. 127(3), 288–292 (2014) 20. Jewker, R.: Intimate partner violence causes and prevention. The Lancet- 359(9315), 1423– 1429 (2002) 21. Jia, J., Liu, Z., Xiao, X., Liu, B., Chou, K.-C.: pSuc-Lys: predict lysine succinylation sites in proteins with PseAAC and ensemble random forest approach. J. Theor. Biol. 394, 223–230 (2016). https://doi.org/10.1016/j.jtbi.2016.01.020 22. Jung, H., Herrenkohl, T.I., Skinner, M.L., Lee, J.O., Klika, J.B., Rousson, A.N.: Gender differences in intimate partner violence: a predictive analysis of IPV by child abuse and domestic violence exposure during early childhood. Violence Against Women 25(8), 903– 924 (2019) 23. Kecman, V.: Support vector machines–an introduction. In Support Vector Machines: Theory and Applications, pp. 1–47. Springer, Heidelberg (2005) 24. Koning, M., Smith, C.: Decision Trees and Random Forests: A Visual Introduction for Beginners: A Simple Guide to Machine Learning with Decision Trees. Seattle (2017) 25. Kranjčić, N., Medak, D., Župan, R., Rezo, M.: Machine learning methods for classification of the green infrastructure in city areas. ISPRS Int. J. Geo-Inf. 8, 463 (2019) 26. Laeheem, K., Boonprakarn, K.: Factors predicting domestic violence among Thai Muslim married couples in Pattani province. Kasetsart J. Soc. Sci. 38(3), 352–358 (2017) 27. Leonardsson, M., San Sebastian, M.: Prevalence and predictors of help-seeking for women exposed to spousal violence in India–a cross-sectional study. BMC Women’s Health 17(1), 99 (2017) 28. Longadge, R., Dongre, S.: Class imbalance problem in data mining review. arXiv preprint arXiv:1305.1707 (2013)
Comparison of Classifiers Models
487
29. Mansilla, M.: Etapas del desarrollo humano. Revista de investigación en Psicología 3(2), 105–116 (2000) 30. Ministerio de la Mujer y Poblaciones Vulnerables: Impacto y consecuencias de la violencia contra las mujeres. Lima (2017) 31. Moraes, C.L., de Tavares da Silva, T.S., Reichenheim, M.E., Azevedo, G.L., Dias Oliveira, A.S., Braga, J.U.: Physical violence between intimate partners during pregnancy and postpartum: a prediction model for use in primary health care facilities. Paediatr. Perinat. Epidemiol. 25(5), 478–486 (2011) 32. Moyano, N., Monge, F.S., Sierra, J.C.: Predictors of sexual aggression in adolescents: Gender dominance vs. rape supportive attitudes. Eur. J. Psychol. Appl. Legal Context 9(1), 25–31 (2017) 33. Nitze, I., Schulthess, U., Asche, H.: Comparison of machine learning algorithms random forest, artificial neural network and support vector machine to maximum likelihood for supervised crop type classification. In: Fourth International Conference on Geographic Object-Based Image Analysis (GEOBIA), 035, Rio de Janeiro, 7–9 May 2012 (2012) 34. Parsian, M.: Data Algorithms: Recipes for Scaling Up with Hadoop and Spark. O’Reilly Media, Inc., Sebastopol (2015) 35. Phan, T.-N., Kappas, M.: Comparison of random forest, k-nearest neighbor, and support vector machine classifiers for land cover classification using sentinel-2 imagery. Sensors 18, 18 (2017). https://doi.org/10.3390/s18010018 36. Pueyo, A., Redondo Illescas, S.: Predicción de la violencia: Entre la peligrosidad y la valoración del riesgo de violencia. Papeles del Psicólogo 157–173 (2007) 37. Rachburee, N., Punlumjeak, W.: A comparison of feature selection approach between greedy, IG-ratio, Chi-square, and mRMR in educational mining. In: 7th International Conference on Information Technology and Electrical Engineering (ICITEE) (2015) 38. Raschka, S.: Naive Bayes and text classification i-introduction and theory. arXiv preprint arXiv:1410.5329 (2014) 39. Saeys, Y., Inza, I., Larrañaga, P.: A review of feature selection techniques in bioinformatics. Bioinformatics 23(19), 2507–2517 (2007) 40. Saile, R., Neuner, F., Ertl, V., Catani, C.: Prevalence and predictors of partner violence against women in the aftermath of war: a survey among couples in Northern Uganda. Soc. Sci. Med. 86, 17–25 (2013) 41. Schafer, K.R., Brant, J., Gupta, S., Thorpe, J., Winstead-Derlega, C., Pinkerton, R., Laughon, K., Ingersoll, K., Dillingham, R.: Intimate partner violence: a predictor of worse HIV outcomes and engagement in care. AIDS Patient Care STDs 26(6), 356–365 (2012) 42. Sheridan, R.P.: Using random forest to model the domain applicability of another random forest model. J. Chem. Inf. Model. 53(11), 2837–2850 (2013) 43. Silva, J., Aleman, E.G., Acuña, G.C., Bilbao, O.R., Hernandez-P.H., Castro, B.L., Meléndez, P.A., Neira, D.: Use of artificial neural networks in determining domestic violence predictors. In: International Conference on Swarm Intelligence, pp. 132–141. Springer, Cham, July 2019 44. Suthar, B., Patel, H., Goswami, A.: A survey: classification of imputation methods in data mining. Int. J. Emerg. Technol. Adv. Eng. 2(1), 309–312 (2012) 45. Swartout, K.M., Cook, S.L., White, J.W.: Trajectories of intimate partner violence victimization. West. J. Emerg. Med. 13(3), 272 (2012) 46. Tan, P.N., Steinbach, M., Kumar, V.: Introduction to Data Mining. Pearson Education India (2016) 47. Ting, K.M.: Confusion Matrix. Encyclop. Mach. Learn. Data Min. 260–260 (2017). https:// doi.org/10.1007/978-1-4899-7687-1_50
488
A. Guerrero et al.
48. Tjaden, P., Thoennes, N.: Prevalence, Incidence, and Consequences of Violence Against Women: Findings from the National Violence Against Women Survey. National Institute of Justice Centers for Disease Control and Prevention. Research in Brief (1998) 49. Wang, Y.: A multinomial logistic regression modeling approach for anomaly intrusion detection. Comput. Secur. 24(8), 662–674 (2005) 50. Wijenayake, S., Graham, T., Christen, P.: A decision tree approach to predicting recidivism in domestic violence. In: Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 3–15. Springer, Cham, June 2018 51. Xiang, Y., Xie, Y.: Imbalanced data classification method based on ensemble learning. In: International Conference in Communications, Signal Processing, and Systems, pp. 18–24. Springer, Singapore, July 2018 52. Xing, E.P., Jordan, M.I., Karp, R.M.: Feature selection for high-dimensional genomic microarray data. In: ICML, vol. 1, pp. 601–608, June 2001 53. Yin, M., Zeng, D., Gao, J., Wu, Z., Xie, S.: Robust multinomial logistic regression based on RPCA. IEEE J. Sel. Top. Sig. Process. 12(6), 1144–1154 (2018)
Data Consortia Eric Bax(B) , John Donald, Melissa Gerber, Lisa Giaffo, Tanisha Sharma, Nikki Thompson, and Kimberly Williams Future Studies Group, Verizon Media, Los Angeles, USA [email protected]
Abstract. We consider the potential for groups of consenting, informed users to pool their data for their own benefit and that of society. Given the trajectory of concerns and legal changes regarding privacy and user control over data, we offer a high-level design for a system to harness user data in ways that users specify, beyond just the targeting of ads, with compensation offered for the use of data. By reviewing examples, we show that most parts of such a system already exist separately, and the remaining parts, such as user compensation based on the value of their data, are being designed now.
Keywords: Big Data
1
· Data dividend · Privacy · Data · Society
Introduction
Web-based organizations access, store, and analyze user data in ways that enhance their users’ lives. Users see more relevant search results more quickly because their past searches are used to determine which results are most likely of interest to them and, collectively, to offer a selection of query completions, saving the need to type long, exact queries. Email users benefit from having their emails stored, organized, and indexed for quick search. They also benefit from email providers analyzing patterns across emails to determine which emails are spam or contain links to malware. Users of online media services benefit from collaborative filtering over user viewing and listening data to provide better recommendations about what to experience next. In each case, users allow organizations to access data about the users in order to provide the users with better service. Web-based organizations also analyze user data to monetize their services [12,21]. Many web-based services are offered to users without direct cost. In exchange, users experience advertisements, which are selected based on user data. Done well, targeted advertising means that users experience relevant messages about products, brands, and issues that interest them, advertisers connect with interested users, and the quality of advertising reflects positively on the media itself. Done poorly, it can leave users with the impression that their attention has been abused and that their data has been used against them, enabling c Springer Nature Switzerland AG 2021 K. Arai et al. (Eds.): FTC 2020, AISC 1289, pp. 489–498, 2021. https://doi.org/10.1007/978-3-030-63089-8_31
490
E. Bax et al.
uninteresting, obnoxious, or offensive messages to chase users around the web [3,14,32]. The same can occur with paid services as well, offending users with efforts to elicit future payments, to elicit feedback on their services or providers, or to recruit users as an informal salesforce targeting their friends. Concerns about these and other issues around the use of data have prompted legislative action in the European Union, in the form of the General Data Protection Regulation (GDPR) [2], and legislative interest in the United States, most notably in the form of congressional hearings. Tim Cook, CEO of Apple, has spoken in favor of regulations that allow users to have more knowledge of and control over how their data is used [29]. Users have begun to ask questions like “What is being done with my data without my knowledge?” and “Is my data being used against me?” Users are beginning to understand that their data has value [6]. In the future, in the same way that people expect a return on money, we predict that users will expect a return on the aggregation and analysis of their data. The return may be in the form of direct financial or other benefits for data contributors, or in the form of benefits for society at large.
2
Data Consortia
Fig. 1. A data consortium collects data from members. An administration team may then augment the data, analyze it, and apply the analysis to inform, bargain for, and invest on behalf of the members.
Our goal is to explore the potential for groups of people to pool and invest their data for individual or societal returns. We use the term data consortium to mean an organization with members and an administration team, in which – Data consortium members grant data access to a consortium’s administration team. Each member selects which data sources it allows the consortium to
Data Consortia
–
–
– –
–
–
–
491
access, and which filters to place on data collected by the consortium. For example, a user may select to share their email inbox and their browsing behavior with the consortium, but not records of which shows they watch on their television. And a user may agree to grant access to their email inbox, but only for receipts, not for personal emails, and only for the amounts of receipts, items purchased, and company sending the receipt. The consortium’s administration team establishes access to the data it is permitted to collect. This step may include receiving external access to the data as if a user or making an agreement with a data collecting entity, such as an email provider or web browser supplier, to receive the data. The consortium’s administration team uses computer systems to aggregate the user data and analyze it. The analysis may include trends, analysis by locale and demographics, weighting data for different users to make the consortium data a better reflection of some other population, and identifying sets of consortium members whose data most effectively predicts different types of trends. One example of analysis is to compute aggregated trend data about consortium members’ purchases from different companies, indicating which companies are increasing their revenue and which are decreasing. The consortium’s administration team may sell some analysis, for example to investment companies, and pay some of the proceeds to consortium members. The administration team itself may use the analysis as an input to decide on which companies to invest in, returning some profits to or giving some payment to consortium members, and/or give consortium members beneficial terms as investors, and/or restrict investment based on the analysis only to consortium members. The consortium administration team may publish the results of some analysis to the consortium members. This may include fashion-buying trends, alerts when infectious disease levels are elevated in a locale, trends in popularity of TV shows or movies by demographics, whether people are buying more books or using libraries more, and which food delivery services receive the fewest complaints and cancellations. Results of analysis may be used to automatically adjust users’ data-generating experiences. For example, if analysis shows that most consortium members who buy from a company return what they have bought, then a web browser extension for consortium members may warn members who navigate to the company’s web site. Payments to consortium members may be based on the amount of data they contribute and on how useful that data is for making accurate predictions, profitable decisions, or valuable insights.
Figure 1 illustrates the flows of data and compensation among the parties and functional parts of a data consortium. The administration team may be part of a larger organization, for example a team within an organization that already hosts member data. Alternatively, the administration team may be drawn from the members and perhaps use software and processes supplied by a third party that specializes in organizing data consortia, or the administration team may be a third-party company that serves that function for one or more data consortia.
492
E. Bax et al.
Each data consortium may use member data for a specific purpose, with members perhaps joining multiple data consortia for different purposes, each with an administration team focused on its purpose – a type of vertical specialization. Or there may be horizontal specialization: different organizations developing expertise in different tasks, such as data extraction, aggregation, analysis and execution for different purposes, with administration teams relying on these external organizations to accomplish some of their tasks, making it easier for a single data consortium to use member data for multiple purposes. Organizations that currently store and manage user data may offer computation and interfaces for data consortia to more conveniently access member data, charging for the service while perhaps enhancing security and lowering bandwidth requirements by doing some of the data filtering and analysis at the site of the data before transmitting it to the data consortia. Data access and expertise gives these organizations an advantage should they choose to offer the services of a data consortium to their existing users. The next few sections discuss in more detail some potential use cases in a few areas: financial investment, consumer spending, and informed living. Then we conclude with a discussion of potential challenges for data consortia.
3
Financial Investment
Aggregated user-generated data can provide accurate and actionable financial forecasts and measurements. Examples include economic forecasts based on search data [10,16,24] and equity investment decisions [30,33,34] and market insights [18,19] based on analyzing email receipts. Data-based investment requires data, analysis, and capital. In the most encapsulated form, data consortium members could supply all three, forming an online version of an investment club. Alternatively, members could supply data and capital, and arrange for a financial institution to supply analysis, along with decision-making, trade execution, and accounting. A data consortium could also use outside capital, allowing external investors who could pay a fee to the consortium for use of its data. In the least encapsulated forms, a data consortium simply sells investors or investment institutions access to aggregated member data or analysis of it. Using data consortium member data for investment decisions raises some interesting issues and questions. One issue is inadvertent insider trading. Members’ insider status would need to be recorded in order to avoid collecting data from insiders, and other members’ data may also need to be checked for whether it contains insider information. This is less true just for scanning for receipts and more true for unstructured text, which may hold valuable information and may be explored in a much more automated way than receipts, making it more difficult to structurally assess what is and is not insider information. One question is how to value and compensate for data vs. analysis and execution vs. capital. Inevitably, the answers will vary, as they do for the division of investment gains and losses today between capital and administration. For some
Data Consortia
493
data consortia, members may wish to be compensated based on the value of their individual data. That may seem impossible, much as micro payments once did. However, a combination of big-data analysis and micro-economic theory can do just this. For details, refer to [4,8,22].
4
Consumer Spending
There are existing services designed to use individual user data for the financial advantage of users. Earny and Paribus analyze member emails for receipts, then monitor prices for items bought. If prices fall within a specified time period, then Earny and Paribus arrange partial refunds for their members. Trim analyzes emails and other personal data sources to detect subscriptions, present them to users, and cancel subscriptions on behalf of users who no longer desire them, or, in some cases, were never even aware of them. BillShark analyzes user billing data and negotiates on behalf of users to lower their bills. Note that these services also use aggregate user data, even if indirectly. Earny, Paribus, and Trim use the fact that they act on behalf of many users to achieve economies of scale through automation, making it much easier for users to get discounts or cancel subscriptions through their services than on their own. Trim and BillShark can invest much more time and resources to learning which deals are the best available than individual consumers can, because these companies’ representatives become experts through experience and because they can amortize the cost of research over their customer bases. A data consortium could use the information from member data to better inform bargaining. Members who are already getting good deals may be the best source of information about which packages or promotions are the best available for members who are getting worse deals. This need not require effort by members who are getting the best deals, just access to their data. This applies to subscriptions for services, and also to tuition and fees for universities, for payments for medical procedures, and to salary and benefits for employees. Many cities have transit riders’ unions, which advocate on behalf of users of public transportation. Having access to member data could make these organizations more effective. For example, if they could analyze their members’ aggregated travel patterns in detail, then they could advocate for specific investments to decrease crowding on the most congested routes or to extend routes to where the most members are traveling by other, more expensive, means. Data consortia could combine collective bargaining with aggregated member data to give consumers more power. For example, a data consortium whose membership includes of a significant fraction of the users of a service might develop an agreement among some of those users that if the price of the service rises above a specified amount, then those users will all unsubscribe from the service. Similarly, the data consortium might develop an agreement that some number of its members who are not using a service provider will switch to it if they can all get a specified bargain rate. It will be interesting to see whether and how aggregated data, analyzed on behalf of consumers, leads to collective action by consumers.
494
5
E. Bax et al.
Informed Living
Aggregate data can improve quality of life. Searches and social media data can detect influenza outbreaks [5,17]. Using a combination of genealogy and genetic and medical data for the people of Iceland, deCODE was founded to develop medical diagnostics and drugs [27]. This information has led to important discoveries in medicine and anthropology [1,11,15]. The nation of Bhutan collects data to regularly evaluate a measure called Gross National Happiness [35,36], to understand the the state of well-being and the needs of its people. For health, access to member data should allow data consortia to more accurately detect and even predict local outbreaks of illnesses. For example, access to emails could allow a consortium to identify which school or schools, and even which classes, each member’s children attend. Combined with member search data or phone records, a consortium could identify which classes at a school are beginning to host an illness and also which nearby schools are likely to be infected next, through transmission via siblings. This would give parents a warning to find care options for kids or to perhaps keep them away from school for the most dangerous day or two. On a longer scale, access to member medical records and genetic data, similar to the Iceland database, and perhaps activity data from cell phones and fitness wearables and electronic receipts for food and activities, would give data consortia the ability to warn individual members about which conditions should concern them most, how to test for them, how to treat them, and, ideally, how to avoid them through lifestyle changes. Many health conditions in later life have their roots in habits or exposure in younger life; a health-focused data consortium could help members draw the connections and also help them distinguish the health issues that have the most impact on their lives from the noise. For happiness, sentiment analysis [9,26,28,31] of member data, such as texts and emails and media consumed, can provide useful insights into the emotional states of members. In aggregate, this data can inform lifestyle decisions for individuals, for example it would be useful to know whether people who make a decision to walk or bike more instead of using a car for short trips become happier as a result. It would be interesting – if perhaps somewhat controversial – to find out whether adopting a dog or a cat makes a person happier. Sentiment analysis could also inform larger decisions. When selecting among jobs, selecting a town or city, or even selecting a neighborhood or building, people would find it useful to know where people tend to be happier, more optimistic, or more outgoing, and whether those things are changing for the better over time. In addition to providing this information to members, data consortia could sell sentiment analysis of aggregated member data to organizations. For retailers, it would be useful to find out whether customers find that they are happier after visiting the retailer’s store or making purchases; the same information about competitors would also be valuable. For governments, it would be valuable to understand emotional well-being and its changes at a local level. It could inform policy decisions, in conjunction with economic measurements and projections.
Data Consortia
495
It will be interesting to see the impact of sentiment measurement on strategy. Will some governments work for the long-term well-being of their people, while others find that their power derives in some part from a certain level of disaffection among the people and so seek to perpetuate it? Will governments find it more efficient to improve the well-being in places that are lacking it or to encourage people to move to places where people are already living well and abandon unhappy places? Will local growth itself be a driver of well-being for some places but a harm for others? Similarly, will some companies attempt to make their customers fundamentally happy, while others seek to create a shortterm emotional high, followed by a low that makes customers crave renewal, prompting another purchase?
6
Discussion – Challenges
Data consortia – organizations designed to use data on behalf of users – have great potential, but they will face, and perhaps create, some novel challenges. Structurally, as a data consortium gains members, it gains statistical accuracy in its analysis and may also gain purchasing power. So a larger data consortium may offer more value to each member than a smaller one. As a result, one large data consortium with a natural monopoly may dominate for each purpose – investing, spending, health, etc. – and a large monopoly established for one purpose may find that it has a natural advantage in pursuing other purposes as well, simply because it already has access to data from, and established relationships with, so many members. However, where gains are shared among members, there may be pressure to keep data consortia small. In general, accuracy increases with the square root of the number of members, so there are diminishing returns to statistical analysis for each new member. If a data consortium can sell aggregate analysis of its members’ data for a fixed price if it reaches a threshold of statistical significance, then members should only want enough other members to achieve that threshold. On the other hand, if members are also investors, then the returns scale with the number of members, while accuracy also increases and overhead costs per member decrease. So members should welcome members. For investing, though, at some scale, trades tend to move the market against the trader, perhaps counterbalancing the advantages of adding more members. Data consortium members may wish to be compensated directly for providing their data. If the analysis leads to decisions that have gains or losses, then the methods discussed in [4,8,22] can be applied. In other cases, members will have to come to agreements with consortia. Different members’ data may have different value for different purposes. For example, a consortium that needs to draw inferences that apply to a larger population may have some segments of that population overrepresented among its members and other segments underrepresented. In general, this will make data from underrepresented-segment members more valuable, because it helps fill in gaps in analysis for the population as a whole. As another example, if a health-focused data consortium finds that a new
496
E. Bax et al.
drug is highly effective, then data from the first members who take the new drug and demonstrate its efficacy is very valuable, and the same is true for members who try out lifestyle changes that prove to be beneficial. In these cases, the value of a member’s data is not necessarily known a priori; it can only be accurately assessed after the data are collected, outcomes are measured, and maybe even after the benefits accrue to other members. Biased [7,25] and false data [13,20,23] will be a challenge for data consortia. Suppose a data consortium sells aggregated information about the shopping habits of a small segment of people whose habits are of interest to a larger group of people, who wish to imitate them. The data consortium consists of members from the small segment, who are paid for their data. Then there is bias in the sense that the members are only from the subsegment of the small segment who are willing to give access to their data in exchange for payment. And there will be false data, because people who are not members of the small segment will attempt to falsely claim that they are, in order to become members of the data consortium. Also, sellers will have an incentive to sell to data consortium members on preferential terms or to create false identities and data streams for people in the small segment, make them members, and place receipts for their items in those false data streams. Similarly, suppose that a data consortium pays members more if they are part of some demographic segment, in order to create a balanced panel. Then members will have an incentive to falsely claim to be part of that demographic segment. Suppose that a data consortium examines email receipts of its members and buys stakes in companies that have increased receipts. Then companies have incentives to target members for marketing and even to email them false receipts. Just as there are myriad schemes to inflate ratings and sales numbers on many commercial websites, there will be efforts to feed data consortia false data.
References 1. Letters from Iceland. Nat. Genetics, 47, 425 (2015) 2. Regulation (eu) 2016/679 of the european parliament and of the council. Official J. Eur. Union (2016) 3. Abrams, Z., Schwarz, M.: Ad auction design and user experience. Applied Economics Research Bulletin, Special Issue I(Auctions), 98–105 (2008) 4. Agarwal, A., Dahleh, M., Sarkar, T.: A marketplace for data: an algorithmic solution. Econ. Comput. (EC) 2019, 701–726 (2019) 5. Alessa, A., Faezipour, M.: A review of influenza detection and prediction through social networking sites. Theoret. Biol. Med. Model. 15(1), 2 (2018) 6. Au-Yeng, A.: California wants to copy Alaska and pay people a ‘data dividend.’ is it realistic? Forbes.com, 14 February 2019 (2019) 7. Baeza-Yates, R.: Bias on the web. Commun. ACM 61(6), 54–61 (2018) 8. Bax, E.: Computing a data dividend. EC (Economics and Computation) (2019) 9. Cambria, E., Schuller, B., Xia, Y., Havasi, C.: New avenues in opinion mining and sentiment analysis. IEEE Intell. Syst. 28(2), 15–21 (2013) 10. D’Amuri, F., Marcucci, J.: The predictive power of Google searches in forecasting US unemployment. Int. J. Forecast. 33(4), 801–816 (2017)
Data Consortia
497
11. Ebenesersd´ ottir, S.S., Sandoval-Velasco, M., Gunnarsd´ ottir, E.D., Jagadeesan, A., ottir, V.B., Thordard´ ottir, E.L., Einarsd´ ottir, M.S., Moore, K.H.S., Guðmundsd´ ´ Magn´ usd´ ottir, D.N., J´ onsson, H., Snorrad´ ottir, S., Hovig, E., Sigurðsson, A., Møller, P., Kockum, P.I., Olsson, T., Alfredsson, L., Hansen, T.F., Werge, T., Cavalleri, G.L., Gilbert, E., Lalueza-Fox, C., Walser, J.W., Kristj´ansd´ ottir, S., ´ ´ Þ, Gilbert, M.T.P., Stef´ ansson, Gopalakrishnan, S., Arnad´ ottir, L., Magn´ usson, O. K., Helgason, A.: Ancient genomes from Iceland reveal the making of a human population. Science, 360(6392), 1028–1032 (2018) 12. Edelman, B., Ostrovsky, M., Schwarz, M.: Internet advertising and the generalized second-price auction: selling billions of dollars worth of keywords. Am. Econ. Rev. 97(1), 242–259 (2007) 13. Elmurngi, E., Gherbi, A.: An empirical study on detecting fake reviews using machine learning techniques. In: Seventh International Conference on Innovative Computing Technology (INTECH), pp. 107–114 (2017) 14. Goldfarb, A., Tucker, C.: Online display advertising: targeting and obtrusiveness. Market. Sci. 30(3), 389–404 (2011) 15. Gudbjartsson, D.F., Helgason, H., Gudjonsson, S.A., Zink, F., Oddson, A., Gylfason, A., Besenbacher, S., Magnusson, G., Halldorsson, B.V., Hjartarson, E., Sigurdsson, G.T., Stacey, S.N., Frigge, M.L., Holm, H., Saemundsdottir, J., Helgadottir, T.H., Johannsdottir, H., Sigfusson, G., Thorgeirsson, G., Th Sverrisson, J., Gretarsdottir, S., Walters, G.B., Rafnar, T., Thjodleifsson, B., Bjornsson, E.S., Olafsson, S., Thorarinsdottir, H., Steingrimsdottir, T., Gudmundsdottir, T.S., Theodors, A., Jonasson, J.G., Sigurdsson, A., Bjornsdottir, G., Jonsson, J.J., Thorarensen, O., Ludvigsson, P., Gudbjartsson, H., Eyjolfsson, G.I., Sigurdardottir, O., Olafsson, I., Arnar, D.O., Magnusson, O.T., Kong, A., Masson, G., Thorsteinsdottir, U., Helgason, A., Sulem, P., Stefansson, K.: Large-scale whole-genome sequencing of the Icelandic population. Nat. Genet. 47, 435 (2015) 16. Hellerstein, R., Middeldorp, M.: Forecasting with internet search data. Liberty Street Economics (2012) 17. Sharpe, J.D., Hopkins, R.S., Cook, R.L., Striley, C.W.: Evaluating google, twitter, and wikipedia as tools for influenza surveillance using bayesian change point analysis: a comparative analysis. JMIR Public Health Surveill. 2(2) (2016) 18. Kooti, F., Grbovic, M., Aiello, L.M., Bax, E., Lerman, K.: iphone’s digital marketplace: characterizing the big spenders. In: Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, WSDM 2017, pp. 13–21, New York, NY, USA. ACM (2017) 19. Kooti, F., Grbovic, M., Aiello, L.M., Djuric, N., Radosavljevic, V., Lerman, K.: Analyzing uber’s ride-sharing economy. In: Proceedings of the 26th International Conference on World Wide Web Companion, WWW 2017 Companion, pp. 574– 582, Republic and Canton of Geneva, Switzerland. International World Wide Web Conferences Steering Committee (2017) 20. Lappas, T., Sabnis, G., Valkanas, G.: The impact of fake reviews on online visibility: a vulnerability assessment of the hotel industry. Inf. Syst. Res. 27(4) (2016) 21. Levin, J.: The economics of internet markets. Discussion Papers 10-018, Stanford Institute for Economic Policy Research, February 2011 22. Mehta, S., Dawande, M., Janakiraman, G., Mookerjee, V.: How to sell a dataset? Pricing policies for data monetization. Economics and Computation (EC) 2019 (2019) 23. Mukherjee, A., Liu, B., Glance, N.: Spotting fake reviewer groups in consumer reviews. In: Proceedings of the 21st International Conference on World Wide Web, WWW 2012, pp. 191–200, New York, NY, USA. ACM (2012)
498
E. Bax et al.
24. Onder, I., Gunter, U.: Forecasting tourism demand with google trends for a major European city destination. Tourism Analysis 21, 203–220 (2015) 25. O’Neil, C.: Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy. Crown Publishing Group, New York (2016) 26. Andrew, O., Clore, G., Collins, A.: The Cognitive Structure of Emotions. Cambridge University Press, Cambridge (1988) 27. Palmer, K.M.: Why Iceland is the world’s greatest genetic laboratory. Wired.com (2015) 28. Pang, B., Lee, L., Vaithyanathan, S.: Thumbs up? sentiment classification using machine learning techniques. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 79–86 (2002) 29. Ramli, D.: Apple’s tim cook calls for more regulations on data privacy. Bloomberg.com (2018) 30. Rogovskyy, V.: How companies use alternative data and AI in fintech market. Intellias.com (2018) 31. Stevenson, R., Mikels, J., James, T.: Characterization of the affective norms for English words by discrete emotional categories. Behav. Res. Methods 39, 1020– 1024 (2007) 32. Stourm, V., Bax, E.: Incorporating hidden costs of annoying ads in display auctions. Int. J. Res. Market. (2017) 33. Thomas, A.: Email receipts used to forecast amazon and uber revenues. Quandl.com (2016) 34. Thomas, A.: How email receipts predicted gopro’s Q3 earnings. Quandl.com (2016) 35. Ura, K., Alkire, S., Zangmo, T., Wangdi, K.: An Extensive Analysis of GNH Index. The Centre for Bhutan Studies (2012) 36. Ura, K., Alkire, S., Zangmo, T., Wangdi, K.: A Short Guide to Gross National Happiness Index. The Centre for Bhutan Studies (2012)
StreamNet: A DAG System with Streaming Graph Computing Zhaoming Yin1,3(B) , Anbang Ruan2 , Ming Wei2 , Huafeng Li3 , Kai Yuan3 , Junqing Wang3 , Yahui Wang3 , Ming Ni3 , and Andrew Martin4 1
StreamNet Chain LLC, Hangzhou, Zhejiang 310007, China [email protected] 2 Octa Innovation, Beijing 100036, China {ar,wm}@8lab.cn 3 TRIAS Lab, Hangzhou, Zhejiang 310008, China {lhf,yuankai,wangjunqing,wangyahui,ming.ni}@trias.one 4 University of Oxford, Oxford OX1 4BH, England [email protected] http://www.streamnet-chain.com/, https://www.8lab.cn/, https://www.trias.one/ https://www.cs.ox.ac.uk/people/andrew.martin/ Abstract. To achieve high throughput in the POW based blockchain systems, researchers proposed a series of methods, and DAG is one of the most active and promising fields. We designed and implemented the StreamNet, aiming to engineer a scalable and endurable DAG system. When attaching a new block in the DAG, only two tips are selected. One is the ‘parent’ tip whose definition is the same as in Conflux; another is using Markov Chain Monte Carlo (MCMC) technique by which the definition is the same as IOTA. We infer a pivotal chain along the path of each epoch in the graph, and a total order of the graph could be calculated without a centralized authority. To scale up, we leveraged the graph streaming property; high transaction validation speed will be achieved even if the DAG is growing. To scale out, we designed the ‘direct signal’ gossip protocol to help disseminate block updates in the network, such that messages can be passed in the network more efficiently. We implemented our system based on IOTA’s reference code (IRI) and ran comprehensive experiments over the different sizes of clusters of multiple network topologies. Keywords: Blockchain
1
· Graph theory · Consensus algorithm
Introduction
Since bitcoin [36] has been proposed, blockchain technology has been studied for 10 years. Extensive adoptions of blockchain technologies was seen in realworld applications such as financial services with potential regulation challenges [34,46], supply chains [4,25,47], health cares [6,49] and IoT devices [9]. The core of blockchain technology depends on the consensus algorithms applying to c Springer Nature Switzerland AG 2021 K. Arai et al. (Eds.): FTC 2020, AISC 1289, pp. 499–522, 2021. https://doi.org/10.1007/978-3-030-63089-8_32
500
Z. Yin et al.
the open distributed computing world. Where computers can join and leave the network, and these computers can cheat. As the first protocol that can solve the so-called Byzantine general problem, the bitcoin system suffers from a low transaction rate with a transaction per second (TPS) of approximately 7, and long confirmation time (about an hour). As more machines joined the network, they are competing for the privileges to attach the block (miners), which results in a massive waste of electric power. While skyrocketing fees are paid to make sure the transfers of money will be placed in the chain. On par, there are multiple proposals to solve the low transaction speed issue. One method intends to solve the speed problem without changing the chain data structure, for instance, segregated witness [31] or offchain technologies such as lightning network [39] or plasma [38]. Another hard fork way changed the bitcoin protocol, such as the bitcoin cash tries to improve the throughput of the system by enlarging the data size of each block from 1 Mb to 4 Mb. To minimize the computational cost of POW, multiple organizations have proposed a series of proof of stake method (POS) [12,14,18,20,48] to make sure that those who have the privilege to attach the block proportional to their token shares. Another idea targeting at utilizing the power in POW to do useful and meaningful tasks such as training machine learning models are also proposed [32]. Besides, inspired by the PBFT algorithm [8] and a set of related variations, the so-called hybrid (or consortium) chain was proposed. The general idea is to use a two-step algorithm; the first step is to elect a committee; the second step is collecting committee power to employ PBFT for consensus. Bitcoin-NG [17] is the early adopter of this idea, which splits the blocks of bitcoin into two groups: for master election and another for regular transaction blocks. Honey-badger [35] is the system that first introduced the consensus committee; it uses predefined members to perform the PBFT algorithm to reach consensus. The Byzcoin system [24] brought forth the idea of POW for the committee election and uses a variation of PBFT called collective signing for speed purposes. The Algorand [19] utilizes a random function to elect a committee and use this committee to commit blocks anonymously, and the member of the committee only has one chance to commit block. Other popular systems include Ripple [41], Stellar [33] and COSMOS [26] etc. All these systems have one common feature, the split of layers of players in the network, which results in the implementation complexity. While the methods above are aiming to avoid side chains, another thread of effort is put on using a direct acyclic graph(DAG) to merge side chains. The first-ever idea comes with growing the blockchain with trees instead of chains [44], which results in the well-known GHOST protocol [45]. If one block links to ≥ 2 previous blocks, then the data structure grows like a DAG instead of tree [28], SPECTRE [42] and PHANTOM [43] are such type of systems. Byteball [10] is the system that constructs the main chain, and leverage this main chain to help infer the total order, nonetheless, the selection of the main chain is dependent on a role called to witness, which is purely centralized. Conflux is
StreamNet
501
an improvement of the GHOST based DAG algorithm, which also utilizes the pivotal (main) chain without the introduction of witness and claim to achieve 6000 of TPS in reality [29]. IOTA tried to avoid the finality of constructing a linear total order by introducing the probabilistic confirmation in the network [40]. As mentioned earlier, the systems are permissionless chains; in the permission chains, DAG technology is also applied. HashGraph [7] is the system that utilizes the gossip on gossip algorithm to propagate the block graph structure, and achieve the consensus by link analysis in the DAG, this method is proved to be Byzantine fault-tolerant and does not rely on voting. Blockmainia [11] is based on the original PBFT design, but its underlying log is DAG-based. Some of the side chain methods also borrow the idea of DAG, such as nano [27] and VITE [30]. These systems, in reality, rely on centralized methods to maintain their stability. Social network analysis has widely adopted the method of streaming graph computing [15,16,21], which deals with how to quickly maintain information on a temporally or spatially changing graph without traversing the whole graph. We view the DAG-based method as a streaming graph problem, which is about computing the total order and achieving consensus without consuming more computing power. In distributed database systems, the problem of replicating data across machines is a well-studied topic [13]. Due to the bitcoin network’s low efficiency, there are multiple ways to accelerate the message passing efficiency [23]. However, they did not deal with network complexity. We viewed scaling the DAG system in the network of growing size and topological complexity as another challenging issue and proposed our gossip solution. This paper’s main contribution is how to utilize the streaming graph analysis methods and new gossip protocol to enable real decentralized, and stabilized growing DAG system.
2 2.1
Basic Design Data Structure
The local state of a node in the StreamNet protocol is a direct acyclic graph (DAG) G =. B is the set of blocks in G. g ∈ G is the genesis block. For instance, vertex g in Fig. 1 represents the Genesis block. P is a function that maps a block b to its parent block P (b). Specially, P (g) =⊥. In Fig. 1, parent relationships are denoted by solid edges. Note that there is always a parent edge from a block to its parent block (i.e., ∀b ∈ B, b, P (b) >∈ E). E is the set of directly reference edges and parent edges in this graph. e =< b, b >∈ E is an edge from the block b to the block b , which means that b happens before b. For example in Fig. 1, vertex 1 represents the first block, which is the parent for the subsequent block 2, 3 and 4. Vertex 5 has two edges; one is the parent edge pointing to 3, another is reference edge pointing to 4. When a new block is not referenced, it is called a tip. For example, in Fig. 1, block 6 is a tip. All blocks in the StreamNet protocol share a predefined deterministic hash function Hash that maps each block in B to a unique integer id. It satisfies that ∀b = b , Hash(b) = Hash(b ).
502
Z. Yin et al.
Fig. 1. Example of the StreamNet data structure.
2.2
StreamNet Architecture
Algorithm 1: StreamNet node main loop. Input: Graph G =< B, g, P, E > 1 while Node is running do 2 if Received G =< B , g, P , E > then 3 G ←< B ∪ B , g, P ∪ P , E ∪ E >; 4 if G = G then 5 G ← G ; 6 Broadcase updated G to neighbors ; 7 if Generate block b then 8 a ← P ivot(G, g) ; 9 r ← M CM C(G, g) ; 10 G ←< B ∪ b, g, P ∪ < b, a >, E∪ < b, a > ∪ < b, r >> ; 11 Broadcase updated G to neighbors ; 12 end
Figure 2 presents the architecture of StreamNet; it is consists of multiple StreamNet machines. Each StreamNet machine will grow its DAG locally and will broadcast the changes using the gossip protocol. Eventually, every machine will have a unified view of DAG. By calling the total ordering algorithm, every machine can sort the DAG into a total order, and the data in each block can have a relative order regardless of their local upload time. Figure 3 shows the local architecture of StreamNet. In each StreamNet node, there will be a transaction pool accepting the transactions from the HTTP API. Moreover, there will be a block generator to pack a certain amount of transactions into a block, and it firstly finds a parent and reference block to attach the new block to, based on the hash information of these two blocks and the metadata of the block itself, it will then perform the proof of work (POW) to calculate the nonce for the
StreamNet
503
Fig. 2. StreamNet architecture.
Fig. 3. One node in StreamNet protocol.
new block. Algorithm 1 summarize the server logic for a StreamNet node. In the algorithm, the way to find parent block is by P ivot(G, g). Furthermore, the way to find a reference block is by calling M CM C(G, g), which is the Markov Chain Monte Carlo (MCMC) random walk algorithm [40]. The two algorithms will be described in the later section.
504
2.3
Z. Yin et al.
Consensus Protocol
Based on the predefined data structure, to present the StreamNet consensus algorithm, we firstly define several utility functions and notations, which is a variation from the definition in the Conflux paper [29]. Chain() returns the chain from the genesis block to a given block following only parent edges. Chain(G, b) returns all blocks except those in the chain. Child() returns the set of child blocks of a given block. Sibling() returns the set of siblings of a given block. Subtree() returns the subtree of a given block in the parental tree. Before() returns the set of blocks that are immediately generated before a given block. Past() returns the set of blocks generated before a given block (but including the block itself). After() returns the set of blocks that are immediately generated after a given block. Later() returns the set of blocks generated after a given block (but including the block itself). SubGraph() returns the subgraph by removing blocks and edges except for the initial set of blocks. ParentScore() presents the weight of blocks, and each block has a score when referenced as a parent. Score() presents the weight of blocks, and each block achieves a score when attaching to the graph. TotalOrder() returns the ‘flatten’ order inferred from the consensus algorithm. Figure 4 represents the definition of these utility functions.
Fig. 4. The Definitions of Chain(), Child(), Sibling(), Subtree(), Before(), Past(), After(), Later(), SubGraph(), ParentScore(), Score(), and TotalOrder().
StreamNet
505
Algorithm 2: MCMC(G, b). Input: The local state G = < B, g, P, E > and a starting block b ∈ B Output: A random tip t 1 t←b 2 do 3 for b ∈ Child(G, t) do 4 5 6 7 8
Pbb =
eαScore(G,b ) Σz:z→b eαScore(G,z)
end t ← choose b by Pbb while Score(G,t) != 0 ; return t ;
Fig. 5. An example of total order calculation.
Parent Tip Selection by Pivotal Chain. The Algorithm 3 presents our pivot chain selection algorithm (i.e., the definition of P ivot(G, b)). Given a StreamNet state G, Pivot(G,g) returns the last block in the pivoting chain starting from the genesis block g. The algorithm recursively advances to the child block, whose corresponding sub-tree has the most significant number of children. Which is calculated by P arentScore(G, b) When there are multiple child blocks with the same score, the algorithm selects the child block with the largest block hash. The algorithm terminates until it reaches a tip. Each block in the pivoting chain defines an epoch, the nodes in DAG that satisfy Past(G,b) - Past(G,p) will belong to the epoch of block b. For example, in Fig. 5, the pivoting chain is < g, 1, 3, 5, 6 >, and the epoch of block 5 contains two blocks 4 and 5. Reference Tip Selection by MCMC. The tip selection method by using Monte Carlo Random Walk (MCMC) is as Algorithm 2 shows. Each random walk step, starting from the genesis, will choose a child to jump to, and the probability of jumping from one block to the next block will be calculated using the formula in the algorithm. α in the formula is a constant that is used to scale the randomness of the MCMC function, the smaller it is, the more randomness will be in the MCMC function. The algorithm returns until it finds a tip.
506
Z. Yin et al.
Algorithm 3: pivot(G, b). Input: The local state G = < B, g, P, E > and a starting block b ∈ B Output: The tip in the pivot chain 1 do 2 b ← Child(G, b) ; 3 tmpM axScore ← -1 ; 4 tmpBlock ← ⊥ ; 5 for b ∈ Child(G, b) do 6 pScore ← ParentScore(G, b ) ; 7 if score > tmpM axScore || (score = tmpM axScore and Hash(b ) < Hash(tmpBlock) then 8 tmpM axScore ← pScore ; 9 tmpBlock ← b ; 10 end 11 end 12 b ← tmpBlock ; 13 while Child(G,b) != 0 ; 14 return b ;
Total Order. Algorithm 4 defines StreamNetOrder(), which corresponds to our block ordering algorithm. Given the local state G and a block b in the pivoting chain, StreamNetOrder(G, b) returns the ordered list of all blocks that appear in or before the epoch of b. Using StreamNetOrder(), the total order of a local state G is defined as TotalOrder(G). The algorithm recursively orders all blocks in previous epochs(i.e., the epoch of P (b) and before). It then computes all blocks in the epoch of b as BΔ . It topologically sorts all blocks in BΔ and appends it into the result list. The algorithm utilizes a unique hash to break ties. In Fig. 5, the final total order is < g, 1, 3, 4, 5, 2, 6 >. 2.4
The UTXO Model
In StreamNet, the transactions utilize the unspent transaction out (UTXO) model, which is the same as in Bitcoin. In the confirmation process, the user will call T otalOrder to get the relative order of different blocks, and the conflict content of the block will be eliminated if the order of the block is later than the one conflicting with it in the total order. Figure 6 shows the example of the storage of UTXO in StreamNet and how the conflict is resolved. Two blocks referenced the same block with Alice having five tokens and constructing the new transaction out, representing the transfer of token to Bob and Jack, respectively. However, after calling totalOrder(), the Bob transfer block precedes the Jack transfer block; thus, the next block will be discarded.
StreamNet
507
Algorithm 4: StreamNetOrder(G, b). Input: The local state G = < B, g, P, E > and a tip block b ∈ B Output: The block list of total top order starting from Genesis block to the giving block b in G 1 L =⊥ 2 do 3 p ← Parent(G, b) ; 4 BΔ ← Past(G,b) - Past(G,p) ; 5 do 6 G ← SubGraph(BΔ ) ; 7 BΔ ← {x || Before(G ,x) = 0} ; 8 Sort all blocks in BΔ in order as b1 , b2 , ..., bk 9 such that ∀1≤ i ≤ j ≤ k, Hash(bi ) ≤ Hash(bj ) ; 10 L ← L + b1 + b2 + ... + bk ; 11 BΔ ← BΔ - BΔ ; 12 while BΔ = 0 ; 13 b=p; 14 while b != g; 15 return L ;
Fig. 6. An example of UTXO.
2.5
Gossip Network
In the bitcoin and IOTA network, the block information is disseminated in a direct mail way [13]. Suppose there are N nodes and L links in the network, for
508
Z. Yin et al.
a block of size B, to spread the information of it, the direct mail algorithm will have a total complexity of O(LB). Moreover, the average complexity for a node will be O( LB N ) In the chain based system, and this is fine because the design of the system already assumes that the transaction rate will below. However, in the DAG-based system, this type of gossip manner will result in low scalability due to the high throughput of the block generation rate and will result in network flooding. What is worse, consider the heterogeneously and long diameters of network topology, the convergence of DAG will take a long time, which will cause the delay of confirmation time of blocks. 2.6
Differences with Other DAG Protocols
Here, we mainly compare the difference of our protocol with two mainstream DAG-based protocols. One is IOTA, and another is Conflux. IOTA. The major difference with IOTA is in three points: – Firstly, the IOTA tip selection algorithm’s two tips are all randomly chosen, and ours is one deterministic which is for the total ordering purposes and one by random which is for maintaining the DAG property; – Secondly, the IOTA consensus algorithm is not purely decentralized, it relies on a central coordinator to issue milestones for multiple purposes, and our algorithm does not depend on such a facility. – Lastly, in IOTA, there is no concept of total order, and there are three ways to judge if a transaction is confirmed: – The first way is that the common nodes covered by all the tips are considered to be fully confirmed; – All transactions referenced by the milestone tip are confirmed. – The third way is to use MCMC. Call N times to select a tip using the tip selection algorithm. If this tip references a block, its credibility is increased by 1. After N selections have been cited M times, then the credibility is M/N . Conflux. The major difference with Conflux is in two points: – Firstly, Conflux will approve all tips in the DAG along with the parent, which is much more complicated than our MCMC based two tip method. Moreover, when the width of DAG is high, there will be much more space needed to maintain such data structure. – Secondly, the Conflux total ordering algorithm advances from genesis block to the end while StreamNet advances in the reverse direction. This method is one of the major contributions to our streaming graph-based optimizations, which will be discussed in the next chapter. In Conflux paper, there is no description of how to deal with the complexity paired with the growing graph.
StreamNet
2.7
509
Correctness
Safety and Liveness. Because StreamNet utilizes the GHOST rule to select the pivoting chain, which is the same as in Conflux. Thus, it shares the same safety and correctness property as Conflux. Although the choice of reference chain in StreamNet is different from Conflux, it only affects the inclusion rate, which is the probability of a block to be included in the total order. Confirmation. According to Theorem 10 in [45] and the deduction in [29], given a period of [t − d, t], and block b in pivot chain in this period, the chance of b kicked out by its sibling b is no more than P r(bdrop ) in (1). Which is the same as in Conflux. P r(bdrop ) ≤
n−m
ζk q n−m−k+1 +
k=0
∞
ζk ζk = e−qλh t
(−qλh t)k k!
(1)
k=n−m+1
Followed by the definitions in Conflux paper [29], in (1), n is the number of blocks in the subtree before t, m is the number of blocks in subtree of b before t. λh is an honest node’s block generation rate. q(0 ≤ q ≤ 1) is the attacker’s block generation ratio with respect to λh . From the equation, we can conclude that with the time t goes, the chance of a block b in the pivoting chain to be reverted is decreased exponentially.
3
Optimization Methods
One of the biggest challenges to maintain the stability of the DAG system is that, as the local data structure grows, the graph algorithms (P ivot(), M CM C(), StreamN etOrder()), relies on some of the graph operators that need to be recalculated for every newly generated block, which is very expensive. Table 1 list all the expensive graph operators that are called. Suppose the depth of the pivoting chain is d, then we give the analysis of complexity in the following way. P arentScore() and Score() rely on the breadth-first search (BF S), and the average BF S complexity would be O(|B|), and for each MCMC() and Pivot() called the complexity would be in total O(|B|2 ) in both of these two cases. The calculation of P ast() also relies on the BF S operator, in the StreamNetOrder() algorithm, the complexity would be accrued to O(|B| ∗ d). TopOrder() is used in sub-order ranking the blocks in the same epoch. It is the classical topological sorting problem, and the complexity in the StreamNetOrder() would be O(|B|).
510
Z. Yin et al. Table 1. Analysis of graph properties calculation Graph Property
Algorithm used
Complexity Tot
ParentScore(G, b)
Pivot()
O(|B|)
O(|B|2 )
Score(G, b)
MCMC()
O(|B|)
O(|B|2 )
Past(G,b) - Past(G,p) StreamNetOrder() O(|B|)
O(|B| ∗ d)
TopOrder(G, b)
O(|B|)
StreamNetOrder() O(|B|)
Algorithm 5: UpdateScore(G, b). Input: Graph G, Block b, Score map S Output: Updated score map S 1 Q = [b] ; 2 visited = {} ; 3 while Q! = Ø do 4 b = Q.pop() ; 5 for b ∈ Bef ore(G, b ) do 6 if b ∈ / visited ∧ b ! =⊥ then 7 Q.append(b ) ; 8 visited.add(b ) ; 9 end 10 S[b ] + + ; 11 end 12 return S ;
Considering new blocks are generated and merged into the local data structure in a streaming way. The expensive graph properties could be maintained dynamically as the DAG grows. Such that the complexity of calculating these properties would be amortized to each time a new block is generated or merged. In the following sections, we will discuss how to design streaming algorithms to achieve this goal. 3.1
Optimization of Score() and ParentScore()
In the optimized version, the DAG will have a map that keeps the score of each block. Once there is a new generated/merged block, it will trigger the BFS based UpdateScore() algorithm to update the block’s scores in the map that are referenced by the new block. The skeleton of the UpdateScore() algorithm is as Algorithm 5 shows.
StreamNet
3.2
511
Optimization of Past(G,b) - Past(G,p)
Algorithm 6: GetDiffSet(G, b, C). Input: Graph G, Block b, covered block set C Output: diff set D ← P ast(G, b) − P ast(G, p) 1 D =Ø ; 2 Q ← [b] ; 3 visited = {b} ; 4 p = P arent(G, b) ; 5 while Q! = Ø do 6 b = Q.pop() ; 7 for b ∈ Bef ore(G, b ) do 8 if IsCovered(G, p, b , C) ∧ b ! =⊥ then 9 Q.append(b ) ; 10 visited.add(b ) ; 11 end 12 D.add(b ) ; 13 C.add(b ) ; 14 end 15 return D ;
We abbreviate the Past(G,b) - Past(G,p) to calculate Bδ as GetDiffSet(G,b,C) which is shown in the Algorithm 6. This algorithm is, in essence, a dual-direction BF S algorithm. Starting from the block b, it will traverse all its referenced blocks. Every time a new reference block b is discovered, it will perform a backward BF S to ‘look back’ to see if itself is already covered by the b’s parent block p. If yes, b would not be added to the forward BF S queue. To avoid the complexity of the backward BF S, we add the previously calculated diff set to the covered set C, which will be passed to GetDiffSet() as a parameter. To be more specific, when a backward BFS is performed, the blocks in C will not be added to the search queue. This backward search algorithm is denoted as IsCovered() and described in detail in Algorithm 7. Figure 7 shows the example of the GetDiffSet() method for block 5. It first performs forward BFS to find block 4, which does not have children, then it will be added to the diff set. 4, then move forward to 1, which has three children. If it detects 3, which is the parent of 5, it will stop searching promptly. If it continues searching on 2 or 4, these two blocks would not be added to the search queue, because they are already in the covered set.
512
Z. Yin et al.
Algorithm 7: IsCovered(G, p, b , C).
1 2 3 4 5 6 7 8 9 10 11 12 13
Input: Graph G, Block b , parent p, covered block set C Output: true if covered by parent, else false Q ← [b ] ; visited = {b} ; while Q! = Ø do b = Q.pop() ; for t ∈ Child(G, b ) do if t = p then return true ; else if t ∈ / visited ∧ t ∈ / C then Q.add(t) ; visited.add(t) ; end end return false ;
Fig. 7. Example of the streaming get diff set method.
3.3
Optimization of TopOrder()
The topological order is used in sorting the blocks in the same epoch. To get the topological order, every time, there needs a top sort of the whole DAG from scratch. However, we can easily update the topological order when a new block is added or merged. The update rule is when a new block is added; its topological position will be as (1) shows. This step can be done in O(1). T opScore(G, b) ← min(T opScore(G, P arent(b)), T opScore(G, Ref erence(b))) + 1
(2)
StreamNet
513
To summarize, the optimized streaming operators can achieve performance improvement as Table 2 shows. Table 2. Analysis of graph properties calculation
3.4
Graph Property
Algorithm used
Complexity Tot
Score(G, b)
MCMC()
O(|B|)
ParentScore(G, b)
Pivot()
O(|B|)
O(|B|)
O(|B|)
Past(G,b) - Past(G,p) StreamNetOrder() O(|B|)
O(|B|)
TopOrder(G, b)
O(|1|)
StreamNetOrder() O(|1|)
Genesis Forwarding
The above algorithm solved the problem of how to dynamically maintaining the information needed for graph computation. However, it still needs to update the information until the genesis block. With the size of the graph growing, the updating process will become harder to compute. With the growth of DAG size, the old historical confirmed blocks are being confirmed by more and more blocks, which are hard to be mutated. Furthermore, the exact probability can be computed in formula (1). Hence, we can design a strategy to forward the genesis periodically and fix the historical blocks into a total ordered chain. The criteria to forward the genesis are based on the threshold of ParentScore(). Suppose we define this threshold as h = n − m, then we only forward the genesis if:
∃b|b ∈ Chain(G, g), f or∀b |b ∈ Chain(G, g), suchthatP arentScore(b) > P arentScore(b ) + h
(3) In Fig. 8, we set h = 5, and there are three side chains with ∀b |b ∈ Chain(G, g), P arentScore(b ) = 9, they are candidates for the new genesis, we choose the block with minimum P arentScore as the new genesis. Besides, after the new genesis has been chosen, we will induce a new DAG in memory from this genesis; furthermore, persist the ‘snapshot’ total order (Conflux paper has the same definition, but it does not show the technical detail, we do not view it trivial) in the local database. Once the total order is queried, a total order based on the current DAG will be appended to the end of the historical snapshot total order and be returned. Also, the vertices in the UTXO graph that belongs to the fixed blocks will be eliminated from the memory and be persisted to disk as well. The algorithm is as Algorithm 8 shows.
514
Z. Yin et al.
Fig. 8. Example of genesis forward method.
3.5
The Direct Signal Gossip Protocol
There are solutions in [13] to minimize the message passing in the gossip network. Moreover, in Hyperledger [5] they have adopted the PUSH and PULL model for the gossip message propagation. However, their system is aiming at permissioned chain. Suppose the size of the hash of a block is H, we designed the direct signal algorithm. The algorithm is divided into two steps, once a node generates or receives a block, it firstly broadcast the hash of the block, this is the PUSH step. Once a node receives a hash or a set of a hash, it will pick one source of the hash for the block content, and this is the PULL step. The direct signal algorithm’s complexity will be O(LH + N B) and for a node averaged to O( LH N + 1) The algorithm is as Algorithm 9 shows.
StreamNet Algorithm 8: Genesis Forward Algorithm. Input: Graph G =< B, g, P, E > 1 while Node is running do 2 if ∃b satisties (3) then 3 O = T opOrder(G, g); 4 g ← b ; 5 G ← induceGraph(G, g ) ; 6 pS = P arentScore(G , g ); 7 S = Score(G , g ); 8 O = T opOrder(G , g ); 9 G ← G ; 10 persist O − O ; 11 sleep (t) ; 12 end
Algorithm 9: The Direct Signal Gossip Algorithm. Input: Graph G =< B, g, P, E > 1 while Node is running do 2 if Generate block b then 3 Broadcast b to neighbors ; 4 if Receive block b then 5 h ← Hash(b) ; 6 cache[h] ← b ; 7 Broadcast h to neighbors ; 8 if Received request h from neighbor n then 9 b ← cache[h] ; 10 Send b to n ; 11 if Received hash h from neighbor n then 12 b ← cache[h] ; 13 if b = N U LL then 14 Send request h to n ; 15 end
515
516
4 4.1
Z. Yin et al.
Experimental Results Implementation
Fig. 9. Block header format, the main transaction information is stored in the signature part. The addr is sender’s address, the timestamp is the time the block has been created, current/last index and the bundle is used for storing the bundle information, trunk and branch are the hash address to store the parent and reference location, the tag is used for store some tagging information, addtach TS is when the block is attached to the StreamNet, the nonce is used in POW calculation.
We have implemented the StreamNet based on the IOTA JAVA reference code (IRI) v1.5.5 [1]. We forked the code and made our implementation; the code is freely available at [3]. In this paper, we use version v0.1.4-streamnet in the v0.1-streamnet beta branch. – The features we have adopted from the IRI are: • The block header format, as shown in Fig. 9. Some of the data segments are not used in StreamNet, which are marked grey. • Gossip network, the network is a bi-directional network in which every node will send and receive data from its peers; • Transaction bundle, because of the existence of the bundle hash feature, StreamNet can support both the single transaction for a block and batched transactions as a bundle. • Sponge hash functions, which is claimed to be quantum immune, in our experiment, the POW hardness is set to 8, which is the same as the testnet for IOTA. – The features we have abandoned from the IRI are: • The iota’s transaction logic including the ledger validation part; • The milestone issued by coordinators, which is a centralized setup. – The features we have modified based on the IRI is: • The tip selection method based on MCMC, since the tip selection on IRI has to find a milestone to start searching, we replace this with a block in the pivotal chain instead.
StreamNet
517
– The features we have added into the StreamNet are: • The consensus algorithms, and we have applied the streaming method directly in the algorithms; • The UTXO logic stored in the signature part of the block header used the graph data structure to store UTXO as well. • In IOTA’s implementation, the blocks are stored in the RocksDB [2] as the persistence layer, which makes it inefficient to infer the relationships between blocks and calculate graph features. In our implementation, we introduced an in-memory layer to store the relationships between blocks, such that the tip selection and total ordering algorithm will be accelerated. 4.2
Environment Set up
Fig. 10. Cluster set up for different network topologies.
We have used the AWS cloud services with 7 virtual machines, for each node, it includes a four-core AMD EPYC 7571, with 16 Gb of memory size and 296Gb of disk size. The JAVA version is 1.8, we have deployed our service using docker, and the docker version is 18.02.0-ce. We have 7 topologies set up of nodes, which are shown in Fig. 10, these configurations are aiming to test: – The performance when the cluster connectivity is high (congestion of communications, like 3-clique, 4-clique, 7-clique, and 7-star); – The performance when the cluster diameter is high (long hops to pass the message, like 4-circle, 7-circle, 7-bridge);
518
Z. Yin et al.
As for the data, we have created 1,000 accounts, with the genesis account having 1,000,000,000 tokens in the coinbase block. We divided the accounts into two groups (each group will have 500 accounts), the first group will participate in the ramp-up step, which means the genesis account will distribute the tokens to these accounts. Moreover, for comparison, we have issued four sets of different size transactions (5000, 10000, 15000, and 20000), respectively. In the execution step, the first group of accounts will issue transactions to the second group of accounts, which constructs a bipartite spending graph. Since there are more transactions than the number of accounts, there will be double-spend manners in this step. The number of threads in this procedure is equal to the number of nodes for each configuration. Jmeter [22] is utilized as the driver to issue the transactions, and Nginx [37] is used to evenly and randomly distribute the requests to different nodes. 4.3
Results and Discussions
Block Generation Rate Test. To test the block generation rate, we set each block in StreamNet to have only one transaction. Furthermore, the performance on this configuration is as Fig. 11 shows. First, as the size of the cluster grows,
Fig. 11. Experimental results for block generation rate.
StreamNet
519
the network will witness little performance loss on all of the data scales. In the experiment, we can also see that with the growth of the data, the average TPS on most of the configurations have grown a little bit (some outliers need our time to triage), this is because the genesis forwarding algorithm needs some ramp-up time to get to the stable growth stage. Considering the system is dealing with a growing graph instead of a chain and the complexity analysis in the previous section, the experiment clearly shows that our streaming algorithm sheds light on how to deal with the growing DAG. Bundle Transaction Test. By default, each block in StreamNet will support bundle transactions. We set each bundle to contain 20 transactions, and for each block, there are approximately 3 transactions included. The performance on this configuration is as Fig. 12 shows. In this experiment, we can see that the performance (TPS) comparing with the block test improved more than twice. This is because there will be less POW works to be done. Besides, with the growth of the data, we do not witness a noticeable performance downturn. Nevertheless, there are some performance thrashing in the experiment, which needs more study.
55
cluster_size 50
3_clique 4_circle
TPS
4_clique 7_bridge 7_circle 7_clique
45
7_star
40
5000
10000
15000
20000
num_txn
Fig. 12. Experimental results for bundle transaction.
520
5
Z. Yin et al.
Conclusion
In this paper, we proposed a way to compute how to grow the blocks in the growing DAG based blockchain systems. And how to maintain the total order as the DAG structure is dynamically turning larger. We referred one of the earliest DAG implementation IRI to conduct our own experiments on clusters of different size and topology. Despite the network inefficiency in the IRI implementation, our method is proven to be able to tolerate the increasing complexity of the graph computation problems involved. This is due to the streaming graph computing techniques we have introduced in this paper.
References 1. 2. 3. 4. 5.
6.
7. 8. 9. 10. 11. 12.
13.
14. 15.
16.
Iota reference implementation. https://github.com/iotaledger/iri Rocksdb reference implementation. http://rocksdb.org Streamnet reference implementation. https://github.com/triasteam/iri Abeyratne, S.A., Monfared, R.P.: Blockchain ready manufacturing supply chain using distributed ledger (2016) Androulaki, E., Barger, A., Bortnikov, V., Cachin, C., Christidis, K., De Caro, A., Enyeart, D., Ferris, C., Laventman, G., Manevich, Y., et al.: Hyperledger fabric: a distributed operating system for permissioned blockchains. In: Proceedings of the Thirteenth EuroSys Conference, p. 30. ACM (2018) Azaria, A., Ekblaw, A., Vieira, T., Lippman, A.: Medrec: Using blockchain for medical data access and permission management. In: International Conference on Open and Big Data (OBD), pp. 25–30. IEEE (2016) Baird, L.: The swirlds hashgraph consensus algorithm: fair, fast, byzantine fault tolerance. Swirlds Tech Reports SWIRLDS-TR-2016-01, Technical Report (2016) Castro, M., Liskov, B., et al.: Practical byzantine fault tolerance. In: OSDI, vol. 99, pp. 173–186 (1999) Christidis, K., Devetsikiotis, M.: Blockchains and smart contracts for the internet of things. IEEE Access 4, 2292–2303 (2016) Churyumov, A.: Byteball: a decentralized system for storage and transfer of value (2016). https://byteball.org/Byteball.pdf Danezis, G., Hrycyszyn, D.: Blockmania: from block dags to consensus. arXiv preprint arXiv:1809.01620, 2018 David, B.M., Gazi, P., Kiayias, A., Russell, A.: Ouroboros praos: an adaptivelysecure, semi-synchronous proof-of-stake protocol. IACR Cryptology ePrint Archive, 2017:573 (2017) Demers, A., Greene, D., Houser, C., Irish, W., Larson, J., Shenker, S., Sturgis, H., Swinehart, D., Terry, D.: Epidemic algorithms for replicated database maintenance. ACM SIGOPS Oper. Syst. Rev. 22(1), 8–32 (1988) Duffield, E., Diaz, D.: Dash: A payments-focused cryptocurrency (2018) Ediger, D., McColl, R., Riedy, J., Bader, D.A.: Stinger: High performance data structure for streaming graphs. In 2012 IEEE Conference on High Performance Extreme Computing, pp. 1–5. IEEE (2012) Ediger, D., Riedy, J., Bader, D.A., Meyerhenke, H.: Tracking structure of streaming social networks. In: 2011 IEEE International Parallel and Distributed Processing Symposium Workshops and PhD Forum, pp. 1691–1699. IEEE (2011)
StreamNet
521
17. Eyal, I., Gencer, A.E., Sirer, E., Van Renesse, R.: Bitcoin-ng: A scalable blockchain protocol. In: NSDI, pp. 45–59 (2016) 18. TRON Foundation. Advanced decentralized blockchain platform (2018). In: Whitepaper https://tron.network/static/doc/white paper v 2 0.pdf 19. Gilad, Y., Hemo, R., Micali, S., Vlachos, G., Zeldovich, N.: Algorand: Scaling byzantine agreements for cryptocurrencies. In: Proceedings of the 26th Symposium on Operating Systems Principles, pp. 51–68. ACM (2017) 20. Goodman, L.M.: Tezos–a self-amending crypto-ledger white paper (2014). https:// www.tezos.com/static/papers/white paper.pdf 21. Green, O., McColl, R., Bader, D.A.: A fast algorithm for incremental betweenness centrality. In: Proceedings of SE/IEEE International Conference on Social Computing (SocialCom), pp. 3–5 (2012) 22. Halili, E.H.: Apache JMeter: A Practical Beginner’s Guide to Automated Testing and Performance Measurement for your Websites. Packt Publishing Ltd, Birmingham (2008) 23. Klarman, U., Basu, S., Kuzmanovic, A., Sirer, E.: bloxroute: a scalable trustless blockchain distribution network whitepaper 24. Kogias, E.K., Jovanovic, P., Gailly, N., Khoffi, I., Gasser, L., Ford, B.: Enhancing bitcoin security and performance with strong consistency via collective signing. In: 25th USENIX Security Symposium (USENIX Security 16), pp. 279–296 (2016) 25. Korpela, K., Hallikas, J., Dahlberg, T.: Digital supply chain transformation toward blockchain integration. In: Proceedings of the 50th Hawaii International Conference on System Sciences (2017) 26. Kwon, J., Buchman, E.: Cosmos: A network of distributed ledgers (2016). https:// cosmos.network/whitepaper 27. LeMahieu, C.: Nano: a feeless distributed cryptocurrency network. Nano [Online resource] (2018). https://nano.org/en/whitepaper, Accessed 24 Mar 2018 28. Lewenberg, Y., Sompolinsky, Y., Zohar, A.: Inclusive block chain protocols. In: International Conference on Financial Cryptography and Data Security, pp. 528– 547. Springer (2015) 29. Li, C., Li, P., Xu, W., Long, F., Yao, A.C.-C.: Scaling nakamoto consensus to thousands of transactions per second (2018). arXiv preprint arXiv:1805.03870 30. Liu, C., Wang, D., Wu, M.: Vite: a high performance asynchronous decentralized application platform 31. Lombrozo, E., Lau, J., Wuille, P.: Segregated witness (2015) 32. Spoke Matthew and Engineering Team Nuco: Aion: Enabling the decentralized internet. Aion project yellow paper, vol. 151, pp. 1–22 (2017) 33. Mazieres, D.: The stellar consensus protocol: a federated model for internet-level consensus. In: Stellar Development Foundation (2015) 34. Michael, J.W., COHN, A., Butcher, J.R.: Blockchain technology. J. (2018) 35. Miller, A., Xia, A., Croman, K., Shi, E., Song, D.: The honey badger of BFT protocols. In: Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, pp. 31–42. ACM (2016) 36. Nakamoto, S.: Bitcoin: A peer-to-peer electronic cash system (2008) 37. Nedelcu, C.: Nginx HTTP Server: Adopt Nginx for Your Web Applications to Make the Most of Your Infrastructure and Serve Pages Faster than Ever. Packt Publishing Ltd, Birmingham (2010) 38. Poon, J., Buterin, V.: Plasma: Scalable autonomous smart contracts. White paper, pp. 1–47 (2017) 39. Poon, J., Dryja, T.: The bitcoin lightning network: Scalable off-chain instant payments (2016)
522
Z. Yin et al.
40. Popov, S.: The tangle. cit.on, p. 131 (2016) 41. Schwartz, D., Youngs, N., Britto, A., et al.: The ripple protocol consensus algorithm. Ripple Labs Inc White Paper, vol. 5 (2014) 42. Sompolinsky, Y., Lewenberg, Y., Zohar, A.: Spectre: Serialization of proof-of-work events: confirming transactions via recursive elections (2016) 43. Sompolinsky, Y., Zohar, A.: Phantom, ghostdag 44. Sompolinsky, Y., Zohar, A.: Accelerating bitcoin’s transaction processing. Fast Money Grows on Trees, Not Chains (2013) 45. Sompolinsky, Y., Zohar, A.: Secure high-rate transaction processing in bitcoin. In: International Conference on Financial Cryptography and Data Security, pp. 507–527. Springer (2015) 46. Tapscott, A., Tapscott, D.: How blockchain is changing finance. Harvard Business Review, 1 (2017) 47. Tian, F.: An agri-food supply chain traceability system for china based on RFID and blockchain technology. In: 2016 13th International Conference on Service Systems and Service Management (ICSSSM), pp. 1–6. IEEE (2016) 48. Wood, G.: Ethereum: a secure decentralised generalised transaction ledger. Ethereum Project Yellow Paper 151, 1–32 (2014) 49. Yue, X., Wang, H., Jin, D., Li, M., Jiang, W.: Healthcare data gateways: found healthcare intelligence on blockchain with novel privacy risk control. J. Med. Syst. 40(10), 218 (2016)
A Disaster Management System on Mapping Health Risks from Agents of Disasters and Extreme Events Christine Diane Ramos(&), Wilfred Luis Clamor, Carl David Aligaya, Kristin Nicole Te, Magdiyel Reuel Espiritu, and John Paolo Gonzales De La Salle University, 2401 Taft Ave, Malate, 1004 Manila, Metro Manila, Philippines {christine.diane.ramos,wilfred.clamor,carl_aligaya, kristin_te,magdiyel_espiritu, john_paolo_gonzalez}@dlsu.edu.ph Abstract. The World Economic Forum ranks the Philippines third among all of the countries with highest risks on disaster, with an index value of 25.14%. This is attributed to the location of the archipelago involving coastal hazards susceptible to impacts such as typhoons, storm, surges and rising sea leaves. Local government units often find difficulty in providing immediate relief, owing to decentralized reliable large-scale data. To improve disaster management and recovery, this research provides a systematic treatment through the development of a national database of health risks, which contains information on exposure and vulnerability to hazards. The system leverages on data management analysis techniques and visualization for disaster management researchers to explore which health issues are prevalent on a certain type of disaster or extreme event. In turn, this presents a more preventive approach in building awareness and recommendations of changing disaster risks, and dissemination of risk information for public health emergencies. Keywords: Disaster
Health Risk
1 Introduction 1.1
Background
The Philippines is known to be exposed to different types of disasters and extreme events. Events such as climate-related, geophysical disasters and armed conflict are frequent in the country. Due to the geographical location of the country, climate-related disasters such as typhoon, continuous rains, flooding, and so on are prevalent. According to the report by Bricker, et al. (2014), the Philippines is strongly affected by rain-bearing winds and high amount of precipitation due to its tropical location. The report also stated that the country experiences an average of 20 typhoons every year. With that, flooding and disruption of drainage are also prevalent in the country due to the high amount of rain the area experiences [1]. Geophysical disasters are also predominant extreme events due to the location of the country. The Philippines is known to be on the Pacific Ring of Fire hence volcanic eruptions and earthquakes are © Springer Nature Switzerland AG 2021 K. Arai et al. (Eds.): FTC 2020, AISC 1289, pp. 523–537, 2021. https://doi.org/10.1007/978-3-030-63089-8_33
524
C. D. Ramos et al.
common. Moreover, the country also has 22 active volcanoes making the country also susceptible to geophysical disasters such as earthquakes and volcanic eruptions. According the Philippine Institute of Volcanology and Seismology (PHILVOCS), five earthquakes and 20 tropical cyclones are experienced in the country at an average- at least five are disruptive as it may lead further to tsunamis, storm surges, landslides, flooding and drought [2]. Geophysical factors also transcend or go beyond administrative boundaries. In terms of armed conflict, the country suffers from Moro separatist movement for sovereignty and right to territorial integrity [3]. With various disasters and extreme events in the Philippines, different health risks arise. Different diseases such as water borne diseases and vector borne diseases are prevalent. Water borne diseases are prevalent due to the disruption of access to safe and clean water. Diseases such as diarrhea, typhoid fever, hepatitis, and cholera occur after drinking water has been contaminated due to flooding and other kinds of extreme event Leptospirosis is also a kind of water borne disease however; this can be transmitted by direct contact of an open wound with contaminated water such as rodent urine [4]. In terms of vector borne diseases, extreme events such as climate related disasters affect vector breeding areas. Still water caused by heavy rains and flooding are seen to be breeding sites hence increases the population of mosquitos carrying vector borne diseases such as dengue and malaria Other disease caused by extreme events is tetanus. It is not transmitted however; wounds of individuals may be contaminated due to anaerobic tetanus bacillus Clostridium tetani [5]. 1.2
Challenges in Disaster Management Response
Various researches were done to investigate the interconnection of health and disasters. The WHO uses the term Health Emergency and Disaster Risk Management to refer to the said relationship. According to WHO (2011), it is the “systematic analysis and management of health risks, posed by emergencies and disasters, through a combination of (1) hazard and vulnerability reduction to prevent and mitigate risks, (2) preparedness, (3) response and (4) recovery measures”. Health Emergency and Disaster Risk Management focuses on different studies on health risk due to disasters (Korteweg, Bokhoven, Yzermans, & Grievink, 2010), interaction of emergency management and health (Clements & Casani, 2016), public health in disaster risk reduction management (Shoaf & Rottman, 2000) were given emphasis [6]. Overlapping and sometimes, conflicting mandates of the DOH and LGUs on health owing to devolution led to the disintegration of the originally integrated referral system that linked public health services and hospital services. Decentralization was pushed too far, leading to fragmentation in service delivery as each health unit was inefficiently assigned to a different local government rather than keeping the integrated provincial health system in place. The flow of health funds was also made more complicated and inequitable with the weak link between health budget allocation and devolved health functions. Even with the additional internal revenue allotment (IRA) in 1992, for instance, some LGUs inevitably suffered revenue shortfalls since the extra IRA was distributed without regard for the distribution of the cost of devolved functions [2]. This prompts government units to include in their strategic thrust and programs a support for research and development with priorities on emerging technology trends for
A Disaster Management System on Mapping Health Risks
525
disaster management. As a response to the challenges identified in improving health outcomes, this study aims to create a central repository that collects data anchored on past studies/literature/reports on health risks from varying hazards and exposures, specifically, environment, climate change, seismic exposure, communal infrastructures and selected human activities. Moreover, it aims to collect data drawn from these agencies’ reports/literature within the past five years to include present community initiatives and intervention meant to reduce risks; and assess and update the current state of data/literature, scientific and indigenous knowledge on disaster risk reduction and resiliency. With a centralized system collecting valuable information, disaster management research assistants can provide recommendations to key policy makers and other stakeholders, allowing them to facilitate a better understanding and analysis to knowing which priority area to target.
2 Review of Related Systems Despite disasters being prevalent, several organizations or government units are still caught unprepared when such happen. This presents an opportunity for information technology as assistive medium in aiding disaster risk reductions, disaster preparedness towards natural disasters and a prompt response as to when they occur [7]. Various disaster management systems have been developed such as that of Alazawi, et al. on their development of a smart disaster management system that uses Intelligent Transportation Systems including Vehicular Ad hoc Networks (VANETs), mobile and Cloud Computing technologies. This research focused on two major evacuation strategies, Demand Strategies (DS) and Speed Strategies (SS) for evacuation results. The motivation was stemmed from increased disruptions caused by man-made and natural disasters such as the Typhoon Haiyan in the Philippines. Given that smart cities allow efficiency especially towards implementation of emergency response systems, the proponents found an opportunity to develop a business model on an advanced infrastructure on the evacuation operation by employing different city evacuation strategies [8]. A similar evacuation management system was developed by Azgar, et al. where in disasters related to unexpected fire accidents, bomb blasts, and stampedes due to crowd panic, incidents of violence and collapse of buildings and other natural calamities. The evacuation management system was a crowd tracking technology which utilized GPS based and Bluetooth tracking to identify critical locations where possible congestion may occur. This paved in the creation of evacuation path guidelines containing proper evacuation instruction paths for self-evacuation [9]. Another simulation research on disaster management control was developed by Song, et al. wherein GPS records over three years data on earthquakes that occurred in Japan were collected to demonstrate model efficiency on human mobility following natural disaster. These data contained information such as the occurrence time, earthquake hypo central location, magnitude and intensity for impacted places and damage levels such as destroyed infrastructure and deaths. Results concluded evaluation metrics which measured overall accuracy of different predictive models of real movements of persons following a specific disaster [10]. A similar research in the United Kingdom called ECSKernel as a demonstrator for coordination algorithms within ALADDIN
526
C. D. Ramos et al.
(Autonomous Learning Agents for Decentralized Data and Information Systems) project. It was developed to show how specific auction mechanisms and collation formation algorithms can be applied to disaster management problems [11]. Other ICT usability on disaster management utilizes social media as a valuable medium to provide an instant view of the calamity-including sharing of information on real time happenings, ongoing relief operations, and connecting displaced families or friends [12]. Internet of Things is also another prevalent ICT solution for disaster. Embedded electronics act as sensors which determine calamity thresholds and the recent emergence of 5G presents a significant advancement of radio science which enable communication and interaction through social collaboration [13]. The system closest to the research was the open source software developed by Currion et al. which coordinated institutional and technical resources in the midst of the Indian ocean tsunami. This disaster management system served as a registry of grassroot information obtained during this calamity. The system was comprised of components such as the organization registry, missing people registry, camp registry (which reports or tracks shelters), inventory management (where supplies were listed), a messaging module, and mapping module which allowed annotation of text and images of disasters [14]. However, despite various technologies developed for disaster management, none of these systems specifically aim to focus on health risks brought by the impact of disasters. As such systems are focused on simulation models or crowd sourcing, our research utilizes data analytics and visualization to investigate correlated diseases as a consequence of a transpired disaster. In effect, these would be used for spreading health awareness and warning to areas most affected by a specific natural disaster or extreme event. The study of Li, et al. was used in the study in ensuring the information system design follows the general requirements outlined in the disaster system architecture-especially towards its information retrieval, information filtering, and data mining [15].
3 Research Design and Methodology 3.1
Requirement Gathering and Problem Analysis
Prior to the prototype development, it is vital to understand the underlying problem areas encountered in the analysis of disaster management data in order to conceptualize an IS solution fit to the need of the stakeholders. Through a series of interview and document analysis, we used the process of the National Disaster Risk Reduction and Management Council (NDRRMC), the government unit responsible for Disaster Risk Reduction, as benchmark as to how they performed their data gathering, identifying key priority areas, and providing relief interventions. This was the baseline taken with respect to the process performed by the Social Development Research Center disaster management analysis. After outlining the main processes, another series of interview and document reviews were performed to identify the main cause of the issue on the difficulty in consolidation, utilization, and analysis of data in collaborative research activities on disaster. This was identified due to several factors such as (1) timeliness ranging from 2-3 months in terms of consolidating data from various disaster related projects as there are instances when some researchers would not be able to share their data immediately as they are bounded by a
A Disaster Management System on Mapping Health Risks
527
copyright of their own data (2) difficulty in consolidating data from different data source types play a role in the lengthening of the consolidation phase. Data collected from field areas impacted by disasters are obtained from multiple instruments such as pen and paper, survey, FGD- which makes it difficult to collate all gathered data into one standardized form. Disaster research teams are also given the liberty to model their data according to the visualization that they are most familiar/comfortable with which limits conclusions if data cannot be transformed, drilled down or compounded. It is often that when published reports are not given to the government unit on time, there are implications on financial milestones hindering release of budget approved. All were modeled and arranged with their corresponding causes and effect, using the Ishikawa problem analysis technique as shown in Fig. 1 below.
Fig. 1. Ishikawa diagram on disaster management analysis processes of the social development research center
3.2
Prototype Development
As for the system development methodology, the research team followed Scrum as the most applicable framework to use in the development of this study because Scrum methodology provides the flexibility in case of alterations, breaks down the overall team objectives in smaller parts, and enables the team to closely cooperate with the project team’s partners as they finish the components. Since the project tackled for this research will be conducted following a specific timeline, the team deemed that Scrum Agile would be the most applicable way of delivering the project on time. Scrum will enable flexibility to adjust to changes based on what the stakeholders mandates which could help eliminate unnecessary tasks if there are unavoidable change requests. Using Scrum will also enable the team to tackle the overall objectives in smaller parts where
528
C. D. Ramos et al.
doing the software can be done in small phases. By tackling it in smaller units, it may help the team become more productive by beating deadlines consistently in quantities that are easily comprehensible. Lastly, Scrum enabled the team to closely cooperate with the project partners as they make progress towards the research. By showing the finished products of the software and the project to the stakeholders, the team can was able to assess progress consistently and in the same way understand where more effort is necessary. Initiating. Identifying project stakeholders and gathering of requirements is the overall objective of this phase. The overarching need for the central database came from the need of the Department of Science and Technology - Philippine Council for Industry, Energy and Emerging Technology Research and Development (PCIEERD). A series of meetings interfacing with PCIEERD and the Social Development Research Center (SDRC) were conducted in order to understand the users and convert those needs into the requirements of the software. During this phase, sprint planning are also held to define and present smaller achievable tasks. It is in this phase that the main problems were identified and mapped with the proposed module solutions specifically functions on Data Collection, Data Consolidation, Data Visualization and Analysis, Initial Results and Feed backing and Project Completion and Reporting which will further be explained in the next chapter. Sprinting/ Executing. After the initiating phase, the executing phase starts with the software development. The output of the software was incrementally presented but at the same time, continually producing output. Inspect and Adapt. This phase is integrated with the sprint phase since it is done during sprinting periods. During this phase, team members gather together every day to what is called stand-ups which aims to inspect the team’s progress towards the sprint’s goal. In these stand-ups, team members are asked to convey what they did, what they are doing, and what they will do in a brief amount of time. This is done in order to be able to monitor the progress and apply corrective measures in cases where complications may arise. Continuous Planning and Backlog Grooming. In this phase, the scrum team members and the product owner meet to discuss what is done and what needs to be done. The two stakeholders also meet for the development members to demonstrate what is currently finished in order to acquire feedbacks and in the same way get information from the project owner regarding new changes to the product. By conducting a continuous planning, the team is then able to be flexible on changes and in the same way communicative to the project owner Closing. At the end of the scrum framework is the closing phase where meetings are held after the final sprint. In this meeting discussions regarding feedbacks about the projects are held where the good practices, the things for improvement, and the possible solutions to constraints are discussed within the scrum team. By discussing the mentioned topics the project team could then be more effective towards handling other projects by knowing their strengths and weaknesses.
A Disaster Management System on Mapping Health Risks
529
4 System Conceptual Framework The research proponents designed the concept framework (see Fig. 2) which consists of the proposed functions and system features to address the problems identified. Along with these modules are the technologies and tools used during the development proper.
Fig. 2. System conceptual framework
4.1
System Users
The test bed users of the system were the university disaster management researchers under the Social Development Research Center of the university. This arm is in touch base by various government units such as the Department of Science and Technology for disaster-related researches. The disaster management researchers include the research coordinator, and director; the second includes the research project team members coming from different branches of expertise (i.e. civil engineering expert on providing findings for infrastructure, and humanities/health expert for health or psychological impact); and the third user will be the public for published report viewing and participation medium. Project Director. The project director is (1) responsible for leading all project coordinators and making sure that the resources gathered is on track with the research, (2) responsible for building relationships with other organizations which will be beneficial for the research, (3) responsible for conducting final review of the research survey/questionnaire to ensure the appropriate data is collected. The project director, which is the owner of the project can view all the analysis and consolidated data uploaded by his research team and can give feedback.
530
C. D. Ramos et al.
Research Associate/s or Experts. The project research associate also known as the research assistant is responsible for (1) the collection and consolidation of data which will uploaded by them in the system and will be used by the research members in their analysis, (2) the sourcing of experts (research members) to be part of a project at the same time delegating their responsibilities. The research associates’ goal is to be able to arrive to a conclusion from the data uploaded by the project research coordinator and with their conclusion collaborate with other research associates under the same project and arrive at agreed upon conclusion among the members. Public User. As the database is a collaborative tool among disaster research teams even outside the university or project team, published sections of the findings (as approved by the project director) can be viewed by public user or external disaster research entities. 4.2
Information Assets
The following below outlines the data points that are collected by the system initially. However, the system can also have the flexibility to accommodate additional data fields or entries depending on the findings from the field data gathering. Literature Upload. This data point allows disaster management researchers to supplement current findings with historical data. The following Table 1 gives a summary of all Literature Upload Data Points. Table 1. Literature upload data points Field entry Sub-entries Categories Year Title Author Abstract Type of Literature Gray News Article Conference Paper Scientific Journals Source Keywords Web URL
Health Data. This data point allows disaster management researchers to supplement current findings with historical data. The following Table 2 gives a summary of all Health Upload Data Points. Event Data. This data point collects the incident transpired or the type of disaster hit the municipality or the barangay. The event data also records the date and the number deaths, casualties- if dead, injured, or missing, the affected family or persons, the
A Disaster Management System on Mapping Health Risks
531
Table 2. Health data points Field entry Sub-entries Year Diseases AW Diarrhea AB Diarrhea Hepatitis Typhoid Fever Cholera Dengue Malaria Leptospirosis Tetanus
evacuated families or persons, the number of evacuation center, damage to houses if total or partial and the damage to properties. Health Infrastructure Damages. This data point is necessary in order to highlight which health infrastructure needs immediate repair. The following Table 3 gives a summary of all Health Infrastructure Damages Data Points. Table 3. Health infrastructure data points Field entry Sub-entries Infrastructure Damages Regional Provincial Municipal Barangay Line/Birthing Level of Hospital Primary Secondary Water System Damages Sewage System Waste Management System Water System
Location. The unit of analysis would be the location areas affected- including island group, region, and its provinces. 4.3
Modules of the System
Data Collection. The data collection module will contain features that deal with collecting raw social research data that will be fed into the system. There will be two ways
532
C. D. Ramos et al.
in which users may be able to feed data in the system. First is manual user entry which follows a standard form that will be designed by the Research Coordinator. Another means would be a data upload file still following the defined form which upon upload, will convert and upload the data in the database according to structure and form. Specifically, four (4) files can be uploaded into the system. Namely, Literature Upload, Health Data, Event Data, and Health Infrastructure Damages containing a number of field entries, sub entries, and categories. Data Consolidation. The data consolidation module identifies lapses in the consolidated data through (1) Data Validation which prompts the user that the consolidated data contains discrepancies and the user has the option to either correct or eliminate the discrepancy through the system itself and (2) Data Integrity which scans the consolidated data if there are missing entries, improper input (i.e. data being asked was age, but the data entered is the birthdate), redundancies, typographical errors. In order to prevent any errors from occurring again, a standard will be put in place (fixed columns) to reduce the time needed to consolidate data. Data Visualization and Analysis. Since the research information system will be dealing with significant amounts of data with a combination of qualitative and quantitative data, the Data Visualization module will be able to generate charts, bars, and graphs fit to the data point and preference of the researcher. Analysis of the data will be made easier with functionalities such as a drill down function wherein users can filter out the data that they want to see based on the unit of analysis they selected, a search function to allow users efficient means to retrieve specific data selected from the data repository. This module will also contain descriptive analytics and predictive analytics features which will aid in the production of reports needed by the researchers. Descriptive statistics will be applied to generate reports using averaging and summation such as auto-generation of summary statistics for areas with high impact/ high risk conditions, averaging on trends of persons affected with illnesses. Moreover, predictive analytics will also be applied by having a facility for a potential forecast on the expected no. of casualties through the use of statistical methods defined by the project research member. Initial Results and Feedbacking. The Initial Results and Feedbacking module will have a facility for communication tracking among the users as long as they are under the same project. This allows tracking of time stamps, user profile account, and changes or revisions done in the report. This facility can also set authorization control as to who can see their initial results data for feedbacking. Project Completion and Reporting. The published reports are findings and accomplishments done during the research proper which can be set to public or internal. This facility can also produce formatted memorandums containing recommendations for external stakeholders. Administrative Module. This facility allows role-based access control features such as limiting access view/edit to a specific research project; including ensuring personal data obtained from the field data gathering are anonymized.
A Disaster Management System on Mapping Health Risks
533
5 Results and Discussion This main objective of the centralized IS was to assist in the mapping of health risks from agents of disasters of extreme events, as performed by various disaster researchers such as the Social Development Research Center to government stakeholders such as the NDRRMC and DOST-PCIEERD. Through the creation of the IS solution, this objective was met as the system enabled disaster research teams to access information and data based on their roles, to capture and upload data more efficiently, and to visualize information assets that allowed researchers to elicit more insightful findings and recommendations. The data collection module allowed standardization of information assets addressing the main problem on the difficulty in consolidating multiple data from various data collection mediums. With pre-defined fields and automated classification of content from the data upload or input, content is easily sorted and searched for all data points (see Fig. 3).
Fig. 3. Standardized information assets
Users are provided with an option for a manual input or a file upload (following a downloadable template) which addresses issues on timeliness in encoding (see Fig. 4).
Fig. 4. Input option for data collection module
534
C. D. Ramos et al.
Most importantly, the system is able to provide auto-generated findings and visuals such as charts and graphs aiding in profound analysis. Figure 5 below shows a heatmap of which region are greatly and least affected by incidents which was pulled from event data points. The heatmap can further be zoomed into the respective island groups for a micro-level view. At a glance, the heatmap allows a disaster researcher to identify which region are prone to incidents. In effect, alerts or controls can be placed in this area for close monitoring.
Fig. 5. Heatmap for incidents per region (or as defined by user)
Several options are available for visualization and feedback which displays descriptive statistical data or correlated data as outlined in Fig. 6. At one click, a report can be generated on key findings such as diseases with the highest/lowest infection rate, the month with the highest/lowest infection rate, the region with highest/lowest infection rate, the communicable disease with the highest/lowest infection rate, and non- communicable disease with the highest/lowest infection rate. This is valuable to government stakeholders in order to predict which month is the influx of a certain disease thus preparing the necessary prevention programs, vaccines or medicines, relevant and needed for the month and location. Spot charts were also useful in determining correlation across data points-especially on diseases, incidents, and type of event (see Fig. 7). The system automatically generates key findings such as which diseases are positively correlated with each other. For instance, in the sample data shown below, it can be concluded that when AB Diarrhea
A Disaster Management System on Mapping Health Risks
535
Fig. 6. Visualization for total cases disease
occurs in a certain area, it will be likely that TB Respiratory will also be high in number, and so on. Correlation reports were generated for all numerical data points.
Fig. 7. Spot chart for correlated diseases
536
C. D. Ramos et al.
The system also helped improve communication avenues by providing a commentary section where the research team can interact and finalize their analysis given the problem that they do not have a common time in convening to compare each other’s findings and feedbacks. In addition, the system also enables help the end users to keep track of their expenses ensuring that the budget does not go out of control. It also aids end users to generate assessment reports minimizing human errors during the creation of such forms since most of the data would be extracted from the system. Various system testing was also performed including unit testing which was done after completion of each module and integration testing after. The users of the system go through the system according to the intended process, covering both common and special cases to ensure that the system will operate appropriately regardless of nature of the cases. The users test all modules of the system with the use of either dummy, live, or a mix of both kinds of data to ensure that the user/s can operate properly with the system. This process ensures that the modules of the system are linked correctly and able to handle situations appropriately. The system is also ensured to efficiently accommodate the identified amount of data that will go in the system. The user acceptance testing result used a Likert scale of 1 being the lowest, to 5 being the highest. UAT results yielded an average of 4.8 which validated further software credibility and ease of use. The User Acceptance Test (UAT) Form was divided into four (4) sections namely: Interface and Design, Content, System Layout, and Website Navigation. The research team have conducted five (5) user acceptance tests which are composed of the end users from SDRC and select IT professionals and software engineers for professional validation. The results for this UAT are as follows: 4.8 out of 5 for the Interface and Design, 4.9 out of 5 for the Content, 4.8 out of 5 for System Layout, and lastly 4.7 out of 5 for Website Navigation.
6 Conclusion and Recommendation This study addresses the Philippines’ current difficulty in disaster management response. With improving and promoting accountability to analysis, transparency to data, and orderly practices in disaster research, the system can make greater heights of impacts to the Philippine society for health care relief. Having the process of SDRC automated will help expedite the disaster research process, have a real-time data repository, and promote optimized collaborations. The effects and result of this new system implementation further extends to the beneficiaries belonging to high risk disaster prone areas. With the results generated from the system, where accessing data sets of health or disaster events and infrastructures capabilities of various regions can provide quicker turnaround times for researchers to process data in making further recommendations, the system can help in creating accurate, timely, and necessary preventive measures in cases such as calamities, disasters, or even epidemics. These recommendations can in turn be utilized by the local government, concerned private sectors, and other organizations to provide better services through instigating health funds towards the necessities of a particular area that it needs it the most, better facilities through promoting and providing accessibility for the citizens through imparting proactive solutions in mitigating risks brought about by calamities. It is with
A Disaster Management System on Mapping Health Risks
537
the hope that future researchers contribute to the objective of the research which may potentially save lives, equip communities, and aid citizens to be prepared for any health or disaster related incidents. Acknowledgments. The proponents would like to thank De La Salle University- University Research Coordination Office (URCO) for funding this project.
References 1. Bricker, J.D., et al.: Spatial variation of damage due to storm surge and waves during Typhoon Haiyan in the Philippines. J. Japan Soc. Civ. Eng. Ser. B2 70(2), 231–235 (2014) 2. Department of Health, “National Objectives for Health: Philippines 2017–2022,” Heal. Policy Dev. Plan. Bur. Dep., no. 1908–6768 (2018) 3. Buendia, R.G.: The state-Moro armed conflict in the Philippines Unresolved national question or question of governance? Asian J. Polit. Sci. 13(1), 109–138 (2005) 4. Lo, S.T.T., et al.: Health emergency and disaster risk management (Health-EDRM): Developing the research field within the sendai framework paradigm. Int. J. Disaster Risk Sci. 8(2), 145–149 (2017) 5. Watson, J.T., Gayer, M., Connolly, M.A., et al.: Epidemics after natural disasters. Emerg. Infect. Dis. 13(1), 1–5 (2007) 6. Grinnell, M.: Status of technology and digitization in the nation’s museums and libraries. J. Acad. Librariansh. 32(4), 445 (2006) 7. Landry, B.J.L., Koger, M.S.: Dispelling 10 common disaster recovery myths: Lessons learned from Hurricane Katrina and other disasters. ACM J. Educ. Resour. Comput 6(4), 6 (2006) 8. Alazawi, Z., Alani, O., Abdljabar, M.B., Altowaijri, S., Mehmood, R.: A smart disaster management system for future cities. In: WiMobCity 2014 – Proceedings 2014 ACM International Workshop Wireless Mobile Technology. Smart Cities, co-located with MobiHoc 2014, pp. 1–10 (2014) 9. Ibrahim, A.M., Venkat, I., Subramanian, K.G., Khader, A.T., De Wilde, P.: Intelligent evacuation management systems: a review. ACM Trans. Intell. Syst. Technol. 7(3), 1–27 (2016) 10. Song, X., Zhang, Q., Sekimoto, Y., Shibasaki, R., Yuan, N.J., Xie, X., et al.: Prediction and simulation of human mobility following natural disasters. ACM Trans. Intell. Syst. Technol. 8(2), 1–23 (2016) 11. Ramchurn, S.D., et al.: Agent-based coordination technologies in disaster management. In: Proceedings of International Joint Conference Autonomus. Agents Multiagent System. AAMAS, vol. 3, pp. 1605–1606 (2008) 12. Nguyen, Q.N., Frisiello, A., Rossi, C.: Co-design of a crowdsourcing solution for disaster risk reduction. In: I-TENDER 2017 – Proceeding 2017 1st Conext Workshop ICT Tools Emergency Networks DisastEr Relief, pp. 7–12 (2017) 13. Velev, D., Zlateva, P., Zong, X.: Challenges of 5G usability in disaster management. In: ACM International Conference Proceeding Service, pp. 71–75 (2018) 14. Currion, P., De Silva, C., De Walle, B.: Open source software for disaster management. Commun. ACM 50(3), 61–65 (2007) 15. Li, T., et al.: Data-driven techniques in disaster information management. ACM Comput. Surv. 50(1), 1–455 (2017)
Graphing Website Relationships for Risk Prediction: Identifying Derived Threats to Users Based on Known Indicators Philip H. Kulp(&) and Nikki E. Robinson Cybrary Fellow, College Park, MD 20737, USA [email protected]
Abstract. The hypothesis for the study was that the relationship based on referrer links and the number of hops to a malicious site could indicate the risk to another website. The researchers chose Receiver Operating Characteristics (ROC) analysis as the method of comparing true-positive and false-positive rates for captured web traffic to test the predictive capabilities of the created model. Known threat indicators were used as designators and leveraged with the Neo4j graph database to map the relationships between other websites based on referring links. Using the referring traffic, the researchers mapped user visits across websites with a known relationship to track the rate at which users progressed from a non-malicious website to a known threat. The results were grouped by the hop distance from the known threat to calculate the predictive rate. The results of the model produced true-positive rates between 58.59% and 63.45% and false-positive rates between 7.42% and 37.50%, respectively. The true and false-positive rates suggest an improved performance based on the closer proximity from the known threat, while an increased referring distance from the threat resulted in higher rates of false-positives. Keywords: Cyber security Graphing database Receiver operating characteristics Neo4j Website Threat model
1 Introduction The purpose of the research was to test a method of identifying risks to websites based on the relationship to known threats. Since a lag exists between identifying an instance of malware and propagation of the threat notifications, blacklists cannot provide a complete solution to identifying risks [1]. The rationale for the study was to test the predictive nature of risks associated with non-malicious websites based on the proximity to malicious sites as determined by known threat indicators. In the current study, the researchers sought a method of enhancing the existing threat indicators with additional metadata about the evaluation of risk based on relationships. Receiver Operating Characteristics (ROC) analysis [2] was used to evaluate the performance of predicting the risk of visiting websites based on the proximity to known malicious sites. The researchers defined proximity as the number of hops to a known
© Springer Nature Switzerland AG 2021 K. Arai et al. (Eds.): FTC 2020, AISC 1289, pp. 538–549, 2021. https://doi.org/10.1007/978-3-030-63089-8_34
Graphing Website Relationships for Risk Prediction
539
threat. ROC analysis allows researchers to evaluate models in medical trials, Intrusion Detection Systems (IDS), and machine learning. Researchers use ROC analysis to assess the performance of IDS and other cybersecurity-related research [3]; therefore, the researchers concluded the method could provide an acceptable approach for testing binary classification of threats. ROC analysis can evaluate multiple classes of a single model, and in the current study, the classes define the distance (in hops) from the known threat. By delineating threats based on the relationships to other websites, organizations could use the information to determine the threshold for acceptable risk. The risk evaluation determines the probability that a user visiting a website may follow a path to malicious content.
2 Related Work Gyongyi, Garica-Molian, and Pedersen performed research on a similar topic as the current study using the equivalent of Google PageRank to determine if a website should be classified as spam [4]. The researchers focused the discussion on a method called TrustRank, which evaluated multiple indicators to assess the trust of a website. Wen, Zhao, and Yan presented similar research [5] as Gyongin et al. except the researchers analyzed the content of pages to determine whether the website should be classified as malicious. While both research studies presented similar topics as the current study, the goal was to track activity across related websites to test the likelihood of predicting a user visiting a malicious website after entering other websites in the relationship. Rawal, Liang, Loukili, and Duan focused on the research in the prior ten years of cyberattacks, the attacker’s patterns, and thus offered a tool to predict the subsequent moves of the attackers [6]. This research aligns with the current study in attempting to understand the next step of the attacker, which would provide an advantage to the organization under threat of attack. Rawal et al. proposed a mathematical model to predict cyberattacks. The information used for this model included an analysis of hundreds of cyberattacks from 2000 to 2015. The researchers first identified who the possible attackers would be, including internal and external intruders, then reviewed the research to identify the top attack methods used. The top three attack vectors included Denial of Service (DoS), Cross-site Scripting (XSS), and Brute Force [6]. The results of their research suggest the need for a tool to predict when a user or organization is at risk of a cyberattack. Chiba, Tobe, Mori, and Goto [1] concurred with Rawal et al. findings that webbased malware attacks are on the rise and considered a severe threat. Chiba et al. tested a method of blacklisting websites using IP characteristics [1]. This study is complementary to previous research as the researchers were not attempting to replace blacklists or common cybersecurity tools but to create a model that may aid in identifying malicious websites. Chiba et al. used two Internet Protocol (IP) characteristics of stability and address space characteristics to assist in the blacklisting of websites. The researchers were able to demonstrate the attributes within the experimental use of their feature extraction model. A key takeaway from this research is the usage of a
540
P. H. Kulp and N. E. Robinson
ROC curve and agreement that there is a need to develop a tool to detect malicious websites. While Chiba et al. focused on specifically blacklisting of websites, the current research focused on determining relationships between malicious websites. Bruijn and Janssen studied the difficulties in defining a framework for improving the communication of cybersecurity-related issues [7]. The researchers built an evidence-based framework to enhance cybersecurity awareness and improve perspectives for solving problems. Bruijn and Janssen identified four main topics in creating cybersecurity policies; intangible nature, socio-technical dependence, the ambiguous impact of a cyberattack, and defense against cyberattackers. The researchers focused on the inability to communicate risk to users about accessing websites. They also determined the need for creating policies to educate users on website safety. The selected research goal was to develop a model that could provide users with risk-based alerts during browsing near malicious websites. The implementation of the model in the current research could aid organizations with the creation of policies and education for the end-users. Schwarz and Morris determined that incorrect and misleading information on websites can have severe consequences for users [8]. The researchers focused on the identification of credibility in web pages to aid users in assessing the validity of the content. This research is critical to the current study, as it reinforces the value of flagging malicious websites to notify users. As Bruijn and Janssen stated, the human factor (or socio-technical aspects) of cybersecurity can be the most crucial in preventing a cyberattack [7]. Schwarz and Morris created a visualization of search results to help individuals determine the validity of a website. Providing a tool to allow users to make the right choice when engaging website content [8], works in combination with the current research; that users need to comprehend how close they may be to a malicious website. The selected model aims to build on this idea by providing users with the knowledge of how close they are to a dangerous web page.
3 Method The researchers designed the quantitative study to test a model for identifying risk to websites based on the relationship to known threat indicators. The general problem is that websites can include content from other sites that may load malicious data into a web browser and cause harm to users. The specific problem is known malicious website content can be blocked, but a safe site may load content from a threat that has not yet been labeled with an indicator in the last defense cybersecurity software. The purpose of the research was to describe a method of identifying risks to websites based on the relationship to known threats. For the study, the independent variable was a threat indicator to a known malicious website. The dependent variable was the relationship between websites inferred by the referring links between all the web traffic captured. For the same window of the captured data, blocked websites were extracted as the threat indicators and mapped as malicious websites. All websites in the captured traffic represented the population for the study, and the websites with a relationship to malicious websites represented the sample of the population. Relationships describe as any path from a target website to a
Graphing Website Relationships for Risk Prediction
541
known malicious website. The relationships were limited to four “hops” from a website to a malicious site with a threat indicator. The research was performed at a single organization; therefore, the traffic for analysis was limited to malicious websites from the users of that organization. The malicious website indicator notification was determined by the firewall device installed in the location and presented a second limitation to the study. Finally, since no preliminary website risk graph database existed, the model had to be created based on the existing indicators. This limitation manifested as 100% true positive rate for first-hop analysis. These limitations were understood by the researchers and accepted as a basis to continue with testing the model. Threat indicators classify malicious websites, but based on the purpose of the study, the researchers sought to test a method of identifying risks to users before they visited a malicious website. Predicting threats to users and averting risk provides a first-move advantage [6] and could reduce the potential risks to an organization. A website that loads content from another website represents a possible link one hop away. If a user clicks on a link to visit a new website, the new website could load data from a third website, which represents a risk two hops away. A website may contain relationships with multiple malicious sites at multiple hop lengths, so risks could be elevated based on not only the distance of the relationship but also the totality of the relationships. We did not attempt to establish a quantitative sum of the risk in the current study, but the topic is addressed in the future research section. The following hypothesis and null hypothesis provide the structure and framing for the study: H = The relationship based on referrer links and the number of hops to a malicious site can indicate the risk to a website. H0 = A risk of visiting a website cannot be indicated by the relationship to known threats. 3.1
Data Collection
Once the research design and method were developed, the extract, transformation, and load (ETL) process began. The data was repurposed from previously collected website traffic [9] and sanitized before ingesting into the databases. The website traffic was gathered from unencrypted network traffic using http-monitor tool [10] and written to a JavaScript Object Notation (JSON) files. The researchers developed a custom Python program to extract the domain, port, URI (Universal Resource Identifier), and referrer data points for ingest into the databases. The referrer represents the URI from the previous website, which leads to the current site. The URI for the original and referrer website was stripped of variables to sanitize any potentially sensitive content. The Python program transformed the content and populated the data into the Neo4j and MySQL databases. Instrumentation. Data collection occurred over 21 days and was limited to the metadata of web traffic occurring on port 80 since metadata from encrypted traffic cannot be extracted. The monitoring tool collected the metadata and web headers of 6,038,580 web traffic connections. The researchers filtered the traffic destined for the websites of the organization to avoid capturing sensitive data. After filtering potentially
542
P. H. Kulp and N. E. Robinson
sensitive data, the number of web traffic connections stored in the MySQL database was 3,735,355. The resulting traffic was stored in MySQL to support queries of domain association generated from the graph database. Neo4j was selected as the graph database for the research based on the popularity of the software and the robust capabilities of existing Python modules. The researchers leveraged the work of the Neomodel module [11] since the object-oriented design allowed for quick code development. All website domains were created as Website nodes in the database, and the referring website was also created as a Website node. A connection was established between the two nodes using the relationship name of Refers. All blocked domains during the time of the web traffic sampling were created as Indicator nodes, and a Threat relationship was established to the Website node. Figure 1 contains the nodes and relationships incorporated within the Neo4j database after the data was loaded.
Fig. 1. Neo4j populated nodes and relationships
With all nodes created and relationships established, the researchers could query the Neo4j database using the cypher query language. The following cypher code provides an example query used during the study to build relationships to malicious websites identified by threat indicator within three hops.
The Neo4j application provides a graphical visualizer capability to render the results of a query. Sample results from the above cypher query are included in Fig. 2 to provide a visualization of all possible paths to a malicious website. Websites include referring links to other websites, and the hierarchy of associations can be plotted out to any number of “hops.” The term hop represents the number of links a user would need to click to arrive at the website under investigation. In Fig. 2, the malicious website is labeled as “px.powe” and located on the middle-left with the incoming Threat relationship.
Graphing Website Relationships for Risk Prediction
543
Fig. 2. Referrer hops to a malicious website
As denoted in Fig. 2, multiple paths exist to the malicious Website node. Two of the paths traverse 3-hop relationships, and seven paths involve a 2-hop relationship via the node denoted such as “csync.s.” The shortest number of hops does not denote the expected path a user will traverse, but additional research could test the number of relationships to determine the most likely path. Websites such as “healthza” in the topright location represent nodes that have multiple paths to the malicious domain via 2hop and 4-hop relationships. The cypher query execution identified a malicious Website node if it contains an Indicator with a Threat relationship. The website a user visits is indicated as a node that contains a Refers relationship with one to four hops of relationship to the malicious Website node. For example, the node on the top-right denoted as “www.go” represents the website the user wants to visit, and the node on the lower left in the figure denoted as “px.powe” with a Threat relationship represents the malicious website. Some of the nodes are annotated in the figure with the number of hops away from the desired website. The domain names are not fully annotated to provide some anonymity to the website traffic. Sampling. For the current study, the researchers were interested in associations with websites with known threat indicators. Existing threat indicators were selected from firewall software that blocked traffic to known malicious websites. All metadata for website traffic was captured, but the sampling for the study relied on malicious websites and derived relationships.
544
P. H. Kulp and N. E. Robinson
While the relationships of the nodes were analyzed within the Neo4j database, the researchers simultaneously loaded the same data into MySQL for alternate analysis methods. The same custom Python script performed the extraction of the data from the source JSON files before inserting it into the Neo4j and MySQL databases simultaneously. Both databases leverage different technical strengths, so Neo4j was used to express relationships, while MySQL was use relied on for the ability to combine and query objects.
4 Results The researchers loaded the captured data into Neo4j to build the relationships, then queried the same data in MySQL to track the paths of users through related websites. The results of the tracked paths of users populated a secondary database table in MySQL to facilitate the final analysis. The researchers queried from the derived table to extract the data needed for the ROC analysis. Table 1 contains the results of the queried data, which was grouped by the number of hops the website was located from the known malicious site. In the table, False Positive (FP) represents all visitors to related websites who did not visit the final threat. False Negative (FN) represents direct visits to a threat that the selected model had no way of representing. True Positive (TP) represents all visitors to related websites who also visited the threat. True Negative (TN) represents all visits to websites that were not related to a known threat.
Table 1. Table captions should be placed above the tables. Hop False positive False negative True positive True negative 4 2406 159 276 6257 3 1884 159 271 6257 2 901 159 241 6257 1 0 159 298 6257 0 474 159 225 6257 Note. The data represents the summation of the analysis of traffic visits and the association of websites to known threats. The hop column denotes the distance to a known threat with a hop of zero representing the threat. FP for 0-hop data represents direct visits to non-threat websites without any referring traffic.
Fawcett defined ROC analysis based on calculations from a confusion matrix [2], as depicted in Table 2. Each hop row of data provided input to a single confusion matrix, and the inclusion of all matrices represented the developed model for the current study. Each hop count analysis was evaluated independently, and no analysis was performed on all hop data as a summary of the model. Still, the researchers provided further analysis of the data in the discussion section.
Graphing Website Relationships for Risk Prediction
545
Table 2. Confusion matrix [2]. p n Y True positives False positives N False negatives True negatives Column totals: P N
A confusion matrix contains the four possible outcomes for any instance with the true class represented by the columns and the hypothetical class represented by the rows [2]. For each of the hops in Table 1, the researchers populated a confusion matrix and performed the ROC calculations. The results of the calculations are documented in Table 3. The calculations were performed based on common metrics suggested by Fawcett [2].
Table 3. Summary of ROC analysis. Hop FP rate TP rate Precision Sensitivity Accuracy F-measure 4 0.375 0.635 0.103 0.635 0.719 0.177 3 0.294 0.630 0.126 0.630 0.762 0.210 2 0.140 0.603 0.211 0.603 0.860 0.313 1 0.000 0.652 1.000 0.652 0.976 0.789 0 0.074 0.586 0.321 0.586 0.911 0.415 Note. The data in the table represents the results of the calculations from the confusion matrices. The FP and TP are used to graph the data, while the remaining metrics provided the researchers with values to evaluate the certainty of the results.
The negative predictive value was not calculated for the data since the TN and FN values remain static across all hop distances. The values were static since they were based on direct visits to known threats and visits to websites not related to known threats. The value for all hop distances for the negative predictive values was 0.9752. The positive predictive value is denoted in Table 3 as the “Precision” value.
5 Discussion Zweig and Campbell explained that the graph of ROC performance would plot an improved accuracy model as tending toward the upper-left corner [12]. The upper-left corner represents a perfect result with a 100% TP rate and a 0% FP rate. Fawcett further explained that “conservative” classes would tend toward the left side of the graph, while more progressive classes tend toward the right side of the graph [2]. A difference between the two classes is related to the prediction of true-positives with a higher willingness to accept increase false-positive rates [2]. The graph of the ROC performance for the current study is depicted in Fig. 3. The dotted line represents x = y and denotes a model whereby
546
P. H. Kulp and N. E. Robinson
Fig. 3. ROC performance graph based on distance from a threat
the true-positive and false-positive rates are equal and, therefore, cannot accurately distinguish between true and false-positive values. Predictive results can only provide a useful model by occurring above the dotted line. The ROC performance of the 1-hop value does not conform to the pattern of the other hops and is due to the nature of the data collection. The website relationships were generated based on referring links from a known threat and subsequent referring links from derived relationships. Since the first hop from known threats was always visited links, no data could be derived from links to an unknown set. The selected model of predicting risk to websites based on association to known threats can still provide value based on the hops of further distances. The issue of inconclusive results for 1-hop values was addressed in the future work section. The FP rate increased as the number of hops increased, which suggests that the further the relationships are from the original website, the less likely the model can predict threat. The conclusion appears to be valid since the further away a website is from the threat, the more possible paths exist; therefore, the more likely a user will follow an alternate path. The conclusion will not be valid for all websites since some content may provide a higher appeal, which could attract users at an increased rate compared to other content. When considering all websites as a unit, the likelihood of predicting a path to a malicious website appeared to be reduced the further away a relationship was evaluated. More specifically, the false-positive rate increased with the distance and may present an unacceptable model for organizations and the users.
Graphing Website Relationships for Risk Prediction
547
The model did not reveal substantial differences in the TP rate as the hop count relationships increased. The FP rate did degrade as the hop count relationship increased; therefore, the data suggests that sacrificing FP rates did not yield a substantial increase in predictive value. Specifically, the FP rate more than doubled from 2hop to 3-hop relationships. The increased rate of blocked sites could increase user annoyance without providing a significant reduction in threat. Each organization would need to evaluate the user experience compared to threat reduction to determine the proper balance. The hypothesis for the current study was that the relationship based on referrer links and the number of hops to a malicious site could indicate the risk to a website. The null hypothesis for the current study was that the risk of visiting a website could not be indicated by the relationship to known threats. Since the analysis methods, based on the hop distance to a known threat suggested a false-positive and true-positive rate, which exceeded a random guess, the null hypothesis was rejected, and the hypothesis was accepted.
6 Conclusions While true-positive rates in the study between 0.5859 and 0.6345 may not suggest an outstanding performance, the goal of the study was to develop a model for enhancing existing capabilities. Cybersecurity products provide indicators to threats, but actions on the indicators are executed singularly without evaluating the relationships. Improving the performance of existing indicators with threat levels could provide users with additional information about the potential risk during browsing without necessitating a block of the network activity. The model in the current research leveraged the existing indicators to extract further value by building a website relationship association in a graph database. Wen, Zhao, and Yan presented similar research as Gyongi et al. except the researchers analyzed the content of pages to determine whether the website should be classified as malicious [5]. The current research model was developed to increase the speed of analysis by determining risk before a user visited a website. The analysis of relationship risk could improve website browsing response by not requiring a review of the website’s content. Determining risk based on the content also presents a potential false assurance if the website provides a high rate of dynamic content or loading of third-party JavaScript. This study is complementary to previous research. The researchers were not attempting to replace common cybersecurity tools but creating a model that could extract further benefits from the existing methods of identifying malicious websites. Chiba et al. were able to demonstrate the attributes within the experimental use of their feature extraction model [1]. The researchers demonstrated the usage of a ROC curve analysis and stated the need to develop a tool to detect malicious websites. The current research proposed the implementation of a website risk association model to discover malicious websites. Bruijn and Janssen identified four main topics in creating cybersecurity policies and identified the need for creating policies to educate users on website safety [7]. The
548
P. H. Kulp and N. E. Robinson
implementation of the model in the current research could alleviate the need to educate the users. Assessing the risk of websites based on the relationships to know malicious websites could provide a technical implementation to the knowledge transfer. The scaled risk model could also provide a notification to users for moderate risk websites, which would provide instant awareness to the users.
7 Future Work The association and distance to a known threat could be developed into a method for establishing the risk to non-malicious websites. Researchers could perform a future study to produce a numerical scale that corresponds to the risk to a website based on the number of relationships to known threats. Like Gyongyi et al. definition of an oracle function [4], a website one hop from a malicious website may present a greater risk to a website two hops from a known risk. The calculations would need to assign values to the distance from a risk to determine if, for example, a website with two relationships to a known threat at a distance of two hops is worse than a website with one hop from a known threat. Once researchers can describe the calculation of risk for websites based on relationships, a scale could be developed to quantify the threat. The scale could limit the results to a known set such as between 1 and 10 and thresholds be developed for ranges. If the evaluation of the risk model were implemented in a product, an organization could set the threshold for various ranges with actions for each range. For example, threat values below five would not produce any notification to users, but values between five and seven would inject a warning notification that a user was browsing near known threats. Finally, values above seven would trigger a block of traffic since the risk of a user visiting a malicious site was very likely to occur. Each organization could evaluate the scale to determine the acceptable risk based on business needs. As stated in the discussion section, the method of data collection produced skewed results for 1-hop relationships. To avoid the problem and test the model again, researchers could build the relationships of websites based on a web crawl. The web crawl would process web page content, identify links, and build relationships based on the links versus the selected method of building relationships based on user web traffic. The links would then be followed to build the relationships of the sites to be studied. This method would require collecting web traffic and sampling only the website visits, which corresponded to domains within existing relationships.
Graphing Website Relationships for Risk Prediction
549
References 1. Chiba, D., Tobe, K., Mori, T., Goto, S.: Detecting malicious websites by learning IP address features. In: 2012 IEEE/IPSJ 12th International Symposium on Applications and the Internet, pp. 29–39. IEEE, Izmir (2012) 2. Fawcett, T.: An introduction to ROC analysis. Pattern Recogn. Lett. 27(8), 861–874 (2006) 3. Gillani, F., Al-Shaer, E., AsSadhan, B.: Economic metric to improve spam detectors. J. Netw. Comput. Appl. 65(C), 131–143 (2016) 4. Gyöngyi, Z., Garcia-Molina, H., Pedersen, J.: Combating web spam with trustrank. In: Proceedings of the Thirtieth International Conference on Very Large Data Bases, Endowment, Toronto, vol. 30, pp. 576–587. VLDB (2004) 5. Wen, S., Zhao, Z., Yan, H.: Detecting malicious websites in depth through analyzing topics and web-pages. In: Proceedings of the 2nd International Conference on Cryptography, Security and Privacy - ICCSP 2018, pp. 128–133. ACM, New York (2018) 6. Rawal, B., Liang, S., Loukili, A., Duan, Q.: Anticipatory cyber security research: An ultimate technique for the first-move advantage. TEM J. 5(1), 3–14 (2016) 7. de Bruijn, H., Janssen, M.: Building cybersecurity awareness: the need for evidence-based framing strategies. Gov. Inf. Q. 34(1), 1–7 (2017) 8. Schwarz, J., Morris, H.: Augmenting web pages and search results to support credibility assessment. In: Proceedings of the 2011 Annual Conference on Human Factors in Computing Systems - CHI 2011, pp. 1245–154. ACM, New York (2011) 9. Kulp, P.: (Doctoral dissertation). Active cyber defense: A case study on responses to cyberattacks. Retrieved from ProQuest dissertations and theses database (UMI No. 13886134) 10. Http-sniffer. https://github.com/caesar0301/http-sniffer. Accessed 19 Aug 2019 11. Neomodel. https://neomodel.readthedocs.io/en/latest. Accessed 02 Nov 2019 12. Zweig, M., Campbell, G.: Receiver-operating characteristic (ROC) plots: a fundamental evaluation tool in clinical medicine. Clin. Chem. 39(4), 561–577 (1993)
FLIE: Form Labeling for Information Extraction Ela Pustulka(B) , Thomas Hanne , Phillip Gachnang, and Pasquale Biafora Institute for Information Systems, School of Business, FHNW University of Applied Sciences and Arts Northwestern Switzerland, Riggenbachstrasse 16, 4600 Olten, Switzerland {elzbieta.pustulka,thomas.hanne}@fhnw.ch http://www.fhnw.ch/en/elzbieta-pustulka, http://www.fhnw.ch/en/thomas-hanne
Abstract. Information extraction (IE) from forms remains an unsolved problem, with some exceptions, like bills. Forms are complex and the templates are often unstable, due to the injection of advertising, extra conditions, or document merging. Our scenario deals with insurance forms used by brokers in Switzerland. Here, each combination of insurer, insurance type and language results in a new document layout, leading to a few hundred document types. To help brokers extract data from policies, we developed a new labeling method, called FLIE (form labeling for information extraction). FLIE first assigns a document to a cluster, grouping by language, insurer, and insurance type. It then labels the layout. To produce training data, the user annotates a sample document by hand, adding attribute names, i.e. provides a mapping. FLIE applies machine learning to propagate the mapping and extracts information. Our results are based on 24 Swiss policies in German: UVG (mandatory accident insurance), KTG (sick pay insurance), and UVGZ (optional accident insurance). Our solution has an accuracy of around 84–89%. It is currently being extended to other policy types and languages. Keywords: Artificial intelligence systems · Information extraction · Schema matching · Feature engineering · Insurance policy · German document · Switzerland
1
Introduction
The insurance business in Switzerland is dominated by a small number of players who offer standardized policies, some of which are required by law. Insurers also create new policy types, such as cyber risk insurance, in response to newly arising threats. There is minimum regulation which makes sure that insurers and brokers are licensed. Policies are issued by an insurer separately for each risk, or as a bundle, and there is no ruling on the layout and form of policies. The industry has so far agreed on a small number of core attributes that can be shared [1] c Springer Nature Switzerland AG 2021 K. Arai et al. (Eds.): FTC 2020, AISC 1289, pp. 550–567, 2021. https://doi.org/10.1007/978-3-030-63089-8_35
FLIE: Information Extraction
551
but there is no published schema for information exchange. Most information is passed around via email, as free text or pdf. In this setting, we envisage a novel service for data exchange where policies are submitted to a web service which extracts structured information, using a given schema. The service can be used by brokers to compare the existing policies and new offers, and to optimize the policy mix for their customers. Having more than 15 insurers, and more than 15 types of insurance in four languages prohibits the use of known solutions which handcraft a data extraction template for each insurer/insurance type/language combination. Constructing such a template using currently known methods is very time consuming, as the elapsed time from annotation start to solution release may be around 8 days, which does not scale. To deal with data complexity, we aimed at a method which is simpler, more generic, more reliable, and has a shorter time to market. In the long term, we aim to deliver a solution for all types of insurance offered in Switzerland. Here, however, we focus on the following types: UVG (accident insurance, Bundesgesetz u ¨ber die Unfallversicherung, 20.03.1981), KTG (sick pay insurance, Kollektiv-Krankentaggeldversicherung), and UVGZ (optional accident insurance). In parallel to this work, we are preparing data models for vehicle insurance, building insurance, and company liability insurance. Our English policies are predominantly liability insurance and they provide the data to illustrate our method. Our project started with business process modeling, leading to the understanding of the business scenario we are supporting, and was followed with data acquisition and anonymization [17]. We sourced some 20’000 policies and documents of various types in four languages. Data modeling was done partly with the help of student projects in collaboration with our business partner. The research work focused on data clustering, natural language processing and machine learning and has not considered rule based approaches yet. Our contributions are as follows. First, we present FLIE, a novel annotation scheme which shortens the manual annotation process for the learning set by about 50%, as compared to the current solution. FLIE annotates a document page with geometry and a human uses this to label a small data sample to provide a training set. Second, we demonstrate the use of machine learning (ML) to achieve high quality of annotation propagation based on just 24 labeled documents. Third, we explain the data extraction process and outline the outstanding work in schema mapping propagation. This paper introduces related work in Sect. 2. In Sect. 3 we focus on data and methods, including the presentation of FLIE. In Sect. 4 we present our results and in Sect. 5 we discuss them and conclude.
2 2.1
Related Work Digitalization in Insurance
New technological developments in insurance come under the name of InsurTech [13] and include many new trends such as Blockchain, smart contracts, AI, IOT,
552
E. Pustulka et al.
FinTech, and robotic advice. New technologies are changing the way insurance is sold, priced and controlled, and lead to new business models which require changes to the legal framework [23]. In Switzerland, FINMA [2] supervises the financial market including insurance. All products and insurers have to gain authorization. New technological solutions include new products for data extraction and brokerage from players like Chisel AI [11]. In the absence of further technical details, one cannot however say if the adoption of tools such as those will just outsource the work to Chisel AI or construct a flexible solution that is future proof and extensible. This is the reason why we are developing a new information extraction method applicable to various industries which use form-based pdf data exchange at present.
Fig. 1. Example: a UVG extract (in German).
2.2
Information Extraction
Cowie and Lehnert [8] have brought the topic of information extraction (IE) to everyone’s attention. In their definition of IE, tasks are well defined, use real-world text, pose difficult and interesting natural language processing (NLP) problems, and IE should perform like humans (90% accuracy). Sarawagi surveys the field [20] and classifies IE according to the following dimensions, where in brackets we show how our task fits in her categories: type of structure extracted (forms), type of unstructured source (documents, templatized), type of input resources available for extraction (labeled unstructured data), method (statistical, trained), and output (a database). Nasar et al. [16] show a newer perspective, and focus on the following methodologies: rule-based approaches, Hidden
FLIE: Information Extraction
553
Markov Models, Conditional Random Fields (CRF), Support Vector Machines, Naive-Bayes classification and Deep Learning. As insurance policies are legal documents, one could use LexNLP [7] which supports the linguistic analysis of legal documents, but falls short of the requirement to process the forms we are analyzing and would not be easy to extend to the languages we are dealing with. IE from tables has been discussed by Gatterbauer et al. [10], and applied to web pages. Their results are not directly applicable, as they focus on completely aligned tables, whereas our forms are only weakly aligned, see Fig. 1. Adelfio and Samet [3] focus on schema extraction from web tables where explicit structural information is absent. They use CRF and consider the visual attributes of table cells as input for the classifier. Their work is similar to ours, as our output is also a schema with extracted values, but they use highly regular tables, and not weakly aligned tables like ours. Their work is typical of the area of IE from tables, with solutions for regular tables coming from Google [5] and others. Sinha et al. extract tables from drawings and then extract information from those tables [22]. Their approach is specific to the area of engineering drawing and is not easily generalizable. Insurance policies we are working with have a higher complexity and are less regular, as shown in the following sections. That is why machine learning has the potential to solve a more general question in an adaptive fashion.
3 3.1
Data and Methods FLIE
Figure 2 presents an overview of the data analysis flow. We assume, a pdf document has been processed, for instance with pdfminer.six [21], and is available as a CSV consisting of bounding boxes containing text, and coordinates for each box. In step 0, the document is assigned to a cluster which groups the policies from the same insurer, same insurance type and language together. In steps 1–2, for each page the geometry is analyzed and boxes are grouped into horizontal groups and assigned a column label within a group (place), see Fig. 3 in Sect. 3.4. Assuming the user has annotated a similar document before, in steps 3–4 training data in the database (DB) are used by a machine learning algorithm to assign two further labels: metavalue/value label which tells us if this is metadata or a value, and if it is a value, the mapping label (attribute name previously provided by a user). In step 5, attribute names are used to produce a data extract where each attribute is mapped to a value. 3.2
Data Acquisition and Anonymization
Insurance policies in pdf format were acquired from insurance brokers and data was extracted using pdfminer.six [21] which extracts text, layout and font. The resulting CSV data were anonymized using lists of names and Swiss addresses, see [17]. The brokers checked the anonymization quality and possibly amended some entries before releasing data, to comply with the law. Each bounding box
554
E. Pustulka et al.
Fig. 2. FLIE overview. Step 0: document cluster assignment. Steps 1–2 geometry: group and place assignment for each box. Step 3: a bounding box is annotated as metadata (meta) or value (val). Step 4: a mapping is assigned (attribute name). Step 5: information extraction.
on a page is a row in the CSV, with geometric coordinates on a page and text. A pdf page is laid out using Cartesian coordinates (x, y) with origin (0, 0) at the bottom left corner and x and y dimensions reflecting the dpi (dots per inch) paper dimensions. In our data set we have seen x ranging from −48 to 953 and y from −17 to 1200, possibly due to page rotation or foreign paper sizes. The columns in the table are: BoxID (ID for each bounding box), FileID, page number, BoxType (textual or non-text), the coordinates (left, bottom, right, top), and the text contained in the bounding box. A data sample is shown in Table 1. 3.3
Identifying Insurer, Insurance Type and Document Type
After acquiring data, we explored it to get an idea of data composition. We histogrammed text lengths per file and performed SQL LIKE queries looking for known insurer names and insurance types. Language assignment was performed with detectlang [9] which uses a non-deterministic naive Bayesian algorithm [15] with character n-grams. We handcrafted SQL queries to select policies of two types: UVG (accident insurance) and KTG (sick pay insurance). To stratify the data sets by insurer and document type, we used clustering [18] for the two policy types separately. We preprocessed the text by removing digits and lowercasing. The initial clustering used TFIDF [12] (using sklearn [4]) and excluded terms
FLIE: Information Extraction
555
Table 1. Sample English Data, Very Similar to a Policy in German. Capitalized Text Results from Anonymization (ONAME: Organization, FNAME: First Name, LNAME: Last Name, PNUMBER: Policy Number, ZIP: Year or Swiss Postcode). L Stands for left, B for Bottom, R for Right and T for Top. L
B
R
T
Text
192.2 735.3 524.9 751.9 Chubb Commercial Excess And Umbrella Insurance 192.0 696.9 267.6 713.0 ONAME 446.4 663.8 527.2 674.9 JANUARY 01, ZIP 192.0 663.1 247.9 674.6 Policy Period 429.8 662.9 441.1 674.4 FNAME LNAME 279.3 663.0 360.2 674.1 JANUARY 01, ZIP 192.0 639.1 251.6 650.6 Effective Date 279.3 639.0 340.5 650.1 January 1, ZIP 192.0 615.1 254.5 626.6 Policy Number 279.3 615.0 324.1 626.1 PNUMBER
that appear in fewer than 5 documents or more than 95% of documents. Using first elbow and then the silhouette method [19], we decided on the number of clusters, K, to be used in K-Means clustering [18]. We analyzed 20 most frequent words for each cluster by eye to identify insurer names and insurance types. To confirm the clustering, we used an alternative method. Preprocessing involved stemming and stop word removal. We calculated word overlap similarity for the documents, using dictionaries for each document Di , with overlapsim(D1 , D2 ) = |D1 ∩ D2 |/min(|D1 |, |D2 |). In each cluster we calculated the cluster mean. Section 4.2 discusses the clustering results. 3.4
FLIE Layout Encoding
Given the coordinates of a bounding box, FLIE automatically encodes its position on a page as belonging to a group and column, see Fig. 3. This encoding is a simplified representation of the geometry that appears to be sufficient for information extraction. The data gets augmented with two new columns group and place which encode the horizontal and vertical arrangement of bounding boxes and are used for HTML generation, and as features in machine learning (ML). Each horizontal run of boxes is a group. Each group is further subdivided into columns which are labeled from left to right. This part of annotation is automated, which we explain further on. At a later stage, during the training set preparation, an expert labels each bounding box to say if a box contains metadata or data, and in the case of data,
556
E. Pustulka et al.
Fig. 3. Page layout encoding. Groups are ordered top to bottom and columns left to right. Abbreviations for Place: L left, M middle, R right, and multiple columns in the center are Ci.
provides the attribute name. After the automated labeling and manual labeling, the data has four additional columns: group, place, type, attribute. The column type tells us if the box contains metadata or a value (meta/val) and the column attribute holds the attribute name to be used in data extraction. Figure 4 shows the algorithm for group generation which produces horizontal groups. It uses anonymized CSV data stored in a table called BoundingBox indexed by FileID and page. It first selects a page to process (line 0). As a pdf page uses Cartesian coordinates with (0, 0) in the bottom left corner, and the output is not sorted to start with, we first sort the boxes top to bottom (line 1). After initialization (lines 2–4), we calculate the top and bottom of a new group in lines 5–6 (and do the same for a new group in lines 14–15). Then the list is traversed (line 7) and boxes assigned to groups one by one. The algorithm assumes a small error eps, currently set to 3.9, as boxes that are horizontally aligned to the eye, are not always so in absolute terms. We add a box to a group if one of the following conditions holds: alignment at the top or at the bottom (line 8), or box is contained within the group (line 9). Otherwise, we start a new group. In theory, we could expect a fourth option, a partial box overlap, but we saw that considering this condition led to incorrect layouts, and might be a side effect of using OCR (optical character recognition) and pdfminer. Figure 5 shows the next algorithmic step where each group is ordered left to right by using the left coordinate of the bounding box. After executing Place Generator, each box is labeled with a place label P. The data previously shown
FLIE: Information Extraction
557
Fig. 4. Algorithm Group Generator. Columns L, B, R, T as in Table 1. Group information is written in column (attribute) group as box.group.
Fig. 5. Algorithm Place Generator assigns the column label (place: P). A new attribute P is filled with values corresponding to placement: l, m, r, and cn (cn is used for the middle columns if there are more than three columns).
558
E. Pustulka et al.
in Table 1 is now extended with two new columns, shown in Table 2. Note that group 2 now has 4 columns (l, c2, c3, r). Columns type and attribute are filled later, see the following sections. Table 2. Sample Data after Labeling with Group g, Place p, Type and Attribute. text
3.5
G P TYPE ATTRIBUTE
Chubb Commercial Excess 0 And Umbrella Insurance
l
Val
Insurer Insurance type
ONAME
1
l
Meta
Policy Period JANUARY 01, ZIP FNAME LNAME JANUARY 01, ZIP
2 2 2 2
l c2 c3 r
Meta val Meta val
Effective Date January 1, ZIP
3 3
l 4
Meta val
Effective date
Policy Number PNUMBER
4 4
l r
Meta val
Policy no
Start End
Manual Data Annotation
We selected 24 policies for annotation: 11 UVG, 11 KTG, one UVGZ (supplementary accident policy offering extra benefits) and one which was a 3-in-1 including all three types. Document choice was based on the clusters we calculated beforehand, and on SQL queries for the policy type and date, aiming for one recent policy of each type per insurer. The FLIE algorithm was used to assign group and place labels to each bounding box. As we did not have the original pdf files, for each file we produced an HTML representation to help the annotator visualize the data. Based on existing knowledge, we generated data models for the UVG, UVGZ and KTG policies by manual data inspection and agreed with our business partner on the attribute names to use. We then annotated by entering for each box a label meta/val and if it was a value, an attribute name, using a spreadsheet, resulting in data shown in Table 2. This was then checked and corrected. Annotation took about half a day per document as we were acquiring complex domain knowledge at the same time and some policies were very long (over 10 pages). Documents were annotated by researchers and Master students with no previous knowledge of insurance, based on examples provided by the company where one analyst is an expert responsible for the development of the new application.
FLIE: Information Extraction
3.6
559
Feature Selection
As features for meta/value annotation using ML we used the following data items: page number, top position (T) of a bounding box, left position (L) of a bounding box, box placement P (column), TFIDF of the text field T F IDF (text, ngram = (1, 2)), and length(text) in characters. The placement (column on the page) is first encoded on a scale 0 to 1 for each group, according to the column count in a group. For mapping assignment using ML, we used the following features: group text (text of the entire group enclosing the box), previous group text, placement (encoded as above), and insurance type (UVG, KTG, UVGZ, or mixed). 3.7
Testing Meta/Val Label Assignment
We tested six ML classifiers in 10-fold cross-validation [18]: SVM (support vector classifier with a linear kernel, with C = 1), MNB (multinomial naive Bayes), KNN (k-nearest neighbors), decision tree, (LR) logistic regression, and random forest (n jobs = 2), with default sklearn settings [4]. We report on the outcome in the results section. 3.8
Attribute Name Propagation
We tested two methods: KNN (k-nearest neighbors) and the Radius Neighbor Classifier RNC [6]. We used a range of k values (number of neighbors) for KNN and several radius values for the RNC. 3.9
Mapping Extraction
We extract the boxes where type is annotated as value and look up the attribute name in column attribute and text in text. Where we find separators (white space, comma) between the attribute names in the attribute column, we also split the text on the separator. Optionally, the currency symbol CHF may be removed.
4 4.1
Results Data Exploration
The data set consists of over 22’000 files and is held in a database of around 8 GB, including indexes on FileID, page and BoxID. We have separately created dictionaries for all files, to support overlap similarity calculations, as those are time consuming (2 h for just over 1’000 UVG policies). 97% of the documents are in German, with over 250 in English, over 200 French, some 50 Italian and 150
560
E. Pustulka et al.
Fig. 6. UVG document lengths. The length in words is on the x axis.
not assigned. We see many OCR errors and multiple languages present in one document, such as a policy in three languages, the first part in German, then a part in French, and the same in Italian, or a mixture of languages on one page. A more detailed analysis of the policies classified as English showed document parts in Arabic. A global analysis indicates that the data are representative of the market in terms of insurance types and insurers, with the most important insurers present, but also some companies we did not expect to see. We looked more closely at the UVG and KTG policies. We identified 1’414 UVG-related documents, some of which turned out to be not policies but emails, additional information, terms of business and insurance quotations. This was reflected in document lengths varying from 137 to 10’760 characters (i.e. up to circa 4000 words), shown in Fig. 6. 4.2
Clustering
Figure 7 shows the silhouette scores (k-means clustering) for the UVG documents with the number of clusters k = 21, chosen after performing elbow and silhouette comparisons for various cluster sizes. The clusters are ordered on the y axis by cluster number and colored. The x axis shows the silhouette score of each cluster. We see 20 homogeneous clusters and one mixture (cluster 19), with the lowest silhouette score of all (negative values). This cluster also has the lowest overlap similarity score of 0.34, see entry 19 in Table 3 where it is shown in bold.
FLIE: Information Extraction
561
Similarly, for the KTGs we used k=21, based on the silhouette score. Looking at top 20 words in each cluster allowed us to characterize the UVG clusters as shown in Table 3. UVGZ is an extra insurance on top of UVG. UVG adjust stands for UVG premium adjustment which some documents contain. Cluster 19 is the one with a negative silhouette and the lowest mean overlap similarity (in bold).
Fig. 7. UVG silhouette scores for k = 21. The x axis shows the silhouette score and the y axis the cluster number. The cluster with negative scores (number 19 on the y axis) is heterogeneous. Mean silhouette score is 0.28.
We find that insurance policies can be reliably clustered using TFIDF with kmeans and confirmed by overlap similarity. In a web service scenario, an incoming document will be assigned automatically to a cluster. This will subsequently improve the quality of information extraction, as cluster assignment is used in ML-based label assignment. For a user, this is very relevant as well, as an automated assignment of document type will free a person submitting a bunch of policies from having to enter each insurer, policy type, and language by hand, and will prevent erroneous data entry. 4.3
Manual Inspection and Annotation
Policy clustering for the UVG followed by visual inspection of top 20 terms, and visual inspection of the outliers, shows that pdfs sometimes combine several documents in one, for instance the policy, the quotation and the initial request
562
E. Pustulka et al.
Table 3. UVG Document Clusters, with Mean Overlap Similarity. Cluster 19 Combines Various Insurers. ID Insurer
Doc Type
Mean sim
1
ElipsLife
UVG/KTG
0.8
2
CSS
UVG
0.8
3
Allianz/ElipsLife UVG
0.6
4
Mobiliar
UVG
0.8
5
Axa
UVG/KTG
0.86
6
Axa
UVG
0.73
7
Axa
UVGZ
0.82
8
Helsana
UVG
0.64
9
Zurich
UVG
0.8
10 Allianz/Vaudoise UVG
0.72
11 Visana
UVG
0.84
12 Allianz
UVG adjust
0.71
13 Axa
Letter
0.95
14 Axa
Letter/leaflet 0.79
15 Concordia
Letter
16 Vaudoise
UVG
0.55
17 Axa
UVG adjust
0.91
0.66
18 CSS
Letter
0.68
19 Various
UVG
0.34
20 Helsana
UVG adjust
0.72
21 Basler
UVG
0.6
for a quotation, or we see a bundle of several policies issued by one insurer, see Table 3. During manual policy annotation, we only annotated the policy part if it was a policy with additional documents, or, in case of a policy bundle, we annotated all parts in the bundle. We saw multiple occurrences of the same policy number in a document or a summary page at the start stating the main conditions, followed by the details, which lead to repetitions of attribute names but may prove to be useful. We obtained a list of possible mapping terms from our business partner: 146 for the UVG, 76 for the KTG and 494 for the UVGZ. This complexity made for a difficult annotation task which took a long time. During annotation we discovered that some of the data had no matching category and we are in frequent contact with the business partner who guides us in extending the mapping lists. The policies contain bounding boxes which are sometimes composed of several values, reflecting a number of attributes. The column containing the corresponding metadata and the value column usually have carriage returns separat-
FLIE: Information Extraction
563
ing text which can be used to parse the text and match metadata to data. The metadata column is often to the left of the value column or at the top of the column. The 24 documents we annotated had 3694 bounding boxes, of which 561 boxes contained one or more values to be extracted (15%). Boxes contained between 1 and 5 values, with additional text also present, including the currency CHF. We counted 138 unique attributes, of which 65 were UVG, 60 KTG and 13 UVGZ. Some attributes are sufficiently well represented in the annotated data but some are only singletons, which shows the need for more annotation. 4.4
Label Propagation: Metadata or Values
We tested six ML algorithms, to find out which of those is best at propagating the annotation of meta/value, using 10-fold cross-validation. This was done first with 12 policies, see Table 4, with maximum accuracy of 84% (in bold), and then with 24 policies, Table 5, with the highest accuracy of 89% (in bold). The best results were achieved using random forest and the worst using logistic regression. We show the accuracy Acc, F1 measure, Matthew’s coefficient [14], and in Table 5 also the confusion matrix (CM) and the time in seconds. The methods are SVM with C = 1, MNB (multinomial naive Bayes), KNN (k-nearest neighbors), tree (decision tree), LR (logistic regression) and RF (random forest). The confusion matrix we show makes it clear that the value class is not recognized properly. 3694 0 A perfect confusion matrix would be . In terms of identifying the 0 561 values, which are our target in data extraction, the tree performs the best with 336 values, and the SVM is second best (306 correct value assignments). Globally, top accuracy is achieved using random forest (89%), which is high, but only finds 267 values correctly. This is reflected in the low values of Matthews’ coefficient, of only circa 50%, which is poor and reflects the class imbalance typical of a policy document with only circa 15% of text boxes containing information that is to be extracted. The experiment times range from 1 s (MNB and KNN) to 1629 s (SVM) and the top performer in terms of overall accuracy, random forest, has an acceptable time of 12.6 s. Table 4. Assignment of Metavalue/Value Label for 12 Policies. SVM is Support Vector Classifier, MNB Multinomial Naive Bayes, KNN is k-Nearest Neighbors, Tree is Decision Tree, LR is Logistic Regression and RF is Random Forest. SVM MNB KNN Tree LR
RF
Acc
0.82
0.82
0.82
0.83 0.80 0.84
F1
0.89
0.88
0.89
0.90 0.87 0.90
Matthews 0.45
0.43
0.43
0.51 0.42 0.45
564
E. Pustulka et al.
Table 5. Assignment of Metavalue/Value Label using 24 Policies, with 3694 Rows of Data, of which 561 Contain Values and 3090 Metadata. The Tree Identifies the Largest Number of Values Correctly (336, in Bold). SVM Acc 0.88 0.93 F1 Matthews 0.51
4.5
MNB
KNN
Tree
LR
RF
0.86 0.92 0.45
0.86 0.92 0.41
0.87 0.92 0.52
0.87 0.93 0.45
0.89 0.94 0.52
CM
2943 190 2896 237 2939 194 2893 240 2985 148 3023 110 255 306 275 286 316 245 225 336 316 245 294 267
Time s.
1629
1.1
1.6
2.8
2.8
12.6
Mapping Attribute Names
Next, we examine the quality of mapping assignment where we predict the attribute name. As many attribute names are singletons at this early stage of our project, we could only predict the mapping for 18 attributes which had over 8 data points. We tested KNN and Radius Neighbor Classifier (RNC). KNN was tested with k set to 3, 5, 7, 9, 11 and 13, and RN with radius set to 1, 10, 100, 1000 and 10000. Table 6 shows the classification accuracy and time for the mapping test. We achieve the best results with the KNN classifier, settings for k from 9 to 13. The radius neighbor classifier performs well with radius in the range 100 to 10’000. Time performance shows a range of 2 to 3 s. These results are only indicative and show that we need to annotate more data to achieve good coverage and prediction quality. Table 6. Mapping Assignment Quality for 18 Attributes Assigned 8 Times or More. n in Kn is the Number of Neighbors in KNN and n in Rn Stands for Radius in RNC. K3 Acc
K7
K9
K11 K13 R1
R100 R1000 R10000
0.86 0.86 0.86 0.87 0.87 0.87 0.39 0.85
Time s. 2.3
4.6
K5 2.4
2.3
2.2
2.2
2.2
2.3
2.8
0.85
0.85
2.8
2.9
Information Extraction
Given labeled policies, one selects the data rows which contain a value label or an attribute label. Then, text and attribute labels are split on white space, currency symbols removed, and the result exported to a database or spreadsheet. Sample output is shown in Table 7.
FLIE: Information Extraction
565
Table 7. Extract: A German Mixed Policy (UVG, UVGZ and KTG), Pages 1–2. Attribute
Value (anonymized in capitals)
KTG-Policennummer
A.AAAA.AAA
UVG-Policennummer
A.AAAAAA.A
UVGZ-Policennummer
A.AAAAAA.AAA
KTG-Beginn
11.12.2013
KTG-Ablauf
31.12.2015
KTG-Zahlungsart
j¨ ahrlich
KTG-Hauptverfall
01.01.
KTG-Versicherer
elipsLife
KTG-Branche
STREET : Krankentaggeldversicherung
KTG-Personengruppe
Betriebsinhaberin
KTG-Personengruppe
Personal
KTG-Pr¨ amiensatzgarantie
-
KTG-Deckung-%-des-Lohnes 80% vom massgeb. STREET KTG-Wartefrist
ONAME 30 Tage
KTG-Leistungsdauer
730 Tage abz¨ uglich WF je STREET
KTG-Pr¨ amiensatz
0.88%
KTG-Geburtentaggeld
Geburtengeld
KTG-Pr¨ amiensatz
0.03
Currently, as annotating is a bottleneck and the mappings are still in development, we use a DUMMY variable for an attribute whose name is unknown or could not be predicted because of poor data coverage. This allows us to output such data along the data which is labeled with an attribute label.
5
Discussion and Conclusion
FLIE is a new approach to information extraction from forms which is driven by the need for simplicity. It can be applied to policies from various insurers, of various types, and in various languages. It has the potential to create new products for the digital economy with less effort. We applied a business-driven and data-driven approach to product development. In preparation, we developed business models showing the future product scenario and gathered anonymized data. As can be seen in our sample data shown here, the anonymization we applied was too strong, as it was not contextsensitive but purely based on dictionaries and regular expressions (street dictionary, name dictionary and other heuristics). In the face of the law, this was correct, however it is inconvenient now. We used standard methods to explore and cluster the data, with a successful combination of TFIDF, overlap similarity and k-means clustering. Setting of k
566
E. Pustulka et al.
= 21 was appropriate for the UVG and we discovered it was also suitable for the other insurance type, the KTG. This approach was fruitful and helped us select the data for annotation. Our main contribution is a simplified labeling scheme which reduces the complexity of annotation and supports the use of ML for information extraction. The time needed to annotate a policy is reduced, subjectively by about 50%. Instead of dealing with a pdf and clicking on fields, maintaining a thesaurus in a separate tool, etc. the annotator views an HTML representation and annotates in a spreadsheet, using a controlled vocabulary. We are considering some automation in this task, offering controlled vocabularies to choose from, i.e. a user interface for the annotator, to be used in various business contexts and not limited to policy documents. Currently, after the annotation of 24 policies, the annotator sees the predictions made by ML, and the majority vote, which speeds up the work. The solution to metadata/value propagation has an accuracy of 89%, which is not sufficient yet, as we do not find the values we want to extract reliably enough. We are testing various preprocessing steps which may reduce the class imbalance. We are also annotating a larger number of policies and exploring rule based approaches in this area. The problem of mapping (allocation of the ATTRIBUTE label) is now in focus and we can try alternative approaches once we have enough annotated data. Future work includes working with other insurance types and further languages. The system needs to be tested and tuned with real data that has not been anonymized. We are implementing a GUI which will be used by the end user for data extraction and by an annotator. The annotation GUI will be supported by rule based and ML techniques. Further options include the use of more advanced machine learning techniques. Refinements to our methods are ongoing, including the use of optimization methods in feature engineering to deliver a better result overall. Acknowledgments. We gratefully acknowledge funding from the Innosuisse, www. innosuisse.ch, grant no 34604.1 IP-ICT, and from the FHNW.
References 1. IG B2B for Insurers + Brokers (2020) 2. Swiss Financial Market Supervisory Authority FINMA (2020) 3. Adelfio, M.D., Samet, H.: Schema extraction for tabular data on the web. Proc. VLDB 6(6), 421–432 (2013) 4. Albon, C.: Machine learning with Python cookbook: practical solutions from preprocessing to deep learning, first edition. O’Reilly, Sebastopol (2018) 5. Balakrishnan, S., Halevy, A.Y., Harb, B., Lee, H., Madhavan, J., Rostamizadeh, A., Shen, W., Wilder, K., Wu, F., Yu, C.: Applying web tables in practice. In: CIDR 2015 (2015) 6. Bentley, J.L.: A Survey of Techniques for Fixed Radius Near Neighbor Searching. Technical report, 8 (1975)
FLIE: Information Extraction
567
7. Bommarito, I.I., Michael, J., Katz, D.M., Detterman, E.M.: LexNLP: natural language processing and information extraction for legal and regulatory texts. arXiv preprint arXiv:1806.03688 (2018) 8. Cowie, J.R., Lehnert, W.G.: Information extraction. Commun. ACM 39(1), 80–91 (1996) 9. Danilak, M.M.: Langdetect - Python port of Google’s language-detection (2016) 10. Gatterbauer, W., Bohunsky, P., Herzog, M., Kr¨ upl, B., Pollak, B.: Towards domainindependent information extraction from web tables. In: Proceedings of the 16th International Conference on World Wide Web, pp. 71–80 (2007) 11. Glozman, R.: Chisel AI (2020) 12. Manning, C.D., Raghavan, P., Sch¨ utze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008) 13. Marano, P., Noussia, K. (eds.): InsurTech: A Legal and Regulatory View. Springer (2020) 14. Matthews, B.W.: Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim Biophys Acta 405(2), 442–451 (1975) 15. Nakatani, S.: Language Detection Library for Java (2010) 16. Nasar, Z., Jaffry, S.W., Malik, M.K.: Information extraction from scientific articles: a survey. Scientometrics 117(3), 1931–1990 (2018) 17. Pustulka, E., Hanne, T.: Text mining innovation for business. In: Dornberger, R. (eds.) New Trends in Business Information Systems and Technology. Studies in Systems, Decision and Control, vol. 294. Springer, Cham (2021). https://doi.org/ 10.1007/978-3-030-48332-6 4 18. Rogers, S., Girolami, M.A.: A First Course in Machine Learning. CRC Press, Chapman and Hall/CRC machine learning and pattern recognition series (2011) 19. Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987) 20. Sarawagi, S.: Information extraction. Found. Trends Databases 1(3), 261–377 (2008) 21. Shinyama, Y., Guglielmetti, P.: PDFMiner.six - Python PDF Parser (2020) 22. Sinha, A., Bayer, J., Bukhari, S.S.: Table localization and field value extraction in piping and instrumentation diagram images. In: 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW), vol. 1, pp. 26–31. IEEE (2019) 23. Tereszkiewicz, P.: Digitalisation of insurance contract law: preliminary thoughts with special regard to insurer’s duty to advise. In: InsurTech: A Legal and Regulatory View, pp. 127–146. Springer (2020)
Forecasting Time Series with Multiplicative Trend Exponential Smoothing and LSTM: COVID-19 Case Study M. A. Machaca Arceda1 , P. C. Laguna Laura1 , and V. E. Machaca Arceda2(B) 1
Universidad Nacional de san Agust´ın de Arequipa, Arequipa, Peru {mmachacaa,plaguna}@unsa.edu.pe 2 Universidad la Salle, Arequipa, Peru [email protected]
Abstract. In this work, we present an analysis of time series of COVID19 confirmed cases with Multiplicative Trend Exponential Smoothing (MTES) and Long Short-Term Memory (LSTM). We evaluated the results utilizing COVID-19 confirmed cases data from countries with higher indices as the United States (US), Italy, Spain, and other countries that presumably have stopped the virus, like China, New Zealand, and Australia. Additionally, we used data from a Git repository which is daily updated, when we did the experiments we used data up to April 28th. We used 80% of data to train both models and then, we computed the Root Mean Square Error (RMSE) of test ground true data and predictions. In our experiments, MTES outperformed LSTM, we believe it is caused by a lack of historical data and the particular behavior of each country. To conclude, we performed a forecasting of new COVID-19 confirmed cases using MTES with 10 days ahead.
Keywords: MTES
1
· LSTM · COVID-19 · Time series · Forecasting
Introduction
Coronavirus COVID-19 pandemic started in late December 2019 in Wuhan, capital of Hubei Province. Since then, it has spread rapidly across China and other countries. Furthermore, 84.347 confirmed cases and 214 000 deaths were reported in China. In addition to this, 3.083467 confirmed cases and 213 824 deaths were reported in the whole world until the last April 28th [15]. Many governments set quarantines up and social distances in their countries to stop this illness. Moreover, it seems that some countries like Australia and New Zealand have recently stopped the virus. Nevertheless, the rate of the confirmed cases is still increasing in other countries as we can see in Fig. 2(a), (b), and (c). Due to this, lots of countries decided to extend the quarantine time in order to stop the virus spread. Their decisions are based on updated information about c Springer Nature Switzerland AG 2021 K. Arai et al. (Eds.): FTC 2020, AISC 1289, pp. 568–582, 2021. https://doi.org/10.1007/978-3-030-63089-8_36
Forecasting Time Series: COVID-19 Case Study
569
newly confirmed, active, mortal, and recovery cases. It should be noted that in that context, there exist an important lack of future information, current data analysis, and forecasting abilities which could support future decisions to deal with this widespread disease. In this paper, we present a comparison between MTES and LSTM for the prediction of COVID-19 confirmed cases, since 01-22-2020 to 28-04-2020. We are taking into account the different countries’ confirmed cases rates. Because of this, we forecast each country independently. Additionally, we focus on countries with more confirmed cases up to April 28th, 2020 such as United States (US), Italy, Spain, and also other countries that have controlled the virus like China, New Zealand and Australia. We have forecast each country using multiplicative trend exponential smoothing, and Long Short-Term Memory (LSTM) methods. We used a dataset from a Git repository [30].
(a) Australia
(b) New Zealand
Fig. 1. Evolution of confirmed, recovered, mortal and active cases of COVID-19 in Australia (a), New Zealand (b)
The work is structured as follows: in Sect. 2, we present the most relevant related works, Sect. 3 describes MTES and LSTM methods used, Sect. 4 describes data used, and parameters in order to replicate the experiment, Sect. 5 describes the results and a comparison between MTES and LSTM, Sect. 6 comments the results, Sect. 7 presents the conclusions and finally in Sect. 8, we present the future work.
2
Related Work
In the following paragraphs, we are mentioning the latest works that forecast and analyse the COVID-19 pandemic, as it is a recent event, we detailed the period of time they used for the analyses. Moreover, we will point some research works that have successfully used MTES and LSTM methods to forecast time series. During the first trimester of 2020, several COVID-19 analyses and forecasting research papers have been published. For instance, a research group had
570
M. A. M. Arceda et al.
(a) US
(b) Spain
(c) Italy
(d) China
Fig. 2. Evolution of confirmed, recovered, mortal and active cases of COVID-19 in US (a), Spain (b), Italy (c) and China (d)
analyzed and forecast confirmed cases in China, Italy, and France since January 22th to March 15th using mean-field models [13]. Other researchers, have used phenomenological models to forecast short-term cumulative confirmed case reports in Hubei province since February 2nd to February 24th [36]; in Guangdong and Zhejiang with data up to February 13th [37]. In addition to this, a modified stacked auto-encoder for real-time forecasting of Covid-19 was proposed in China, in order to estimate the size, length, and ending time, since February 11th to February 27th [22]. Other research [3] predicted China’s case fatality and recovery ratios with 90% accuracy; they used data since January 11th to February 10th with a Susceptible-Infectious-Recovered-Dead (SIDR) model. As well as that, an adaptive neuro-fuzzy inference system (ANFIS) with an enhanced Flower Pollination Algorithm (FPA) and Salp Swarm Algorithm (SSA) are used to forecast confirmed cases in China with 10 days ahead [2]. In addition, a research group have predicted the excessive demand of beds, ICU beds, and ventilators for the following 4 months [11]. Moreover, some tools were presented such as the following web page When Will COVID-19 End [28] this one presents a life cycle pandemic estimation
Forecasting Time Series: COVID-19 Case Study
571
and predicts when pandemic might end. Also, a visual dashboard [30] with GIS presents the current COVID-19 confirmed cases. In addition, the Holt method and its variants (MTES) are highly utilized to forecast. For instance,(ARIMA and Holt-Winters) which are two adapted time-series methods that were utilized to predict: electricity consumption [23], short-term electricity demand [42], and the multi-scale internet traffic [10]. Also, forecasting with the damped trend is possible, using the Holt-Winters method too; just as Howard Grubb did it in his paper, forecasting United kingdom long lead-time air passengers [16]. The Holt-Winters can forecast using the additive [1] or multiplicative method [27], getting good results with both methods. Additionally, exist an especially forecasting model from Holt’s methods family such as the Multiplicative Trend Exponential Smoothing (MTES) model, that was utilized to predict the novel COVID-19 timeline, thus, governments can plan and make decisions [35]. Besides, Long Short-Term Memory (LSTR) is an recurrent network from machine learning methods which are utilized to predict events in time series. Therefore, it has been used to forecast traffic flow [12,14,43,46,48], head pose estimation [18], stock price [4,24–26,38,39,49], financial time series [5,7,40,47], malaria incidence [9,44] and dengue epidemics [8,31].
3
Time Series Forecasting
Brockwell defined times series as a set of observations xt , where each one where recorded at time t. Also, the goal of time series analysis is the the selection of a suitable probability model for future observations [6]. The most used forecasting techniques are the ARAR algorithm [32]; Holt-Winters [20] are recently Long Short-Term Memory [19]. 3.1
Multiplicative Trend Exponential Smoothing
According to Pegels [33], the exponential smoothing methods are classified into nine forms. Each method is recognized by being suitable for constant level series, additive/multiplicative trend, additive/multiplicative seasonality, and non-seasonality. The Holt method [21] is usually applied when our data has a linear trend and it is non-seasonality [17]; the Holt additive trend method, estimates the local growth by smoothing successive differences of the local level; the sum of level and projected growth, give us the forecasts. Holt-winters is usually known to be a seasonal method with a multiplicative trend [41]. Holt-Winters [45] is an upgrading of Holt method [20] by adding seasonality; there are two versions: the additive method (also know as Holt-Winter) and the multiplicative method (also know as Holt-Winters seasonal) [6]. Moreover, in 2003, the Holt-Winters method is extended to multiple seasonalities [42]. The Multiplicative Trend Exponential Smoothing model (MTES) also called (the multiplicative Hold method), proposed by Pegels [33] shows forecasts like
572
M. A. M. Arceda et al.
the product of the level and growth rate. Moreover, in the real world, most of the series have multiplicative trends [41]. The multiplicative Hold method models the local growth rate Rt (Eq. 2), by smoothing successive divisions (St /St−1 ), from the local level St (Eq. 1). The ˆ t (m) (Eq. 3) as modelling with a trend in a method show us the forecasts X multiplicative way. Also, is important to consider the two smoothing parameters as: [0 < α < 1], [0 < γ < 1] [41].
3.2
St = αXt + (1 − α)(St−1 ∗ Rt−1 )
(1)
Rt = γ(St /St−1 ) + (1 − γ)Rt−1
(2)
ˆ t (m) = St Rtm X
(3)
Long Short-Term Memory
by y inj (t), and the activation of outj is denote Proposed by Hochreiter [19] is a especial case of recurrent networks. In LSTM, the more complex unit is called memory cell (Fig. 3). The j − th memory cell is denoted as cj . Each cell have fixed self-connections. In addition to netcj , get input from a multiplicative unit outj (output gate), and from another multiplicative unit inj (input gate). The activation of inj at time t is denoted by y inj (t), and the activation of outj is denoted by y outj (t). Also, we have: y outj (t) = foutj (netoutj (t))
(4)
y inj (t) = finj (netinj (t))
(5)
Where: netoutj (t) = netinj (t) = netcj (t) =
woutj u y u (t − 1)
(6)
winj u y u (t − 1)
(7)
wcj u y u (t − 1)
(8)
At time t, cj ’s output y cj (t) is compute as: y cj (t) = y outj (t)h(Scj (t))
(9)
The internal state Scj (t) is: Scj (0) = 0, Scj (t) = Scj (t − 1) + y inj (t)g(netcj (t))
(10)
Forecasting Time Series: COVID-19 Case Study
573
Fig. 3. Architecture of memory cell cj and its gate units inj , outj . Source: [19]
4
Methods and Materials
In this work, we present a comparison between MTES and LSTM to predict COVID-19 confirmed cases, since 01-22-2020 to 28-04-2020. The data set was collected from a Git repository which is daily updated [30]. For instance, in Fig. 1 and 2 we present the confirmed, active, mortal and recovered cases of COVID-19. China has controlled the virus, but in other countries it is not possible. However, the active cases have started to decrease in Spain. We opted to MTES model that belongs to the exponential smoothing family [33] since this family presented good accuracy in forecasting above sundry forecasting competitions, also is worthy for short series [29]. Despite existing algorithms to select a forecasting model [34], we’ve wanted to select judgmentally a model that reflects the nature of the data, as we can see in Fig. 1 and 2, the time series are non-seasonal and we assume that the trend will continue growing. Due to that perspective, S-Curve model that assumes that COVID-19 convergence is rejected. Furthermore, choosing models judgmentally gives similar results to the algorithmic selection, if it is not better than that [34]. For optimal forecasting with MTES method, we used Solver to find the minimum error in the prediction. We changed the values of the smoothing parameters until we find out the optimal values. For LSTM, we used one layer of 40 nodes followed by a dense layer. In Table 1, we detailed the parameters used. Moreover, we implemented the net with Keras and Tensoflow in Python language. In order to replicate the results, we used a random seed equal to 1. Table 1. LSTM parameters. Parameter
Value
Num. layers
1
Cells by layer
40
Batch Size
1
Validation split 20% Look back
1
Epoch
50
574
5
M. A. M. Arceda et al.
Results
In LSTM case, we run the net with different configurations in order to know the best parameters. In Fig. 4, we evaluated different net architecture. We varied the number of cells in first layer (the second layer is a dense net) and computed Root Mean Square Error (RMSE) vs epochs of the confirmed cases in US, we used 80% of the whole data set for training and 20% for testing. We didn’t used more layers, because the model’s RMSE increased. In addition to this, as we can see in Fig. 4, with more epoch, the model started to overfit. Furthermore, this behavior is different for each country. In Fig. 5 and 6, we present how LSTM performs in US, Italy, Spain, China, New Zealand and Australia confirmed cases data. All countries have different behaviors, and we needed to adapt some net parameters in order to improve the results. For example, for US, Spain, Italy and China we used 50 epoch and 40 cells in the LSTM layer (we use one layer), and all of them got acceptable results. In the other hand, for New Zealand and Australia, we used the same net topology, but we had to increase the number of epoch to 100 and 270, respectively. Moreover, we use 80% of data for each country in order to train the LSTM and MTES. Then, we used only the last sample in training to predict the next days with both models. Then we plotted each prediction and compared with true ground, In Fig. 7, we present the results. For LSTM, we made a forecast using the last sample in training and then, this predicted value, was used to predict a new one, therefore, if there is an error in some prediction, it causes an error chain that is shown in Fig. 7. In addition, MTES outperform LSTM, this happens because, LSTM needs more historical data, and the parameter tuning
Fig. 4. RMSE vs epochs in test dataset of US, each curve represent an architecture of the LSTM varying the numbers of cells in the first layer.
Forecasting Time Series: COVID-19 Case Study
(a) US
(b) Spain
(c) Italy
(d) China
575
Fig. 5. Comparison of prediction and ground true curves, in train and test dataset of LSTM for US (a), Spain (b), Italy (c) and China (d)
(a) NewZealand
(b) Australia
Fig. 6. Comparison of prediction and ground true curves, in train and test dataset of LSTM for New Zealand (a) and Australia (b)
576
M. A. M. Arceda et al. Table 2. RMSE in test dataset for each country. Country
LSTM
MTES
US
1234694.24 485278.14
Italy
263230.19
25251.04
Spain
725785.44
24644.52
China
11551.90
222.93
Australia
672.35
905.39
New Zealand
527.77
73.18
depends on data, since each country has a particular behavior, it is difficult to have a general model for all countries. In addition, in Table 2, the Root Mean Square Error (RMSE) is presented of both models. Furthermore, we predict the confirmed cases since April 29th to May 9th using MTES model. We used data up to April 28th for training. In Fig. 8, we present the results, as we can see, MTES predicts accurately the COVID19 confirmed cases according to the curve. Also, in Table 3, we present the prediction of COVID-19 confirmed cases up to May 9th using MTES model. 5.1
Limitations
Unfortunately, the information of confirmed cases in the world is no trustworthy. For example, the asymptomatic cases are no tracked in almost all countries, and the number of confirmed cases depend on the number of tests performed by each country, for instance, if a country duplicates the number of tests therefore the number of confirmed cases will increase. All of these information have to take into account but they are not delivered by countries.
6
Discussion
LSTM is a neural network used in forecasting, it seems to perform well when we use a ground true sample in order to predict the next sample (Fig. 5 and 6). Nevertheless, when we use a predicted sample as input, the net makes poor predictions as is shown in Fig. 7. We got poor results because of the small number of samples, we are managing data just since January. Furthermore, it is now that MTES performs well for short time predictions. In this case, for COVID-19 confirmed cases, MTES outperformed LSTM (see Fig. 7 and Table 2) with these small number of samples.
Forecasting Time Series: COVID-19 Case Study
(a) US
(b) Spain
(c) Italy
(d) China
(e) New Zealand
(f) Australia
577
Fig. 7. Comparison of LSTM and MTES for prediction of test dataset. US (a), Spain (b), Italy (c), China (d), New Zealand (e) and Australia (f)
578
M. A. M. Arceda et al.
(a) US
(b) Spain
(c) Italy
(d) China
(e) New Zealand
(f) Australia
Fig. 8. Confirmed COVID-19 cases predicted since April 29th to May 9th. US (a), Spain (b), Italy (c), China (d), New Zealand (e) and Australia (f)
Forecasting Time Series: COVID-19 Case Study
579
Table 3. Prediction of COVID-19 confirmed cases up to May 9th using MTES model.
7
Date
US
Spain
Italy
China New Zealand Australia
4/29/2020
1037569 234583 203532 83957 1476
6762
4/30/2020
1063172 237089 205600 83976 1478
6782
05/01/2020 1089407 239622 207689 83994 1480
6802
05/02/2020 1116289 242181 209799 84013 1482
6822
05/03/2020 1143835 244768 211931 84032 1484
6842
05/04/2020 1172060 247382 214084 84050 1487
6862
05/05/2020 1200982 250025 216259 84069 1489
6882
05/06/2020 1230617 252695 218456 84088 1491
6903
05/07/2020 1260984 255394 220675 84106 1493
6923
05/08/2020 1292100 258122 222917 84125 1495
6944
05/09/2020 1323984 260879 225182 84143 1497
6964
Conclusions
A comparison between MTES and LSTM is presented. We evaluated the models, forecasting COVID-19 confirmed cases in the US, Spain, Italy, China, New Zealand, and Australia. Also, we used data up to April 28th for training the models. MTES outperformed LSTM in terms of RSME. We used 80% for training and then we predicted using both models. The results are presented in Fig. 7 and Table 2. LSTM got poor results, we believe it is caused by a lack of historical data. Besides, we used only the last sample in order to predict the confirmed cases in next day, then, this predicted value was used for a new prediction and so on. A minor error at the beginning could cause a chain error. Not only every country have different behaviors but also it is difficult to have a unique model for all countries. Proof of this is the case of Australia and New Zealand, in which we need more epoch in order to get better results.
8
Future Work
In this work, MTES outperformed LSTM, because of the few historical information. In future work, we are planing to recollect more samples and attributes in order to analyse the confirmed cases with multivariate time series techniques.
References 1. Abdurrahman, M., Irawan, B., Latuconsina, R.: Flood forecasting using holtwinters exponential smoothing method and geographic information system. In: 2017 International Conference on Control, Electronics, Renewable Energy and Communications (ICCREC), pp. 159–163. IEEE (2017)
580
M. A. M. Arceda et al.
2. Al-Qaness, M.A., Ewees, A.A., Fan, H., Abd El Aziz, M.: Optimization method for forecasting confirmed cases of COVID-19 in China. J. Clin. Med. 9(3), 674 (2020) 3. Anastassopoulou, C., Russo, L., Tsakris, A., Siettos, C.: Data-based analysis, modelling and forecasting of the COVID-19 outbreak. PLoS ONE 15(3), e0230405 (2020) 4. Baek, Y., Kim, H.Y.: Modaugnet: a new forecasting framework for stock market index value with an overfitting prevention LSTM module and a prediction LSTM module. Expert Syst. Appl. 113, 457–480 (2018) 5. Bao, W., Yue, J., Rao, Y.: A deep learning framework for financial time series using stacked autoencoders and long-short term memory. PLoS ONE 12(7), e0180944 (2017) 6. Brockwell, P.J., Davis, R.A.: Introduction to Time Series and Forecasting. Springer, Cham (2016) 7. Cao, J., Li, Z., Li, J.: Financial time series forecasting model based on CEEMDAN and LSTM. Phys. A 519, 127–139 (2019) 8. Chakraborty, T., Chattopadhyay, S., Ghosh, I.: Forecasting dengue epidemics using a hybrid methodology. Phys. A 527, 121266 (2019) 9. Connor, S.J., Mantilla, G.C.: Integration of seasonal forecasts into early warning systems for climate-sensitive diseases such as malaria and dengue. In: Seasonal Forecasts, Climatic Change and Human Health, pp. 71–84. Springer, Dordrecht (2008) 10. Cortez, P., Rio, M., Rocha, M., Sousa, P.: Multi-scale internet traffic forecasting using neural networks and time series methods. Expert Syst. 29(2), 143–155 (2012) 11. IHME COVID, Murray, C.J.L., et al.: Forecasting COVID-19 impact on hospital bed-days, ICU-days, ventilator-days and deaths by us state in the next 4 months. medRxiv (2020) 12. Cui, Z., Ke, R., Pu, Z., Wang, Y.: Deep bidirectional and unidirectional LSTM recurrent neural network for network-wide traffic speed prediction. arXiv preprint arXiv:1801.02143 (2018) 13. Fanelli, D., Piazza, F.: Analysis and forecast of COVID-19 spreading in China, Italy and France. Chaos Solitons Fractals 134, 109761 (2020) 14. Fu, R., Zhang, Z., Li, L.: Using LSTM and GRU neural network methods for traffic flow prediction. In: 2016 31st Youth Academic Annual Conference of Chinese Association of Automation (YAC), pp. 324–328. IEEE (2016) 15. Google. Google news COVID-19. https://news.google.com/covid19/map?hl=enUS&gl=US&ceid=US:en. Accessed 28 Apr 2020 16. Grubb, H., Mason, A.: Long lead-time forecasting of UK air passengers by holtwinters methods with damped trend. Int. J. Forecast. 17(1), 71–82 (2001) 17. Hanke, J.E., Wichern, D.W.: Pronosticos en los negocios. Technical report (2006) 18. Hasan, I., Setti, F., Tsesmelis, T., Del Bue, A., Galasso, F., Cristani, M.: MXLSTM: mixing tracklets and vislets to jointly forecast trajectories and head poses. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6067–6076 (2018) 19. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997) 20. Holt, C.C.: Forecasting seasonals and trends by exponentially weighted moving averages. ONR Research Memorandum, 52 (1957) 21. Holt, C.C.: Forecasting trends and seasonals by exponentially weighted averages. Carnegie institute of technology. Technical report, Pittsburgh ONR memorandum (1957)
Forecasting Time Series: COVID-19 Case Study
581
22. Hu, Z., Ge, Q., Jin, L., Xiong, M.: Artificial intelligence forecasting of COVID-19 in China. arXiv preprint, arXiv:2002.07112 (2020) 23. Hussain, A., Rahman, M., Memon, J.A.: Forecasting electricity consumption in Pakistan: the way forward. Energy Policy 90, 73–80 (2016) 24. Jiang, Q., Tang, C., Chen, C., Wang, X., Huang, Q.: Stock price forecast based on LSTM neural network. In: International Conference on Management Science and Engineering Management, pp. 393–408. Springer, Heidelberg (2018) 25. Kim, H.Y., Won, C.H.: Forecasting the volatility of stock price index: a hybrid model integrating LSTM with multiple GARCH-type models. Expert Syst. Appl. 103, 25–37 (2018) 26. Kim, T., Kim, H.Y.: Forecasting stock prices with a feature fusion LSTM-CNN model using different representations of the same data. PLoS ONE 14(2), e0212320 (2019) 27. Koo, B.-G., Kim, M.-S., Kim, K.-H., Lee, H.-T., Park, J.-H., Kim, C.-H.: Shortterm electric load forecasting using data mining technique. In: 2013 7th International Conference on Intelligent Systems and Control (ISCO), pp. 153–157. IEEE (2013) 28. SUTD Data-Driven Innovation Lab. When will COVID-19 end. https://ddi.sutd. edu.sg/when-will-covid-19-end/. Accessed 04 Apr 2020 29. Makridakis, S., Andersen, A., Carbone, R., Fildes, R., Hibon, M., Lewandowski, R., Newton, J., Parzen, E., Winkler, R.: The accuracy of extrapolation (time series) methods: results of a forecasting competition. J. Forecast. 1(2), 111–153 (1982) 30. Miller, M.: 2019 novel coronavirus COVID-19 (2019-nCoV) data repository. Bulletin-Association of Canadian Map Libraries and Archives (ACMLA), no. 164, pp. 47–51 (2020) 31. Mussumeci, E., Coelho, F.C.: Machine-learning forecasting for dengue epidemicscomparing LSTM, random forest and lasso regression. medRxiv (2020) 32. Newton, H.J., Parzen, E.: Forecasting and time series model types of 111 economic time series. Technical report, Texas A&M Univ College Station Inst of Statistics (1983) 33. Pegels, C.C.: Exponential forecasting: some new variations. Manag. Sci. 15, 311– 315 (1969) 34. Petropoulos, F., Kourentzes, N., Nikolopoulos, K., Siemsen, E.: Judgmental selection of forecasting models. J. Oper. Manag. 60, 34–46 (2018) 35. Petropoulos, F., Makridakis, S.: Forecasting the novel coronavirus COVID-19. PLoS ONE 15(3), e0231236 (2020) 36. Roosa, K., Lee, Y., Luo, R., Kirpich, A., Rothenberg, R., Hyman, J.M., Yan, P., Chowell, G.: Real-time forecasts of the COVID-19 epidemic in china from February 5th to February 24th, 2020. Infect. Disease Model. 5, 256–263 (2020) 37. Roosa, K., Lee, Y., Luo, R., Kirpich, A., Rothenberg, R., Hyman, J.M., Yan, P., Chowell, G.: Short-term forecasts of the COVID-19 epidemic in Guangdong and Zhejiang, China: February 13–23, 2020. J. Clin. Med. 9(2), 596 (2020) 38. Selvin, S., Vinayakumar, R., Gopalakrishnan, E.A., Menon, V.K., Soman, K.P.: Stock price prediction using LSTM, RNN and CNN-sliding window model. In: 2017 International Conference on Advances in Computing, Communications and Informatics (ICACCI), pp. 1643–1647. IEEE (2017) 39. Shao, X., Ma, D., Liu, Y., Yin, Q.: Short-term forecast of stock price of multibranch LSTM based on k-means. In: 2017 4th International Conference on Systems and Informatics (ICSAI), pp. 1546–1551. IEEE (2017) 40. Siami-Namini, S., Namin, A.S.: Forecasting economics and financial time series: ARIMA vs. LSTM. arXiv preprint, arXiv:1803.06386 (2018)
582
M. A. M. Arceda et al.
41. Taylor, J.W.: Exponential smoothing with a damped multiplicative trend. Int. J. Forecast. 19(4), 715–725 (2003) 42. Taylor, J.W.: Short-term electricity demand forecasting using double seasonal exponential smoothing. J. Oper. Res. Soc. 54(8), 799–805 (2003) 43. Tian, Y., Zhang, K., Li, J., Lin, X., Yang, B.: LSTM-based traffic flow prediction with missing data. Neurocomputing 318, 297–305 (2018) 44. Verma, A.K., Kuppili, V.: Data-oriented neural time series with long short-term memories (LSTM) for malaria incidence prediction in Goa, India. In: 2019 10th International Conference on Computing, Communication and Networking Technologies (ICCCNT), pp. 1–6. IEEE (2019) 45. Winters, P.R.: Forecasting sales by exponentially weighted moving averages. Manag. Sci. 6(3), 324–342 (1960) 46. Wu, Y., Tan, H.: Short-term traffic flow forecasting with spatial-temporal correlation in a hybrid deep learning framework. arXiv preprint, arXiv:1612.01022 (2016) 47. Yan, H., Ouyang, H.: Financial time series prediction based on deep learning. Wireless Pers. Commun. 102(2), 683–700 (2018) 48. Zhao, Z., Chen, W., Wu, X., Chen, P.C., Liu, J.: LSTM network: a deep learning approach for short-term traffic forecast. IET Intell. Transp. Syst. 11(2), 68–75 (2017) 49. Zhuge, Q., Xu, L., Zhang, G.: LSTM neural network with emotional analysis for prediction of stock price. Eng. Lett. 25(2), 167–175 (2017)
Quick Lists: Enriched Playlist Embeddings for Future Playlist Recommendation Brett Vintch(B) iHeartRadio, New York City, NY, USA [email protected] Abstract. Recommending playlists to users in the context of a digital music service is a difficult task because a playlist is often more than the mere sum of its parts. We present a novel method for generating playlist embeddings that are invariant to playlist length and sensitive to local and global track ordering. The embeddings also capture information about playlist sequencing, and are enriched with side information about the playlist user. We show that these embeddings are useful for generating next-best playlist recommendations, and that side information can be used for the cold start problem. Keywords: Playlist recommendation
1
· Playlist embeddings.
Introduction
Playlists are a common medium for music consumption and dissemination, and thus an important domain for the development of recommendation engines. While playlists are composed of individual tracks, the collection itself can be a distinct entity. Each track can be associated with multiple genres, moods, or concepts, and it is a track’s context that defines its meaning and interpretation. For example, a Rage Against the Machine track could be included in both a rock genre playlist and in a protest song playlist, and the overall playlist context could plausibly affect a user’s reaction to the track. In this work, we present a new method to embed playlists into a high dimensional space that is sensitive to local track context, and is naturally suited to recommending next-best playlists. To fully capture the complexity of playlists, we believe that embeddings should meet a number of criteria. Embeddings should be invariant to playlist length and be sensitive to local or global track ordering. They should also ideally encode information about playlist sequencing, or the next-best future playlists given a current playlist. Much work has been done on embedding individual tracks using both user behavior [7,8] and audio content [12], but it is not clear how one should aggregate these embeddings to the playlist level. Operations on individual item embeddings tend to employ order-invariant aggregations across the collection, such as sums, averages, or maximums. Though these approaches allow for comparison between playlists and are length-agnostic, they do not account for sequencing within a playlist or between playlists. c Springer Nature Switzerland AG 2021 K. Arai et al. (Eds.): FTC 2020, AISC 1289, pp. 583–593, 2021. https://doi.org/10.1007/978-3-030-63089-8_37
584
B. Vintch
current playlist
future playlist
future playlist
random future playlist
Fig. 1. Playlist recommendation is treated as a classification task. The training paradigm seeks to embed current and future playlists such that actual future playlists are selected with a higher probability than random playlists. The classifier is trained with a Bayesian Personalized Ranking (BPR) loss function
There are strong analogies between the task of representing playlists and that of representing natural language. Sentences are collections of words, where word order matters and phrase context gives additional meaning. Similarly, playlists are collections of tracks, where track ordering may be important and local track context can have an impact on the perceived mood or meaning. Recent works have made a similar connection [6], but viable solutions mostly focus on recurrent neural networks for playlist completion [3,11]. These types of recommenders are notoriously difficult to tune in order to produce useful recommendations, and they are also slow to train. Instead, we take inspiration from a new method that explicitly embeds sentences for use in determining the next most logical sentence [5]; importantly, this method frames the process as a simple classification problem. Our primary contribution is to show the utility of sentence embedding models for the task of recommending playlists. We extend this model to user side information and show that it is possible to manipulate recommendations with the addition of side information, and even to use only side information in a cold start situation. This new model meets the criteria for playlist embeddings outlined above, and is efficient to learn.
2 2.1
Methods Model
The quick thoughts model, introduced by Logeswaran & Lee [5], treats sentence representation as a classification task. Sentences are encoded such that they are maximally predictive of the next sentence, as determined by a classifier. This discriminative approach to sentence embedding operates an order of magnitude
Quick Lists
585
faster than generative approaches, and learns to ignore aspects of sentences not connected to its meaning. Our approach for embedding playlists, quick lists, borrows this framework, substituting sequences of sentences with sequences of playlists. We further extend the framework by allowing for the inclusion of side information that describes the playlist listeners’ themselves. global max pooling + concatenation track embeddings
bidirectional LSTM
1D max pooling
3
10 kernel width 64
input playlist
64
32
64 25
64
dense tanh embedding
5 kernel width
8 12
12
dense relu
convolutional
64
25 64
128
128
2 kernel width
11
64
64
user side information 86
Fig. 2. Encoders embed playlists and user side information with a deep neural network that includes track embeddings, a bidirectional LSTM, a bank of convolutional kernels, and a final dense network. Numbers depict layer output sizes.
Our goal is to embed playlists such that embeddings are maximally predictive of the future playlists in a sequence. We define two encoders. The first, “current” encoder embeds a playlist into a high dimensional space. The second, “future” encoder embeds a user’s subsequent playlist into the same space. A classifier then seeks to identify the correct playlist from a pair of playlists, where one is the actual future playlist and the other is a random playlist (Fig. 1), that is, pairs of actual current and future playlists should be close together, and random playlists should be far apart. It is important that the current encoder and the future encoder are learned separately; although we want the embeddings from each encoder to live in the same space, current and future playlists are expected to be composed of different tracks that reflect a user’s listening trajectory over time. We chose our loss function to be analogous to Bayesian Personalized Ranking loss (BPR) [10], which seeks to maximize the probability that a user u’s preferred item p ranks higher than a user’s non-preferred item n: P (p > n|Θ, u) = σ(ˆ xupn (Θ)), where σ is a sigmoid function: σ(x) =
1 . 1 + e−x
586
B. Vintch
x ˆupn (Θ) is an arbitrary function parameterized by Θ that captures the relationship between a user’s current and future playlists, c and p, and compares it against the relationship between the user’s current playlist and a random playlist, n. That is, x ˆupn (Θ) captures the extent to which the actual future playlist is closer to the current playlist than a random future playlist. We restrict our classifier to simple distance metrics so that learning is targeted to the playlist encoders and not the classifier; we prefer a powerful encoder for the generalizability of the embeddings. We considered euclidean distance, cosine distance, and a distance metric based on the dot product between two embeddings. Though dot products are commonly used for similarity calculations in recommendation tasks, we find that this metric’s sensitivity to vector length encourages a bias towards popular content and does not produce qualitatively good predictions for less popular content. Though we observed that Euclideanbased models tended to take longer to converge, we also noticed a tendency for inference to be more equitable across content types; this was preferable for our use case, and so the experiments described below use euclidean distance. Thus, x ˆupn (Θ) = vuc − vup −vuc − vn , where vectors vue represents playlist embeddings, and e can reference current (c), preferred (p), and non-preferred (n) playlists. The encoders for the current and next playlists both share the same architecture (Fig. 2), but they are trained independently so that they can adapt to sequence order. The encoders operate on a padded sequence of track vectors that are concatenated into a 2D matrix. This matrix is passed through a 1D max pooling function before being fed into a bidirectional LSTM with 16 hidden units. This output is then processed by a bank of 3 convolutional layers with different filter sizes (2, 5, and 10) and ReLu activation functions. Each filter output is subjected to another 1D max pooling function, and 50% dropout is applied to this filter bank during training. The final output of the network is a dense layer with T anh activation functions and L2 regularization; this layer produces the final mapping of each playlist to its embedding. An additional factor often found in the context of playlists but not in natural language is the existence of user side information. We hypothesized that this information could be useful for recommendations, especially in the case of new users and cold starts. In the spirit of Wide and Deep models [1], we include a shallow network that combines categorical and scalar user information with the output of the encoder just before the final Tanh activation layer. 2.2
Training
We define and train the network in Keras [2] with an Adam optimizer [4]. Track embeddings are initialized as their word2vec embeddings learned over playlists as if they were sentences (we use the gensim implementation [9] and drop tracks with 5 or fewer plays in our dataset). However, track embeddings are not fixed
Quick Lists
587
and are further learned along with the rest of the model during optimization. We find that fixing track embeddings hinders performance, and this seems to be especially true for euclidean-based classifiers. The model is trained over 100 epochs using a learning schedule. The schedule drops the learning rate by a factor 0.25 every 10 epochs. Training takes about 16 h on an NVIDIA Tesla K80. By manual inspection, epochs past the point where training loss begins to asymptote (10–20 epochs) help to fine tune the recommendations, and help most for users that play rare or unpopular content. 2.3
Data
The quick lists algorithm is designed to embed and recommend playlists. However, any ordered sequence of tracks can be used as input. Our primary use case is to recommend a next-best playlist to a user, and so for the purpose of these experiments we define a user’s current playlist to be the sequence of most recently played tracks, regardless of their source. iHeartRadio has an array of digital music products, including live radio, custom artist radio, and user generated playlists. We take the last 50 played tracks for each user across all of these products and split them into “current” and “future” playlists. We do not include tracks that were thumbed down or skipped. In the case where a user listened to between 25 and 50 tracks, the last 25 tracks are assigned to the future playlist, and the rest are assigned to the current playlist. Data is collected for a random sample of 300,000 users in January 2019. Users are further randomly split into training and testing sets with a 85/15 ratio. We also collect side information for each user where available. We one-hot encode the user’s gender, age (in 5 year bins), and country of registration (out of 5 possible countries), and multi-hot encode their self-reported genre or radio format preferences from user on-boarding. There are 57 unique genres that were chosen, with the most popular being “Top 40 & Pop”, “Hip Hop and R&B”, and “Country”. While a user’s stated genre preference does not always reflect their revealed preference in actual listening, these preferences are of considerable interest to us as a possible solution to the cold start problem. In total, there are 86 binary features that describe a user and 72% of users had at least one active feature. 2.4
Experiments and Analysis
The quick lists procedure is intended to recommend future playlists to users based upon recent listening history. Playlist recommendation is treated as a nearest neighbor task. A user’s current state, consisting of their recently played tracks and their profile side information, is embedded with the current encoder. Meanwhile, all possible future playlists are encoded with the future encoder. The future playlist that is closest to the current playlist in the embedded space is recommended to the user.
588
B. Vintch
Fig. 3. Average (line) and 25th and 75th percentiles (band) of x ˆupn while training, for both training and test sets.
The embedded proximity of pairs of current and future playlists is an indicator of how well the model fits the data. During training we track the distribution of distances between current and future playlists and compare it to the distribution of distances between random current and future playlists. After fitting, we contrast these distributions to versions of the model that are effectively “lesioned” by omitting the input data for different feature sets (set to zeros). As a proxy for future playlist prediction accuracy we analyze the accuracy of track recommendations in predicted playlists. Specifically, we measure the overlap in tracks between the predicted playlists and each user’s actual future playlist for the test set users, measured as an F1 score which combines precision and recall. We use this metric because it allows for easy comparisons between models, including baseline models. Note, however, that it does not take into account track order within playlists, despite sensitivity to order being a desired property; the metric may therefore miss more subtle quality differences between models. We also measure the percentage of tracks recommended in the predicted future playlist that also appear in a user’s current playlist, as some use cases (including our own) may wish to penalize familiarity (i.e. repetitiveness) and encourage novelty. We compare the quick lists model predictions to several baselines and to a word2vec-based approach. We consider a baseline where the recommended playlist is built from random tracks, one where the recommended playlist is an identical set of the most popular tracks in the data set, and one where the current playlist is simply repeated as the recommended future playlist. For the word2vec-based model, we average the word2vec vectors for each track in the current playlist and create a recommended playlist by finding the tracks closest to the average. For each of these approaches we draw the playlist length from the distribution of actual playlist lengths for test-set users.
Quick Lists
589
Fig. 4. Model performance measured as the distance between pairs of current and future playlists, with and without lesioning (red largely overlaps blue). (Color figure online)
3 3.1
Results Predictive
The quick list loss function encourages current and future user playlists to be close together in embedded space and random current and future playlists to be far apart. The model learns to distinguish these two categories during training with performance beginning to asymptote around 10–20 epochs, for both training and testing data (Fig. 3). It continues to improve in subsequent epochs but at a slower rate. We justify the continued training by observing qualitative improvements for less popular content. We assess the relative importance of each of the two model inputs by omitting one at a time during inference. We use the distance between actual pairs of current and future playlists as a measure of performance quality for users in the test set, where the desired distance of zero would so that real current and future playlist embeddings perfectly overlap. The distribution of distances across users for the full model and the model without side information show similar performance (Fig. 4; blue and red distributions, which largely overlap), with the lesioned model performing only 0.5% worse than the full model on average. This is an indication that side information is less informative in predicting future playlists than a user’s recent listening history. Alternatively, removing recent playlist information reduces average performance by 346% (green). Reversing the current playlist before embedding also leads a decrease in performance of 18.5% on average (not shown), which indicates the importance of track ordering in deriving embeddings. However, all models perform better than a weak baseline, where current and future playlists are paired randomly (gray; 645% decrease in performance compared to the full model, on average). Thus, even the scenario where side information is used alone shows some some predictive power. We also examine recommendation quality by measuring the frequency by which tracks in the recommended playlist actually occur in the future playlist.
590
B. Vintch Table 1. Model performance Model
F1
Familiarity
Baseline - Random tracks
0.00022 0.014%
Baseline - Popular tracks
0.026
2.1%
Baseline - Current playlist as future playlist 0.092
100%
Word2vec - Closest tracks to average
0.020
2.8%
Quick lists - No current playlist
0.018
1.5%
Quick lists - No user information
0.076
7.5%
Quick lists - Full model
0.075
7.5%
Quick lists - Reversed current playlist order 0.072
7.5%
We measure this overlap as the F1 score between the predicted future playlist and the actual future playlist for each user in the test set. We also measure the percentage of tracks in the predicted future playlist that are found in the current playlist as a measure of familiarity (low overlap means lower familiarity but higher novelty, which is generally preferable). Table 1 shows these metrics for a collection of baseline models and quick list models. The quick lists model performs relatively well compared to most baseline models, with only moderate repetition between the recommended future playlist and the current playlist. Reversing the current playlist order reduced predictive power slightly, but removing information about the current playlist drastically decreases accuracy. This lesioned model, however, does still have some predictive power above random playlists, and may still be useful for cold start users (see the Qualitative section below). In the context of the test set users, the lesioned model with no user information slightly outperforms the full model. Among the baseline models, simply using the current playlist as the recommended future playlist performs surprisingly well, beating the best quick lists model F1 score. With a repetition rate of 100% the recommended playlist is a poor user experience and thus not viable for most production purposes. However, it does demonstrate real users preferences for familiarity and repetition. 3.2
Qualitative
Recommendations can also be generated for manipulated or arbitrary inputs. For example, if we start with a user that most recently listened to songs in the classic rock genre, but inject a strong preference for country music via the user’s side information, we can change the user’s recommended future playlist from pure rock to something that crosses the boundaries between classic rock and country (Fig. 5(a)). Similarly, we can use a user’s side information to help solve the cold start problem. Recommended playlists can be generated for hypothetical new users where the model is only supplied with the user’s age range (Fig. 5(b)).
Quick Lists a)
current playlist 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
b)
title Roll With It Your Smiling Face China Grove Into The Mystic H Gang Give A Little Bit Home At Last Somebody's Baby Searching For A Heart Don't Stop You've Got A Friend You Make My Dreams Long Train Runnin' Breakfast In America Josie While You See A Chance Lowdown Rhiannon American Girl Small Town Rock'n Me I Won't Back Down Wonderful Tonight Mary Jane's Last Dance
recommended future playlist artist_name Steve Winwood James Taylor The Doobie Brothers Van Morrison Donald Fagen Supertramp Steely Dan & Tom Scott Jackson Browne Don Henley Fleetwood Mac James Taylor Daryl Hall & John Oates The Doobie Brothers Supertramp Steely Dan & Tom Scott Steve Winwood Boz Scaggs Fleetwood Mac Tom Petty & the Heartbreakers John Mellencamp Steve Miller Band Tom Petty Eric Clapton Tom Petty & the Heartbreakers
recommended playlist for age bin 1965-1970 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
title Ain't No Rest for the In Bloom Love Bites Love Bites Basket Case I Love Rock 'N Roll Another One Bites The The Kids Aren't Alright Run Like Hell Werewolves Of London When I Was Your Man There's Nothing Holdin' You're The Inspiration The Middle Girls Just Want to Have Happy (From "Despicable Together Forever (7") Hello Hit Me With Your Best Am I Wrong So What Iris
artist_name Cage the Elephant Nirvana Def Leppard Def Leppard Green Day Joan Jett & the Blackhearts Queen The Offspring Pink Floyd Warren Zevon Bruno Mars Shawn Mendes Chicago Zedd, Maren Morris, & Grey Cyndi Lauper Pharrell Williams Rick Astley Adele Pat Benatar Nico & Vinz P!nk Goo Goo Dolls
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
title Over The Mountain (Album Livin' On A Prayer Smells Like Teen Spirit Mary Jane's Last Dance Walk Of Life Low Rider I'll Wait Love Is Like A Rock Rag Doll Trampled Under Foot American Woman Love Bites Time Juke Box Hero Kryptonite Have a Drink on Me Round And Round We Will Rock You We Are The Champions Wanted Dead Or Alive When The Curtain Falls You Don't Know How It
591
recommended future playlist with user bias for country artist_name Ozzy Osbourne Bon Jovi Nirvana Tom Petty & the Heartbreakers Dire Straits War Van Halen Donnie Iris Aerosmith Led Zeppelin The Guess Who Def Leppard Pink Floyd Foreigner 3 Doors Down AC/DC Ratt Queen Queen Bon Jovi Greta Van Fleet Tom Petty
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
title There Was This Girl Stay Cry Pretty Hey You Take It From Me White Wedding Livin' On The Edge Gimme All Your Lovin' Born To Be Wild In The Air Tonight (2015 Runnin' Down A Dream Don't Stop Fat Bottomed Girls Who's Crying Now I Love Rock 'N Roll Old Time Rock & Roll Surrender Don't You (Forget About Speak To Me / Breathe (In Start Me Up Black Betty Simple Man Smokin'
artist_name Riley Green Florida Georgia Line Carrie Underwood Pink Floyd Jordan Davis Billy Idol Aerosmith ZZ Top Steppenwolf Phil Collins Tom Petty Fleetwood Mac Queen Journey Joan Jett & the Blackhearts Bob Seger & The Silver Bullet Cheap Trick Simple Minds Pink Floyd The Rolling Stones Ram Jam Lynyrd Skynyrd Boston
recommended playlist for age bin 1985-1990 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
title SICKO MODE Down In the DM Flex (Ooh, Ooh, Ooh) TAlk tO Me Southside BIG BANK Sunday Bloody Sunday BIG BANK Yes Indeed Boo'd Up Seein' Red Sky Walker One That Got Away One Number Away Sixteen Take It From Me Do I Make You Wanna Sunrise, Sunburn, Sunset Make It Sweet Day Drunk This Is It Miss Me More Burning Man Good as You
artist_name Travis Scott Yo Gotti Rich Homie Quan Tory Lanez & Rich The Kid Lil' Keke YG, 2 Chainz, Big Sean, & U2 YG, 2 Chainz, Big Sean, & Lil Baby & Drake Ella Mai Dustin Lynch Miguel feat. Travis Scott Michael Ray Luke Combs Thomas Rhett Jordan Davis Billy Currington Luke Bryan Old Dominion Morgan Evans Scotty McCreery Kelsea Ballerini Dierks Bentley & Brothers Kane Brown
Fig. 5. Examples of manipulating a user’s side information to generate playlists. a) Recommended playlists with manipulation of side information for an actual user. Left: actual current playlist. Middle: recommended future playlist for this user. Right: recommended future playlist with an artificial preference for Country music injected via side information. b) Recommended playlists for a new user with no current playlist and only one active age bin feature.
Finally, we observe that track order is important in generating embeddings for use in recommending future playlists. We take two playlists that contain the same set of ten tracks; in one, tracks are ordered in a progression from the alternative tracks to the classic rock tracks, and the other playlist they are reversed. Despite an identical set of tracks, each input produces a recommendation that emphasizes continuity with the tracks that were most recently played. The first two recommended tracks for the current playlist ending with alternative tracks are from the artists Flora Cash and Fall Out Boy, while they are from the artists Supertramp and Bonnie Tyler for the current playlist ending in classic rock (full playlists not shown for space).
4
Discussion
We present a novel method to embed playlists and use those embeddings to make recommendations for future playlists. This method builds upon recent advances in natural language representation for embedding sentences, and adds the ability to leverage side information for the playlist user. Though side information alone does not appear to provide very accurate recommendations compared to recent
592
B. Vintch
listening history, we demonstrate that it may still be useful for the cold start problem and for playlist manipulation. Real listeners demonstrate repetitive behavior, listening to a handful of tracks many times. This pattern leads to the surprising result that simply using a user’s current playlist as a prediction for their best future playlist is reasonably accurate approach. Prior work has indeed shown a reliable preference of real listeners for familiar music [13]. Unfortunately, for real world music recommendation products this simple tactic is usually not a viable solution because the user experience is undesirable. In the experiments described above we define a playlist as a collection of tracks listened in sequence, regardless of their source. This liberal definition was chosen because in our use case we wish to make future playlist recommendations to users using their most recent history, regardless of how they listened to each track. However, this definition may also increase problem difficulty as compared to a scenario in which a user listened to user-generated or curated playlists. This is because these types of playlists are more likely to be narrow in scope and coherent in theme. Despite this added difficulty, we find that the model trained with the more liberal definition of a playlist still produces useful recommendations. A logical next step in this work is to improve the decoder for recommendation. In this work, we rely on recommending playlists that other users have actually created through the use of nearest neighbor lookup. However, there is nothing barring one from training a separate decoder that takes playlist embeddings and produces new, never before created playlists. We have begun experimenting with the use of recurrent neural networks, specifically LSTMs, to create a generator for playlists that are potentially infinitely long; we see encouraging results thus far.
References 1. Cheng, H.T., Koc, L., Harmsen, J., Shaked, T., Chandra, T., Aradhye, H., Anderson, G., Corrado, G., Chai, W., Ispir, M., Anil, R., et al.: Wide & deep learning for recommender systems. In: Proceedings of the 1st Workshop on Deep Learning for Recommender Systems, pp. 7–10. ACM (2016) 2. Fran¸cois, C., et al.: Keras: Deep learning library for theano and tensorflow https:// keras.io/k (2015) 3. Emil, K-S., Michael, S.: Music predictions using deep learning.could lstm networks be the new standard for collaborative ltering? (2016) 4. Diederik, P.K., Ba, J.A.: A method for stochastic optimization. arXiv preprint arXiv:12.6980 (2014) 5. Logeswaran, L., Lee, H.: An efficient framework for learning sentence representations. arXiv preprint arXiv:1803.02893 (2018) 6. McFee, B., Lanckriet, G.R.G.: The natural language of playlists. ISMIR 11, 537– 541 (2011) 7. Moore, J.L., Chen, S., Joachims, T., Turnbull, D.: Learning to embed songs and tags for playlist prediction. In: ISMIR, vol. 12, pp. 349–354 (2012)
Quick Lists
593
8. Maciej, P.: A matrix factorization algorithm for music recommendation using implicit user feedback (2009) 9. Radim, R., Petr, S.: Software framework for topic modelling with large corpora. In: Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks. Citeseer (2010) 10. Steffen, R., Christoph, F., Zeno, G., Lars, S-T.: Bpr: Bayesian personalized ranking from implicit feedback. In: Proceedings of the Twenty-fifth Conference on Uncertainty in Artifcial Intelligence, pp. 452–461. AUAI Press (2009) 11. Vall, A., Quadrana, M., Schedl, M., Widmer, G.: Order, context and popularity bias in next-song recommendations. Int. J. Multimed. Inf. Retr. 8(2), 101–113 (2019) 12. Van den Oord, A., Dieleman, S., Schrauwen, B.: Deep content-based music recommendation. In: Advances in Neural Information Processing Systems, pp. 2643-2651. (2013) 13. Ward, M.K., Goodman, J.K., Irwin, J.R.: The same old song: The power of familiarity in music choice. Mark. Lett. 25(1), 1–11 (2014)
Data Security Management Implementation Measures for Intelligent Connected Vehicles (ICVs) Haijun Wang(&), Yanan Zhang, and Chao Ma China Automotive Technology and Research Center Co., Ltd, Tianjin 300300, China [email protected]
Abstract. The new wave of technological revolution and industrial transformation that has swept over the globe has led to Intelligent Connected Vehicles (ICVs) to become a mandatory technological change for the transformation and upgrade of the automobile industry. Although the Internet of Vehicles is convenient for consumers, data security issues including leakage of personal information and sensitive data are increasingly prominent. Vehicle data security issues have caused great concern for various countries’ and regions’ governments. The United States, the European Union, and other countries and regions are thus actively introducing vehicle data management policies. Under such circumstances, China should accelerate the overall intelligent connected vehicles data security plans, and actively introduce policies to ensure the continual development of intelligent connected vehicles. Keywords: Intelligent connected vehicles industry Management
Data security Automobile
1 Foreword Following rapid developments in new technology infrastructures, such as 5G, artificial intelligence, and the Internet of Things, the automobile industry new technology trends in recent years include intelligence, networking, vehicle electrification, and sharing. In recent years, the global automobile industry has witnessed intelligent connected vehicles (ICVs) become a strategic innovation development focus. Automobiles are gradually transforming from conventional transportation tools to smart mobile spaces and application terminals. Currently, core technological breakthroughs in ICVs have improved the ICVs basic support system, and the related industrial ecological system is maturing. However, vehicle interconnection data security issues are also increasingly prominent. In May 2018, Honda India leaked more than 50,000 customers’ data because of insecure AWS S3 buckets. Leaked data included names, phone numbers, passwords, gender, Email addresses of customers and their trusted contacts. Additionally, vehicle information including VIN and Connect ID was leaked. In July the same year, Canadian automobile supplier Level One suffered a data breach that © Springer Nature Switzerland AG 2021 K. Arai et al. (Eds.): FTC 2020, AISC 1289, pp. 594–600, 2021. https://doi.org/10.1007/978-3-030-63089-8_38
Data Security Management Implementation Measures
595
exposed 157 gigabytes of data. Leaked data included nearly 47,000 files, due to Level One’s failure to restrict user access rights and authentication by IP address in its backup server. The data breach affected more than 100 car makers, including Ford, Toyota, Volkswagen and Tesla. In March 2019, hackers breached Toyota Motor’s IT systems twice, resulting in the leakage of approximately 3.1 million customers’ personal information [1]. Following connected vehicles’ developments, vehicles are becoming open systems and their functions are highly dependent on mass data. Because of the gradual increase in open connections, data interaction is more frequent between related devices and systems. Hackers not only can intercept communication information and attack the cloud server to steal user information and vehicle data, but can also endanger personal safety and social order. In addition, potential security threats in the life cycle of data collection, storage, processing, transmission, and sharing details a challenge to the safe operation of vehicles and user information protection. Therefore, how to make full and effective use of data on the basis of safety assurance is a problem that needs to be solved urgently to realize the scale development and commercialization of intelligent connected vehicles.
2 ICVs Data Security Risks ICVs networking features distinguish them from traditional cars in the amount of data obtained. Conventional vehicles generally obtain data through four major buses Controller Area Network (CAN), Local Interconnect Network (LIN), FlexRay, and Media Oriented Systems Transport (MOST). The amount of data generated by the CAN bus is about tens of kilobytes per second (kB/s). In contrast, ICVs can utilize various physical interfaces of the on-board computing platform and IoV communication interface to obtain data. It is estimated that each ICV can generate data up to 100 GB per second (GB/s) [2]. ICVs can generate mass data, including vehicle data, user data, map data, location data, real-time traffic data, business data, and third-party data. The security attributes of ICV data can be categorized into confidentiality, availability, and integrity. Confidentiality is defined as user privacy data, test scenario data, location data, and others that are not disclosed to unauthorized users and entities. Availability allows authorized users and entities to access and use autopilot data and resources if needed. While integrity indicates that the decision-making, control data, dynamic traffic environment data, and others are not subjected to unauthorized changes or destruction to ensure accurate generation, storage, and transmission of vehicle data [3]. ICVs data security is not limited to user privacy and safe vehicle operation, but also significant for the entire automobile industry safety management [4]. Currently, there are security risks in data collection, data storage, data transmission, and data usage. 2.1
Data Collection
Automobile data collection devices include pre-mounted devices, mobile terminals, and retrofitted devices. There are two main data risks. One main security risk is caused
596
H. Wang et al.
by incomplete security mechanisms and inadequate data collection protective measures for the device that can cause data leakage, hijack, tampering and forgery. The other main security risk is contaminated data and data loss caused by the aging or damage of the data collection device. 2.2
Data Storage
Automotive data storages include cloud storage, local storage on the vehicle, and mobile terminal storage. Cloud storage is used extensively because of the mass amount of data. However, over-reliance on cloud storage presents several risks. Data stored in the cloud is often mingled without classified isolation and graded protection, and data can be stored without control on the access mechanism. Moreover, different levels of fine-grained access have not been used for storing different levels of data. 2.3
Data Transmission
The ICV industry not only requires a large amount of data transmission but also has high data reliability and data security requirements. The main risks include internal vehicle communication transmission risks, such as tampered or forged CAN message, blocked communication bus, unavailable or delayed data, lack of verification mechanisms during data transmission between the CAN bus and the Electronic Controller Unit (ECU). ICVs are also vulnerable to off-vehicle communication transmission risks, such as eavesdropping or man-in-the-middle attacks during short-range Bluetooth or WIFI communication transmission or long-distance communication transmission via 4G, 5G, or C-V2X. Geographic information data leakage and leakage of vehicle trajectory information can also occur during V2V broadcast communication. 2.4
Data Usage
Since vehicle data usage involves drivers, OEMs, and cloud service providers, this generates three main security risks. Firstly, the use of relevant data is not defined, and there is a risk of unauthorized access to important sensitive data. Secondly, the lack of effective data management and control because of undefined data ownership, use rights, and other related rights and responsibilities, which can easily lead to data abuse. Lastly, potential privacy leakage risk exists when user profile images are exposed during data analysis and mining.
3 International Data Security Management of Intelligent Connected Vehicles As informatization and digitization are increasingly important in everyday life and production have accelerated the approach of the data-driven intelligence era. Hidden data security problems caused by vehicle intelligentization and networking have raised concerns for various countries’ governments. The United States (U.S.), the European Union, and other countries have actively promoted the deployment of automotive data
Data Security Management Implementation Measures
597
security policies for automotive network, and accelerated automotive data management requirements. 3.1
The U.S
The current U.S. federal privacy protection laws are unsuitable for ICVs. In recent years, the U.S. has paid special attention to the automotive data security management issues while greatly driving the realization of autopiloting. In 2017, the U.S. House of Representatives passed the Self Drive Act. The Act requires ICV developers to develop a data privacy protection plan and prohibits manufacturers from selling vehicles without a privacy protection plan [5]. In 2019, U.S. legislators proposed again the Security and Privacy in Your Car Act, and this time the emphasis was on transparent data operations, the right for users to terminate data collection and storage, and manufacturers or operators can only use information collected by cars for advertising or marketing purposes only after obtaining express consent from users [6]. In both acts, car manufacturers are responsible for user privacy protection. 3.2
The European Union
European countries have always valued data protection. In 2014, the Alliance of Automobile Manufacturers and the Association of Global Automakers formulated seven privacy protection principles—transparency, selectivity, respect for scenarios, data minimization, data security, completeness, desirability, and accountability. The organizations clearly stipulated automobile manufacturers can only share consumers’ personal data with third parties based on contracts, with consumer consent, or for compliance with legal requirements. The General Data Protection Regulation, one of the most stringent data protection act to date, came into effect in May 2018. The provisions on personal data protection will be uniformly applied to the ICVs, car manufacturers, parts manufacturers, Internet of Vehicles service providers, and other related enterprises. All these organizations need to follow personal data protection law requirements to effectively protect users’ data. In March 2019, the European Network Information Security Agency (ENISA) released Toward a Framework for Policy Development in Cybersecurity—Security and Privacy Considerations in Autonomous Agents, proposing the key to autonomous driving is data collection. ENISA proposed establishing a comprehensive policy framework that incorporates security and privacy design principles, baseline security requirements and establishment of ethical codes [7]. In January 2020, the European Data Protection Board adopted Guidelines on Processing Personal Data in the Context of Connected Vehicles and Mobility Related Applications. Regarding personal data within the vehicle, personal data exchanged between vehicles and connected devices, and data collected by the vehicle and submitted to external entities for further processing, ENISA proposed ICVs upstream and downstream enterprises may act as data controllers or data processors. ICVs upstream and downstream enterprises should follow general recommendations, such as data relevance, data minimization, design default data protection, security, and confidentiality when handling personal data within the vehicle.
598
H. Wang et al.
4 Intelligent Connected Vehicles Data Security Management Measures in China As automotive data security problems gradually emerged in recent years, China currently does not have comprehensive laws that regulate personal data and privacy protection. The country still has to establish a special national regulatory body for binding and adjudicating the use of personal information complies with the regulations. For the time being, the implementation of ICVs data security management is based on multiple policies and regulations. When China enacted the Network Security Law in 2016, it was the first established and comprehensive closed-loop legal system for protecting personal information. The Network Security Law highlights important principles and requirements for personal data collection, use and transmission, and data localization requirements for key information infrastructure operators. In May 2019, the Data Security Management Measures(Exposure Draft) focused on personal information and important data security, and systematically regulated network operators in terms of data collection, data processing, data usage, data security supervision, and management. In June 2019, the Outbound Security Evaluation Methods for Personal Information and Important Data (Exposure Draft) extended data localization requirements to all network operators. In December 2019, the Network Security Level Protection 2.0 System was officially implemented, which stipulated the requirements for personal information and data protection usage. In general, specific ICVs privacy and data security issues in the current regulations or policy documents of the publicized drafts have not been addressed. China has established an ICV standard, the GB/T 37973-2019 Big Data Security Management Guide, GB/T 37988-2019 Information Security Technology Data Security Capability Maturity Model, GB/T 37964-2019 Personal Information Deidentification Guide, and other standard documents that were released in 2019 to instruct enterprises to use effective technology and management measures to ensure data security. These standards are universal and can help ICV related enterprises to carry out data security management activities to a certain extent. However, the guide still lacks special standards and specifications for protecting ICVs data. Standards such as classification and grading of vehicle data, safety technical requirements, safety management requirements, and safety assessment requirements still need to be formulated. To sum up, China’s lack of synchronization in higher-level law and standard systems of data security management has resulted in the inability to form effective enterprise guidance and promotion [8]. Therefore, China urgently needs to strengthen ICVs data security management. The following countermeasures implementation can be carried out through forming standards and regulations, technological breakthroughs, and certification assessment are accordingly proposed.
Data Security Management Implementation Measures
4.1
599
Strengthentop-Level Design to Introduce a Specification Guide
A data security legal system for intelligent connected vehicles based on the Network Security Law, Data Security Law and Personal Information Protection Law that incorporates data generated by ICVs into higher-level laws regulations, such as Personal Information Protection Law and the Data Security Law. In light of the Network Security Law, the rights and liabilities of the government, enterprises, and users in data security management of ICVs has to be defined under the current legal framework and specified data coverage, classification, and grading, protection measures, key technologies of intelligent connected vehicles to form a guideline policy document to guide enterprises in related work. Lastly, establish and improve the ICVs data security standard system to provide a basis to meet the industry’s data security protection needs, and effectively implement data security management requirements. 4.2
Establish a Management Mechanism to Strengthen Supervision and Management
The responsibilities of various ministries and commissions at the national level need to be clarified, the regulatory boundaries defined, and the establishing a safety management mechanism that covers the entire life cycle of intelligent vehicle data needs to be explored. Efforts should also be made to study and form special ICV data security management methods. OEMs, parts suppliers, cloud service providers, users, and other relevant principals should be urged to effectively implement data protection responsibilities in the form of enterprise self-inspection and government spot check. 4.3
Breakthroughs in Key Technologies to Build a Protection System
To ensure that data usage is compliant to privacy protection laws, in-depth research needs to be conducted on the full life cycle of data collection, data transmission, storage, processing, exchange, and destruction. Research should also be conducted on data cleaning and comparisons, data leak prevention, anonymization, de-identification, data desensitization, security audit, multi-party security computing, transparent encryption, data tracing, recovery and destruction technologies, and other key technologies. In addition, intrusion detection and protection technologies should be dynamically integrated to realize the application of China-made cryptographic algorithms in ICVs and build a vehicle data security protection system to effectively improve data security protection capabilities.
5 Conclusion This paper fully analyzes the security risks of ICVs in data collection, data storage, data transmission, data use and other aspects on the basis of illustrating the security attributes of ICVs data. According to the United States, the European Union and other countries and regions in the automotive data security policies deployment, combined with the current status of data safety management in China, from the aspects
600
H. Wang et al.
of establishing and optimizing in higher-level law and standard systems of data security management, clarifying the responsibilities and boundaries of different ministries and commissions, establishing a safety management mechanism that covers the entire life cycle of intelligent vehicle data, breaking through the key technologies such as anonymity/de-identification, the implementation suggestions of China are put forward.
References 1. Ma, Y.: Industrial data leakage of automobile manufacturers cannot be ignored. China Inf. Wkly 20 (2019) 2. Zhang, Y., Liu, Y.: A study on the development of intelligent connected vehicles in the era of big data. Jiangsu Sci. Technol. Inf. 24, 7–9 3. IBM Business Value Research Institute: Accelerating Vehicle Information Security: Winning Vehicle Integrity and Data Privacy Competition. IBM Business Value Research Institute, Beijing (2017) 4. Zhao Shijia, X., Ke, X.X., et al.: Implementation countermeasures for information security management of intelligent connected vehicles. Strateg. Study Chin. Acad. Eng. 021(003), 108–113 (2019) 5. Anran, T.: Overseas references for the administrative regulations of autonomous vehicles. J. Shanxi Police Acad. 27(02), 45–50 (2019) 6. Liu, K.: The laws and regulations and supervision system for personal data privacy protection in the United States. Global Sci. Technol. Econ. Outlook 4 (2019) 7. Jiaying, H.: Comment on ‘establishing the development framework of network security policy-security and privacy of independent agents’. Inf. Secur. Commun. Confidentiality 05, 72–79 (2019) 8. Zhang, Y.: A research on legal issues of personal data security supervision in the era of big data. Sichuan Acad. Soc. Sci. (2017)
Modeling Dependence Between Air Transportation and Economic Development of Countries Askar Boranbayev1(&), Seilkhan Boranbayev2, Tolendi Muratov2, and Askar Nurbekov2 1
2
Nazarbayev University, Nur-Sultan, Kazakhstan [email protected] L. N. Gumilyov Eurasian National University, Nur-Sultan, Kazakhstan [email protected], [email protected], [email protected]
Abstract. This article is devoted to modeling the relationship between air transportation and the economic development of countries. An analysis is made of various countries of the world, including some countries that were part of the former USSR. For these countries, an analysis of gross domestic product and aviation passenger flow was made. A connection was established between these indicators and recommendations were made. Keywords: Modeling Air transportation Economic development Aviation Passenger flow Gross domestic product Information system Automation Reliability Safety
1 Introduction Civil aviation is an important mode of transport for the successful functioning of the global economy and the maintenance of its sustainable economic growth and social development. The rapid movement of people and goods between different countries contributes to the development of world trade and the international tourism industry. Aviation demonstrates an impressive level of macroeconomic performance in providing services to the community and regions. The development of infrastructure creates initial employment, and the subsequent operation of airports and airlines creates new networks of suppliers, tourist flows, and provides local producers with access to remote markets. Such nascent trade and tourism economies then continue to expand, providing wider and more sustainable regional growth. Since the mid-1970s, air traffic growth rates have consistently ignored recession cycles, doubling every 15 years. Air transport does not succumb to such recessions precisely because it is one of the most effective tools to combat them - an important factor in difficult economic conditions [1]. The acceleration of economic development around the world in modern conditions is largely due to air transport. The economic benefits of air transport are manifested in
© Springer Nature Switzerland AG 2021 K. Arai et al. (Eds.): FTC 2020, AISC 1289, pp. 601–613, 2021. https://doi.org/10.1007/978-3-030-63089-8_39
602
A. Boranbayev et al.
increased links between different cities, allowing the flows of various goods, people, capital, technology and ideas to move freely. Therefore, the success of many types of business currently largely depends on the coordinated operation of the air transportation system [2]. The unprecedented breadth of passenger choice leads not only to an increase in services and tariffs in the aviation industry, but also to the dominance of large airlines acquiring effective management based on data voluntarily provided by customers. When making strategic management decisions, it is necessary to take into account highly intelligent dynamic systems (knowledge base management systems (KBMS) and database management systems (DBMS)) that allow real-time forecasting of a particular process, heuristic rules and techniques, as well as expert knowledge.
2 Air Transportation and Economic Development of the Republic of Kazakhstan In recent years, the airline industry in the Republic of Kazakhstan has shown positive dynamics. Key growth indicators significantly exceed the corresponding values in neighboring countries. Table 1 shows data showing the growth of key indicators [3].
Table 1. Statistics on the aviation industry in Kazakhstan
Passengers transported (mln. people.) Freight, baggage, cargo luggage carried (thousand tons) Passengers served (mln. people.) Air transit (million km)
2014 2015 2016 2017 2018 2019 (realtime data) 5,5 5,9 6 7,4 7,9 8,6 19,6 17,0 18,1 22,4 29,1 26,4 10,7 12,1 12,2 14,3 15,0 17 179,8 171,3 169 175,2 186,8 194
The process of integration of Kazakhstan airlines into the global airline industry is actively ongoing. In addition, since November 2019, the “open skies” regime has been introduced at airports in 11 cities of Kazakhstan, which provides for the removal of restrictions on the number of flights and the provision of 5 degrees of “freedom of air” to foreign airlines in areas that Kazakhstan carriers do not operate on. Flights with the 5th degree of “air freedom” implies flights of foreign airlines through the cities of Kazakhstan to the cities of third countries, which will increase the transit potential of Kazakhstan and increase transit traffic through domestic airports. The introduction of this regime in Kazakhstan will help attract new foreign carriers, open new international routes, increase competition and ultimately affect the reduction of airfare and air transport for the general population, as well as the development of tourism and increased transport accessibility of the International Financial Center “Astana” [4].
Modeling Dependence between Air Transportation
603
A separate area of development and integration is the application of the best modern techniques and research in the civil aviation industry. The main events in the history of the implementation of automation of planning processes are shown in Table 2 [5].
Table 2. Automation of planning processes in aviation Process automation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Using mathematical programming methods to determine optimal flight routes. Dantzing, 1956 [6] The first interactive scheduling system Upper limit sales control The solution to the problem of placing the aircraft fleet Unified interactive computer program (PAM - The profit analysis model) Fleet planning in order to maximize profits on a given route network First Route Revenue Management System Optimization of flight chain planning First Computerized Booking System - Amadeus First Route Revenue Management System Crew scheduling using the branch and grade method Implementation of the first control system for tariffs without restrictions Solving the problem of arranging the aircraft fleet taking into account demand for routes Transition to scheduling Using customer behavior forecast
Airline, process automation implementation year
British European Airways division (BEA), 1969 1970 Trans World airlines, 1976 British European Airways division (BEA), 1977 Air France, 1984 American Airlines, 1985 American airlines, 1992 Air France, 1992 United airlines, 1994 United airlines, 1995 British Midland airways, 2002 United Airlines, American airlines, 2004 Lufthansa, 2006 GOL Airlines, 2007
Methods of mathematical modeling and operations research are successfully used in planning tasks for air transportation. Traditionally, there are four main areas. These are planning of the aircraft flight schedule, revenue management, planning of ground handling operations, as well as process control on the day of flights [7]. The methods of mathematical modeling used in the aviation industry are based on extensive scientific research in the field of control theory, statistics, operations research, game theory, optimization and other related fields. Leading Western airlines not only use the results of scientific research, but also actively participate in the development of relevant disciplines. Moreover, some areas of knowledge, for example, revenue management, owe their origin to the airline industry. Having first appeared as a tool to satisfy the practical needs of the airline business, these applied research were then formed into an independent science and are now being applied in various fields.
604
A. Boranbayev et al.
Acquaintance, study and active use of decision-making techniques is a prerequisite for the competitiveness of any airline [5]. The application of air transportation optimization is directly related to the choice of mathematical modeling methods. And for the implementation of the methods themselves, the use of modern means of computer technology and software is required. Small airlines, as well as airlines flying on subsidized routes, can be excluded when solving optimization problems. The software (software) used today in Kazakhstan airlines can be divided into two groups: 1) Software developed by Russian manufacturers (providers); 2) Software developed by foreign providers. Software Developed by Russian Manufacturers (Providers). Consider the software of the first group. Among the companies involved in the development of appropriate software, the following can be mentioned: Aviabit (St. Petersburg), ATIS (Moscow), ИATBT (Moscow), Mirage (St. Petersburg), PИBЦ-Pulkovo (St. Petersburg), TAИC (Moscow), Siren-travel (Moscow). Using the first group of software for an airline is more expensive than a website or software developed in-house within the airline. The seriousness of software products in this group is much higher, there are comprehensive solutions. However, the application of mathematical methods in these systems is not widespread. Things are even worse with the implementation of tasks in the field of airline revenue management. This is primarily due to the complexity of the implementation of the tasks themselves due to the need for widespread use of mathematical models. Also, the lack of solutions related to revenue management is also due to the need for close integration of such systems with inventory systems in which airlines store their seat resources. This integration is difficult due to the different “weight categories” of domestic providers and foreign developers - holders of inventory systems (SITA, Saber, Amadeus, etc.). In addition, there is a very logical lobbying of their own interests on the part of foreign developers to promote their decisions. As an example, let’s take the Kazakh airline SCAT, which uses a Russian software product, namely the Websky e-commerce system, developed by Sirena Travel. Software Developed by Foreign Providers. Consider the software of the second group. Foreign providers: Amadeus, Lufthansa Systems, SITA, Saber. Most of these providers offer truly complete integrated product lines for airlines (either their own products or integrated with their partners’ products), optimization tasks are widely used, and mathematical modeling methods are used. But there are a number of constraining factors that currently hinder the widespread penetration of such software on the Kazakhstani market. Firstly, the high cost of such solutions. Very few Kazakh airlines can afford such a “luxury”. Secondly, all these providers as part of their products offer off-the-shelf technology. It can, in a certain framework, tune in to work in a particular airline, but, nevertheless, this is the established standard technology [5]. It is often difficult for a Kazakh airline to break down its existing technology of work, which is why the acquired system cannot function to the maximum of its capabilities. In addition, these systems may simply not fulfill a number of specific and mandatory requirements for a Kazakhstan airline.
Modeling Dependence between Air Transportation
605
As an example, we cite the Kazakh airline Air Astana, which has been using Amadeus software products for a long time.
3 Air Transportation and Economic Development of Some Countries of the Former USSR Demand for air transportation is subject to frequent fluctuations, characterized by procyclicality, seasonality and instability. The positive factors affecting air traffic, in addition to the direct price of airline tickets, include the gross domestic product, population growth, the level of political stability, the average amount of money allocated by people for leisure and the availability of the air transportation market [8]. Also, here you can add government programs to subsidize routes. For example, in order to meet the needs of the population for domestic flights, since 2002, Kazakhstan has subsidized air routes. The main purpose of subsidizing is to provide air traffic between the regional centers of the Republic of Kazakhstan, as well as the development of tourist destinations within Kazakhstan [9]. The negative factors affecting air traffic include military operations, natural disasters and the epidemic of the disease (pandemic). Business travel is particularly sensitive to various fluctuations in these factors, which have a very large impact on airline revenues. In fact, all airline revenue depends on the occupancy of seats in the cabin, which makes air transportation a risky business, extremely vulnerable to external crises and negative phenomena. To measure the growth of air traffic in civil aviation, the RPK indicator is used - the number of passenger kilometers carried out (RPK - revenue passenger-kilometers). This indicator characterizes the number of kilometers covered by paid passengers on a vehicle. The algorithm for calculating this indicator is simple: it is necessary to multiply the number of passengers who paid for a ticket by the length of the distance covered by them. For example, a plane with 150 paid passengers on board, flying over a distance of 500 km, will generate 75 thousand RPK. It is interesting that in a number of scientific papers there is still debate and discussion regarding the causal relationship between indicators of GDP and RPK. Some authors believe that GDP is the main driver of demand for air travel. Other authors believe that a better air transportation infrastructure leads to an increase in passengers and, consequently, affects GDP growth. Third authors are committed to both positions, in other words, recognize the existence of a two-way causal relationship between GDP and RPK. However, it seems that the GDP indicator should still be considered more paramount for the reason that it reflects the market value of all final goods and services produced per year in all sectors of the economy, and not just the civil aviation sector [8]. Therefore, good aviation infrastructure can only partially affect GDP growth. So, according to the forecast of the Japan Aviation Development Corporation, for the medium and long term (until 2038), the global average GDP will grow by 2.8%. At the same time, the increase in RPK will average 4.4% per year [10]. We examine the strength of the relationship between GDP and RPK. To do this, we use the data on GDP and GDP growth from 2015 to 2018, taken from the annual forecasts of Boeing Current Market Outlook [11]. The percentage change in the values of the GDP and RPK indicators is given in Table 3.
606
A. Boranbayev et al. Table 3. Percentage change in the values of GDP and RPK
2014 GDP (%) RPK (%) 2015 GDP (%) RPK (%) 2016 GDP (%) RPK (%) 2017 GDP (%) RPK (%) 2018 GDP (%) RPK (%)
Asia North America 4,3 2,5 6,1 3,1 4,1 2,3 6 3,1 3,9 2,1 5,7 3 3,9 2 5,7 3,1 3.9 1.9 5.5 3.2
Europe Middle East 1,8 3,8 3,8 6,2 1,8 3,8 3,7 5,9 1,7 3,5 3,7 5,6 1,7 3,5 3,8 5,2 1.6 3.2 3.6 5.1
Latin America 3,4 6 2,9 5,8 3 6,1 3 5,9 2.9 5.9
CIS countries 2,4 3,7 2,5 3,7 2 4,3 2 3,9 2 3.3
Africa Worldwide 4,5 5,7 3,7 6,1 3,5 5,9 3,3 6 3.4 5.9
3,1 4,9 2,9 4,8 4,7 4,7 2,8 4,7 2.7 4.6
We estimate the strength of the connection between GDP and RPK using the Pearson correlation coefficient formula: Pn i¼1 ðXi X ÞðYi Y Þ ffi r ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Pn P 2 n Yj Y 2 i¼1 ðXi X Þ j¼1
ð1Þ
Y – sample mean values. where, X; The calculation of the correlation coefficient between GDP and RPK growth indicators shows that in 2015 the coefficient is r 0.875, in 2016 the coefficient is r 0.849, in 2017 the coefficient is r 0.864, and in 2018 the coefficient was approximately 0.875. To interpret the data obtained, we use the Cheddock scale (see Table 4). Table 4. Assessment of the value of the indicator of the strength of the connection between two variables Indicator value Interpretation of the indicator 0–0,3 Very weak 0,3–0,5 Weak 0,5–0,7 Average 0,7–0,9 High 0,9–1 Very high
In our case, in accordance with Table 4, the correlation coefficients are in the range of 0.7–0.9, which allows us to talk about the high strength of the relationship between the growth rates of GDP and RPK on a global scale.
Modeling Dependence between Air Transportation
607
Also, we will examine the strength of the relationship between the GDP and the GDP indicators of the CIS countries, in particular, Kazakhstan, Russia, Belarus and Uzbekistan. The percentage change in the values of the GDP and RPK indicators of the CIS countries is given in Table 5. Table 5. Percentage change in the values of GDP and RPK indicators of the CIS countries Kazakhstan Russia Belarus Uzbekistan 2014 GDP (%) 4,2 RPK (%) 2,0 2015 GDP (%) 1,2 RPK (%) 2,0 2016 GDP (%) 1,1 RPK (%) 1,0 2017 GDP (%) 4,1 RPK (%) 12,0 2018 GDP (%) 4,1 RPK (%) 30,0
0,7 3,0 −2,3 2,0 0,3 −2,0 1,6 16,0 2,3 12,0
1,7 17,0 −3,8 9,0 −2,5 20,0 2,5 21,0 3,0 12,0
7,2 0,0 7,4 −4,0 6,1 −1,0 4,5 11,0 5,1 18,0
The calculation of the correlation coefficients between the GDP and the GDP indicators of the CIS countries is carried out according to the formula (1). The data obtained are shown in Table 6. As can be seen, they differ from the value of the correlation coefficients of the global scale. Table 6. Correlation coefficients between GDP and RPK indicators of the CIS countries Years 2014 2015 2016 2017 2018
Correlation coefficient r between the indicators of GDP and RPK of the CIS countries −0,539 −0,902 −0,7 −0,730 0,555
The value of the Pearson correlation coefficient r varies in the range from -1 to +1, that is: −1 r 1. The sign r means whether one variable increases as another increases (positive r), or if one variable decreases as the other increases (negative r). If the correlation coefficient is 1, then the two variables are completely interconnected. To interpret the data, we use the Cheddock scale (Table 4). As can be seen from Table 6, the value of the correlation coefficient between the GDP and the GDP indicators of the CIS countries from 2014 to 2017 has a negative value. Only in 2018, the coefficient value is in the range of 0.5-0.7, which allows us to talk about the average relationship
608
A. Boranbayev et al.
between GDP and RPK growth indicators this year. Some CIS countries showed from 2014 to 2016. negative dynamics of air transportation due to macroeconomic (devaluation of national currencies) and political factors. Statistical data on RPK and GDP of the CIS countries from 2013 to 2016 are shown in Table 7. Table 7. Statistics from 2013 to 2016 on GDP and RPK Country
2013 RPK (mln km) Kazakhstan 9 352 Russia 162 367 Belarus 1 624 Uzbekistan 6 906
GDP (billion $) 236,6 2297 75,5 57,7
2014 RPK (mln km) 9583 176 360 1 903 6 758
GDP (billion $) 221,4 2064 78,8 63,1
2015 RPK (mln km) 9 692 179 680 1 974 6 464
GDP (billion $) 184,4 1368 56,4 66,9
2016 RPK (mln km) 9 791 176 622 2 378 6401
GDP (billion $) 137,3 1285 47,7 67,1
The Kazakhstan tenge has fallen in price from December 2013 to January 2016 by 148%. The causes of the devaluation were: falling oil prices, problems with growing imports and reserves that went to maintaining the tenge exchange rate. From 2012 to 2017, the Russian ruble fell by 78%. The depreciation of the ruble is due to many factors, among which, in particular, the fall in oil prices. Among other things, the introduced economic sanctions against Russia in connection with the events in Ukraine influenced. From the beginning of 2012 to 2017, the Belarusian ruble devalued by 131%. The reasons for the devaluation of steel: a decrease in Belarusian exports. From January 2014 to December 2015, the Uzbek sum fell by 15% against the dollar. In September 2017, the Uzbek soum devalued by 97%. In 2017 and 2018 the macroeconomic situation in these countries improved, GDP growth resumed, which immediately affected the growth of air transportation. Statistics 2017 to 2018 for completed passenger-kilometers (million) and GDP are shown in Table 8. Table 8. Statistics 2017 to 2018 on GDP and RPK Country
2017 2018 RPK (mln.km.) GDP (billion $) RPK (mln.km.) Kazakhstan 10 979 159.4 16 329 Russia 205 407 1557.5 229 060 Belarus 2 871 54.4 3217 Uzbekistan 7 113 48.7 8 408
GDP (billion $) 179,3 1 657.5 59.6 50.5
So, in Kazakhstan, even a relatively small increase in GDP (4.1%) in 2017 affected a significant increase in air transportation (12%), which was also facilitated by the Universiade-2017 and the international exhibition “Expo-2017” held in Astana.
Modeling Dependence between Air Transportation
609
4 Reliability and Software Security The pace of development of information technologies and modern trends (microtrends, macrotrends) allows us to say today that the level of reliability and security of information systems determines the efficiency indicator for managing business processes. Without ensuring the reliability and security of automated information systems, an uninterrupted, error-free and trouble-free control system cannot be achieved. Historically, business processes in the aviation industry have required an effective approach to working with data. Reliability and completeness of the data provided several advantages, but it was extremely important to automate business processes. Until 1980, airlines used different mathematical models for planning and managing the fleet and scheduling flights. But the effectiveness of the airline was determined by the profitability of the flight or the optimal route. The concept of “revenue management” arose as a result of airline deregulation in 1979. In the 1960s, American Airlines developed the first online booking system called Saber (Semi-Automated Business Research Environment). The Saber system provided a platform for American Airlines special employees to track the level of actual booking in various fare categories, compare them with the forecast fare, and then adjust the range of seats at different prices accordingly. The reliability and effectiveness of the Saber semi-automated system depended on the decisions of the designated employees. By 1988, American Airlines had introduced the Dinamo module (dynamic inventory and maintenance optimizer), which combines rebooking, discount distribution, and transportation management [12]. The aviation industry is a collection of specific business processes. In order to manage them, reliable control systems are needed, including those operating in an automatic or automated mode. And such an opportunity appeared in connection with the rapid development of computer technology and software [13]. For example, the requirements for the reliability and safety of the software of the Russian airline Aeroflot are determined primarily by the requirements of business units, relevant legislation and the Sky Team airline alliance, as well as the recommendations of the auditors. The airline carried out a security audit of various information systems (IS), during which the following activities were carried out: – information survey of IP; – examination of technical solutions proposed for implementation; – development of a package of regulatory documents on information security and reliability of IP. Taking into account the results of the audit of information security and reliability at Aeroflot Airlines, it was decided to switch to a centralized model of the information and communication structure, combine information resources and application services, and develop documents governing the reliability and security of information systems [14]. Information security of Air Astana Airlines is recognized as complying with ISO/IEC 27001: 2013 management system standards. The certificate of compliance was awarded to the company in April 2019 [15].
610
A. Boranbayev et al.
Information security and reliability issues in Kazakhstan are dealt with in several universities and organizations. In particular, these issues are dealt with at the L.N. Gumilyov Eurasian National University and Nazarbayev University, where technologies, software and methods have been developed to increase the level of safety and reliability [16–46].
5 Conclusion Demand for air transportation depends on many factors, the main of which is considered to be the dynamics of changes in the GDP of each region. But there are countries, in particular the CIS countries, where the analysis showed that this is not always the case. In the study of the impact of GDP growth in the CIS countries (Kazakhstan, Russia, Belarus, Uzbekistan) on the growth in the number of passenger kilometers completed, it was found that there was no correlation between these two indicators in certain years. The absence of a correlation between GDP and RPK is influenced by certain factors, such as macroeconomic, political, environmental, disease epidemics (pandemics), etc. In the development of the air transportation market, it is important to forecast routes (networks, lines), the branching and condition of the airfield network and the needs of aircraft by a certain date. But it is also necessary to take into account negative factors such as hostilities, an epidemic of diseases (pandemic) and natural disasters. To develop such a forecast, a dynamic information system is needed, which in real time will allow you to assess the current situation and make the best decision.
References 1. Global Air Navigation Plan 2016–2030 Doc 9750-AN/963. Fifth edition. https://www.icao. int/publications/Documents/9750_cons_ru.pdf 2. Brutyan, M.M.: Environmental tax and its role in the innovative development of civil aviation. Econ. Anal. Theory Pract. 10(265), 22–26 (2012) 3. http://www.gov.kz/memleket/entities/aviation/documents/details/14031?lang=ru 4. http://www.gov.kz/memleket/entities/miid/press/news/details/v-aeroportah-11-gorodovkazahstana-vvoditsya-rezhim-otkrytogo-neba?lang=ru 5. Vinogradov, L.V., Fridman, G.M., Shebalov, S.M.: Mathematical modeling in the optimization of air transportation planning: development prospects and the effect of use. Sci. Bull. MSTU GA Appl. Math. Ser. Inf. 132 (2008) 6. Ferguson, A.R., Dantzig, G.B.: The allocation of aircraft to routes-an example of linear programming under uncertain demand. Manage. Sci. 3(1), 45–73 (1956) 7. Vinogradov, L.V., Fridman, G.M., Shebalov, S.M.: Mathematical modeling in the optimization of air transportation planning: formulations and methods for solving typical problems. Sci. Bull. MSTU GA Appl. Math. Ser. Inf. 132 (2008) 8. Brutyan, M.M.: The world civil aviation market: current state and development forecast. Bull. Eur. Sci. 1 (2019). https://esj.today/PDF/20ECVN119.pdf 9. Rules for holding a tender for subsidized air routes, approved by the Government of the Republic of Kazakhstan dated 31 January 2013 No. 69
Modeling Dependence between Air Transportation
611
10. Japan Aircraft Development Corporation. Worldwide Market Forecast 2019–2038, March 2019. http://www.jadc.jp/en/data/forecast/ 11. Commercial Market Outlook 2019–2038, Boeing Commercial Airplanes 12. Donovan, A.W.: Yield management in the airline industry. J. Aviat. Aerosp. Educ. Res. 14 (3) (2005). Art. 9 13. Kalashnikova, K.A., Orlova, D.R.: Automation of planning and management of the airline. In: Proceedings of the International Symposium “Reliability and Quality”, vol. 2, pp. 159– 161 (2018) 14. http://lib.itsec.ru/articles2/control/bezopasnost_informacionnyh_sistem 15. https://rusregister.ru/news/ejr-astana-proshla-audit-po-informatsionnoj-bezopasnosti-iso-iec27001-2013/ 16. Boranbayev, A., Boranbayev, S., Nurusheva, A., Yersakhanov, K.: Development of a software system to ensure the reliability and fault tolerance in information systems. J. Eng. Appl. Sci. 13(23), 10080–10085 (2018) 17. Boranbayev, A., Boranbayev, S., Nurusheva, A., Yersakhanov, K.: The modern state and the further development prospects of information security in the Republic of Kazakhstan. In: Advances in Intelligent Systems and Computing, vol. 738, pp. 33–38 (2018) 18. Boranbayev, S., Goranin, N., Nurusheva, A.: The methods and technologies of reliability and security of information systems and information and communication infrastructures. J. Theor. Appl. Inf. Technol. 96(18), 6172–6188 (2018) 19. Akhmetova, Z., Zhuzbaev, S., Boranbayev, S.: The method and software for the solution of dynamic waves propagation problem in elastic medium. Acta Phys. Polonica A 130(1), 352– 354 (2016) 20. Boranbayev, A., Boranbayev, S., Nurusheva, A.: Development of a software system to ensure the reliability and fault tolerance in information systems based on expert estimates. In: Advances in Intelligent Systems and Computing, vol. 869, pp. 924–935 (2018) 21. Boranbayev, A., Boranbayev, S., Yersakhanov, K., Nurusheva, A., Taberkhan, R.: Methods of ensuring the reliability and fault tolerance of information systems. In: Advances in Intelligent Systems and Computing, vol. 738, pp. 729–730 (2018) 22. Hritonenko, N., Yatsenko, Y., Boranbayev, S.: Environmentally sustainable industrial modernization and resource consumption: is the Hotelling’s rule too steep? Appl. Math. Modell. 39(15), 4365–4377 (2015) 23. Boranbayev, S., Altayev, S., Boranbayev, A.: Applying the method of diverse redundancy in cloud based systems for increasing reliability. In: The 12th International Conference on Information Technology: New Generations (ITNG 2015). 13–15 April 2015, Las Vegas, Nevada, USA, pp. 796–799 (2015) 24. Turskis, Z., Goranin, N., Nurusheva, A., Boranbayev, S.: A fuzzy WASPAS-based approach to determine critical information infrastructures of EU sustainable development. Sustainability (Switzerland) 11(2), 424 (2019) 25. Turskis, Z., Goranin, N., Nurusheva, A., Boranbayev, S.: Information security risk assessment in critical infrastructure: a hybrid MCDM approach. Inf. (Netherlands) 30(1), 187–211 (2019) 26. Boranbayev, A.S., Boranbayev, S.N., Nurusheva, A.M., Yersakhanov, K.B., Seitkulov, Y. N.: Development of web application for detection and mitigation of risks of information and automated systems. Eur. J. Math. Comput. Appl. 7(1), 4–22 (2019) 27. Boranbayev, A.S., Boranbayev, S.N., Nurusheva, A.M., Seitkulov, Y.N., Sissenov, N.M.: A method to determine the level of the information system fault-tolerance. Eur. J. Math. Comput. Appl. 7(3), 13–32 (2019)
612
A. Boranbayev et al.
28. Boranbayev, A., Boranbayev, S., Nurbekov, A., Taberkhan, R.: The development of a software system for solving the problem of data classification and data processing. In: 16th International Conference on Information Technology - New Generations (ITNG 2019), vol. 800, pp. 621–623 (2019) 29. Boranbayev, A., Boranbayev, S., Nurusheva, A., Yersakhanov, K., Seitkulov, Y.: A software system for risk management of information systems. In: Proceedings of the 2018 IEEE 12th International Conference on Application of Information and Communication Technologies (AICT 2018), 17–19 October 2018, Almaty, Kazakhstan, pp. 284–289 (2018) 30. Boranbayev, S., Boranbayev, A., Altayev, S., Nurbekov, A.: Mathematical model for optimal designing of reliable information systems. In: Proceedings of the 2014 IEEE 8th International Conference on Application of Information and Communication Technologies (AICT 2014), Astana, Kazakhstan, 15–17 October 2014, pp. 123–127 (2014) 31. Boranbayev, S., Altayev, S., Boranbayev, A., Seitkulov, Y.: Application of diversity method for reliability of cloud computing. In: Proceedings of the 2014 IEEE 8th International Conference on Application of Information and Communication Technologies (AICT 2014), Astana, Kazakhstan, 15–17 October 2014, pp. 244–248 (2014) 32. Boranbayev, S.: Mathematical model for the development and performance of sustainable economic programs. Int. J. Ecol. Dev. 6(1), 15–20 (2007) 33. Boranbayev, A., Boranbayev, S., Nurusheva, A.: Analyzing methods of recognition, classification and development of a software system. In: Advances in Intelligent Systems and Computing, vol. 869, pp. 690–702 (2018) 34. Boranbayev, A.S., Boranbayev, S.N.: Development and optimization of information systems for health insurance billing. In: ITNG2010 - 7th International Conference on Information Technology: New Generations, pp. 1282–1284 (2010) 35. Akhmetova, Z., Zhuzbayev, S., Boranbayev, S., Sarsenov, B.: Development of the system with component for the numerical calculation and visualization of non-stationary waves propagation in solids. Front. Artif. Intell. Appl. 293, 353–359 (2016) 36. Boranbayev, S.N., Nurbekov, A.B.: Development of the methods and technologies for the information system designing and implementation. J. Theor. Appl. Inf. Technol. 82(2), 212– 220 (2015) 37. Boranbayev, A., Shuitenov, G., Boranbayev, S.: The method of data analysis from social networks using apache hadoop. In: Advances in Intelligent Systems and Computing, vol. 558, pp. 281–288 (2018) 38. Boranbayev, S., Nurkas, A., Tulebayev, Y., Tashtai, B.: Method of processing big data. In: Advances in Intelligent Systems and Computing, vol. 738, pp. 757–758 (2018) 39. Yatsenko, Y., Hritonenko, N., Boranbayev, S.: Non-equal-life asset replacement under evolving technology: a multi-cycle approach. Eng. Econ. (2020) 40. Boranbayev, A., Boranbayev, S., Nurbekov, A.: Estimation of the degree of reliability and safety of software systems. In: Advances in Intelligent Systems and Computing. AISC, vol. 1129, pp. 743–755 (2020) 41. Boranbayev, A., Boranbayev, S., Nurbekov, A.: Development of the technique for the identification, assessment and neutralization of risks in information systems. In: Advances in Intelligent Systems and Computing. AISC, vol. 1129, pp. 733–742 (2020) 42. Boranbayev, A., Boranbayev, S., Nurusheva, A., Seitkulov, Y., Nurbekov, A.: Multi criteria method for determining the failure resistance of information system components. In: Advances in Intelligent Systems and Computing, vol. 1070, pp. 324–337 (2020) 43. Boranbayev, A., Boranbayev, S., Nurbekov, A., Taberkhan, R.: The software system for solving the problem of recognition and classification. In: Advances in Intelligent Systems and Computing, vol. 997, pp. 1063–1074 (2019)
Modeling Dependence between Air Transportation
613
44. Boranbayev, S.N., Nurbekov, A.B.: Construction of an optimal mathematical model of functioning of the manufacturing industry of the republic of Kazakhstan. J. Theor. Appl. Inf. Technol. 80(1), 61–74 (2015) 45. Seitkulov, Y.N., Boranbayev, S.N., Davydau, H.V., Patapovich, A.V.: Speakers and auditors selection technique in assessing speech information security. J. Theor. Appl. Inf. Technol 97 (12), 3305–3316 (2019) 46. Akhmetova, Z., Boranbayev, S., Zhuzbayev, S.: The visual representation of numerical solution for a non-stationary deformation in a solid body. In: Advances in Intelligent Systems and Computing, vol. 448, pp. 473–482 (2016)
Sentiment Analysis to Support Marketing Decision Making Process: A Hybrid Model Alaa Marshan1(&), Georgia Kansouzidou1, and Athina Ioannou2 1
Brunel University London, London UB8 3PH, UK [email protected] 2 University of Surrey, Guildford GU2 7XH, UK
Abstract. Marketers aim to understand what influences people’s decisions when purchasing products and services, which has been proven to be based on natural instincts that drive humans to follow the behavior of others. Thus, this research is investigating the use of sentiment analysis techniques and proposes a hybrid approach that combines lexicon-based and machine learning-based approaches to analyze customers’ review a major e-commerce platform. The lexicon approach was firstly applied at a word-level to explore the reviews and provide some preliminary results about the most frequent words used in the reviews in a form of word-clouds. Then, the lexicon approach was applied to sentence-level to obtain sentiment polarity results, which was used to train machine learning models. Next, the trained models were tested on un-labelled reviews (test data); proving that Naïve Bayes (NB) outperformed other classifiers. The hybrid model described in this research can offer organizations a better understanding of customers’ attitudes towards their products. Keywords: Sentiment analysis
Marketing Hybrid machine learning model
1 Introduction The success of a business depends on whether consumers prefer their products or services compared to others offered in the market. Consequently, marketers try to understand what affect people’s decisions when purchasing commodities, which has been proven to be based on natural instincts that drive humans to follow the behavior of others [1]. Thus, it is considered critical for the marketing strategy of a business to collect data about customers that provide knowledge about their preferences as well as emerging trends in terms of preferences and expectations, because this results into better marketing-related decision-making outcomes [2]. The word-of-mouth (WOM) phenomena, in which customers share their shopping experiences and impressions about certain product or service with others, has gained a lot of attention due to the spread of social media platforms that are being used in such way; resulting in the electronic word-of-mouth (e-WOM) marketing model [3]. Similarly, e-commerce platforms collect enormous amount of product reviews. Such platforms generate huge amounts of customers opinions and reviews in a form of unstructured data (one form of Big Data), which proves to be very important for businesses in order to measure
© Springer Nature Switzerland AG 2021 K. Arai et al. (Eds.): FTC 2020, AISC 1289, pp. 614–626, 2021. https://doi.org/10.1007/978-3-030-63089-8_40
Sentiment Analysis to Support Marketing Decision Making Process
615
consumers satisfaction with their merchandise and understand certain aspects of their shopping tendencies [4]. In particular, users’ opinions on products or brands that are available via online sources, mostly social media and e-commerce platforms, have brought the need to formulate new methods for analyzing text (e.g. online reviews) in order to understand what consumers think and how they feel about companies’ merchandises, services and reputation. Sentiment analysis and opinion mining are part of the automated algorithms, which can evaluate the polarity and emotions of customers’ reviews [4]. There is a high volume of research focusing on sentiment analysis; experimenting on different tools and methods to perform test mining tasks as well as comparing and categorizing them. In 2017 there were approximately 7000 research articles published on this subject, with the majority of them being published after 2004 [5]. Sentiment analysis employs text mining and natural language processing (NLP) techniques to recognize opinions from text, polarity classification, agreement and subjectivity-objectivity detection [6]. There are different lexicon-based approaches to sentiment analysis such as lexical affinity [7], and keyword spotting [6]. This topic has attracted considerable amount of attention as several research papers focused on this subject and many different approaches have been proposed to analyze written language. Nevertheless, the sentiment analysis field is growing significantly, and its methods are still evolving. That is because every problem is unique and needs a different approach to address its specific characteristics. In addition, the fact that written language is changing through the years, especially in social media or e-commerce platforms where various acronyms can be noticed, forces sentiment analysis models to adapt to these changes. In this research, we aim to develop a hybrid model that combines lexicon-based and machine learning-based methods to perform sentiment analysis on product reviews. More specifically, a lexicon approach to word-level will be applied first to give some preliminary results about the most frequent words used in the reviews. Then a lexicon approach to sentence-level will be used to provide the sentiment polarity results, which will create the data set that will be used to train the machine learning model. In achieving the above aim, the paper is structured as follows. Section 2 reviews the works related to sentiment analysis approaches discussed in the literature. Section 3 clarifies the research methodology, namely CRISP-DM, that is followed to pre-process the data and develop and validate the proposed hybrid sentiment analysis model. Moreover, this section explains the research setting and the data used to demonstrate the performance of the proposed hybrid approach. Next, Sect. 4 discusses the result and their potential impact on the marketing decision making process. Section 5 concludes the work and notes future work that we will undertake in advancing the state-of-the-art.
2 Sentiment Analysis in the Literature Over the last decade, organizations started to be more interested in collecting online reviews of their customers in order to help them understand customers’ opinions about their products, with the use of sentiment analysis algorithms [8]. Consequently, the exploration of people’s sentiments expressed in the form of reviews has been covered
616
A. Marshan et al.
in many research articles. Sentiment analysis tasks vary and can be applied in numerous situations such as analyzing users’ opinions, which are increasingly available due to the rapid development of e-commerce platforms [9]. Sentiment analysis classification can be divided into three main levels: document-level, sentence-level and aspect-level, and there are two main approaches to perform sentiment analysis, which are lexicon-based and machine learning-based approaches [10]. 2.1
Sentiment Analysis Classification Levels
Sentiment analysis at document level focuses on extracting the overall sentiment of the document, by assuming that each document contains the opinion over a unique entity (e.g. product or topic). Therefore, comparing entities or different aspects of an entity is not easy on this classification level, because in a document level sentiment analysis, different feelings towards various aspects of an entity cannot be obtained since the focus is on the overall sentiment [11]. Furthermore, besides the challenge of mixed emotions found at the same document, Yessenalina, Yue and Cardie [12] mention two additional problems of applying sentiment analysis on document-level: 1) the difficulties that result from the changing level of contained information in the different parts of the document, and 2) the combination of objective and subjective sentences in a single text, which cannot be easily distinguished by the learning methods. When there is a need to extract many emotions from the same document, the use of sentence level classification is more suitable. There are two assumptions made at this classification level. First, the entity to which each sentence refers to is known. Second, each sentence derives a distinct opinion [13]. This method measures the overall polarity of a document with the help of the sentence’s sentiment polarity and its significance to the document. The significance of a sentence to a document depends on the sentence’s position, which makes the first sentence the most important one, as it usually declares the main opinion of the document [14]. Contrary to the two previous methods, aspect level sentiment analysis is not only interested in measuring the sentiment over an entity, but also it focuses on determining the sentiment against various aspects of an entity as well. Aspect-level methods need to be robust and flexible among other characteristics. Robustness is crucial for dealing with informal writing and flexibility provides the method with the ability to be applicable in numerous domains [15]. Aspect-level classification is often used for tasks that deal with customer reviews and it follows three steps: Identification, Classification and Aggregation [16]. In the first step, the sentiment-target pairs of the text are identified, where a target is an aspect of the entity under analysis. In the second step they are classified to a sentiment category, for example positive or negative. Lastly, the third step is the aggregation of the sentiment for every aspect [15]. Aspect-level sentiment analysis differs from syntactical techniques, as it is more related to natural language and it can also recognize the less obvious sentiments [17]. 2.2
Sentiment Analysis Approaches
The lexicon-based approach uses a lexicon to classify the sentiment of a text, by comparing the words in the text with those existing in the lexicon, which are assigned
Sentiment Analysis to Support Marketing Decision Making Process
617
with a sentiment value or label. The overall sentiment of the text is generated by aggregating all the sentiment scores [18]. There are various lexicons created over the time and are proven to be very efficient in classifying the sentiment of product reviews if suitable weights are assigned to the sentiment words [19]. One disadvantage of lexicon-based methods is word limitations [20]. To tackle this problem and the failure of some lexicons to address sentiment of some words of a particular domain, Cho et al. [21] proposed to merged various lexicons, by standardizing them to the same score range and manipulating the words in the lexicons and their respective scores based on positive/negative word occurrence ratio and the review polarity ratio. Nevertheless, Asghar et al. [22] argue that because a word’s sentiment score varies across different domains, existing lexicons fail to assign the correct sentiment values to the words and consequently misclassify the sentiments of the words. Therefore, for combined lexicons, they suggested a method which avoids this misclassification by changing the sentiment score of word when its meaning implies the opposite sentiment of the one assigned to it by another existent lexicon. Along the same lines, in their attempt to address the domain-specific lexicon problem, Taboada et al. [23], proposed another model by creating, by hand, separate dictionaries for nouns, verbs and adverbs for each domain. The second method of sentiment analysis is a Machine Learning (ML) based method. This approach, in large part, depends on supervised machine learning algorithms to classify a text to positive or negative. Although these methods can be used to generate text mining models that can be applied for specific tasks and domains, they require already labeled data, which are not always available [24]. The ML methods employ classifiers, such as Naïve Bayes (NB), k-nearest neighbors (KNN) [25–27] or Support Vector Machine (SVM) that are firstly trained on known and labelled data, and then apply them on the actual data of interest [28]. Some of the most popular text mining techniques are terms presence, frequency, parts of speech, opinion words and phrases, and negations. Terms presence and frequency technique exploits the appearance of words or n-grams, or their frequency count. Parts of speech technique focuses on adjectives, because they point out an opinion, while opinion words and phrases deal with the most common words and phrases used to state opinions. Lastly, negations technique addresses the polarity shift problem [29]. To tackle the polarity shift challenge, however, Xia et al. [30] proposed a dual sentiment analysis (DSA) model, which not only uses the original reviews, but creates their reversed reviews as well, and uses these pairs in the training and prediction process. In addition, Cruz, Taboada and Mitkov [31] have developed a model that considers the speculation cues. Xia et al. [32] have also created the Polarity Shift Detection, Elimination and Ensemble (PSDEE) three-stage cascade model, which is using varying weights on different parts of the text. In their model, they first split each document into a set of sub-sentences and build a hybrid model that employs rules and statistical methods to detect explicit and implicit polarity shifters. Then, they employed a polarity shift elimination method, to remove polarity shift in negations. Finally, they train base classifiers on training subsets divided by different types of polarity shifts and use a weighted combination of the component classifiers for sentiment classification. Dhaoui, Webster and Tan [33], compared the machine learning and lexicon-based methods on social media conversations and they concluded that their performance is
618
A. Marshan et al.
approximately the same for this type of data. Sankar and Subramaniyaswamy [34], have also discussed the various sentiment analysis techniques. Both studies agree that the combination of the two approaches is promising, with the latter arguing that each method has different drawbacks and consequently their combination will provide a method with higher accuracy. Hybrid approaches, thus, are combination of lexicon based and machine learning-based methods. Hybrid approaches can benefit from the accuracy of machine learning and the speed of the lexicon-based methods [35]. Zhang L. et al. [36] was among the first studies that proposed a hybrid approach for Twitter sentiment analysis using both lexicon and machine learning based methods. Their strategy was to first use a lexicon-based approach on entity level (e.g. product), which is more detailed, but has low sensitivity to polarity shifters. Then, using the information extracted from applying the lexicon-based approach and using Chi-square test, they were able to automatically identify other relevant opinionated tweets that help with the training of a sentiment classifier to identify the polarity of the newly identified tweets. The binary classifier was trained on data provided by the lexicon-based method and then used in the machine learning SVM algorithm to determine the text polarity. The authors emphasize that their method is independent from manual labeling in the training process, as the training is done by feeding the algorithm previous results, and additionally, that it can automatically adjust to new trends. Similarly, Mukwazvure and Supreethi [37] and Alhumoud, Albuhairi and Alohaideb [38], proposed hybrid models for sentiment analysis tasks, which firstly employ lexicon-based methods to create the training data and feed them to train a classifier for the machine learning approach that will be used to classify new data. Hybrid approaches are the least popular between the three types of sentiment analysis approaches. In addition, there are many variations of hybrid methods, because there are several combinations of a number of lexicons and machine learning classifiers that can be employed to create them. Therefore, every hybrid method is unique, and with different levels of performance.
3 Dataset, Research Methodology and Analysis Steps The key components of every research project are data collection, data analysis, as well as the methods that are chosen in each step of the research. This project is using a publicly available data set, which can be obtained from the Kaggle website, an online community engaged with data science [39]. The data set includes information for Amazon’s users and products. Amazon is an American company focused on ecommerce, among others, and attract 89% of the online consumers [40]. In all the data science projects, Data Mining (DM) plays a highly significant role, since it can provide a way to turn previously unused data into useful information. When it was first introduced, the main focus was to construct algorithms that a company can apply, to overcome all difficulties that can be faced when trying to extract knowledge from a very large amount of data [41]. For this research, the cross-industry process for data mining (CRISP-DM) methodology is adopted. This methodology is primarily divided into six steps, which are explained in the following sub-sections [42]. The R
Sentiment Analysis to Support Marketing Decision Making Process
619
programming language was used to perform the technical steps of the CRISP-DM methodology. 3.1
Business Understanding
In the business understanding step, the main goal is to comprehend the company’s objectives that will set the requirements and purpose of the project. Considering that the business under analysis is Amazon, it is possible to conclude that every ecommerce platform that sells products or provides online service to customers can be treated in the same way, as they share a common purpose, which is customer satisfaction. As explained before, in order for the customer-centered companies to achieve their goal, the key action is to understand their customers. If these companies know what their customers desire, what they like or dislike, it gives them the opportunity to adjust their products or services and achieve higher customer satisfaction. Therefore, Amazon’s objective is to obtain knowledge about their customers’ preferences and turn it into valuable insight. The easiest way to acquire this knowledge, is through the exploitation of consumers reviews. Consequently, the research objective is to analyze these reviews, by employing sentiment analysis methods to determine their sentiment polarity, hence, satisfaction. 3.2
Data Understanding and Preparation
The second and third steps are data understanding and preparation, which are about inspecting and understanding the variables included in the dataset and dealing with missing values or any possible irregularities. The data contains 32,308 observation and multiple variables such as ID, the name of the product, the text of the review, product rating and a variable that indicates whether a customer recommend the product or not. The last two variables were used to give the researcher the ability to compare the sentiment analysis results with other relevant and interrelated features. Other data preparation tasks included the removal of duplicate rows as well as the observation that contained reviews in languages other than English language. Finally, several data sets, one for every unique product, were created from the main set. This step was taken in order to analyze each product individually and get some preliminary results. 3.3
Model Development
The next step is concerned with building the hybrid sentiment analysis model, which is a combination of the lexicon and machine learning approaches for sentiment analysis at sentence level. First, a word-level sentiment analysis is applied to the reviews of each product separately. The outcome of this step is a word cloud for every product, which is informing about the most frequent words used in the reviews of the product. Word clouds are word visualizations, where the size of each word depends on the word frequency [43]. In other words, the more frequent the word, the bigger its size in the plot. The word clouds of the three most reviewed products “Amazon Kindle Paper White – eBook reader”, “Echo (smart speaker)” and “Fire tablet” are presented in
620
A. Marshan et al.
Fig. 1. The word cloud visualization provided a first impression of the consumer sentiment for each product.
Fig. 1. Word clouds for three most reviewed products (left to right) “Echo (smart speaker)”, “Amazon Kindle Paper White – eBook reader” and “Fire tablet”
Two different sentiment analysis packages were used to perform sentence-level sentiment analysis in order to create a more accurate model. The results were compared, and the outcome of the best performing package was utilised in the machine learning phase. The first package used was the “Syuzhet” package. There are four standard lexicons embedded in this package, the “AFINN”, the “BING”, the “NRC” and the default lexicon which is “SYUZHET” [44]. The sentiment analysis task was performed by a function that splits the text into sentences, extracts the sentiment score of each sentence using the combination of the lexicons available in “Syuzhet” and lastly, computes the overall sentiment score for each review. This function was applied on Amazon’s processed dataset to acquire the sentiment of the reviews. The second package used was the “Sentimentr” package. In contrast to the “Syuzhet” package, “Sentimentr” tries to handle negators, amplifiers and de-amplifiers, by using valence shifters to reverse, increase or decrease the score of a polarized word [44]. The lexicon in “Sentimentr” package that deals with valence shifters is the “hash_valence_shifters” lexicon. The function that computes the sentiment with this lexicon also allows to set the number of words before and after the polarized word to consider valence shifters. Finally, while processing the text in the reviews, it was discovered that some of the reviews included emojis within the text. To deal with these emojis we used a “Sentimentr” function that can replace emoticons with word equivalent. Both packages mentioned earlier were used on the reviews provided in the combined dataset to measure their sentiments. Additionally, the “reviews.rating” and “reviews.doRecommend” variables were considered for the comparison with the sentiment analysis outcomes resulting from both packages (see Fig. 2 and 3). These variables should be highly correlated, since their notions are connected. For example, someone would expect from a user that gave a positive review on a product, to give the product a high rating and recommend it. High correlations among the sentiments measured from the reviews and the products rating and recommendations can be observed in Fig. 2 and 3. Such high correlations validate the sentiment detection performance. Surprisingly however, the comparison between the performance of the two packages, shows that the “Syuzhet” package outperform the “Sentimentr” package
Sentiment Analysis to Support Marketing Decision Making Process
621
in classifying neutral emotions into either positive or negative and is more consistent with the results of the rating and recommendations of the products.
Fig. 2. Comparison between products recommendation (left) and product rating (right)
Fig. 3. Sentiments classification comparison between Syuzhet (left) and Sentimentr (right) packages
The results of the lexicon approach were fed into a machine learning model, so that the labeled text can be used to train the sentiment classifier. The machine learning approach does not perform additional sentiment analysis nor provide new information about the reviews’ sentiment. However, it delivers a model already trained on known data and can be used directly for classification of unlabeled text; saving time from the analysis part. For this hybrid model, the Naïve Bayes (NB), k-nearest neighbors (KNN) and Support Vector Machine (SVM) classifiers were chosen to be trained on the known data, which were labeled by the “Syuxhet” lexicon [28]. The accuracy achieved by these models were 86.3%, 79.6% and 82.2%, respectively. 3.4
Evaluation
In the evaluation step, the performance of the chosen classifiers is measured using a confusion matrix [45]. For this study, the classification classes are Positive, Negative and Neutral. Therefore, the cells of the confusion matrix are appointed with one of the following six outcomes: true positive, false positive, true negative, false negative, true
622
A. Marshan et al.
neutral and false neutral. The word true before each class name indicates that the classifier assigned the correct value to the text, while the word false is used for the misclassified text. The Naïve Bayes model has achieved the highest accuracy of 86.3%, which is the measure that evaluates the rate of correct classifications represented in true positive, true negative and true neutral compared to the total number for predictions. The accuracy achieved using the Naïve Bayes classifier reflects a good classification performance. 3.5
Deployment
The deployment step focuses on the explanation, organization and implementation of the project. However, due to the fact that the dataset used for this research is a public dataset and the researchers were not collaborating with a specific company, the implementation of the model could not be carried out. This is the last phase of the CRISP-DM methodology, and it is as crucial as the other phases because it offers the explanation of every step taken to complete the project [46].
4 Analysis Results and Discussion The main data analysis model created for this project, starts by comparing two different sentiment analysis packages: “Syuzhet” and “Sentimentr”. In order to provide better insight of how these packages work, we used their built-in functions to create an HTML file that contains all the reviews highlighting sentences with red or green colors; representing the negative or positive sentiments respectively. Some sections of this file are illustrated in Fig. 4 and 5.
Fig. 4. Correctly classified text
Fig. 5. Misclassified text
Sentiment Analysis to Support Marketing Decision Making Process
623
Performing a sentiment analysis task on a text cannot be completely accurate. All of the methods that can be used to carry out sentiment analysis will have some misclassifications, because the written language has many peculiarities. “Sentimentr” package employs valence shifters to deal with negators, amplifiers and de-amplifiers. However, the inference drawn from the two figures above, is that in many cases, the valence shifters can result in incorrect classification of the text. For example, the phrase “Couldn’t be happier with this purchase” seems to have a negative score, because of the negation appearing before the positive word “happier”. Hence, the review was falsely classified as neutral. The same conclusion can be drawn by Fig. 6, where it is apparent that “Syuzhet” has managed to classify more of the neutral cases into either positive or negative classes. Therefore, the outcome of the “Syuzhet” package is considered better and it was chosen to be used in the machine learning phase. Moving forward, observing the results of the machine learning models that were applied on un-labeled dataset, it can be inferred that the Naïve Bayes machine learning model has achieved highest performance represented in 86.3% accuracy. Other metrics (sensitivity and specificity) were also calculated and demonstrate high percentages, as the lowest value is approximately 62% in most of the cases.
Fig. 6. Comparison of the sentiment classification performance of the “Syuzhet” and “Sentimentr” packages.
Overall, this study successfully met the objectives set and can support companies’ marketing decisions-making process, in two ways. The word-level analysis can provide a company with knowledge about the primary sentiments of consumers towards a product. However, because this is not an in-depth knowledge, the combination of a lexicon and a machine learning approach has resulted in the construction of a model with high levels of accuracy. The lexicon approach performs sentiment analysis on the text of user reviews, while the machine learning approach contributes by providing an already trained model, which can classify unlabeled text. Therefore, the Naïve Bayes model, can be solely used to classify new consumer reviews on products, making the sentiment
624
A. Marshan et al.
analysis task less time consuming. The results of sentiment analysis of customer reviews are highly significant to a company, as it can support better marketing decision making, by equipping the business with insights related to their products.
5 Conclusion and Future Work The sentiment analysis field is growing significantly over the years and its importance to businesses is increasing. Businesses need to find new ways to understand their customers in order to improve the company’s performance. This study focused on developing a hybrid method, for analyzing online user reviews, which is a combination of the lexicon and machine learning approaches for sentiment analysis. The word-level analysis provides some preliminary results about the consumers’ sentiment towards a product by creating a word cloud visualization. To further analyze the reviews, we compared the lexicons available in two different R packages (“Syuzhet” and “Sentimentr”) at sentence-level to classify the reviews into positive, negative or neutral. The best performing between the two packages was chosen to train the classifiers on the labelled text provided by the “Syuzhet” package. The constructed Naïve Bayes model performed very well, achieving a high accuracy of 86.3%. The aim of this study was to develop a sentiment analysis model in order to help businesses make better marketing decisions by turning consumers’ online reviews into valuable insight for the company. Through the observation of the products’ word clouds and the hybrid model, a company can gain a better understanding of the customers’ sentiment towards their products or services. Hence, the decision to improve their offerings which are reviewed negatively or to promote another positively reviewed ones. As with all research studies, there are certain limitations in this study. For example, more combinations of the lexicon and machine learning approaches could be tested and compared in order to find the best performing hybrid model.
References 1. Willcox, M.: The Business of Choice: Marketing to Consumers’ Instincts. Pearson FT Press, Upper Saddle River (2015) 2. Howells, K., Ertugan, A.: Applying fuzzy logic for sentiment analysis of social media network data in marketing. Proc. Comput. Sci. 120, 665 (2017) 3. Wu, S., Chiang, R., Chang, H.: Applying sentiment analysis in social web for smart decision support marketing. J. Ambient Intell. Humanized Comput. 1–10 (2018) 4. Verhoef, P.C., Kooge, E., Walk, N.: Creating Value with Big Data Analytics: Making Smarter Marketing Decisions. Routledge, London (2016) 5. Mäntylä, M., Graziotin, D., Kuutila, M.: The evolution of sentiment analysis—a review of research topics, venues, and top cited papers. Comput. Sci. Rev. 27, 16–32 (2018) 6. Cambria, E., Schuller, B., Xia, Y., Havasi, C.: New avenues in opinion mining and sentiment analysis. IEEE Intell. Syst. 28, 15–21 (2013) 7. Hu, X., Tang, J., Gao, H., Liu, H.: Unsupervised sentiment analysis with emotional signals. In: Proceedings of the 22nd international conference on World Wide Web - WWW 2013, pp. 607–618 (2013)
Sentiment Analysis to Support Marketing Decision Making Process
625
8. Gupta, E., Kumar, A., Kumar, M.: Sentiment analysis: a challenge. Int. J. Eng. Technol. 7 (2.27), 291 (2018) 9. Ravi, K., Ravi, V.: A survey on opinion mining and sentiment analysis: tasks, approaches and applications. Knowl. Based Syst. 89, 14–46 (2015) 10. Medhat, W., Hassan, A., Korashy, H.: Sentiment analysis algorithms and applications: a survey. Ain Shams Eng. J. 5(4), 1093–1113 (2014) 11. Behdenna, S., Barigou, F., Belalem, G.: Document level sentiment analysis: a survey. EAI Endorsed Trans. Context Aware Syst. Appl. 4 (2018) 12. Yessenalina, A., Yue, Y., Cardie, C.: Multi-level structured models for document-level sentiment classification. In: EMNLP (2010) 13. Feldman, R.: Techniques and applications for sentiment analysis. Commun. ACM 56(4), 84 (2013) 14. Wang, H., Yin, P., Zheng, L., Liu, J.: Sentiment classification of online reviews: using sentence-based language model. J. Exp. Theor. Artif. Intell. 26(1), 13–31 (2013) 15. Schouten, K., Frasincar, F.: Survey on aspect-level sentiment analysis. IEEE Trans. Knowl. Data Eng. 28(3), 814 (2016) 16. Vanaja, S., Belwal, M.: Aspect-Level sentiment analysis on e-commerce data. In: International Conference on Inventive Research in Computing Applications (ICIRCA), p. 1276 (2018) 17. Poria, S., Cambria, E., Winterstein, G., Huang, G.: Sentic patterns: dependency-based rules for concept-level sentiment analysis. Knowl. Based Syst. 69, 46 (2014) 18. Deng, S., Sinha, A., Zhao, H.: Adapting sentiment lexicons to domain-specific social media texts. Decis. Support Syst. 94, 66 (2017) 19. Khoo, C., Johnkhan, S.: Lexicon-based sentiment analysis: comparative evaluation of six sentiment lexicons. J. Inf. Sci. 44(4), 491–511 (2017) 20. Vu, L., Le, T.: A lexicon-based method for sentiment analysis using social network data. In: International Conference Information and Knowledge Engineering (IKE 2017) (2017) 21. Cho, H., Kim, S., Lee, J., Lee, J.: Data-driven integration of multiple sentiment dictionaries for lexicon-based sentiment classification of product reviews. Knowl. Based Syst. 71, 61–71 (2014) 22. Asghar, M., Khan, A., Ahmad, S., Qasim, M., Khan, I.: Lexicon-enhanced sentiment analysis framework using rule-based classification scheme. PLoS ONE 12(2), e0171649 (2017) 23. Taboada, M., Brooke, J., Tofiloski, M., Voll, K., Stede, M.: Lexicon-based methods for sentiment analysis. Comput. Linguist. 37(2), 267–307 (2011) 24. Gonçalves, P., Araújo, M., Benevenuto, F., Cha, M.: Comparing and combining sentiment analysis methods. In: Proceedings of the First ACM Conference on Online Social Networks, pp. 27–38 (2019) 25. Goel, A., Gautam, J., Kumar, S.: Real time sentiment analysis of tweets using Naive Bayes. In: 2016 2nd International Conference on Next Generation Computing Technologies (NGCT) (2016) 26. Dey, L., Chakraborty, S., Biswas, A., Bose, B., Tiwari, S.: Sentiment analysis of review datasets using Naïve Bayes’ and K-NN classifier. Int. J. Inf. Eng. Electron. Bus. 8(4), 54–62 (2016) 27. Pratama, Y., Roberto Tampubolon, A., Diantri Sianturi, L., Diana Manalu, R., Frietz Pangaribuan, D.: Implementation of sentiment analysis on Twitter using Naïve Bayes algorithm to know the people responses to debate of DKI Jakarta governor election. J. Phys: Conf. Ser. 1175, 012102 (2019)
626
A. Marshan et al.
28. Singh, V., Piryani, R., Uddin, A., Waila, P., Marisha.: Sentiment analysis of textual reviews; evaluating machine learning, unsupervised and SentiWordNet approaches In: 2013 5th international conference on knowledge and smart technology (KST), pp. 122–127. IEEE (2013) 29. Pannala, N., Nawarathna, C., Jayakody, J., Rupasinghe, L., Krishnadeva, K.: Supervised learning based approach to aspect based sentiment analysis. In: IEEE International Conference on Computer and Information Technology (CIT) (2016) 30. Xia, R., Xu, F., Zong, C., Li, Q., Qi, Y., Li, T.: Dual sentiment analysis: considering two sides of one review. IEEE Trans. Knowl. Data Eng. 27(8), 2120–2133 (2015) 31. Cruz, N., Taboada, M., Mitkov, R.: A machine-learning approach to negation and speculation detection for sentiment analysis. J. Assoc. Inf. Sci. Technol. 67(9), 2118–2136 (2015) 32. Xia, R., Xu, F., Yu, J., Qi, Y., Cambria, E.: Polarity shift detection, elimination and ensemble: a three-stage model for document-level sentiment analysis. Inf. Process. Manage. 52(1), 36–45 (2016) 33. Dhaoui, C., Webster, C., Tan, L.: Social media sentiment analysis: lexicon versus machine learning. J. Consum. Market. 34(6), 480–488 (2017) 34. Sankar, H., Subramaniyaswamy, V.: Investigating sentiment analysis using machine learning approach. In: 2017 International Conference on Intelligent Sustainable Systems (ICISS) (2017) 35. Thakkar, H., Patel, D.: Approaches for sentiment analysis on Twitter: a state-of-art study. In: Proceedings of the International Network for Social Network Analysis Conference, Xi’an, China (2013) 36. Zhang, L., et al.: Combining lexicon-based and learning-based methods for Twitter sentiment analysis. Technical Report HPL-2011–89 (2011) 37. Mukwazvure, A., Supreethi, K.: A hybrid approach to sentiment analysis of news comments. In: 2015 4th International Conference on Reliability, Infocom Technologies and Optimization (ICRITO) (Trends and Future Directions) (2015) 38. Alhumoud, S., Albuhairi, T., Alohaideb, W.: Hybrid sentiment analyser for Arabic Tweets using R. In: 2015 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K), pp. 417–424 (2015) 39. Kaggle.com. Consumer reviews of Amazon products (2019). https://www.kaggle.com/ datafiniti/consumer-reviews-of-amazon-products 40. Masters, K.: 89% of consumers are more likely to buy products from Amazon than other ecommerce sites: study. Forbes.com. (2019). https://www.forbes.com/sites/kirimasters/2019/ 03/20/study-89-of-consumers-are-more-likely-to-buy-products-from-amazon-than-other-ecommerce-sites/#dae18c64af1e. Accessed 21 Dec 2019 41. Marbán, O., Segovia, J., Menasalvas, E., Fernández-Baizán, C.: Toward data mining engineering: a software engineering approach. Inf. Syst. 34(1), 87–107 (2009) 42. Marbán, O., Mariscal, G., Segovia, J.: A data mining & knowledge discovery process model. In: Data Mining and Knowledge Discovery in Real Life Applications, Vienna: I-Tech (2009) 43. Ramasubramanian, K., Singh, A.: Machine Learning Using R. Apress, Chapter 4 (2016) 44. Naldi, M.: A review of sentiment computation methods with R packages. https://arxiv.org/ pdf/1901.08319v1.pdf (2019) 45. Sammut, C., Webb, G.: Encyclopedia of Machine Learning. Springer, New York (2011) 46. Nadali, A., Kakhky, E., Nosratabadi, H.: Evaluating the success level of data mining projects based on CRISP-DM methodology by a Fuzzy expert system. In: 2011 3rd International Conference on Electronics Computer Technology (2011)
Jupyter Lab Based System for Geospatial Environmental Data Processing Nikita A. Terlych(&) and Ramon Antonio Rodriges Zalipynis National Research University Higher School of Economics, Moscow, Russia [email protected], [email protected]
Abstract. This paper describes a new system which uses the Jupyter Lab development environment as the base for a graphical user interface (GUI). The system extends it to provide geospatial environmental data (geodata) processing functionality. We aim to make environmental data exploration in diverse domains including precision agriculture, hazard monitoring, and surface classification easier for researchers and practitioners. The development brings an easy-to-use command line language together with syntax highlighting extension, enabled commands autocompletion, documentation hints, and the back-end interpretation and execution. To make geodata processing even easier, our system supports graphical representation of data in a form of interactive Web maps with layers. We used Jupyter Project solutions to enable users to work within the familiar environment of Notebooks and keep their data secure and easy to share with other users at the same time. Our solution is designed to serve as a front-end to an arbitrary environmental data processing system. Keywords: Jupyter line language GUI
Notebook Geospatial environmental data Command Web map
1 Introduction A large diversity of ecologic monitoring problems is solved using geospatial environmental data [1, 2] (or geodata further on). Since environmental data is used for agriculture, soils, air quality [3, 4] and climate assessment [5], the investigation into this data helps to discover environmental patterns, predict weather, discover the best and secure approaches to use natural areas and reduce dangers. One of the most efficient ways to represent geodata is to visualize it, e.g. on an interactive map. The geodata is distributed among several layers, which hold a thematic set of objects. Thus, geodata analysis is bound tightly to the appropriate map representation. We aim to make environmental data exploration easier and came up with an idea to use already existing and well-known web-based GUI for geospatial data processing. A GUI would be simple to install and flexible enough to embed geodata analysis and visualization tools. To make the system support functionality to work with maps and layers, we decided to use a simple command line language. We examined different systems and have chosen the successor of Jupyter Notebook [6] the Jupyter Lab [7] as the project that satisfies requirements the most.
© Springer Nature Switzerland AG 2021 K. Arai et al. (Eds.): FTC 2019, AISC 1289, pp. 627–638, 2021. https://doi.org/10.1007/978-3-030-63089-8_41
628
N. A. Terlych and R. A. Rodriges Zalipynis
Jupyter Lab provides an extendable system of plugins, different types of extensions and kernels through which developers can bring new functionality to Jupyter. We used this system to make Jupyter Lab to support our command line language with highlighting, introspection, autocompletion of code. In the same way we added a function for opening interactive maps. The purpose of the project was to develop a set of tools for geodata querying and visualizing it on a multilayer map. The system allows users to perform several tasks: create a multilayer map, edit the map, and represent it as an interactive map based on Leaflet [10] in the client Web browser. JupyterLab cannot accomplish all the requirements the project has, therefore it was coupled with JupyterHub and enhanced by the developed server which handles command language processing. As a result, a complete server-client system capable of processing geodata was developed, and the powerful and extendable JupyterLab GUI was inherited.
2 System Architecture 2.1
System Overview
Our system has eight parts. Including Jupyter Hub, Jupyter Lab, Syntax Highlighting Extension, Geo Kernel, Leaflet Map, Netty Server and Database. The overview of the system is presented in Fig. 1.
Fig. 1. System overview
The whole system is divided into two parts serving as backend and frontend. The backend part of the system runs on Netty [8] server which is responsible for command line language interpretation and respective database queries (Sect. 2.3). The server runs commands from the Geo kernel (which was designed to introduce a new language in Jupyter Notebook) and handles the update process for all opened instances of maps. The other part serves as a GUI. A crucial part of the frontend is the Jupyter Lab which instances are spawned by Jupyter Hub [9] that creates configurable environments for users (Sect. 2.2). Jupyter Lab includes the extension for command line language
Jupyter Lab Based System for Geospatial Environmental Data Processing
629
highlighting and the Geo kernel for code interpretation, autocompletion and introspection. When a user runs Jupyter Lab cell, all the code goes to the Geo kernel, which then sends it to the Netty server. Netty server interprets the code and sends it back to the Geo kernel which returns results to the Jupyter Lab where a user can see it. All the data is transmitted between the Geo kernel and Netty server via the secure WebSocket protocol [11]. To streamline the processing of geospatial data, we use a command line language. Its syntax is similar to many other command languages: a line of code contains an object (e.g. map or layer) followed by an operation (e.g., “create” or “delete”) together with command attributes. The syntax of the latter is as follows: a keyword with a dash followed by a value. For example, “layer update -name Streets” updates a layer called “Streets”. Our language supports CRUD operations for maps and layers and some GDAL [12] commands. All the commands are stored as nested json objects, so the hierarchy order is as follows: object, command, argument, default values. This structure itself provides autocompletion suggestions as a list of the next hierarchical level objects. Each stored object contains a description to display during code inspection in notebook cells. The described architecture renders maps and manages command input at the client side and uses the server to parse user’s commands and return requested data. Hence, this system may be applied to the web applications with client-side or sever-side architectural approaches as those are described in [13]. A thick server approach based on the geodata processing libraries could be added to the Java server. In this case, the command line language will be extended to support new server functionality. 2.2
Jupyter Components
Jupyter Hub. Project Jupyter developed Jupyter Hub to build scalable systems based on Jupyter Notebook or Jupyter Lab. Jupyter Hub gives users access to computational environments and resources and automatically does installation and maintenance tasks. Jupyter Hub consists of three parts: authenticator, spawner, and user database, Fig. 2. The Authenticator controls access to Jupyter Hub, i.e. login and register logic. Spawner controls the configuration of a new notebook server for a user. User database keeps records of registered users. In our work we used Native Authenticator [14], Docker Spawner, and the SQLite database engine [15]. Native Authenticator enabled us to add the registration page: outof-the-box Jupyter provides only an authentication for users already registered in the system. Docker Spawner sets up Jupyter environment for users in separate containers, this makes user environments independent of the system. We use SQLite database for keeping user accounts only. Jupyter Hub was originally developed with Jupyter Notebook [16] in mind and Hub panel will not appear in Jupyter Lab by default, so “jupyterlab-hub” [17] extension should be installed. Jupyter Hub allows configuring users’ environments, so Jupyter
630
N. A. Terlych and R. A. Rodriges Zalipynis
Fig. 2. Jupyter Hub structure diagram
Lab may be set as the default GUI and all needed extensions and kernels can be installed for all users. Jupyter Lab. Jupyter Lab is the next generation of user interface for notebook server, but it uses the same notebook document format as Jupyter Notebook and has the same basic structure as it is shown in Fig. 3.
Fig. 3. Jupyter notebook structure
The Jupyter Lab UI is built as a combination of plugins and extensions which provide functionality of representing different types of data, highlighting syntax for programming languages and more. Support for code execution, inspection and autocompletion suggestions functionality in Jupyter Lab is brought by subprograms called kernels.
Jupyter Lab Based System for Geospatial Environmental Data Processing
631
The frontend which is presented as a “Browser” runs in a browser and is powered by notebook server. The core of Jupyter Notebook system is the notebook server which handles user actions in GUI including cell execution and notebook file management. Each time a user runs a cell with code, commands are sent through the notebook server to the kernel of the respective language. For each opened notebook the server starts the respective instance of the kernel. Geo Kernel. We added the support of our language to the Jupyter Lab via creating a new kernel. Jupyter allows developers to add a new kernel in two ways, Fig. 4.
Fig. 4. Kernel implementation choices
One way is to develop a kernel using native programing language. A developer may create any architecture but must implement communication with notebook server via different predefined sockets using ZeroMQ protocol. Another way, which we have used to develop our Geo kernel, is to use Python and inherit the kernel class from the IPython kernel [18], which is a part of Jupyter Lab. A developer may use inherited IPython kernel methods to communicate with Jupyter Lab, where the autocompletion and the introspection logic can be defined, so that only the interpretation of a language should be written from scratch. We re-defined the following IPython kernel methods: “do_complete”, “do_inspect” and “do_execute”. The “do_execute” is called when a user runs the cell, this method gets a code and some additional parameters as an input and should return the execution results dictionary, which contains the following fields: ‘status’ which is ‘ok’ and ‘execution_count’ which is a counter of executions referenced as ‘self.execution_count’ in IPython kernel. The main purpose of this method in our kernel is to send code to the Netty server and then display received result. To display execution results in the output block of the cell we used the “send_response” method of IPython kernel. The arguments of this method are socket, message type, data dictionary, metadata dictionary. The example code:
632
N. A. Terlych and R. A. Rodriges Zalipynis
self.send_response(self.iopub_socket, ‘display_data’, { ‘data’: {‘text/html’: data}, ‘metadata’: {} }) Jupyter Lab allows to call autocompletion suggestions (a list of words to write next) on “Tab” press and introspections (key words description) on “Shift + Tab” press. These events are implemented via “do_complete” and “do_inspect” methods, which take the code and a cursor pose from which the method was called as the input parameters. Return message for the “do_complete” request contains: 1. ‘status’ (‘ok’ or ‘error’); 2. ‘cursor_start’, ‘cursor_end’ as integers; 3. ‘matches’ as list of suggestions. Start and end of the cursor is the part of code which will be replaced with chosen autocompletion. The “do_inspect” request should return a dictionary with: 1. 2. 3. 4.
‘status’ – ‘ok’ or ‘error’; ‘found’ – Boolean value; ‘data’: {‘text/plain’: inspection}; ‘metadata’: {}.
We use cursor position from input arguments to define the hierarchy level of an expected suggestion. There are three levels according to the command line language structure: the object, the command, the argument. To offer a response based on the context we compare the last row of the code with regular expressions representing different levels of hierarchy. For example, if a user calls autocompletion with cursor position after an object, the kernel will return a list of available commands for this object. Highlight Extension. The GUI, e.g. interactive controls and data rendering functions, are provided by Jupyter Lab. Among others, there is the CodeMirror plugin [19] that provides syntax highlighting (predefined keywords inside notebook cells are colored differently to enhance perception of the text). The only way to add syntax highlighting rules for a new language is to extend this plugin. SimpleMode is an addon for the CodeMirror library, using which a developer can describe a programming language syntax via regular expressions. This way the command line language used in the project was introduced to CodeMirror. We defined highlighting for all objects, commands and arguments in the command line language and registered “.geo” as the file type for the language. 2.3
Netty Server
The back-end core of the system is the Netty server which we developed to serve four main purposes: to keep connection and perform data transmission with all running Geo Kernels, to run database queries, to interpret the command line language and to handle opened map instances. The server architecture is split into three groups of classes according to these purposes, Fig. 5.
Jupyter Lab Based System for Geospatial Environmental Data Processing
633
Fig. 5. Netty server class diagram
The first group contains handler classes for the server pipeline to process data transmission via WebSocket protocol. The “Server” class runs an instance of Netty server with a pipeline of “Server”, “WebSocket” and “ActiveMaps” handlers. ServerHandler initiate a handshake for new WebSocket requests from clients, then WebSocketHandler reads the data, defines the type of a client (kernel or map) and either runs the interpretation process or passes the data to the ActiveMapsHandler. Clients use JSON protocol to pack the request data each message of which is a json object with three fields: type, token, data. The “type” field is the type of a client (map or kernel), the “token” is a generated security key which validates the content and the “data” field is used to send code or map attributes. Server replies via the same protocol, the only difference is that messages from the server do not include the “token” field and use the “type” field to define the data type of the content (e.g. html or text). The second group is for working with the database. We built it with the main “DBManager” class which implements connection to the database logic and holds classes representing each of the command line language objects: “MapSet”, “LayerSet”, “UserSet”. Object classes include the sql queries to support CRUD, login and registration commands. The database stores users’ accounts, their maps and layers, Fig. 6. The last part of the server implements the interpretation of command line language code. It consists of Interpreter and MapGenerator classes. Interpreter is called from socket handlers and uses database classes to make requested changes with maps and layers. This class performs tokenization of the received commands and runs the corresponding commands. We made easily extendable list of supported commands by using reflection on database classes and naming methods as commands. Thus, when a new method is being added to the database entity classes, it is instantly supported by the command line language.
634
N. A. Terlych and R. A. Rodriges Zalipynis
Fig. 6. Database diagram
Another important class is the MapGenerator which creates ready-to-use HTML map template. Interpreter uses this class whenever “show map” command appears. Each generated map includes embedded authentication token and a JavaScript code to authenticate to the server and receive and apply content updates. 2.4
User Workflow
Before starting to work with geodata a user should authenticate himself. Arbitrary users would have the link to the Jupyter Hub authentication route where a user should log in to his personal environment with preinstalled Jupyter Lab with Geo kernel and highlight extension out of the box. The user will be automatically redirected to the Jupyter Lab route. The second step is an authentication to the Netty server. A user opens a new Notebook with the Geo kernel and simply runs the “login” or “register” command in the cell. Notebook will reply with an authentication form in the output of the cell. After successful authentication a user may run any commands and open maps. We created a sequence diagram to show the communication between all the main parts of the system during command execution, excluding Jupyter Hub, which is only used to maintain user environments, Fig. 7. As the diagram shows, Jupyter Lab serves only as a UI to run commands and display the output, while an instance of the Geo kernel takes the role of a bridge between Jupyter Lab and the Netty server. The kernel gets code from notebook cell, exchanges this data with the Netty server and returns an output back to Jupyter Lab. Additionally kernel handles the creating of an html file for the map instance and opening it in a new browser tab. One of the main use cases for a user is the work with maps. Users run commands to create maps, attach layers and show the web representation of the map. The web page with a map has the panel with all attached layers which can be switched. Users may zoom or pan the map. Users can only modify and view their own layers and maps.
Jupyter Lab Based System for Geospatial Environmental Data Processing
635
Fig. 7. Sequence diagram for executing the command
3 Conclusions The described project tries to simplify geodata processing, which is done using the notebook GUI. Jupyter Lab served as a foundation of the project, while several other components were designed and implemented. We developed the Geo kernel for Jupyter Lab to communicate with the Netty server and support the autocompletion and introspection for command line language. In order to improve the visual representation of code, we implemented the CodeMirror mode highlighting extension, Fig. 8. We extended the power of notebooks to expose them to groups of users via Jupyter Hub with secure individual workspaces for each user. To interpret the command line language queries and apply them to geodata database, we developed the Netty server, which can generate HTML pages with interactive Leaflet maps, Fig. 9, and keep their instances up to date. Finally, we created the Jupyter Lab based system to help researchers’ process geospatial data by constructing and viewing multilayer maps. This development may be used in geospatial data exploration projects to help with geodata selection and representation. The observed Jupyter Lab modernization mechanisms may help developers to create their own Jupyter Lab GUI based systems for any purpose. The described system can be used to perform time series analysis [20], display isolines [21], carry out ecological analysis [22], and as an interface to existing systems
636
N. A. Terlych and R. A. Rodriges Zalipynis
Fig. 8. The system in action: login, create a map with layers and open the map
Fig. 9. Generated Map (by the command in Fig. 8)
[23–25]. The server could be extended to handle data in a distributed fashion [26], use compressed data [27], or migrate into the Cloud [28–30]. The future development includes the connection of the real geospatial data database to a system with the support of GDAL commands like [31], extending the interactive map functionality including the improvements to the user interface, and supporting different layer types.
Jupyter Lab Based System for Geospatial Environmental Data Processing
637
References 1. ArcGIS book (2019). https://learn.arcgis.com/en/arcgis-imagery-book/ 2. Rodriges Zalipynis, R.A.: Array DBMS in environmental science: satellite sea surface height data in the Cloud. In: 9th IEEE International Conference on Intelligent Data Acquisition and Advanced Computing Systems: Technology and Applications, IDAACS 2017, Bucharest, Romania, 21–23 September 2017, pp. 1062–1065. IEEE (2017). https://doi.org/10.1109/ IDAACS.2017.8095248 3. Averin, G., Rodriges Zalipynis, R.A.: AIR-Q-GOV Report Supplement–Pollutant Emissions Inventory and Air Quality Monitoring in Ukraine (2012). http://www.wikience.org/rodriges/ publications/averin_rodriges_mwh_2012.pdf 4. Rodriges Zalipynis, R.A.: The place of Ukraine in Europe according to the level of air pollution using Earth remote sensing data. In: Proceedings of IV All-Ukrainian Congress of Ecologists with International Participation, Vinnytsia, Ukraine, 25–27 September, pp. 130– 132 (2013). http://www.wikience.org/rodriges/publications/rodriges-vinnytsia-2013.pdf 5. Newberry, R.G., Lupo, A.R., Jensen, A.D., Rodriges Zalipynis, R.A.: An analysis of the spring-to-summer transition in the West Central Plains for application to long range forecasting. Atmos. Clim. Sci. 6(3), 375–393 (2016) 6. Jupyter Notebook (2020). https://jupyter-notebook.readthedocs.io/en/stable/notebook. html#introduction 7. Jupyter Lab (2020). https://jupyterlab.readthedocs.io/en/latest/getting_started/overview.html 8. Netty Framework (2020). https://netty.io/ 9. Jupyter Hub (2020). https://jupyter.org/hub 10. Leaflet (2020). https://leafletjs.com/ 11. WebSocket protocol (2020). https://tools.ietf.org/html/rfc6455 12. GDAL (2020). https://www.gdal.org/index.html 13. Kulawiak, M., Dawidowicz, A., Pacholczyk, M.E.: Analysis of server-side and client-side Web-GIS data processing methods on the example of JTS and JSTS using open data from OSM and geoportal. Comput. Geosci. 129, 26–37 (2020) 14. Native Authenticator (2020). https://native-authenticator.readthedocs.io/en/latest/index.html 15. SQLite (2020). https://www.sqlite.org/index.html 16. The Jupyter Notebook (2020). https://jupyter-notebook.readthedocs.io/en/stable/notebook. html 17. Jupyterlab-hub (2020). https://github.com/jupyterhub/jupyterlab-hub 18. IPython (2020). https://ipython.org/ 19. CodeMirror (2020). https://codemirror.net/ 20. Rodriges Zalipynis, R.A.: Representing Earth remote sensing data as time series. Syst. Anal. Environ. Soc. Sci. 2(3), 135–145 (2012) 21. Rodriges Zalipynis, R.A.: Efficient isolines construction method for visualization of gridded georeferenced data. Probl. Model. Des. Autom. 10(197), 111–123 (2011) 22. Rodriges Zalipynis, R.A.: Ecologic assessment of air pollution by nitrogen dioxide over the territory of Europe using Earth remote sensing data. Inform. Cybern. Comput. Eng. 1(19), 126–130 (2014) 23. Rodriges Zalipynis, R.A.: ChronosServer: real-time access to “native” multi-terabyte retrospective data warehouse by thousands of concurrent clients. Inform. Cybern. Comput. Eng. 14(188), 151–161 (2011)
638
N. A. Terlych and R. A. Rodriges Zalipynis
24. Rodriges Zalipynis, R.A.: ChronosServer: fast in situ processing of large multidimensional arrays with command line tools. In: Voevodin, V., Sobolev, S. (eds.) RuSCDays 2016. CCIS, vol. 687, pp. 27–40. Springer, Cham (2016). http://doi.org/10.1007/978-3-319-556697_3 25. Rodriges Zalipynis, R.A., et al.: The Wikience: community data science. Concept and implementation. In: Informatics and Computer Technologies, pp. 113–117. DNTU (2011) 26. Rodriges Zalipynis, R.A.: Generic distributed in situ aggregation for Earth remote sensing imagery. In: van der Aalst, W.M.P., et al. (eds.) AIST 2018. LNCS, vol. 11179, pp. 331– 342. Springer, Cham (2018). http://doi.org/10.1007/978-3-030-11027-7_31 27. Rodriges Zalipynis, R.A.: Evaluating array DBMS compression techniques for big environmental datasets. In: 10th IEEE International Conference on Intelligent Data Acquisition and Advanced Computing Systems: Technology and Applications, IDAACS 2019, Metz, France, 18–21 September 2019, pp. 859–863. IEEE (2019). https://doi.org/10. 1109/IDAACS.2017.8095248 28. Rodriges Zalipynis, R.A., et al.: Retrospective satellite data in the Cloud: an array DBMS approach. In: Voevodin, V., Sobolev, S. (eds.) RuSCDays 2017. Communications in Computer and Information Science, vol. 793, pp. 351–362. Springer, Cham (2017). https:// doi.org/10.1007/978-3-319-71255-0_28 29. Rodriges Zalipynis, R.A., et al.: Array DBMS and satellite imagery: towards big raster data in the Cloud. In: van der Aalst, W.M.P., et al. (eds.) AIST 2017. LNCS, vol. 10716, pp. 267– 279. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-73013-4_25 30. Rodriges Zalipynis, R.A.: Distributed in situ processing of big raster data in the Cloud. In: Petrenko, A.K., Voronkov, A. (eds.) PSI 2017. LNCS, vol. 10742, pp. 337–351. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-74313-4_24 31. Rodriges Zalipynis, R.A. In-situ processing of big raster data with command line tools. Russian Supercomputing Days, pp. 20–25. RussianSCDays (2016)
Collaboration-Based Automatic Data Validation Framework for Enterprise Asset Management Kennedy Oyoo(&) College of Engineering and Information Technology, University of Arkansas at Little Rock (UALR), Little Rock, USA [email protected]
Abstract. Automatic Data Validation (ADV) is critical for the effective and successful implementation of Enterprise Asset Management (EAM) systems in the Power and Utilities (P&U) and other asset-intensive industries as part of their digital transformation initiatives. During such implementations, data is consolidated, standardized and integrated from multiple systems. The exclusion of ADV explains why data quality issues are encountered during the project phase and after deployment of the EAM system. Enforcing ADV in EAM will directly contribute to asset data quality which is defined by a set of dimensions such as completeness, objectivity, relevancy, reputation, timeliness, accuracy, and consistency. This research is proposing a framework called CollaborationBased Automatic Data Validation Framework for Enterprise Asset Management (CBADVFEAM) to complement the traditional data Extraction, Transformation and Loading (ETL) process. The research introduces data domains that emphasize direct engagement with the asset management stakeholders in the early stages of EAM system implementations. The CBADVFEAM framework will also deploy an intelligent toolset based on an algorithm that (a) detect data anomalies from distributed, heterogeneous data sources, (b) automatically validate the accuracy, and (c), report on the variances. Finally, this research will set the stage for future studies on the importance of ADV during the implementation of EAM solutions in the P&U industry and thus raise general awareness of data quality problems. Keywords: Data quality Data validation Asset Management Power Utilities
Asset management Enterprise
1 Introduction The Power and Utilities (P&U) industry is generally known to be asset intensive. The industry puts a strong emphasis in the area of Enterprise Asset Management (EAM). For these organizations to generate consistent revenue and maintain customer satisfaction, they need to utilize their production assets in an effective and efficient way. Therefore, asset management has been regarded as an essential business process in the P&U industry. Furthermore, with the proliferation of Internet-of-Things (IoT) enabled devices, studies and applications of P&U big data has steadily increased [18]. Almost © Springer Nature Switzerland AG 2021 K. Arai et al. (Eds.): FTC 2020, AISC 1289, pp. 639–657, 2021. https://doi.org/10.1007/978-3-030-63089-8_42
640
K. Oyoo
every process and activity that revolves around asset management in the Power and Utilities sector requires data. A review of the existing literature about data quality in the area of Enterprise Asset Management reveals that there is no existing framework that drives and enables automatic data validation. The traditional approach when integrating data from multiple systems is the process of Extraction, Transformation, and Loading (ETL) of data. Typically, the data sources maintained by these systems differ in semantics, structure and organization. In the P&U organization technology landscape, EAM systems typically span many business units such as power generation, transmission and distribution. Consequently, EAM has gained significant recognition in the P&U organizations as being in the forefront of contributing to their vision and financial objectives. This trend has contributed to the implementation of EAM solutions with the focus being the standardization of business processes, optimization of asset utilization, maximization of the asset availability and minimization of the asset maintenance cost throughout their entire lifecycle. There is strong evidence indicating that P&U organizations have far more data than possibly use and simultaneously, they continue to generate large volumes of both structured and unstructured data which in most cases do not meet the quality they need to provide the required insights [8]. Asset data has become a key ingredient in P&U organizations, a fact driven by the increasing number of embedded systems such as condition monitoring in power plants which produce huge amount of data. Despite this explosion of data, P&U asset custodians and managers are always concerned about the accuracy, reliability, consistency, and timeliness of the data is for use in decisionmaking. This lack of data visibility and quality often leads to strategic business decisions being made based on expert judgment rather than based on data. Based on personal experience working in the P&U industry, maintaining the quality of asset data is acknowledged as being problematic. Future research will incorporate additional data sets, apply the CBADVFEAM framework, ADV tool and Policy-Based templates to answer the following three questions: 1. Is asset Data Quality currently a key consideration in the Power and Utilities organizations during the implementation of EAM systems? 2. How is the quality of asset data after the consolidation of data from the legacy systems during the implementation of the EAM systems? 3. What are the key data quality dimensions that are important to the asset management stakeholders in the Power and Utilities organizations? Data and information are often used interchangeably and unless specified otherwise, this paper will use data interchangeably with information.
2 Enterprise Asset Management Enterprise Asset Management (EAM) maximizes the performance of fixed, physical, or capital assets that have a direct and significant impact on achieving corporate objectives [19]. Steed [20] indicates that during its lifetime, the asset is subjected to a host of external factors such as environmental conditions, system events, normal and abnormal loads and wear-and-tear. At several critical stages, information is required on the
Collaboration-Based Automatic Data Validation Framework
641
condition of the assets. Knowing what to measure, how to measure it, and what to do with the information becomes particularly important. Sandberg [21] argues that contemporary asset management demands elevated ability and knowledge to continuously support the asset management process in terms of data acquisition, real-time monitoring, and computer supported categorization and recording of divergences from standard operations. P&U organizations rely on strategic assets which are often inter-dependent and spanning power generation, transmission and distribution, to provide services to their customers. In an ideal situation, data required for the maintenance of these tightly interdependent assets should be managed as a set of unified enterprise information resources in order to achieve higher asset performance. It is evident from the number of Data Warehousing projects which normally run parallel with EAM implementation that P&U organizations are increasingly adopting the management of the interdependencies between the different types of data from assets that drive their operations. Traditionally, the generation, transmission, and distribution business units have always operated their assets separately and independent from one another. However, they are currently recognizing that from a strategic perspective, the data required to maintain these assets need to be viewed across the entire organization. This explains why many EAM implementation projects always have goals such as “standardizing business processes so that industry best practices can be put in place”. During the asset life cycle, the asset will be monitored, and data required - such as the condition of the assets - will be collected. At this point, knowing what asset attributes to measure, how to measure them, and what to do with the data becomes important. The process of asset management in the P&U industry requires substantial information to be collected from many different business units within the organization including Supply Chain, Finance, Fleet, Generation, Transmission and Distribution. This information must be maintained for many years in order to identify long term trends during the entire life of the asset. The two diagrams below show the information flow in asset management (Fig. 1) and Data quality framework for lifecycle asset management (Fig. 2). Figure 1 shows the scope of the EAM Systems in the overall Information flow in asset management. Figure 2 shows the various systems involved in the asset life-cycle management from a technology perspective.
3 Literature Review Numerous researchers have conducted studies that highlight the importance of Data Quality (DQ) in the areas of Enterprise Asset Management (EAM). Many of these studies have been conducted in the field of engineering that put heavy emphasis on asset maintenance and life cycle costing. However, in the P&U industry, no studies or literature exist that directly link data quality with enterprise asset management, but it is evident from these other studies that Data Quality (DQ) is a critical issue for effective asset management across all the industries. The studies indicate that most organizations have DQ problems [12]. In one study, researchers have indicated that achieving data quality in Asset Management is the key challenge engineering organizations face today [13]. Additionally, international standards are now available for organizations that wish
642
K. Oyoo
Fig. 1. Information flow in asset management (Source CIEAM Business Plan, adopted and modified from Bever.)
Fig. 2. Data quality framework for lifecycle asset management (Note: The IPWEA lifecycle asset management processes are the basis of this model.)
Collaboration-Based Automatic Data Validation Framework
643
to implement a systematic approach to asset management. These standards include ISO 55001 (“Asset management - Management systems – Requirements”) and ISO 55002 (“Asset management - Management systems - Guidelines for the application of ISO 55001”). Asset intensive organizations must take a series of decisions to achieve effective and efficient management of those assets. The decisions to manage the P&U assets can only be robust when the asset data are complete and available to provide insights that are relevant to the asset management practice. The data should be fit for the purpose of the asset management stakeholder taking the right decision at the right time based on the underpinning asset data characteristics conforming to the requirements. This conformance to requirements is the formal definition of quality as per ISO 9000 (“Quality management systems - Fundamentals and vocabulary”) [17]. Previous studies have also found that data quality requirements can be best described by using a TOP multiple-perspectives approach. Mitroff and Linstone [7] argue that any phenomenon, subsystem, or system needs to be analyzed from what they call a Multiple Perspective method – employing different ways of seeing, to seek perspectives on the problem. These different ways of seeing are demonstrated in the TOP model of Linstone [9] and Mitroff and Linstone [7]. The TOP model allows analysts to look at the problem context from either Technical, Organizational, or Personal points of view: The technical perspective (T) sees organizations as hierarchical structures or networks of interrelationships between individuals, groups, organizations, and systems. The organizational perspective (O) considers an organization’s performance in terms of effectiveness and efficiencies. For example, leadership is one of the concerns of the organizational perspective. The personal perspective (P) focuses on the individual’s concerns. For example, the issues of job description and job security are the main concerns in this perspective. Mitroff and Linstone [7] suggest that these three perspectives can be applied as “three ways of seeing” any problems arising for or within a given phenomenon or system. Werhane [10] further notes that the dynamic exchanges of ideas which emerge from using the TOP perspectives are essential, because they take into account “the fact that each of us individually, or as groups, organizations, or systems, creates and frames the world through a series of mental models, each of which, by itself, is incomplete”. In other words, a single perspective of the problem does not provide a comprehensive or balanced view of the problem nor does it provide an insightful appreciation of it. Data and knowledge about DQ are crucial for any organizations in order to make informed decisions. However, despite its importance, three decades of DQ research and various methodologies for DQ assessment, DQ is still not considered in everyday decisions because of two identified reasons: (1) Up to now, research has led to ‘‘fragmented and sparse results in the literature’’ [1] with techniques and tools still missing. (2) There is a gap between the techniques and measurements developed by the research side and their actual use in practice [15]. Business Asset Management mainly uses data from Business, Financial, Human Resource, Inventory, and Maintenance systems, such as Facility Management Systems (FMS) or Financial Systems (FS). These systems could include critical or non-critical data. They could be fully developed systems or technologies designed with a specific focus in mind (such as systems used to record dynamic snapshots for control and monitoring, e.g., SCADA - Supervisory Control and Data Acquisition). The problem with all these systems from various domains is that they do not communicate with each other since they use different technologies, store data in separate databases, and use different data structure formats
644
K. Oyoo
[16]. Even though these past studies are not aligned to P&U industry, it is evident that Data Quality is a key ingredient in the management of enterprise assets. In reviewing these past literatures, what is lacking is a data quality validation framework when integrating, consolidating or rationalizing systems involved in the management of assets. The “Theoretical Framework for Integrated Asset Management” designed by CIEAM [17] highlights all the modules involved in the management of assets and further highlights the importance of a generic data validation framework.
4 The Need for Collaboration-Based Automatic Data Validation Framework for Enterprise Asbset Management (CBADVFEAM) Previous research in asset management [1–4] suggest that a common, critical concern with EAM is the lack of quality data which triggers down to the implementation of the EAM systems. Phase one of this research is focused on highlighting the need for a collaborative framework that puts more emphasis on the engagement of key asset management business stakeholders during the implementation of EAM solutions. 4.1
The Unique Characteristics of Enterprise Asset Management in the P&U Industry
In most P&U organizations, asset management is not traditionally considered a core business activity and is mostly aligned with Information Technology (IT) organizations [8]. The traditional alignment of EAM to IT does not reflect an accurate knowledge of asset management to the asset custodians nor to the information contained in the asset registry. The application of the CBADVFEAM framework will engage key asset management stakeholders to improve their understanding of how asset data quality improves the efficiency of the processes involved in EAM practice. 4.2
Assets are the Foundation of the Power and Utilities Organization
Power Generation, Transmission, and Distribution assets are the lifeblood of the P&U Industry and Return-On-Assets (ROA) is the key measure of asset performance. Maximizing asset performance is always a key challenge facing P&U organizations since the objective of the EAM practice is to optimize the lifecycle value of the deployed assets by minimizing the ownership, operating and replacement costs of the asset, while ensuring uninterrupted service to customers. Performance of the asset and the asset management program in P&U must therefore be periodically assessed, responsibly managed and continuously improved [6] all of which depend on good quality data. 4.3
Manual and Automatic Data Capture
In many Power and Utilities organizations, the asset management data is collected either automatically or manually and the sources can include sensors or technicians using field mobile devices while working on the asset. In practice the data collected are
Collaboration-Based Automatic Data Validation Framework
645
stored in their respective systems and formats which is not comprehensive and is business process centric. This makes the data difficult to be reused for any other outside processes [8]. Traditionally, the asset registry databases and their respective systems are islands of separate data that are dispersed throughout the organization. Access to the dispersed data by other business units is often difficult and this always limits the effectiveness of the organization’s knowledge base for asset management [11]. Several data integration and conversion tools are available to consolidate and translate the data from the dispersed systems such as SCADA, GIS, and OMS but effective implementation of these tools is always a complex activity. Such disconnects between these systems make it extremely difficult to bring good quality asset management data to the management decision making process. The lack of process-toproduct data transformation capabilities in linking business systems and EAM applications continue to pose significant data quality problems which ultimately affect datadriven decision-making in the utilities industry.
5 Proposed Collaboration-Based Automatic Data Validation Framework for Enterprise Asset Management (CBADVFEAM) The proposed framework in Fig. 3 captures the complexity of a consolidated asset management program utilizing data from multiple sources. The framework emphasizes client engagement in the earlier stages of the data conversion and transformation activities and follows the same architecture design as in Fig. 3 [5]. The main capabilities of the CBADVFEAM framework are the engagement of the asset management stakeholders, the automatic data validation and the ETL layers. The total stakeholder participation and engagement is driven by the eight domains described below and follows the same approach indicated on Table 1 [14]. The Collaboration-Based Automatic Data Validation Framework for Enterprise Asset Management (CBADVFEAM) Framework architecture will be supported by the deployment of an Automatic Data Validation (ADV) tool focusing on large Enterprise Asset Management (EAM) and Technology Transformation programs in P&U or other asset intensive industries. 5.1
Consolidation
This is the first domain in the CBADVFEAM framework that will apply the concept of matching and de-duplicating data from multiple sources. The activities in this domain are similar to the traditional entity resolution since all the existing or current EAM legacy systems have data spanning multiple repositories and in different formats. 5.2
Classification
Data classification is about categorizing and organizing data for better analysis and decision making. There are many ways to classify data and for the CBADVFEAM framework, classification is based on how critical the asset data is to the business. In the
646
K. Oyoo
Fig. 3. Proposed collaboration-based automatic data validation framework for enterprise asset management (CBADVFEAM) Table 1. Example – missing specification attribute values between GIS and maximo for a reclosure asset ESRI GIS Asset# Attribute Description Data Type Alpha Numeric Value Numeric Value Unit of Measure Table Value Start Measure End Measure
MAXIMO 12345 Asset # LONGITUDE Attribute Location Longitude Description ALN Data Type −90.54086 Alpha Numeric Value 10 Numeric Value IN Unit of Measure 50 Table Value 100 Start Measure 1000 End Measure
12345 LONGITUDE Location Longitude ALN −90.54086
Collaboration-Based Automatic Data Validation Framework
647
Power and Utilities industry, classification of data can be based on compliance and regulations as part of a risk management program which is applicable to the nuclear power assets. A data classification in the P&U industry EAM program is in the establishment of asset and location hierarchies. When most P&U organizations implement an EAM solution like IBM Maximo, they often struggle with the fundamental development of their asset and location hierarchies. Typically, their structure is fragmented, incomplete, and does not allow the maintenance crew to logically find or organize their assets. As an example, an asset might be associated with multiple addresses or accounts. The classification domain in the CBADVFEAM framework will enable the early discovery of such anomalies in the ETL process by fully engaging the key stakeholders. 5.3
Nomenclature
With EAM systems implementation comes the requirement of a stronger need for enterprise-level asset data standardization. Data conversion and Integration from legacy EAM systems requires the alignment of data definitions, representation, and structure. It is therefore important that data standardization is considered a core data-quality technique and deliverable to facilitate the consistency and synchronization of data across the enterprise. To reinforce the need for data standardization, below are some examples of how similar asset data can be represented differently across multiple systems: • An asset custodian name may be captured as separate fields for First, Middle, and Last names in one system, but as a single field for Full Name in another. • An asset custodian address in one system may have one single address line for street number and name, but two or more in another. The goal of this domain is to have a single naming standard that is flexible enough to adapt and transform data as needed. 5.4
Extraction
The Data Extraction domain is responsible for extracting data from multiple EAM legacy systems that will be loaded into the new EAM system. Each legacy data source has its distinct set of characteristics that need to be managed in order to effectively extract data. This domain focuses on integrating data from systems that have different platforms and the engagement team should be able to understand: (a) what database drivers are used to connect to the databases, i.e., Open Database Connectivity (ODBC) or Java Database Connectivity (JDBC), (b) understand the data structure of the source legacy systems, and (c) know how to handle the data sources in different platforms, i.e., mainframes. 5.5
Transformation
Data Transformation is the process of converting data from one format or structure into another. Data extracted from source systems is raw and not usable in its original form. Therefore, it needs to be cleansed, mapped and transformed. This is a key domain where the ETL process adds value and changes the data. This domain will focus on engaging the key asset management stakeholders early in the process in validating the
648
K. Oyoo
goals of the data transformation and the target of the new format. Some of the activities in this domain will include converting data types, cleaning of data by removing nulls or duplicates, performing data enrichment or data aggregations etc. 5.6
Loading
This domain primarily focuses on the activities in the loading component of the ETL process. After data is retrieved and combined from multiple sources (extracted), cleaned and formatted (transformed), it is then loaded into a new EAM storage system. This domain focuses on ensuring that proper types of ETL tools and methods are used and that they meet the requirements such as loading large volume of data. 5.7
Validation
This domain focuses on validating the new data format for accuracy and quality based on the source data before it is put into production. The Automatic Data Validation (ADV) tool will be used to validate the data and report on variances. In the context of the CBADVFEAM framework, data validation is performed after the ETL (Extraction Translation Load) process to make sure the data meets the asset management business requirements. This domain takes into consideration the consumers’ viewpoint of “fitness for use” in conceptualizing the underlying aspects of data quality [11]. 5.8
Acceptance
This domain primarily engages the key asset management stakeholders and focuses on going through the checklist to make sure the completeness data quality dimension which aligns to the CBADVFEAM framework has been met. In order to illustrate and support the importance of the above proposed CBADVFEAM framework, the collaborative asset lifecycle management model [11] shown in Fig. 4 below has been adopted to reinforce a deeper understanding of the complex processes involved in the asset management lifecycle that require collecting data from multiple systems. The Asset Lifecycle Domain captures the processes related to asset such as design, build, installation, and decommissioning with the primary stakeholders in this domain being the asset owner and the operators. The Asset Operation Domain recognizes that asset operations take a significant time on the asset’s capabilities and these are restored periodically with parts and services acquired by the organization. The Asset Performance Management Domain includes those processes that must be performed during the asset’s operation to monitor its condition and manage the performance. The asset management is the heart of this model since it is a collaborative activity which includes multiple stakeholders responsible for the care and performance improvement of the assets.
Collaboration-Based Automatic Data Validation Framework
649
Fig. 4. Collaborative asset lifecycle management model
6 Research Method The research has selected three largest P&U organizations in the United States which have implemented IBM Maximo EAM system as their asset management system. Maximo is a robust platform that has the capabilities to transform how P&U organizations manage their physical assets by offering insights into asset usage over the entire Lifecycle. The criteria for selecting these organizations is based on the researcher’s membership of the IBM Maximo EAM implementation team and firsthand experience of the problem that is the objective of this research. The research will be divided into two phases. Phase one is what this paper is based on and will take a Qualitative Observational research approach through the definition of a specific data quality problem. Using the Data Quality (DQ) dimension of completeness, this initial phase of the research will focus on highlighting data anomalies on an ongoing EAM implementation for one of the organizations selected for the research. The researcher has experienced the problem from working with the Organization Data Management Team. The preliminary results of this initial phase of the research further proves that the ETL processes alone are not enough to fully produce data that meets the qualification of “fitness for the purpose of use” without a supported data validation framework. As the Data Management Body of Knowledge states “One expectation of completeness indicates that certain attributes always have assigned values in a data set”. Focusing on the completeness dimension of data quality, this initial phase of the research will
650
K. Oyoo
simulate the ETL process previously used in the organization to load the asset specification attributes. The same attributes have been flagged by the Data Management Team as having incorrect or missing values and therefore not meeting the DQ completeness dimension. It is also important to note that completeness is a contextual DQ dimension and to this end, it is not only defined as the degree of absence of missing values, but the extent to which all required data are available and enough to meet the required needs. This research paper will focus on highlighting the need for a framework that complement the ETL process and provide preliminary results to support the same. The research will consist of 10 structured questions. Forty-five (45) questionnaires will be distributed - 15 to each of the three organization EAM stakeholders such as, asset managers, maintenance engineers, technicians, data operators, and asset custodians. Responses to the research questions will be collated and analyzed using quantitative data analysis such as descriptive statistics and inferential statistics. The quantitative analysis will enable the exploration of raw data and identify relationships between the objects and attributes being examined in a rigorous manner. The questionnaires will include questions about the background of the organization in relations to their asset management practice, the participants’ roles, and their views about data quality issues in managing the assets. A quantitative research approach has been proposed as appropriate for future research given that there is no existing empirical research on data quality as it pertains to the implementation of EAM solutions in the Power and Utilities industry. This approach will enable the exploration of the general research question by capturing data from experienced domain practitioners through the identification of key asset data issues. It will then apply the CBADVFEAM framework to reflect the reality of current practice.
7 Data for the Initial Phase of the Research For this phase, the researcher has been working with the Data Management Team (DMT) in one of the organizations selected for the research to correct the production asset data that were converted from a legacy EAM system into Maximo using the ETL process. The researcher is a complete participant, i.e., a member of the Maximo EAM implementation team. The organization currently has approximately 6 million power distribution assets of different types such as transformers, switch gears, poles, reclosures. The first data sample used for this research was extracted from the Maximo production environment and consisted of 1,000 recloser assets each with 15 asset specification attributes which totaled 15,000 data sets (1,000 15 = 15,000). It is important to note that a typical recloser asset has approximately 178 specification attributes in total. For this research, fifteen (15) attributes have been selected. Reclosures are circuit breakers located at the top of electric distribution poles and are typically used to isolate a section of the feeder in fault or overload conditions and thereby minimize the number of customers without service. From a maintenance and life cycle costing aspects, each reclosure attribute specification data point must be tracked in the EAM system (IBM Maximo). For illustration purposes, the diagram below (Fig. 5) shows the physical reclosure asset mounted on an electric distribution pole.
Collaboration-Based Automatic Data Validation Framework
651
Fig. 5. Physical reclosure asset on electric distribution pole
Working with the Data Management Team, the researcher reviewed each of the five thousand (5,000) reclosure assets with the incorrect attribute specification values using a data comparison tool (RedGate Schema/Data Compare). The researcher was able to simulate the ETL process (executing all the scripts in virtual environments) previously used into loading data into IBM Maximo. For illustrations purposes, Table 1 below shows the sample missing value attributes in Maximo that should have been populated from the GIS system for the reclosure asset using the data conversion mapping document.
8 Research Steps As indicated in Fig. 6 on the following page, the research steps follow a sequence from the data analysis to the ETL scripts execution. There are five cycles in the ETL script execution all of which produce the variance report that is based on comparing the source database schema and the ADV database schema mappings. Each of the ETL script cycle runs consist of the same attribute specifications selection for the same recloser asset.
9 Data Analysis 9.1
Identification of Data Attribute
Asset specification attributes have been identified as having null values which are not accepted by the asset management business stakeholders. This problem is spanning multiple asset types such as transformers. In IBM Maximo EAM, Classifications are used to categorize the data that can be used later for analysis.
652
K. Oyoo
Fig. 6. Flowchart – research steps
9.2
Attribute Selection
Asset Classification attributes have been selected because they can have number of technical parameters assigned. They describe the features of a specific asset object and
Collaboration-Based Automatic Data Validation Framework
653
at the same time they allow to search by parameter e.g. diameter > 5” in addition to offering additional functions like data standardization. Below is a typical example to show why asset classification attributes data are important and how they can be organized in hierarchical structures. Car (attribute: engine type) | Passenger Car (attribute: engine type, number of passengers) | Pickup Truck (attribute: engine type, load capacity) | Truck (attribute: engine type, load capacity, number of axles) | Special Purpose Vehicle (attribute: engine type, equipment type) It is important to note that in Maximo, each sub-classification will inherit attributes from the parent record, but it can have its own, specific set of attributes. 9.3
ETL Script Execution
For the cycle runs for each of the attributes, the ETL run script was simulated based on the variables showed in the table below used in the initial production load. 9.4
Report Variances
Reporting the variance is the process of verification of data after the ETL process using the variables defined in the ETL Run script as indicated in Table 2 between the source and target systems. During this process, the target data is compared with source data to ensure that the Automatic Data Validation (ADV) Tool is transferring data and producing results accurately. The Automatic Data Validation (ADV) Tool will use a mathematical algorithm to produce the data variance report based on the following measurements: Table 2. Example – Variables – ETL Run Script SYSTEM = INFORMATICA RUN_ID RUN_SEQ WORKFLOWNAME SESSION_NAME TARGET_TABLE DATABASE SCHEMA RUN_DATE TOTAL_COUNT LOAD_START_TIME LOAD_END_TIME LOAD_DURATION LOAD_STATUS
SYSTEM = MAXIMO PROCESS_START_TIME PROCESS_END_TIME DURATION STATUS EXECUTED_BY
654
• • • • •
K. Oyoo
Missing values Incorrect values Duplicate records Badly formatted values Missing relationships between objects & attributes
The Automatic Data Validation Module indicated in Fig. 7 shows multiple objects can be loaded into the ADV tool simultaneously in a sequence using the data load module. For example; the assets objects cannot be loaded before site or location objects are loaded because an asset must be associated to a location or a site. The ADV tool will check for discrepancies using the schema mapping and the validation rules to produce the report.
Fig. 7. Automatic Data Validation Module
10 Preliminary Research Findings The preliminary results of this research indicate that as much as P&U organizations continue to invest heavily in the EAM system implementation and reliance in the ETL process, asset data quality remains a big challenge. Based on the results observed and presented in Table 3 below, it is evident that the ETL process alone does not guarantee the level of data quality expected to manage the assets from a decision-making standpoint. The results indicate that even though the ETL process was executed in five (5) iterative cycles for the same attribute data sets, there were variations in the results for each cycle run. The results were not consistent with the expected outcome in Cycle 1 which was to populate all the specification attribute values. It is observed that each cycle run produced different results which further confirm why these 1000 reclosure assets were identified in the sample as having incorrect data.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
ELECTRONIC_PHASE_TD_ADDER ELECTRONIC_PHASE_MULT ELECTRONIC_PHASE_FAST_ADDER MINIMUM_GROUND_TRIP_SETTING GROUND_OPERATION_SEQUENCE MINIMUM_PHASE_TRIP_SETTING PHASE_OPERATION_SEQUENCE CONTROL_SERIAL_NO CONTROL_TYPE GROUND_SLOW_TRIP_SHOTS GROUND_SLOW_TRIP_SEQUENCE GROUND_FAST_TRIP_SHOTS GROUND_FAST_TRIP_SEQUENCE GROUND_TRIP_AMP PHASE_SLOW_TRIP_SHOTS
Cycle Number of Assets Reviewed: 1,000 Cycle Number of Attributes Reviewed: 15 Data Quality (DQ) Dimension: Completeness Reclosure asset specification attribute name Expected Attribute Value populated (YES = 1, NO = 0) Cycle 1 Cycle 2 1 0 0 0 0 1 1 1 1 0 1 1 1 0 0 0 1 1 1 0 0 0 1 1 0 0 1 1 1 1 Cycle 3 1 0 1 0 0 1 0 1 0 0 1 1 1 0 1
Cycle 4 1 1 1 1 0 0 1 1 0 1 1 1 0 1 0
Table 3. Preliminary Results – Asset Data/Attribute Review
Cycle 5 0 0 1 0 1 0 1 1 1 1 0 1 1 1 1
YES 3 1 4 3 2 3 3 3 3 3 2 5 2 4 4
NO 2 4 1 2 3 2 2 2 2 2 3 0 3 1 1
Collaboration-Based Automatic Data Validation Framework 655
656
K. Oyoo
11 Conclusions and Future Work The successful implementation of an EAM solution such as IBM Maximo is a fundamental building block for the performance improvement and reliability of assets. In many cases, the asset management key stakeholders view the implementation of the EAM solution as a fix to their problems related to the maintenance of the assets. This view and thought process has led to many unrealistic expectations especially when the solution does not deliver the much needed or anticipated improvements. It is also important to note IBM Maximo EAM is just a tool to facilitate asset management processes. More important and significant to the overall success of the EAM implementation is the quality of data that is migrated into Maximo from the many years of decentralized systems and processes. Good quality asset data is required to manage the assets operationally, strategically, and tactically. It directly impacts the P&U asset management digital transformation initiatives specifically in asset life-cycle costing. Dependency on the ETL processes and tools when asset data is consolidated from multiple systems during EAM systems implementations will continue to be the “De facto standard”. To address asset data quality issues experienced during the implementation of EAM solutions in P&U asset management programs, a consolidated framework like the one proposed in this paper that combines people, processes, and technology leveraging automatic data validation is required. Future research will include basic statistical techniques on which to base any asset data-gathering activity such as sample size and related rationale, testing and simulation, acceptance criteria, and observed outcome/results. These techniques will be based on existing standards such as described in ISO 8000-8:2015, Data quality - Part 8: Information and data quality: Concepts and measuring. Finally, future research on the proposed CBADVFEAM framework/data validation tool shall be subjected to basic test method validation (TMV) steps to confirm its suitability, i.e., reliability, repeatability, and limitations, for its intended use.
References 1. Woodhouse, J.: Asset Management. The Woodhouse Partnership Ltd (2001). http://www. plant-maintenance.com/articles/AMbasicintro.pdf. Accessed 10 Apr 2004 2. Woodhouse, J.: Asset Management: concepts & practices. The Woodhouse Partnership Ltd. (2003) 3. Eerens, E.: Business Driven Asset Management for Industrial & Infrastructure Assets, Le Clochard, Australia (2003) 4. IPWEA, International Infrastructure Management Manual, Australia/New Zealand Edition (2002) 5. Almorsy, M., Grundy, J., Ibrahim, A.S.: Collaboration-Based Cloud Computing Security Management Framework (2011) 6. Lin, S., Gao, J., Koronios, A.: The Need for A Data Quality Framework in Asset Management (2006) 7. Mitroff, I.I., Linstone, H.A.: The Unbounded Mind: Breaking the Chains of Traditional Business Thinking. Oxford University Press, New York (1993)
Collaboration-Based Automatic Data Validation Framework
657
8. Levitan, A.V., Redman, T.C.: Data as a resource: properties, implications and prescriptions. Sloan Manag. Rev. 40(1), 89–101 (1998) 9. Linstone, H.A.: Decision Making for Technology Executives: Using Multiple Perspectives to Improve Performance. Artech House Publisher (1999) 10. Werhane, P.H.: Moral imagination and systems thinking. J. Bus. Ethics 38, 33–42 (2002) 11. White Paper - ARC, “Asset Information Management – A CALM Prerequisite”, ARC Advisory Group (2004) 12. Wang, R.Y., Strong, D.M.: Beyond accuracy: what data quality means to data consumers. J. Manag. Inf. Syst. 12(4), 5–33 (1996) 13. Koronios, A., Lin, S., Gao, J.: A data quality model for asset management in engineering organisations (2005) 14. Lin, S., Gao, J., Koronios, A., Chanana, V.: Developing a data quality framework for asset management in engineering organisations. Int. J. Inf. Qual. 1(1), 100–126 (2007) 15. King, T.M., Crowley-Sweet, D.: Best practice for data quality enables asset management for rail. For Improved Decision-Making: A Methodology for Small and Medium-Sized Enterprise, Costing Models for Capacity Optimization In Industry 4.0: Trade-Off, a Fraunhofer Institute for Manufacturing Engineering and Automation IPA, Nobelstrasse 12, 70569 Stuttgart, Germany 16. Koronios, A., Nastasie, D., Chanana, V., Haider, A.: Integration through standards – an overview of international standards for engineering asset management (2007) 17. Stapelberg, R.F.: Assessment of Integrated Asset Management. Professional Skills Training Courses: CIEAM- Cooperative Research Centre for Integrated Engineering Asset Management (2006) 18. Design and Implementation of Enterprise Asset Management System Based on IOT Technology (2015) 19. MRO: Strategic Asset Management Methodology. Executive White Paper (2004) 20. Steed, J.C.: Aspects of how asset management can be influenced by modern condition monitoring and information management systems IEE (1988) 21. Sandberg, U.: The coupling between process and product quality: the interplay between maintenance and quality in manufacturing. Euromaintenance (1994)
EEG Analysis for Predicting Early Autism Spectrum Disorder Traits Parneet Kaur Saran and Matin Pirouz(B) California State University, Fresno, CA 93740, USA [email protected] Abstract. Autism Spectrum Disorder (ASD) is a neurodevelopment disorder associated with impairments in socio-communication, relationships, restrictions in thoughts, imagination, etc. Autism being identified as genetic, depending upon person to person and their sociocommunication, it is important for computer science researchers to analyze the big data visualizations using phenotypic features (age, sex, etc.) of each patient. This paper aims to develop a framework in which the subject does not need to push emotions. It is done with machine learning algorithms and affective computing to produce a better manmachine interface. The emotion of disabled people was deduced through Electroencephalogram (EEG) signal by placing EEG headset electrodes on their scalp. To classify the emotions and to differentiate the person as autistic or neurotypical, and to extract features (wavelength, waveform, mean, etc.) from EEG signals, the machine learning algorithms K-nearest neighbors (KNN) algorithm, Random Forest Classifier, Support Vector Machines, and Logistic regression were used. Based on datasets for Autism in toddlers and Autism in adults, a prediction model is developed which predicts the chance of ASD characteristics so that parents/guardians can early steps and the performance rate of every method applied was determined to choose the best classifier model and precision rate achieved for best classifier model is 73%. This dataset supports the hypothesis that an electroencephalogram reveals information about the performance of the proposed methods and has the potential to benefit individuals with ASD. Keywords: Electroencephalogram Classification · Machine learning.
1
· Brain-computer interface ·
Introduction
Autism Spectrum Disorder (ASD) is a lifelong neuro-development disorder caused by impairment in socio-environmental interaction.ASD is, for the most part, viewed as a deep-rooted incapacity of yet unsure etiology, without a set up corroborative research center test, and so far without all around built up, curative pharmacological or behavioral therapy [10]. In 2016, Manning et al. using birth certificates and Early Intervention data reported that in the Commonwealth of Massachusetts between 2013 and 2018, the occurrence of ASD analyzed by three years of age increased from 56 to 93 infants per 10,000 [21]. c Springer Nature Switzerland AG 2021 K. Arai et al. (Eds.): FTC 2020, AISC 1289, pp. 658–675, 2021. https://doi.org/10.1007/978-3-030-63089-8_43
EEG Analysis for Predicting Early Autism Spectrum Disorder Traits
659
These data results have spawned potential research on the exploration of etiologies as well as finding the diagnosis, in terms of EEG or neuro-imaging, to establish an early intervention and develop diagnosis at the earliest stage. ASD is considered by numerous individuals to be a hereditarily decided issue; three understood twin investigations gauge heritability at around 90% [7]. Traditional diagnosis was based mainly on behavioral tests and learning activities that involve statistical diagnosis of mental disorders. Different types of autism are classified as different disorders, but with recent behavioral analysis made they all fall under one category called Autism Spectrum Disorder. Computer-Aided Diagnosis is a better alternative; it is not intended to diagnose by itself but provides help to clinicians as an assisting tool [14]. Numerous youngsters with ASD demonstrate a partiality to PCs, which prompts the utilization of PC helped advancements (CAT) to upgrade the facial influence acknowledgment abilities in people with ASD [4]. Advanced media applications can help in changing the learning procedure of kids, with the scope of various chances and empowering the making of situations creating important exercises like language upgrade, feeling acknowledgment, social-relational abilities, instructive improvement and so on [2,13]. In this research paper, to structure an extension among humans and machines and to frame the Brain-Computer Interface system, Emotion Classification plays a vital role and here computer-aided diagnosis is done using Electroencephalogram (EEG) signals [8,9]. EEG signals have been recently used to perceive mindful state, commitment level, mental outstanding task at hand, and feelings in different applications [19]. The advantage of Emotion Classification and finding differences in brain activity is that it will accurately distinguish between ASD and neuro-typical person which leads to better diagnosis and treatment [1]. Electroencephalogram (EEG) is a method used for measuring electrical signals generated by the brain by placing electrodes located in different positions and maintaining standard distribution. The speech signals are acquired using EPOC headsets. EPOC+ is a low-cost electroencephalogram [20], designed as a teaching and mental training device to help people with socioenvironmental problems by only using body movements. The high fleeting goals of EEG accounts take into account the assessment of mind elements on the millisecond timescale [26]. EEG sign has been recently used to perceive mindful state, commitment level, mental outstanding task at hand, and feelings in different applications [4,22]. EEG recordings taken are destroyed with large amounts of noise, which makes the analysis difficult. The robust features are extracted from the recordings. The rapid growth in autism disorder cases necessitates datasets related to behavioral traits [11]. The proposed EEG-based BCI for the VR-based driving framework comprises of three primary modules: Signal Pre-preparing module, Feature Generation module, and Classification module [15]. The initial step is to choose the best classification model to classify toddlers and adults as having ASD or typically developing. The Machine Learning Algorithm used in this research paper are K-Nearest Neighbour (KNN), Random Forest Classifier and Logistic Regression and KNN and Random Forest Classifier performs better precision result than Logistic regression. The neurodevelopment nature of the disorder, including phenotypic features (age, sex, etc.) would improve classification accuracy and deep
660
P. K. Saran and M. Pirouz
learning networks are designed for specific multimodal tasks of combining phenotypic information into neural networks [23,27]. The crude EEG sign gathered from the scalp of the member are first encouraged into a Signal Pre-handling module to evacuate exceptions, right EOG and EMG antiquities, and upgrade the sign tocommotion proportion [1]. Feature Generation module at that point changes the time arrangement signals into a lot of important highlights for the Classification module to identify the commitment level, passionate states and mental outstanding task at hand of the member [5,25]. In this work, the social appraisals information and EEG based information can be utilized as ground truth to prepare a gathering of models that could be utilized in the Classification module to arrange commitment, pleasure, disappointment, fatigue, and trouble [6,17]. These outcomes likewise recommend that EEG-based BCI could be utilized in the Virtual Realitybased framework to enhance the human PC communication, and all the more significantly, improve the framework proficiency through individualized framework adjustment dependent on multimodal tactile information and execution information [3]. 1.1
Contributions
This paper expects to build up a classifier that can accurately distinguish whether a subject is imagining an errand that is commonplace or new. Notwithstanding it, to making a system which distinguishes passionate and mental state to arrange among Autistic and regularly creating patients. The system takes the contribution of pre-prepared informational collection containing EEG sign and representations are drawn from it. At that point, the distinctive component age models characterized in methodology are applied and the best classifier model is looked over it, which deliberately group among mentally unbalanced and commonly creating patients. – Science and technology are developing at a high rate, there is some section of people who feels the lack of emotion or need emotional comfort. So because of the existing situation, this research aims to lead emotional computing into the field of human-computer dialogue with the usage of Electroencephalogram signals and Epochs. Secondary goals include providing insight into which brain regions and frequency bands associated with each of the respective classes. If a deep learning approach is found to be viable, these insights may correspond to latent features found within the neural network. Other insights may be obtained from more traditional data processing and machine learning techniques. – It likewise intends to give deliberate assessment of highlight age models and the highlights that influence the Autism Spectrum Disorder like impact of jaundice while birth, ethnicity of kids and the impact of age, subjective capacity, conduct, enthusiastic responses, relational interchanges, etc. Man-machine discourse doesn’t bestow just high subjective aptitudes, yet additionally have passionate insight, with the goal that it can without much of a stretch analyzed the side effects. Figure 1 presents the framework proposed herein.
EEG Analysis for Predicting Early Autism Spectrum Disorder Traits
661
Fig. 1. Framework describing the process of paper
2
Related Work
The improvement of a programmed system to examine cerebrum signals would help clinicians to diagnose disease disorder within certain speed and accuracy [3]. Several computer-aided diagnosis using EEG signals has been proposed by several studies. In the work presented by Justin Eldridge [12], he used classification methods named Logistic Regression, KNN classifier and Random Forest Classifier and classify between autistic children and typically developing children and choose the best model. The classification accuracy was 80% for toddlers aged 12–36 months and 71% for adults aged 20–60 years in his proposed work. With the main approach to follow, it consists of first identifying the Autism Spectrum Disorder, what is the purpose of this report and then to advancement in data
662
P. K. Saran and M. Pirouz
collection, that is, by reading signals from Epoch headsets and lastly is method and models used, the following papers have helped to reach step-by-step goal. As this research deals with human-computer interaction or simply say, evolved the computer science research in identifying Autism Spectrum Disorder,Chung Hyuk Park [25], claimed the importance of software application as he stated that computerized media applications can help in changing the learning procedure of youngsters, with the scope of various chances and empowering production of situations creating important exercises like language upgrade, feeling acknowledgment, social-relational abilities, instructive improvement and so on. Then the question came, how this can be done, or how to collaborate psychology with Computer science, then with Hahn’s research [15,16], advanced data analysis technique like Logistic Regression was proposed on data set and it generates predictive algorithm from themselves to make the model free of the information. An exactness of 90% was accomplished with a spiral premise work classifier [1]. The past research on Autism demonstrates that early intercession can improve advancement, however, the conclusion presently relies upon the clinical perception of conduct, an obstruction to early analysis and treatment. Most youngsters are not determined to have ASD until after an age of 4 [29]. To apply methods and models over further work on data, [3], William J. Bosl, represented nonlinear features from the EEG flag and utilized as contribution to measurable learning techniques. Forecast of the clinical analytic result of ASD or not ASD was exceptionally exact when utilizing EEG estimations from as right on time as 3 months of age. Explicitness, affectability, and PPV were high, surpassing 95% at certain ages. In addition to it, Simons work, [26], information from various streams progressively: voice, position, finger tapping, and undertaking execution. Examinations demonstrate that people with ASD tapped their fingers more gradually than controls, and the two gatherings tapped less musically when the psychological burden was high and particularly when they committed errors in the subjective undertaking. For Big data analyses and visualization, Di Martino, [1] performs exploratory analyses the usage of an array of regional metrics of intrinsic Genius feature converged on frequent loci of dysfunction in ASDs and highlighted much less typically explored areas such as the thalamus. The survey of the ABIDE R-fMRI data sets affords exceptional demonstrations of both replication and novel discovery. The main task of classification over Electroencephalogram signals was taken from Erik Bates, [24] whose research paper deals with the network approach to characterization of Spatio-temporal dynamics of EEG data in Typically Developing and ASD youths. EEG recorded during both wakeful rest (resting state) and a social–visual task was analyzed using cross-correlation analysis of the 32-channel time series to produce weighted, undirected graphs corresponding to functional brain networks. Apostolatos et al. [1] examined and analyzed the white matter structure of pre-teens and adolescent individuals with autism spectrum disorder (ASD) as collected through diffusion tensor imaging and developed with tractography. Given the pivotal nature of axon tracts on coordinating communication among brain regions, and the inherent network-like structure of these tracts, we believe that network and graph analysis may provide greater insight
EEG Analysis for Predicting Early Autism Spectrum Disorder Traits
663
and understanding of the relationship between ASD and white matter structure [18,24]. The following research papers deal with independent research and their advancement and thinking of one particular area, this paper aims to collectively perform all researches consecutively over the collected data set. [28]
3
Our Approach
The main focus emphasizes on ‘emotion in the computer field’. Likewise, it emphasizes how the computer can be intelligent without emotions. As indicated by past investigations, PCs depend for the most part on sensible thinking framework, not on enthusiastic capability. In Autism Spectrum Disorder, individuals don’t comprehend and acknowledge the learning content, because of which they produce uneasiness, disagreeable, frightful feelings. Accordingly, with the assistance of PC based conditions, following the status of feeling is fundamental, and with the fast advancement of data innovation, the man-machine exchange is always showing signs of change. The informational indexes perusing outward appearance, motion investigation, discourse acknowledgment and articulation of feeling are considered and AI strategies clarified underneath are applied over it. Full of feeling figuring examination will keep on developing the comprehension of individuals’ passionate state, the impression of setting improves the PC’s capacity to cause PCs to turn out to be increasingly “savvy”, and can contact with individuals regular, warm and exuberant insight connection [1]. The dataset containing EEG samples was evaluated, which were processed by taking a 30- s segment from the beginning of the recording when the subject was sitting quietly. The selection of the segment was not based on review. So to start with research, first classification is performed upon the EEG data sets openly accessible that contains all information from age to ethnicity, values of EEG signals collected by placing Epoch headsets over their scalp. Then the feature generation models are applied, which aims for choosing the best model. During the preprocessing stage, it is important for the data taken to be treated through a signal processing block to remove the noises. But in this paper, data already free of artifacts or noises are taken. Table 1 presents a summary of the symbols. The methods used in the classification are: 1) Logistic Regression: This technique is a classification approach making the usage of logistic function to show binary and multi class-dependent variables with two feasible values (in the form of binary regression). It performs discrete categorization of sampled data. The output is measured with variables in which there are two possible outcomes. The analogy between the linear and logistic regression can be explained with the regression hypothesis. p (1) logit(p) = lg 1−p where p= probability of presence of characteristics of event. 2) K-Nearest Neighbor: It seeks to classify models based on averages of nearest neighbors. KNN can be used both for classification and regression
664
P. K. Saran and M. Pirouz Table 1. Symbols Symbols
Definition
L
Distance constant
M
Matrix
x and y variables Query points e
Vector of all 1s
C
Upper Bound
Q
n by n Positive Semi-definite matrix
K
Kernal
w
Perform mapping of training vectors
hm
Normalized feature for importance of tree i in j
predictive models. When performing classification using the K-nearest neighbor, the algorithm gives the way of extracting the majority vote which decides if observation belongs to a similar K instance. Euclidean distance is used to solve the method. The performance is fully dependent on k, done through iterations above k-fold cross validation from 0 to 5, where the values of k for KNN ranged from 2 to 50, this method is not preferred to find the value of k. As the size of data increases, this method has lower efficiency causing the need for a feature decomposition algorithm. exp(−||Lxi − Lxj ||2 ) , pij = exp −(||Lxi − Lxk ||2 )
pii = 0
(2)
k=i
The above formula defines the Euclidean distance between two dependent variables and ||Lxi − Lxj || is the distance between query points and L is constant. It can be also solved from Mahalonobis distance metric: ||L(xi − xj )||2 = (xi − xj )T M (xi − xj ), where
(3)
M = LT L
is a matrix. 3) Support Vector Machines: The supervised learning method used for classification and outlier detection is Support vector machines. This method is really useful for high dimensional spaces and is used where the number of dimensions is greater than the number of samples. For decision functions, different kernel methods are used. but the main disadvantage is, it does not provide probability estimates, their calculation is done basically from five-fold crossvalidation. In addition to it, the compute and storage requirements increase rapidly with the increase in the number of training vectors.
EEG Analysis for Predicting Early Autism Spectrum Disorder Traits
Given training vectors
xi ∈ Rp
665
(4)
i=1,. . . , n, in two classes, and a vector y ∈ {1, −1}n , SVM solves the following primal problem: n
1 min wT w + C ζi w,b,ζ 2 i=1 subject to yi (wT φ(xi ) + b) ≥ 1 − ζi , ζi ≥ 0, i = 1, ..., n
(5)
Its dual is 1 min αT Qα − eT α α 2 subject to y T α = 0 0 ≤ αi ≤ C, i = 1, ..., n
(6)
where e is the vector of all ones, C ¿ 0 is the upper bound, Q is an n by n positive semi-definite matrix, Qij ≡ yi yj K(xi , xj ),
(7)
K(xi , xj ) = φ(xi )T φ(xj )
(8)
where
is the kernel. Here training vectors are implicitly mapped into a higher (maybe infinite) dimensional space by the function phi. 4) Random Forest Classifier: It includes supervised machine learning techniques like Support Vector Machines. Sci-kit learn provides extra variable with model, which shows contribution of each feature in prediction analysis. Random Forest is set of multiple decision trees and prevent over fitting by creating trees on subsets. F (x) =
M
γm hm (x).
(9)
m=1
where F (x)= Total number of trees.
hm = arg min h
n
L(yi , Fm−1 (xi ) + h(xi )).
i=1
hm = normalized feature for importance of tree i in j.
(10)
666
P. K. Saran and M. Pirouz
To calculate importance of each feature, Random forest uses Mean Decrease in impurity. Mean decrease is parameter for variable selection. As shown in Fig. 2, larger the decrease, more significant the variable is. The larger decrease is obtained in this model, so out of all different methods used, Random Forest is chosen as best classifier model. In graph, x-axis defines the error rate or the percentage rate of different values obtained after applying above defined formulas whereas k is the kernel value. The minimum decrease is obtained when k equals 13.
4 4.1
Experiment Setup
The method defined was implemented on personal setup using following libraries and software tools: Jupyter Notebook: Jupyter Notebook reports are delivered by the Jupyter Notebook Application, which contains both PC code (for example python) and rich content components (section, conditions, figures, joins, etc.). Notebook is both intelligible archives containing the investigation depiction and the outcomes (figures, tables, and so on.) just as executable reports which can be rushed to perform information examination. Establishment of the Jupyter Notebook through conda requires command: pip3 introduce jupyter. The elements of information control, change, simulation, factual demonstrating, AI strategies are performed in jupyter note pad.
Fig. 2. Mean Decrease Impurity graph for best variable selection in random forest.
EEG Analysis for Predicting Early Autism Spectrum Disorder Traits
667
Anaconda: Anaconda is a package deal manager, a surroundings manager, a Python distribution, and a gathering of over 1,000+ open inventory bundles. It is free and simple to introduce, and it manages free network support. More than one hundred fifty projects are precisely settled with Anaconda. We utilized the accompanying order to set up the contemporary variant of Anaconda chief. It is additionally conceivable to make your very own custom bundles utilizing the “conda construct” order. A portion of the bundles and devices that we required were made accessible through Conda, these were: – Scikit-learn for implementing Machine Learning in Python. Simple and efficient tools for data mining and data analysis; Accessible to everybody, and reusable in various contexts; Built on NumPy, SciPy, and matplotlib; Open source, commercially usable. Most of the inbuilt machine learning algorithms like SVM, multinomial NB, that we used are defined in scikit-learn. – Pandas is an easy-to-use data structure and data analysis tool for the Python programming language. Reading and analysis of CSV dataset can be done with pandas, and also has various features that allow us to format datasets and perform cleaning. – NumPy scientific uses high-level mathematical functions to operate on arrays. NumPy can also be used as an efficient multi-dimensional container of generic data. Arbitrary data-types can be defined. This allows NumPy to seamlessly and speedily integrate with a wide variety of databases. – NTLK is one of the leading platforms for working with human language data and Python, the module from this library we are using the snowball stemmer, Snowball is a small string processing language designed for creating stemming algorithms for use in Information Retrieval. Requests is an Apache2 Licensed HTTP library, written in Python. It is designed to be used through humans to interact with the language. So you don’t have to manually add query strings to URLs or form-encode your POST data. Requests two enable you to ship HTTP/1.1 requests using Python. With it, you can add content like headers, form data, multipart files, and parameters via easy Python libraries. It additionally approves you to get right of entry to the response data of Python in an identical way. This library is required for getting the datasets and can also be used in the future to learn about to include streaming data. The Collections module implements high-performance container fact types (beyond the built-in kinds list, dictionary, and tuple) and incorporates many beneficial statistics buildings that you can use to keep statistics in memory, such as name tuple, deque, CounterDict, OrderedDict and defaultdic. 4.2
Data-Sets
The data set used to be developed the use of a cellular app called ASD Test app to display autism by Dr. Fadi Fayez Thabtah (fadifayez.com). In this dataset, ten behavioral points have been recorded plus different men and women characteristics that have proved to be wonderful in detecting the ASD cases from controls
668
P. K. Saran and M. Pirouz
in behavior science. The attribute A1-A10 in which questions viable answers are Always, Usually, Sometimes, Rarely and Never items’ values are mapped to “1” or “0” in the dataset. The last features are amassed from the “submit” screen in the ASD Test App. All members had a medical diagnosis of ASD and scored at or above scientific cut-off on the Autism Diagnostic Observation Schedule [24]. To make certain the obscurity of the subjects, private statistics is not published(name, address, etc.). The other dataset consists of adult information, to find autistic fee difference between babies and adults. All the datasets consist of wholesome volunteer topics also. The EEG dataset used has 16 channels, these channels record the EEG signals. The EEG dataset is by and large for research and initial analysis, no longer or scientific use. This undertaking makes the utilization of pre-processed data to perform statistical evaluation except performing any cleansing and manipulation. 4.3
Results and Discussion
The research deals with the main task of classifying patients with ASD or not ASD based on the best classifier model. To obtain results, the method chosen was KNN Classifier Model, Logistic Regression and Random Forest Classifier and Support Vector Machines. Figure 4 illustrates the classification graph presents the results of jaundice. Figure 5 and 6 present the age distribution and ethnicity, respectively.
Fig. 3. The classification graph representing ASD or not ASD from the datasets.
EEG Analysis for Predicting Early Autism Spectrum Disorder Traits
669
Fig. 4. The graph determining the link of ASD with children born with jaundice
Fig. 5. Age Distribution Graph of ASD positive toddlers and adults, respectively
Fig. 6. Positive ASD results in toddlers and adults of different ethnicity in USA
The perception is drawn from the pre-handled dataset. Likewise, Fig. 3 clarifies the arrangement with the x-pivot speaks to the age and y-hub speaks to the choice rate premise on the level of individuals in the group. It limits the information from least age in dataset to most extreme age alongside choice class touted as healthy and autistic brain. It can be presumed that Autism begins from birth and is long-lasting sickness and regardless of whether it is regularly creating at an early age it gets typical by grown-ups age and is at most elevated ordinary pinnacle point.
670
P. K. Saran and M. Pirouz
Fig. 7. Heat Map showing the intensity of features
The classification also deals with choosing the best classifier model. The machine learning strategies gave an itemized report of Support, F-score, Precision, Recall at various paces of the neurotypical, healthy and autistic brain. Table 2 presents the performance of random forest classifier and Tables 3 and 4 demonstrate performances of KNN and logistic regression, respectively. Table 2 clarifies the exactness pace of Random Forest classifier, out of the over three models, arbitrary Forest and KNN classifier performs the same generally speaking yet superior to Logistic Regression. Random Forest and K-closest Neighbor performs best with the general precision of 71% and with standard exactness of 35%. Then again, Table 3 clarifies the Logistic Regression performs with an accuracy of 66% for choice class with
EEG Analysis for Predicting Early Autism Spectrum Disorder Traits
671
Table 2. Performance Rate of Random Forest Classifier Random Forest Classifier Rates
Precision f1-score Recall Support
Healthy brain
0.71
0.35
0.47
78
Autistic brain
0.71
0.92
0.88
133
Neuro-typical brain
0.71
0.71
0.67
211
Between healthy and Neuro-typical brain 0.71
0.71
0.71
211
Between autistic and Neurotypical brain
0.63
0.63
211
0.71
Table 3. Performance Rate of K-Nearest Neighbor K-nearest neighbor Rates
Precision f1-score Recall Support
Healthy brain
0.71
0.35
0.47
78
Autistic brain
0.71
0.92
0.88
133
Neuro-typical brain
0.71
0.71
0.67
211
Between healthy and Neuro-typical brain 0.71
0.71
0.71
211
Between autistic and Neurotypical brain
0.63
0.63
211
0.71
Table 4. Performance Rate of Logistic Regression Logistic Regression Rates
Precision f1-score Recall Support
Healthy brain
0.73
0.14
0.24
78
Autistic brain
0.66
0.97
0.78
133
Neuro-typical brain
0.69
0.66
0.58
211
Between healthy and Neuro-typical brain 0.66
0.66
0.66
211
Between autistic and Neurotypical brain
0.56
0.51
211
0.70
an estimation of 1. As we realize that, Support is characterized as an absolute number of right reactions from several tests that lie in the same class. Precision can be defined as, TP P recision = TP + FP where TP stands for True positives, that is, correct predictions done and FP stands for False positives, meaning incorrect predictions classified as true. The recall score is given as follows Recall =
TP TP + FN
672
P. K. Saran and M. Pirouz
Recall is also called as sensitivity, the true positive rate of correct predictions to the total number of positive examples also. FN stands for False Negatives, which determine the incorrect values, not even added to the list. The F1 score is combination of Precision and Recall. F1 score is good, if it contains low false positives and low false negatives. F 1 = 2x
precision ∗ recall precision + recall
Tables display the performance rate of the top 2 models that is, Random Forest Classifier, K-nearest neighbor(performs the same overall rate) and Logistic Regression. The performance rate of other methods are not considered in tables here as the results are consistently lower than the baseline selected. The introduction of deep learning methods resulted in increased accuracy and tune the features to provide the best results. Furthermore, it representing visualization from a given dataset. From Fig. 4 it can be seen that jaundice has a link with ASD patients. The jaundice is almost 6–7 times more for non-ASD patients (in Adults) and 2–3 times more (in Toddlers)for non-jaundice born ASD positive whereas according to reports it is around 10 times. The children born with jaundice have a strong link with ASD. Also, ASD is more common among boys than girls. For the next result, that is, for Fig. 5, adults with positive ASD are round 20 or 30 years of age, whereas infants are around 36 months. As the age increases, wide variety decreases whereas in toddlers as age will increase number. Adults increase strategies to assist them from age better. For toddlers, the significant sign of autism exhibits around 3 years of age. Similarly, the Positive Autism consequences for children and adults of distinct ethnicity are discovered and the extent is excessive among the youngsters aged up to 36 months. The warmness map is drawn, as shown in Fig. 7, in which every component is drawn to each other, as the depth of shade will increase, then facets are directly proportional to each other, relevance to their effect. The inexperienced color determines the constant value
5
Conclusion
With causes being unknown and cures being unavailable, without interventions, working to lower costs is purpose of action. The collaboration of people in education of people is really important. Digital media functions can help in reworking the mastering system of children, with the range of extraordinary possibilities and enabling creation of environments creating meaningful activities like language enhancement, emotion recognition, social communication skills, academic enhancement, etc. In this paper, primarily based on the KNN classifier model on above dataset, if any parent offers toddler’s age, gender, ethnicity, jaundice while birth and any relative having ASD traits, the model can predict both the little one has ASD or no longer with precision of seventy one percent.
EEG Analysis for Predicting Early Autism Spectrum Disorder Traits
673
For future work, we aim to predict the future state of the brain in terms of physical, emotional and decision making. It is important to connect work with epochs. The above experimentation also explains that accurate results are achieved using machine learning techniques on pre-processed data set before testing on real subjects. Acknowledgment. This project is partially funded by a grant from Amazon Web Services.
References 1. Ahmadlou, M., Adeli, H.: Electroencephalograms in diagnosis of autism. Comprehensive Guide to Autism, pp. 327–343 (2014) 2. Bodike, Y., Heu, D., Kadari, B., Kiser, B., Pirouz, M.: A novel recommender system for healthy grocery shopping. In: Future of Information and Communication Conference, pp. 133–146. Springer (2020) 3. Bosl, W.J., Tager-Flusberg, H., Charles, A.: Nelson. EEG analytics for early detection of autism spectrum disorder: a data-driven approach. Sci. Rep. 8(1), 6828 (2018) 4. Daros, A.R., Zakzanis, K.K., Ruocco, A.C.: Facial emotion recognition in borderline personality disorder. Psychol. Med. 43(9), 1953–1963 (2013) 5. Di Martino, A., Yan, C., Li, Q., Denio, E.C., Francisco, X., Alaerts, K., Anderson, J.S., Assaf, M., Bookheimer, S.Y., Dapretto, M., et al.: The autism brain imaging data exchange: towards a large-scale evaluation of the intrinsic brain architecture in autism. Mol. Psych. 19(6), 659 (2014) 6. Dickstein-Fischer, L., Fischer, G. S.: Combining psychological and engineering approaches to utilizing social robots with children with autism. In: 2014 36th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, pp. 792–795. IEEE (2014) 7. Djemal, R., AlSharabi, K., Ibrahim, S. Alsuwailem, A.: EEG-based computer aided diagnosis of autism spectrum disorder using wavelet, entropy, and ANN. BioMed Res. Int. (2017) 8. Doma, V., Pirouz, M.: A comparative analysis of machine learning methods for emotion recognition using EEG and peripheral physiological signals. J. Big Data 7(1), 1–21 (2020) 9. Doma, V., Singh, S., Arora, N., Ortiz, G., Saran, P.K., Chavarin, S., Pirouz, M.: Automated drug suggestion using machine learning. In: Future of Information and Communication Conference, pp. 571–589. Springer (2020) 10. Duffy, F.H., Als, H.: A stable pattern of EEG spectral coherence distinguishes children with autism from neuro-typical controls-a large case control study. BMC Med. 10(1), 64 (2012)
674
P. K. Saran and M. Pirouz
11. Dvornek, N.C., Ventola, P., Duncan, J.S.: Combining phenotypic and resting-state fMRI data for autism classification with recurrent neural networks. In: 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018), pp. 725–728. IEEE (2018) 12. Eldridge, J., Lane, A.E., Belkin, M., Dennis, S.: Robust features for the automatic identification of autism spectrum disorder in children. J. Neurodevelop. Dis. 6(1), 12 (2014) 13. Fan, J., Bekele, E., Warren, Z., Sarkar, N.: EEG analysis of facial affect recognition process of individuals with ASD performance prediction leveraging social context. In: Seventh International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW), pp. 38–43. IEEE (2017) 14. Fan, J., Wade, J.W., Bian, D.K., Alexandra, P.W., Zachary, E., Mion, L.C., Sarkar, N.: A step towards EEG-based brain computer interface for autism intervention. In: 2015 37th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pp. 3767–3770 (2015) 15. Han, L., Li, X.: The appliance of affective computing in man-machine dialogue: assisting therapy of having autism. In: Fourth International Conference on Communication Systems and Network Technologies, pp. 1093–1096. IEEE (2014) 16. Howsmon, D.P., Kruger, U., Melnyk, S., James, S.J., Hahn, J.: Classification and adaptive behavior prediction of children with autism spectrum disorder based upon multivariate data analysis of markers of oxidative stress and DNA methylation. PLoS Comput. Biol. 13(3), e1005385 (2017) 17. Hopcroft, J., Khan, O., Kulis, B., Selman, B.: Natural communities in large linked networks. In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 541–546. ACM (2003) 18. Kana, R.K., Libero, L.E., Hu, C.P., Deshpande, H.D., Colburn, J.: S: Functional brain networks and white matter underlying theory-of-mind in autism. Soc. Cogn. Affect. Neurosci. 9(1), 98–105 (2012) 19. Lievesley, R., Wozencroft, M., Ewins, D.: The emotiv EPOC neuroheadset: an inexpensive method of controlling assistive technologies using facial expressions and thoughts? J. Assistive Technol. 5(2), 67–82 (2011) 20. Malaia, E., Bates, E., Seitzman, B., Coppess, K.: Altered brain network dynamics in youths with autism spectrum disorder. Exper. Brain Res. 234(12), 3425–3431 (2016) 21. Manning, S.E., Davin, C.A., Barfield, W.D., Kotelchuck, M.C., Karen, D.H., Osbahr, T., Smith, L.A.: Early diagnoses of autism spectrum disorders in Massachusetts birth cohorts, 2001–2005. Pediatrics 127(6), 1043–1051 (2011) 22. Minzenberg, M.J., Poole, J.H., Vinogradov, S.: Social-emotion recognition in borderline personality disorder. Compr. Psychiatry 47(6), 468–474 (2006) 23. Nunez, P.L., Srinivasan, R., et al.: Electric Fields of the Brain: the Neurophysics of EEG. Oxford University Press, USA (2006) 24. Patel, A.N., Jung, T.P., Sejnowski, T.J., et al.: A wearable multi-modal bio-sensing system towards real-world applications. IEEE Trans. Biomed. Eng. 66(4), 1137– 1147 (2018) 25. Shahbodin, F., Mohd, Che Ku Nuraini C.K., Azni, A.H., Jano, Z.: Visual perception games for autistic learners: Research findings. In: Proceedings of the 2019 Asia Pacific Information Technology Conference, pp. 56–60. ACM (2019) 26. Simmons, T.L., Snider, J.A., Moran, N.G., Tse, NGA., Townsend, J., Chukoskie, L.: An objective system for quantifying the effect of cognitive load on movement in individuals with autism spectrum disorder. In: 2019 9th International IEEE/EMBS Conference on Neural Engineering (NER), pp. 1042–1045. IEEE (2019)
EEG Analysis for Predicting Early Autism Spectrum Disorder Traits
675
27. Tayeb, S., Pirouz, M., Sun, J., Hall, K., Chang, A., Li, J., Song, C., Chauhan, A., Ferra, M., Sager, T., et al. Toward predicting medical conditions using k-nearest neighbors. In: 2017 IEEE International Conference on Big Data (Big Data), pp. 3897–3903. IEEE (2017) 28. Li, M.W.W., Huang, C., Chena, X.: Stepping community detection algorithm based on label propagation and similarity. Phys. Stat. Mech. Appl. 472, 145–155 (2017) 29. Zheng, Z., Fu, Q., Zhao, H., Swanson, A.R., Weitlauf, A.S., Warren, Z.E., Sarkar, N.: Design of an autonomous social orienting training system (ASOTS) for young children with autism. IEEE Trans. Neural Syst. Rehabil. Eng. 25(6), 668–678 (2016)
Decision Support System for House Hunting: A Case Study in Chittagong Tanjim Mahmud1(&), Juel Sikder1, and Sultana Rokeya Naher2 1
2
Department of Computer Science and Engineering, Rangamati Science and Technology University, Rangamati, Bangladesh [email protected] Department of Computer Science and Engineering, University of Information Technology and Sciences, Dhaka, Bangladesh
Abstract. House hunting is one of the most significant tricks for several families in Bangladesh and worldwide, which also involves difficult decisions to make. It requires a large number of criteria to be simultaneously measured and evaluated. As house hunting attributes are expressed in both quantitative and qualitative terms, decision-makers have to base their judgments on both quantitative data and practical subjective assessments. Many of these criteria are related to one another in a complex way and therefore, they very often conflict in so far as improvement in one often results in decline of another. House hunting problem exist uncertainties or incompleteness data. Consequently, it is necessary to address the suitable house by using appropriate methodology; otherwise, the decision to select a house to live in will become unsuitable. Therefore, this paper establishes the application of a method named Analytical Hierarchical Process, which is capable of addressing the suitable house in taking account of multicriterion analysis problem. Chittagong, which is the mega city of Bangladesh, has been considered as the case study area to demonstrate the application of the developed Decision Support System. Keywords: Decision Support System Analytical Hierarchical Process Multi-criterion analysis House hunting (HH)
1 Introduction Chittagong is a beautiful city with its city center facing the port. Many families migrate to Chittagong due to the fact that it provides a nice and safe environment. It is however, House hunting is the mind-numbing activities in Bangladesh and worldwide. It is difficult to find the perfect area to live in without thorough research of the locations in the city. Selecting the `most excellent house is a composite decision process for home buyer or renter. It requires a large number of criteria to be concurrently measured and evaluated. Many of these criteria are related to one another in a complex way and therefore, they very often conflict insofar as improvement in one often results in decline of another. Furthermore, as house attributes are expressed in both quantitative and qualitative terms, decision-makers have to base their judgments on both quantitative data and practical subjective assessments [1, 2]. It is worth mentioning house hunting © Springer Nature Switzerland AG 2021 K. Arai et al. (Eds.): FTC 2020, AISC 1289, pp. 676–688, 2021. https://doi.org/10.1007/978-3-030-63089-8_44
Decision Support System for House Hunting: A Case Study in Chittagong
677
scenario in Bangladesh is so bad because different real estate company use static system (Fig. 1) such as normally search method to find out from database.This does not give efficient result and is a time consuming process. For this reason, the house hunters may still miss out the ideal home they dream of.
Fig. 1. Scenario in Bangladesh (User Preference)
In this paper, the analytical Hierarchical Process (AHP) approach (which is capable of processing both quantitative and qualitative measures) is applied as a means of solving the house hunting (HH) crisis [3–5]. In the process of house hunting a multiple criteria decision model of a hierarchical structure is presented, in which both quantitative and qualitative information is represented in a combined manner. The HH crisis is then fully investigated using the AHP approach. Hence, this paper presents the design, development and application of Decision Support System (DSS) that will find a suitable house precisely in a short time with low cost. In Sect. 2 briefly described the literature review, in Sect. 3 demonstrated the application of AHP to find suitable house. In the next Section results and comparisons are represented. Finally, the paper is concluded in Sect. 5.
2 Literature Review MCDM problems are very common in everyday life. Many methods have been proposed to solve the problem, such as Belief Rule Base decision support system, Evidential Reasoning approach, Analytic Network Process etc. In reference 5 Evidential Reasoning method is proposed for house hunting with 16 attributes for 5 alternatives, where a belief structure is used to model an assessment as a distribution. To calculate the degree of belief 4 evaluation grades were used namely excellent, good, average and bad. The ER approach was used to obtain the combine degree of belief at the top level attribute of a hierarchy based on its bottom level attribute. Then utility function was used to determine the ranking of different alternatives [5]. The Analytic Hierarchy Process (AHP) is a mathematical technique for multi criteria decision making (MCDM) originally proposed by Saaty [3]. It enables people to
678
T. Mahmud et al.
make decisions involving many kinds of concerns including planning, setting priorities, selecting the best among a number of alternatives, and allocating resources. It is a popular and widely used method for multi-criteria decision making. It allows the use of qualitative, as well as quantitative criteria in evaluation develops a hierarchy of decision criteria and defines the alternative courses of actions [6–8]. AHP algorithm is basically composed of three steps: first one is structuring a decision problem and selection of criteria then priority setting of all the criteria by pair wise comparison (weighting), second one is pair wise comparison of options on each criterion (scoring) and final is both qualitative and quantitative information can be compared by using informed judgments to derive weights and priorities [1, 2].
3 AHP to Design Decision Support System for House Hunting House hunting problem (HHP) is a massive problem in Bangladesh and global, because House hunting problem exists multiple criteria such as qualitative- location, attractiveness, safety, environment and quantitative attribute–proximity to hospital, main roads, education institution, shops, offices, recreation centers, police precincts, etc [5]. In trying to select the ‘best’ house task facing client is a multiple criteria decisionmaking (MCDM) process, in which a large number of criteria need to be evaluated. Most of these criteria are related to each other in a complex way. Furthermore, many usually conflicts, such that a gain in one criterion enquires an exchange in another. As HHP decision criteria are a mix of both qualitative and quantitative characteristics, DMs have to base their decisions on both quantitative analysis and subjective (typically experiential) judgments. DMs may spontaneously and it easier to make subjective judgments by using linguistic variables However, this can cause problems during evaluation of alternatives, because it is difficult to aggregate process these two types of measure one quantitative and another linguistic. It is, therefore, necessary that any MCDA method be capable of aggregating these two types of measures in a coherent and reliable manner; ultimately providing a ranking of all decision alternatives [9]. I have provided the same set of criteria that are used in reference 5 for HHP and asked some of the house hunters to select the criteria which are considered by them while selecting a house. Here I found that 80% of the house hunters didn’t select the criteria–nice neighborhood, proximity to shop, proximity to bus & railway station, proximity to recreation center, police precincts, property insurance and population density. Here, HHP is connected to qualitative attributes are Location, Attractiveness, Safeness, Environment and quantitative attributes are proximity to Main road, Hospital, Office, Eeducational institute and Cost per square feet. Also connected to alternatives are Khulsi, Devpahar, Jamal khan, Suganda, Chandgoan which is shown in Fig. 2 [5]. I can make a matrix from the 9 comparisons above shown in Fig. 2. Because I have 9 comparisons, thus I have 9 by 9 matrix. The diagonal elements of the matrix are always 1 and we only need to fill up the upper triangular matrix. To fill up the upper triangular matrix the following two rules are used: 1. If the judgment value is on the left side of 1, we put the actual judgment value. 2. If the judgment value is on the right side of 1, we put the reciprocal value.
Decision Support System for House Hunting: A Case Study in Chittagong
679
Fig. 2. Alternative Courses of Action
To fill the lower triangular matrix, I use the reciprocal values of the upper diagonal [10]. If aij is the element of row i and column j of the matrix, then the lower diagonal is filled using Eq. (1): aij ¼
1 aij
ð1Þ
The preferences of a criterion over others are set by the users in the form of comparison matrix as shown in Table 1. Each entry of the comparison matrix ranging from 1 to 9 reflects the degree of preference of a criterion over another. For instance the entry of “Prox_education institution” raw and “Prox_main roads” column which is 9, reflects the highest preference of “Prox_education institution” over “Prox_main roads”. Table 1. Pair wise comparison matrix of criteria Criteria
Location
Attractiveness
Safeness
Environment
Prox_education institution
Prox_hospital
Prox_main roads
Prox_office
Cost per squ. ft 1/2
Location
1
2
1/3`
1/4
1/5
1/6
4
1/5
Attractiveness
1/2
1
1/2
1/3
1/6
1/4
2
1/4
1/4
Safeness
3
2
1
4
5
2
3
2
2
Environment
4
3
1/4
1
4
1/3
7
1/3`
1/2
Prox_education institution
5
6
1/5
1/4
1
1/8
9
5
2
Prox_hospital
6
4
1/2
3
8
1
8
1/7
1/5
Prox_main roads
1/4
1/2
1/3
1/7
1/9
1/8
1
5
2
Prox_office
5
4
1/2
3
1/5
7
1/5
1
1/2
Cost per squ.ft
2
4
1/2
2
1/2
5
1/2
2
1
680
T. Mahmud et al. A=
1
2
0.33
0.25
0.2
0.17
4
0.2
0.5
1
0.5
0.33
0.17
0.25
2
.25
0.5 0.25
3 4 5
2 3 6
1 0.25 0.2
4 1 0.25
5 4 1
2 0.33 0.13
3 7 9
2 .33 5
2 0.5 2
6 0.25 5 2
4 0.5 4 4
0.5 0.33 0.5 0.5
3 0.14 3 2
8 0.11 0.2 0.5
1 0.13 7 5
8 1 0.2 0.5
.14 5 1 2
0.20 2 0.5 1
Ÿű ľ(Priority Vector) Normalized Column
Z= 0.04 0.02
0.08 0.04
0.10 0.14
0.02 0.03
0.01 0.01
0.02 0.02
0.12 0.06
0.01 0.02
0.06 0.03
0.12 0.20 0.20 0.24 0.01
0.08 0.13 0.26 0.17 0.02
0.28 0.07 0.06 0.14 0.10
0.33 0.08 0.02 0.25 0.01
0.27 0.21 0.05 0.43 0.01
0.18 0.03 0.01 0.10 0.01
0.09 0.20 0.26 0.23 0.02
0.14 0.02 0.36 0.01 0.36
0.22 0.06 0.22 0.02 0.22
0.20 0.07
0.17 0.15
0.14 0.12
0.25 0.14
0.01 0.03
0.63 0.31
0.01 0.01
0.07 0.13
0.06 0.11
Row Averages
Loca on Ara c venes s Sa fenes s Envi ronment
0.05 0.04 0.17 0.10
Prox_edu.Ins Prox_hos pi ta l
0.15 0.16
Prox_ma inroad
0.08
Prox_offi ce
0.14
Cos t-pr-s q
0.12
Fig. 3. Criteria weights
3.1
Criteria Weights
In order to interpret and give relative weights to each criterion, it is necessary to normalize the previous comparison matrix of Table 1. The normalization is made by dividing each table value by total column value using the Eq. (2 where i and j represents the subscripts of MXN matrix Zij ¼ Aij =
n X
Aij
ð2Þ
i¼0
Z, the normalized principal Eigen vector, since it is normalized, the sum of all elements in a column is 1. Then priority vector is calculated by calculating the average of each row of Z and the sum of all priority vector is 1. The priority vector shows relative weights among the things that I compare [11]. In above, Location is 5%, Attractiveness is 4%, Safeness is 17%, Environment is 10%, Prox_edu.Ins is 15%, Prox_hospital is 16%, Prox_main road is 8%, proximity to office is 14% and cost per square feet is 12%. A House buyer most preferable selection criterion is safeness, followed by remaining criteria. In this case, I know more than their ranking. In fact, the relative weight is a ratio scale that I can divide among them. For example, I can say that buyer prefers safeness 3.4 (=17/5) times more than Location and he also prefers safeness so much 2.1 (=17/8) times more than Prox_main road.
Decision Support System for House Hunting: A Case Study in Chittagong
3.2
681
Quantitative Ranking
Distance from the main road has great importance in case of selecting location of the house which impels me to run a campaign to get the approximated distances of the locations from main road. I have collected the data from construction developers. I divide each element of the matrix with the sum of its column; I have calculated normalized relative weight n0. The sum of each column is 1. Proximity to educational institutes as well as hospitals and offices and also cost per square feet has similar importance in case of choosing locations. Since in case of distance and cost, the lowest distance and lowest cost is best, I have calculated the distance scores by subtracting the normalized values n0 from 1. Again I have normalized the distance scores to get the normalized value n1. The approximated distances of those alternative locations from the road, educational institute and hospitals and also cost per square feet of those alternatives are shown in Tables 2, 3, 4, 5 and 6, respectively. I have taken the normalized values of those distances using Principal Eigen vector theory. 3.3
Qualitative Ranking
Pair wise comparison, is a process of comparing alternatives in pairs to judge which entity is preferred over others or has a greater qualitative property. Tables 7, 8, 9 and 10 show pair wise comparison matrix of alternatives based on Location, Attractiveness, Safety and Environment respectively. The pair wise comparison matrices of the alternatives per criterion are set in the similar manner as described in Sect. 3 and the actual judgment values are set from the results of the survey (survey from construction developers) I have ran so far to understand the degree of preference of a location over another. The score of alternatives based on location is calculated from Table 7 using principal Eigen vector theory and priority vector described earlier which is shown in Fig. 4 [11]. Similarly I have calculated the scores of all alternatives for each of the criterion from the respective pair wise comparison matrix of alternatives based on the respective criterion denoted by C where in matrix C the process produces criteria wise scores {v11, v12……v1n},{v21,v22……v2n},…{vm1,vm2……vmn} for n criteria and m alternatives which is shown in Table 11.
682
T. Mahmud et al. Table 2. Proximity to Main road Alternatives Khulsi Devpahar Jamalkhan Suganda Chandgoan Summation
KM 1.4 1.0 2.1 2.5 2.8 9.8
Normalized n0 0.14 0.10 0.21 0.26 0.29 1.00
Distance scores Normalized n1 0.86 0.21 0.90 0.22 0.79 0.20 0.74 0.19 0.71 0.18 4.0 1.00
Table 3. Proximity to Education Institute Alternatives Khulsi Devpahar Jamalkhan Suganda Chandgoan Summation
KM 2.1 3 2 1.9 1.7 9.8
Normalized n0 0.19 0.28 0.19 0.18 0.16 1.00
Distance scores Normalized n1 0.81 0.20 0.72 0.18 0.81 0.20 0.82 0.21 0.84 0.21 4.0 1.00
Table 4. Proximity to Hospitals Alternatives Khulsi Devpahar Jamalkhan Suganda Chandgoan Summation
KM 2.3 2.6 2.4 2.0 3.0 9.8
Normalized n0 0.19 0.21 0.19 0.16 0.24 1.00
Alternatives Khulsi Devpahar Jamalkhan Suganda Chandgoan Summation
KM 2 1.6 1 2 3 9.6
Normalized n0 0.21 0.17 0.10 0.21 0.31 1.00
Distance scores Normalized n1 0.81 0.20 0.79 0.20 0.81 0.20 0.84 0.21 0.76 0.19 4.0 1.00
Table 5. Proximity to Office Distance scores Normalized n1 0.79 0.20 0.83 0.21 0.90 0.22 0.79 0.20 0.69 0.17 4.0 1.00
Decision Support System for House Hunting: A Case Study in Chittagong
A=
Col umn Sums
X=
(Priority Vector)
1 0.5 0.33 0.25 0.2 2.28
2 1 0.17 0.14 0.13 3.44
3 6 1 4 0.11 14.11
Khulsi Dev pahar Jamalkhan Suganda Chandgoan
4 7 0.25 1 0.33 12.58
5 8 9 3 1 26
Normalized Column Sums
0.43 0.21 0.14
0.58 0.29 0.05
0.21 0.43 0.07
0.32 0.56 0.02
0.19 0.31 0.5
0.11
0.04
0.28
0.08
0.12
0.08
0.04
0.01
0.03
0.04
683
Row Averages
0.34 0.36 0.12 0.16 0.04
Fig. 4. Score of alternatives based on Location
Table 6. Cost per square feet Alternatives Khulsi Devpahar Jamalkhan Suganda Chandgoan Summation
Thousand 10 5.5 6.5 6.0 3.5 31.5
Normalized n0 0.32 0.17 0.21 0.19 0.11 1.00
Distance scores Normalized n1 .68 0.17 .83 0.21 .79 0.20 .81 0.20 .89 0.22 4.0 1.00
Table 7. Pair wise Comparison Matrix of alternatives based on Location alternatives Khulsi Dev pahar Jamalkhan Suganda Chandgon
3.4
Khulsi 1 1/2 1/3 1/4 1/5
Dev pahar Jamalkhan Suganda Chandgon 2 3 4 5 1 6 7 8 1/6 1 1/4 9 1/7 4 1 3 1/8 1/9 1/3 1
Overall Assessments of Alternatives
In Sect. 3.1 I have calculated the weights of criteria wp from the pair wise comparison matrix of criteria and in Sect. 3.3 I have calculated C, the matrix of the weights of the selected alternatives for each of the criterion n and alternatives m. Finally the score of those alternatives, is calculated by talking their weighted sum as denoted by Eq. (3):
684
T. Mahmud et al. Table 8. Pair wise Comparison Matrix of alternatives based on Attractiveness Alternatives Khulsi Devpahar Jamalkhan Suganda Chandgoan Khulsi 1 8 2 6 10 Devpahar 1/8 1 1/11 1/2 8 Jamalkhan 1/2 11 1 2 3 Suganda 1/6 2 1/2 1 9 Chandgoan 1/10 1/8 1/3 1/9 1
Table 9. Pair wise Comparison Matrix of alternatives based on Safety Alternatives Khulsi Devpahar Jamalkhan Suganda Chandgoan
Khulsi Devpahar Jamalkhan 1 7 4 1/7 1 1/9 1/4 9 1 1/6 4 2 1/9 1/7 1/9
Suganda 6 1/4 1/2 1 1/5
Chandgoan 9 7 9 5 1
Table 10. Pair wise Comparison Matrix of alternatives based on Environment Alternatives Khulsi Devpahar Jamalkhan Suganda Chandgoan
Khulsi Devpahar Jamalkhan 1 5 4 1/5 1 1/2 1/4 2 1 1/2 3 3 1/9 1/2 1/4
Suganda 2 1/3 1/3 1 1/7
Chandgoan 9 2 4 7 1
Table 11. Scores of Alternatives for All Criteria Criteria Altern.
Location
Attractiveness
Safety
Environment
Prox_edu.Ins
Prox_hospital
Prox_main road
Prox_office
Cost per squ ft
Khulsi
0.25
0.46
0.50
0.61
0.20
0.20
0.21
0.20
0.17
Dev pahar
0.47
0.08
0.08
0.27
0.18
0.20
0.22
0.21
0.21
Jamalkhan
0.12
0.26
0.21
0.51
0.20
0.20
0.20
0.22
0.20
Suganda
0.12
0.13
0.17
0.38
0.21
0.21
0.19
0.20
0.20
Chandgoan
0.04
0.03
0.04
0.13
0.21
0.19
0.18
0.17
0.22
Decision Support System for House Hunting: A Case Study in Chittagong
Yi ¼
n X
685
ð3Þ
vip wp
p¼0
The process is shown in Fig. 5.
=
0.30 0.20 0.24 0.21 0.15
Greatest
Fig. 5. Final scores of alternatives
The final result is simply found from the rankings of the final scores of the alternatives got from Sect. 3.4. The scores of the alternatives with their respective ranks are shown in Table 12. Table 12. Overall assessment and Ranking of five different houses Alternatives Khulsi Dev pahar Jamalkhan Suganda Chandgoan
Alternative final scores Ranking 0.30 1 0.20 4 0.24 2 0.21 3 0.15 5
Here Khulsi is best house because ranking is 1. Then jamal khan(2) > Suganda (3) >)>Dev pahar (4) Chandgoan (5).Although in case of quantitative measurement “Khulsi” is in 2nd, 3rd and 5th position respectively in Proximity to main road, Proximity to education, Proximity to Hospital, Proximity to Office and Cost Per square feet, it has high score in qualitative measurements because of its attractiveness, environment and safety. Hence in overall assessment it has got the highest score. Figure 6 shows the study area in Chittagong.
686
T. Mahmud et al.
Fig. 6. Study Area Chittagong District
4 Comparisons with Reference Work For comparing the result of the proposed method with the referenced work I simply took the average expected utilities of the reference work and converted those in normalized form and Table 13 shows the acquired result and expected result:
Table 13. Results of AHP and ER approach in normalized form Result (AHP) Result (ER) Normalized (ER) 0.30 0.86 0.22 0.20 0.74 0.19 0.24 0.85 0.21 0.21 0.81 0.20 0.15 0.74 0.19
The bar graph shown in Fig. 7 represents the results of both proposed work and the reference work that I have explained earlier in the literature review section. The results are quite similar in case of “Devpahar”, “Jamalkhan”, “Sugondha” and “Chandgaon”. Although the result for “Khulsi” of the reference work is quite low as compared to proposed work, both the methods provide highest score for “Khulshi”. Though in AHP approach every step is dependent on previous step it is easy and simple approach with only 3 steps but ER approach is a complex approach with more than 6 steps.
Decision Support System for House Hunting: A Case Study in Chittagong
687
Fig. 7. Results of proposed work and reference work.
5 Discussion and Conclusions In this research has been shown how to select the best house offer and minimize of the interdependence between the factors for choice and conflicts to each other in terms in the decision making process. From the results shown above, it is reasonable to say that the AHP method is a mathematically sound approach towards measuring the house performance as it employs a structure to represent an assessment as a distribution. The best alternative may change from “khulsi” to any other alternatives because the preferences of a criterion over others are set by the users in Table 1. Here, a user preferred safeness at first, others may prefer proximity to office or proximity to hospital, etc. This approach is quite different from the other Multi Criteria Decision Making model. Finally, in a complex assessment as in the house performance appraisal which involved objective and subjective assessments of many basic attributes as shown in Table 12. Therefore, the AHP is seen as feasible method for performance appraisal. In the future, based on the result of the refinements, it is best to develop a model that includes carefully selected criteria and sub-criteria as well as evaluations of consistent decision makers. This study can also be conducted to include more alternatives to be studied. Moreover AHP method cannot tackle the uncertainties or incompleteness in the data gathered and also exist rank reversal phenomenon when new data will be added. In order to tackle the problem of AHP method, the fuzzy AHP method can be employed when evaluating the elements of the model.
688
T. Mahmud et al.
References 1. Bhushan, N., Rai, K.: Strategic Decision Making: Applying the Analytic Hierarchy Process. Springer, New York (2004) 2. Coyle, G.: The analytic hierarchy process (New York: Pearson Educational) (2004). Haas, R., and Meixner, O.: An illustrated guide to analytic hierarchy process (Vienna, Austria: University of Natural Resources and Applied Life Sciences) (2005) 3. Saaty, T.L.: Theory and Applications of the Analytic Network Process: Decision Making with Benefits, Opportunities, Costs, and Risks. RWS Publications, Pittsburgh (2005) 4. Saaty, T.L.: Relative measurement and its generalization in decision making: why pairwise comparisons are central in mathematics for the measurement of intangible factors - the analytic hierarchy/network process. Rev. Spanish Roy. Acad. Sci., Ser. A Math. 102, 251– 318 (2008) 5. Mahmud, T., Hossain, M.S.: An evidential reasoning-based decision support system to support house hunting. Int. J. Comput. Appl. 57(21):51–58 (2012) 6. Saaty, T.L.: Extending the measurement of tangibles to intangibles. Int. J. Inf. Technol. Decis. Making 8(01), 7–27 (2009) 7. Teknomo, K.: Analytic hierarchy process (AHP) tutorial (2006). http://people.revoledu.com/ kardi/tutorial/ahp/. Viewed 11 Aug 2019 8. Triantaphyllou, E., Mann, S.H.: Using the analytic hierarchy process for decision making in engineering applications: some challenges. Int. J. Ind. Eng. Appl. Pract. 2, 35–44 (1995) 9. Triantaphyllou, E.: Multi-Criteria Decision Making Methods: A Comparative Study. Springer, New York (2002). ISBN 978-1-4757-3157-6 10. Vargas, L.G.: An overview of the analytic hierarchy process and its applications. Eur. J. Oper. Res. 48, 2–8 (1990) 11. Kostlan, E.: Statistical complexity of dominant Eigenvector calculation. Hawaii J. Complex. 7(4), 371–379 (1991)
Blockchain in Charity: Platform for Tracking Donations Sergey Avdoshin(&) and Elena Pesotskaya(&) National Research University Higher School of Economics, 20 Myasnitskaya ulitsa, 101000 Moscow, Russian Federation {savdoshin,epesotskaya}@hse.ru
Abstract. The paper explores the possibilities of using blockchain technology in charity. Problems in this area require implementation of new storage tools and the transfer of information between donors, foundations, donation recipients and other charitable actors to ensure data security, the integrity of funds, and the control of donations. Using the blockchain will allow for an increase in the confidence of potential donors in charitable organizations through guaranteed data security, the ability to track the movement of funds and transactions. In this article, the authors analyze needs, review existing charity platforms on the basis of blockchain in Russia and in the world. They offer an example of the implementation of a platform for placing and tracking donations of funds for charitable purposes using distributed registry technologies. While researching, the authors cooperated with local Funds and NPOs to validate the solution, get a better understanding of the ecosystem needs and share this experience in the paper. Keywords: Charity Blockchain Smart-contracts Ethereum Transparency
1 Introduction Blockchain is already disrupting many industries. It was intended as a banking platform for digital currency, but currently blockchain has applications that go beyond financial transactions and its operations are becoming popular in many fields. The idea of blockchain is to use a decentralized system that can replace banks and other trusted third parties. Blockchain is a large structured database distributed by independent participants of the system. This database stores an ever-growing list of records in order (blocks). Each block contains a timestamp and a reference to the previous block. The block cannot be changed spontaneously - each member of the network can see that a transaction has taken place in the blockchain and it is possible to perform a transaction only possessing the access rights (private key). Blockchain can solve the problem of trust mechanism which is a key element of blockchain technology. It is more like a public account book that everyone can record, view, and maintain [1]. Blocks are not stored on a single server; this distributed ledger is replicated on thousands of computers worldwide, so users interacting in the blockchain do not have any intermediaries. Blockchain or distributed register technology can be shared by individuals, organizations, and even devices. It saves time, increases transparency, and © Springer Nature Switzerland AG 2021 K. Arai et al. (Eds.): FTC 2020, AISC 1289, pp. 689–701, 2021. https://doi.org/10.1007/978-3-030-63089-8_45
690
S. Avdoshin and E. Pesotskaya
gives the ability to make everything a tradable asset [2]. The World Economic Forum predicts that by 2027, it would be possible to store nearly 10% of the global gross domestic product on blockchains [3]. Due to its ability to control and monitor information, distributed register technology has also become useful in the social areas of society, including charity. It is especially important to have guarantees that no one can manipulate and overwrite data. Implementation of blockchain technology in the system of transferring and tracking financial tranches and information flows provides each stakeholder with an opportunity to track the movement of funds and evaluate the work of each link in charity work. This will have a direct impact on the key problem of charity - the lack of trust among potential donors. The continuous decline in trust on the part of potential donors is justifiable: according to Essential Research, 35% of US citizens have little or no trust in charitable institutions. 52% of charities are not adequately funded and cannot match the distressing increase in demand for their services [4]. According to a study of charity [5, 27], factors such as openness, transparency, and accessibility of reporting are also key ones when making a donation decision: • Charity is widespread among capital owners. This gives a significant return in view of both the regularity and the size of the revenue. However, wealthy donors require not only openness and transparency of the sphere, but also the frequency of communication about the results achieved, targeting, and efficient use of funds. • When choosing a non-profit organization (NPO), the reputation of the fund, its successfully implemented projects, and its management play a significant role. In the framework of this study, the authors highlight the key problems of each stakeholder and the possibility of solving problems due to the blockchain technology: 1. Challenges of Donors. Firstly, the donor’s constraint is the lack of practice in publishing successful projects and the effectiveness of each donor’s contribution. Secondly, geographic or legal barriers may arise, limiting the potential donor’s ability to make transactions or imposing commissions that are too high on the transfer [6]. These difficulties are solved due to the ability to track cash flows from any place in the world, the lack of geographical restrictions, and legal barriers to transfer funds. Blockchain also provides cost reduction due to low transaction fees and the absence of taxation on cryptocurrency. The problem associated with the lack of reporting provided to the donor is being solved due to the opportunity to see the entire history of his transfer to the final destination, which is stored in a distributed registry. The donor can see all details about the purpose and the cost of the transaction. 2. NPO challenges. Often companies face too much lag between the time of receiving a donation and the provision of assistance. This is the example of ineffective fund management [6]. Legal restrictions on the work of organizations and the inability to accept assistance from foreign individuals and companies also make the work complicated. Using blockchain provides a higher speed of assistance due to the fast processing of cryptocurrency transactions and the absence of intermediaries. The possibility of fraud faced by funds can be solved through transparent operations for any observer: a blockchain allows to evaluate the size of available assets at any
Blockchain in Charity: Platform for Tracking Donations
691
time, which also provides increased financial management efficiency. When choosing a non-profit organization, the reputation of the fund and its management plays a significant role. According to a survey by Fidelity Charitable, 41% of donors say they have changed the purpose of their donations due to increased knowledge about nonprofit effectiveness. Complicating matters, organizations transferring funds internationally may lose from 3% up to 10% of the funds in transaction fees and inefficiencies caused by having to go through multiple intermediaries such as banks, agencies, and governments. This is especially critical for aid organizations dealing with large sums of money and more complex geographies and financial systems [7]. The study has the following structure: in the introduction, the prerequisites for using blockchain in the field of charity are described. Further, in Sect. 2, the current technological trends in the field of charity in different countries, the experience of introducing new technologies and the result of their use are covered. The next Sect. 3 describes the implementation of the blockchain platform in the field of charity based on the needs of participants. Section 4 contains restrictions and describes the barriers to usage of the blockchain platform in the process of receiving and distributing charitable contributions. In the last section, the results of the study are summarized, postulating: what qualitatively new results have been achieved during the work.
2 Charity and Technology Trends 2.1
Best Practices of Using Technology in Charity
Technology can greatly help non-profit organizations, foundations, volunteers, and social entrepreneurs. The vigorous implementation of innovative solutions in the field of charity, which was called “PhilTech” [8], demonstrates that non-profit organizations are not only charity “with an outstretched hand”. This is a new market for technological solutions that can help organizations solving social problems do it more efficiently [9]. Blockchain technology is underpinned by an orchestration of supporting hardware and software components that integrate decentralised cryptographic protocols, distributed cloud computing (i.e. processing, storage, connectivity), and development environments to support the implementation and actuation of real-time blockchain applications. The most effective practices of using technology in charity are presented (see Table 1). 2.2
Related Works of the Blockchain Application
Blockchain technology is underpinned by an orchestration of supporting hardware and software components that integrate decentralised cryptographic protocols, distributed cloud computing (i.e. processing, storage, connectivity), and development environments to support the implementation and actuation of real-time blockchain applications [10]. According to the IDC study [11], in 2018, the USA is the leader in the implementation of blockchain projects - 341 startups (36.9% of the total number of startups
692
S. Avdoshin and E. Pesotskaya Table 1. The best practices of using technology in the field of charity
Technology Alternative payment methods Mobile applications
Social networks
Virtual reality (VR)
Artificial Intelligence (AI) Blockchain
Charity opportunities Digital fundraising, as well as contactless methods of transferring funds (for example, PayPal and Apple Pay) can significantly increase donations Mobile applications allow users to get closer to charity, as well as manage their interaction with organizations (for example, view the history of previous donations, track donated amounts and view a map of their movements up to the final beneficiary) Social networks allow you to attract as many caring people as possible by creating groups and publishing information on fundraising for a specific charitable target VR offers a powerful tool for interaction, visually depicting what is happening in a real situation (for example, through a 360° video, creating emotional narratives to attract donors, and also demonstrating how a charity helps those in need) AI is an adaptation of assistants and chatbots that allows for optimizing research processes Blockchain is an extremely open structure that allows donors to provide assistance to those in need and without mediation, and charitable foundations eliminate corruption and optimize marketing and accounting costs
in the world). Canada ranks third with 42 startups (4.5% of the total). A significant proportion of all projects in the US and Canada are startups in the financial industry, including lending, insurance, and investment. Next comes industry and the services and retail sector. However, there are many successful blockchain projects in the sphere of charitable giving [12–14]. Western Europe ranks second in terms of blockchain development costs, while Great Britain ranks second in the ranking of countries after the United States in terms of the number of blockchain projects (136 of the total number of projects in the world 14.8%). The UK is actively developing projects for charity based on blockchain technology (the largest charity project Alice.si). However, the bulk of blockchain projects aims at the financial industry of the region. Despite a conservative attitude towards Initial Coin Offerings (ICOs) and cryptocurrency exchanges, China has shown an increased interest in this distributed ledger technology. The Chinese government agency responsible for social services intends to use blockchain technology to modernize the charitable donation system, in particular to increase its transparency. As for Russia, there is an increase in the scale of the sector of charitable organizations, and the sector of charitable foundations is being structured. By the beginning of 2018, more than 11.6 thousand charitable organizations were registered in Russia [15]. In 2018, 90% of Russians at least once responded to a request for help, 84% at least once supported the activities of charitable organizations over the past year, and
Blockchain in Charity: Platform for Tracking Donations
693
31% do this at least once every two to three months, and 69% of citizens are ready to donate more if there will know exactly as to what they are spent on [16]. Blockchain initiatives dedicated toward social impact are still in the early days— 34% were started in 2017 or later, and 74% are still in the pilot or idea stage. However, 55% of social-good blockchain initiatives are estimated to impact their beneficiaries by 2019 [17]. There are a lot of successful blockchain projects in charity already: OIN.Space blockchain platform (www.oin.space), the Leukemia Fund blockchain reporting (www. leikozu.net), An electronic wallet on the Ethereum cryptocurrency for the Charity Fund for older people (www.starikam.org) and Blockchain Charity Lottery created by DataArt (www.devpost.com/software/blockchaincharitylottery), Crypto Charity Fund aimed at creating a global ecosystem of animal care (www.cryptocharityfund.com) - in the fall 2017 it released CCF cryptocurrency, also they accept BTC and ETH for their purposes). The UN World Food Programme (WFP) has used blockchain for aid distribution in Jordan to directly pay vendors, facilitate cash transfers for over 10,000 Syrian refugees, and audit beneficiary spending (www.innovation.wfp.org). Hypergive is a secure digital food wallet for homeless or hungry people in the community (www. medium.com/hypergive). Akshaya Patra, the world’s largest non-profit supplier of cooked meals for school children, used a Blockchain solution, reducing the cost of each meal it supplies, being able to do more with the resources the blockchain has (www. akshayapatra.org). Since Fidelity Charitable began accepting cryptocurrency in 2015, donors have made contributions totaling nearly $106 million by the end of 2018 [18]. Andrej Zwitter and Mathilde Boisse-Despiaux in their work [19] raise the issue of introducing technology not only in the field of charity, but in the social sphere as a whole, analyze the possible directions, risks and boundaries of the applicability of the blockchain to solving human problems. Pratyush Agarwal, Shruti Jalan, and Dr. Abhijit Mustafi [20] propose to implement a platform based on a system with cryptocurrency transactions. The model involves the use of the principle of the stock market: a certain charity project is laid out on the blockchain platform in the form of certain “certificates”, then they go through a series of checks for compliance with the standards and go to the “market”, where they can be purchased by buyers for cryptocurrency, which guarantees the effective use of funds, while all operations are performed on the blockchain and are strictly transparent. Scientists from Saudi Arabia, Lama Abdulwahab Dajim, Sara Ahmed Al-Farras, Bushra Safar Al-Shahrani, Bushra Safar Al-Shahrani have proposed a decentralized organ donation application using blockchain technology. It will contain all medical information about the necessary organs, blood type, and the patient’s condition. The system will work on the basis of the order of priority if a patient is not in critical condition [21]. The work “A model of donation verifications” [22] is devoted to the development of an algorithm for verifying the correctness of a donation. The article tests the hypothesis that the funds received by beneficiaries are an approximation to the total amount of money contributed by donors. The authors also get some negative results, which show that it is impossible to verify the correctness of the donation in some circumstances.
694
S. Avdoshin and E. Pesotskaya
As a result of the analysis of social projects based on blockchain and relevant studies on this topic, we can conclude that in most cases projects have the following features: • • • •
Ability to raise funds in cryptocurrency; Data storage in the blockchain; Motivation of donors and volunteers (through rewards in the form of tokens). High integration with a typical client/server architecture (hybrid) as a consequence of the fact that most blockchain-based projects are not fully decentralized.
3 Application of Blockchain in the Donation Industry In the framework of this work, the authors propose a solution based on a distributed registry (hereinafter referred to as the Platform), which can help monitor and control the process of money transfer from donor to beneficiary and their further information interaction with the service. Currently, digital platforms (sharing platforms, search engines, social networks, ecommerce platforms, etc.) and the platform ecosystems transform entire industries and various types of socio-economic activity and become drivers of economic growth, innovation, and competition. The digital platform economy is an online service that provides the opportunity to carry out transactions between contractors and counterparties. The online system conducts all stages of the transaction from providing communication between the contractor and the customer to receiving payment and recall. The platform is exactly the solution that can help people who are located even on the other side of the globe. The blockchain-based platform is an alternative solution with decentralized and direct transactions that can help charities receive donations and raise funds more efficiently. In order to determine the approach and to create a format for a possible product, we need to understand what the audience needs and what it lacks now. 3.1
The Analysis of Participants and Their Needs
In order to determine a blockchain solution for charity and donation tracking purposes and design a possible product, it is necessary to understand what the audience needs and what it lacks now, namely, the interests of consumers and service providers. • Donors (sponsors) are individuals or foundations who invest in a cause, or an organization that supports their cause. • A recipient is the ultimate target of a donation, typically an individual [23]. • Fund and charity organizations (Aggregators) are commercial entities that provide services for donors, which are not driven by profit but by dedication to charity. When a donor gives money to charity, they want to be sure that their funds will go in the right hands. They need to know that their donation will really benefit those whom they have decided to help. They want to be sure that a charitable organization can be
Blockchain in Charity: Platform for Tracking Donations
695
trusted, and that it is not a scammer. A blockchain solution should bring together donors and recipients within charity projects. Moments when a person performs operations with money and shares personal data, are often objectively insecure and create an increased opportunity for abuse and fraud. Therefore, in such situations (especially if both money and personal data appear simultaneously), people behave very vigilantly. It is especially applicable to payments on the Internet, by mobile phone. The task of the blockchain platform is to create a mechanism that is safe for the donor, and subjectively transmits information to the donor that the money is transferred through a secure distributed registry system. Thus, we should create a system with the following features: • Donors should be able to easily join instructions with their contributions, such as “I would like my donations to be sent only to food expenses and that work in my state”. • Donors need to receive the most complete report on how their money was spent. • Donors do not need to spend a lot of time looking for projects that match their interests, or checking how their contributions have been spent, even if they keep a full account. • In addition to transferring funds, donors should be able to remain anonymous. However, donors wishing to publicly recognize or incentivize others by matching contributions should be able to do this easily. It is necessary to ensure the anonymity of donors and aid recipients from the internal operators, with only authorized persons given access to search technology for the encrypted data used to access personal information. All personal information of donors and aid recipients must be encrypted so that it cannot be read by anyone other than the user [24]. For the Funds and Charity organizations, we should ensure efficiency of operations and the following functionality. Recipient aggregators should be able to easily identify projects and their attributes in ways that facilitate optimal pairing or large-scale aggregation. The system should allow interested parties to stand out. Therefore, attributes such as the transaction’s low cost should be readily available as conditions. Transparency, safety, and accuracy of operations will undoubtedly affect the reputation of the fund, which will help attract more donors and implement charity projects. Now the donation route that many funds use is “manual” and non-transparent (see Fig. 1). The donated funds route contains many links that are not automatically registered, and funds manually prepare reports on received and spent financial assets. To implement the functionality of donation tracking, it is proposed to modify the existing donation scheme by adding the database, cloud storage, and blockchain components there (Fig. 2). 3.2
System Architecture
The platform solution suggested by the authors consists of the following parts: Server and client part, Smart contracts; Database and FTP server for file storage (Fig. 3).
696
S. Avdoshin and E. Pesotskaya
Fig. 1. The current donation route without blockchain
Fig. 2. Modified donation route with blockchain
Fig. 3. High level platform architecture
The integration of the platform with charitable Foundation systems (e.g. CRM) takes place through the REST API which is provided. All donations and movements of charitable funds will have to be registered through the REST API.
Blockchain in Charity: Platform for Tracking Donations
697
The platform uses the Ethereum test network – Ropsten. For communication with the blockchain server part, we use the Web3.js standard library. Smart contracts are implemented on Solidity language. Server part (REST API) implemented using platform Node.js and framework Express on programming language JavaScript. Node.js has high productivity, an active community, and is supported by large companies. MySQL is used as a centralized data storage (off-chain storage). Functions and procedures for quick interaction with the database have been developed. Also the Telegram bot is included in the system, which interacts with the user when they create a new donation. This bot receives from the user the amount that they want to donate to charity and gives the ID for a donation. Then the user can enter this ID and get detailed information about the donation on the website or in the Telegram bot to track where exactly the funds were spent. The functionality of this bot is similar to the functionality of the website. Telegram bot was implemented to simulate the process of donation and expenditure of funds on programming language Python. 3.3
System Functionality
The main functionality available to the donor is “Make a donation” and “Get information about a donation”. The donor can receive information about donation using a unique identifier. The information provides a route for the movement of funds from the account of the charity fund to the ultimate goal. The functionality for the charity organization is “To Register Transaction” and “Upload Donation Report”. Based on the donations’ information, a charity will be able to export the report to the Ministry of justice and report for publication on the website. The work of the platform ensures the implementation of five key stages (see. Figure 4). Let us consider each stage in more detail.
Fig. 4. Stages of implementing blockchain in charity
Stage 1. Donor registration and beneficiary verification. The Fund verifies the applications of the Beneficiary, determines the type of service, creates an electronic wallet, and indicates the required amount. An electronic wallet is also being created for the donor. Each electronic wallet is assigned its own unique ID number. After registration, the charity fund verifies the beneficiary’s applications and determines the type of service, creating an electronic donation collection wallet for the account of the service donor and the necessary amount.
698
S. Avdoshin and E. Pesotskaya
Stage 2. Making donations. The donor makes a classic bank payment by specifying the ID of his electronic wallet in the payment destination field. The Fund confirms that the transfer is completed and credited to the donor’s electronic wallet. Stage 3. Disposal of donations. The donor, using his ID, enters the Platform and transfers funds in any proportion from their electronic wallet to the electronic wallets of beneficiaries, whom the donor wants to help with a certain amount. Stage 4. Expenditure of Donations. When the necessary amount is accumulated in the beneficiary’s “wallet”, a smart contract for spending funds is executed, and the charity fund transfers money from its account to the Service Provider’s account in the amount indicated in the beneficiary’s electronic wallet. Stage 5. Control of the donations expenditures by the donor. At all stages, the donor can track the process of helping the beneficiary and the process of spending donations. Let us consider an example of a transaction using the Platform. An individual or company that has allocated $100; thanks to the blockchain transaction monitoring system, it can be seen that $20 went to protect the national parks of region X, $40 to protect the rights of children in orphanages, $30 - to provide access to the network to the poorest segments of the population, and $10 - for the purification of water in the reservoir N.
4 Limitations Despite all the advantages of using blockchain in conducting and monitoring financial transactions in charity, the Platform has a number of restrictions for use. Firstly, this service requires a robust computing power, since several thousand operations must be processed simultaneously. Secondly, these are legal barriers – the existing law on personal data today limits the use of this technology. The last barrier – blockchain is still considered to be a new technology, people might have their internal blocks of using the technology being not well tested and feeling some underlying risk. That is why all potential risks associated with blockchain technology require intensive consideration. It is vital that all new technology is tested in an ethical manner, with minimal risk to beneficiaries [25].
5 Conclusion and Future Work Blockchain has enormous potential in solving the problems of charity today. The main ones are increasing confidence in NPOs and removing legal and geographical barriers to donations. In a number of developed countries (USA, UK, Canada, China) there are already blockchain platforms and cryptocurrencies that optimize processes in charity. The goal of every social innovation organization is to solve problems at scale. By reducing costs and increasing the flow of funds, Blockchain can help social innovation organizations scale up their operations. It requires a huge operational and financial
Blockchain in Charity: Platform for Tracking Donations
699
transformation, especially in building the right kind of ecosystem enabled by emerging technologies like Blockchain [26]. The creation of such a Platform can be divided into five stages: registration and verification of donors and beneficiaries, making donations, disposing of funds, and spending them. At each of the stages, the donor controls the movement of their funds. The platform has technological and legal limitations that must be considered during implementation. Currently, the minimum viable product of the future Platform is implemented, using the Ethereum Ropsten network. Smart contracts implemented using Solidity language. The server part of the platform was developed on Node.js platform using JavaScript. Telegram bots were developed for simulating the process of donations and receiving chain of funds spending. The main advantages of the chosen architecture are as follows: • • • •
Save disk space on blockchain. A trusted authority to access the data. Fast data processing speed. Ability to implement client applications to other platforms (REST API).
The future scope of the Platform development involves improving the platform in terms of customization for fund’s needs, integrating the platform with various client systems (e.g. CRMs), as well as scaling and increasing platform performance. The results were also presented to the Russian National Payment Card System (NSPK), responsible for the development of payment services and supporting the country’s sovereignty, and setting up the industry standards with further agreement on cooperation and joint projects.
References 1. Yli-Huumo, J., Ko, D., Choi, S., et al.: Where is current research on blockchain technology?A systematic review. Plos One 11(10) (2016). https://doi.org/10.1371/journal.pone.0163477. Accessed 12 Mar 2020 2. Avdoshin, S., Pesotskaya, E.: Blockchain revolution in the healthcare industry. In: Proceedings of the Future Technologies Conference (FTC) 2018, vol. 1. Springer, Switzerland (2019) 3. Herweijer, C., Waughray, D., Warren, S.: Building block(chain)s for a better planet. Fourth Industrial Revolution for the Earth Series. World Economic Forum. http://www3.weforum. org/docs/WEF_Building-Blockchains.pdf. Accessed 22 Feb 2020 4. Fusetti, F.N., Ravalli, M., Minacori, V., et al.: AidCoin Whitepaper. https://www.aidcoin.co/ assets/documents/whitepaper.pdf. Accessed 21 Jan 2020 5. Shpak, A., Misytina, V., Oganesyan, A.: Russian_Philanthropist. Skolkovo School. Doi of Management. https://common.skolkovo.ru/downloads/documents/SKOLKOVO_WTC/ Research/SKOLKOVO_WTC_Russian_Philanthropist_Rus.pdf. Accessed 02 Dec 2019 [In Russian]. Accessed 22 Jan 2020 6. Tambanis, D.: Blockchain Applications: Charitable Giving. Blockchain Philanthropy Foundation – 2018. https://medium.com/bpfoundation/blockchain-applications-charitablegiving-a3c50837f464. Accessed 07 Feb 2020
700
S. Avdoshin and E. Pesotskaya
7. The Future of Philanthropy. Fidelity Charitable Report (2016). https://www. fidelitycharitable.org/docs/future-of-philanthropy.pdf. Accessed 24 Jan 2020 8. Nizza, A.: Blagotvoritelnost na blokchejne [In Russian]/Charity on Blockchain. Higher School of Economics (2017). https://www.hse.ru/news/community/215221346.html. Accessed 14 Jan 2020 9. Cox, L.: 5 ways charities are using technology. Disruption magazine (2017), https:// disruptionhub.com/5-ways-charities-use-technology/. Accessed 22 Dec 2019 10. Elsden, C., Manohar, A., Briggs, J.: Making sense of blockchain applications: a typology for HCI. In: The 2018 CHI Conference, April 2018 11. Soohoo, S., Goepfert, J.: Worldwide blockchain 2018–2022 forecast: market opportunity by use case—2H17 update. IDC, August 2018 12. Klein, J.: Blockchain social good organizations that are actually doing something. BreakerMag (2019). https://breakermag.com/73-blockchain-social-good-organizations-thatare-actually-doing-something/. Accessed 17 Jan 2020 13. Culhane, M.: Scientometric study on distributed ledger technology (Blockchain). National Research Council (NRC). Defence Research and Development Canada, March 2019 14. Wong, M., Valeri, D.: Distributed ledger technology/blockchain: perspectives. Chartered Professional Accountants of Canada (CPA Canada). University if Toronto, Institute of Management and Innovation (2019) 15. Ivanov, V., Ivanova, N., Mersiyanova, I., Tumanova, A., et al.: Volunteering and charity in Russia & goals of National development. In: XXI April International Conference on Economic and Social Development Proceedings [In Russian]. Higher School of Economics (2019). https://conf.hse.ru/mirror/pubs/share/262128086. Accessed 22 Feb 2020 16. Yaznevich, E., Babikhina, K., Freik, N.: Professional charity in russia. analytical report. Charity Fund “NeedHelp” (2018). [In Russian]. https://nuzhnapomosh.ru/research/2018/ professionalnaya-blagotvoritelnos/. Accessed 17 Jan 2020 17. Galen, D., Brand, N., Boucherle, L., et al.: Blockchain for social impact. Stanford Graduate School of Business, www.gsb.stanford.edu/sites/gsb/files/publication-pdf/study-blockchainimpact-moving-beyond-hype.pdf. Accessed 12 Feb 2020 18. Giving report. Fidelity Charitable Report (2019). www.fidelitycharitable.org/content/dam/fcpublic/docs/insights/2019-giving-report.pdf. Accessed 21 Dec 2010 19. Zwitter, A., Boisse-Despiaux, M.: Blockchain for humanitarian action and development aid. J. Int. Hum.Action 3(1), 16 (2018) 20. Agarwal, P., Jalan, S., Mustafi, A.: Decentralized and financial approach to effective charity. In: 2018 International Conference on Soft-computing and Network Security (ICSNS). IEEE (2018) 21. Dajim, L.A., et al.: Organ donation decentralized application using blockchain technology. In: 2019 2nd International Conference on Computer Applications & Information Security (ICCAIS). IEEE (2019) 22. Fu, B., Zhu, F., Abraham, J.: A model for donation verification. arXiv preprint arXiv:1708. 07973 (2017) 23. Jain, S., Simha, R.: Blockchain for the common good: a digital currency for citizen philanthropy and social entrepreneurship. In: 2018 IEEE Proceedings Confs on Internet of Things (2018) 24. Lee, J., Seo, A., Kim, Y., Jeong, J.: Blockchain-based one-off address system to guarantee transparency and privacy for a sustainable donation environment. Sustainability 10, 4422 (2018)
Blockchain in Charity: Platform for Tracking Donations
701
25. Zwitter, A., Herman, J. (eds.): Blockchain for Sustainable Development Goals: #Blockchain4SDGs - Report 2018 (7-2018 ed.) Rijksuniversiteit Groningen, Leeuwarden (2018). https://www.rug.nl/research/portal/files/63204374/351162_Paper_Blockchain_ 4SDGs_A4_RUG_CF_LRdef_2_.pdf. Accessed 20 Feb 2020 26. Podder, S., Roy, P., Tanguturi, P., Singh, S.K.: Blockchain for good. Accenture Lab Report (2017). https://www.accenture.com/_acnmedia/pdf-68/accenture-808045-blockchainpovrgb.pdf. Accessed 14 Jan 2020 27. Lehn, M.B., Voida, A., Bopp, C.: Policy fields, data systems, and the performance of nonprofit human service organizations. Hum. Serv. Organ.: Manag. Leadersh. Govern. 42, 2 (2018). https://doi.org/10.1080/23303131.2017.1422072. Accessed 20 Jan 2020
Data Analytics-Based Maintenance Function Performance Measurement Framework and Indicator C. I. Okonta(&) and R. O. Edokpia University of Benin, PMB 1154, Benin City, Nigeria [email protected]
Abstract. Maintenance tag is a tool used to identify and visualize a deviation on a machine from basic conditions. Failure or deviation is identified during an inspection, then tagged for easy identification. The tag is then registered on the computerised maintenance management system (CMMS) as a work request which is thus converted to work order, scheduled and executed after which feedback is given the operator to remove the tag (de-tagging). In this work, a data analytics-based maintenance function performance measurement framework and indicators are designed and the process of maintenance decision support system is applied in a beverage production plant. A Classification algorithm is applied to groups the notification on maintenance requests based on parameter as defined by maintenance policy into pending tags, current week tags, two weeks tags, three weeks tags, report tags and, solved tags and distributed according to functional locations. With this insight, a correct decision on preventive maintenance could be made based on the actual condition of the machine by selecting the most cost-effective maintenance approach and system to achieve operational safety. Keywords: Maintenance
Tag Analytics
1 Introduction The key determinants of competitiveness and performance of manufacturing companies are availability, reliability and productivity of their production equipment. With this recognition, the perception of maintenance as a necessary evil over the past decade has changed and now evolved to become a value-adding activity. The ever-increasing demand from stakeholders on the need to increased profitability, coupled with the advancement in technology, has led to the implementation of advanced manufacturing technologies, increased automation and reduction in buffers of spares inventory. This, on the other hand, has resulted in an increasing pressure on maintenance managers to sustain and improve system performance as maintenance management has become central to manufacturer’s productivity and profitability [1]. A good overview of maintenance processes and achievements such as maintenance performance frameworks and indicators by the maintenance managers is a prerequisite to ensuring a good performance of the production plant. Maintenance performance measurement (MPM) may not necessarily guarantee performance improvement if they © Springer Nature Switzerland AG 2021 K. Arai et al. (Eds.): FTC 2020, AISC 1289, pp. 702–716, 2021. https://doi.org/10.1007/978-3-030-63089-8_46
Data Analytics-Based Maintenance Function Performance Measurement Framework
703
are generic [2]. Thus, the MPM framework should be linked with business objective and organization cooperate strategy and, in this way, customised rather than generic MPM system is developed. In maintenance, the policy that dictates which parameter triggers a maintenance action is referred to as the maintenance policy [3]. General maintenance triggers such run time, elapsed time etc., may not always fit companies hence the need for tailored maintenance concept Maintenance requires practical approaches which are still underrepresented in literature [4, 5] thus creating a gap between academia and practice [6]. Maintenance performance measurement (MPM) framework should align with the maintenance objectives of strategic, tactical and operational management levels with the relevant maintenance performance indicators (MPI) [7]. To remain profitable an industrial revolution referred to as industry 4.0 was adopted a new improvement to operation and maintenance. Industry 4.0 focus on data-driven operations and maintenance. Big data analysis techniques allow for the systematic analysis to reveal hidden patterns in huge streaming data sets. This has facilitated the development of preventative maintenance (PM) and its use in a wide range of monitoring applications, such as vibrations, temperature, and noise [8]. To attain sustainable maintenance that aligns with the organisational objective, data-driven maintenance and operations based on data analytics are the most versatile approaches. Sustainable maintenance is a new challenge for companies to realize sustainable development. It is a process of continuous development and constant improvement of maintenance processes, increasing efficiency (operational excellence), the safety of operations and maintenance of technical objects and installations, focused on employees [9].
2 Data-Driven Process Data analytics is a systematic approach of identification of data characteristics to derive logic, pattern and relationship [10]. The essence of big data analytics in a production system is to get a better understanding of the deviations from the normal operational standard. A data-driven process as shown in Fig. 1 entails the utilisation of data mining and statistical tool, to discover the patterns, trends, and other valuable information in a data set [11]. The data inherent in automation systems, computerized maintenance management systems, security and access control systems, and IT networks represent an untapped opportunity to improve the operation and maintenance (O&M) [12]. 2.1
Model of the Data Analytics Process
Data analysis process comprises several steps, which are necessary to analyse the current condition of the target component. These steps as shown in Fig. 2 include data pre-processing, feature extraction and selection and classification. Classification involves the mapping of the suitable features to the known classes using machine learning algorithms [13].
704
C. I. Okonta and R. O. Edokpia
Data-Driven maintenance
Data
AnalyƟcs
Data-driven operaƟon
Fig. 1. Data-driven process
ClassificaƟon model
Postprocessing
CMMS/SAP Data
Scorecard
Fig. 2. Data processing
Fig. 3. Sample tags
Maintenance Alert
Data Analytics-Based Maintenance Function Performance Measurement Framework
705
In this study, maintenance data from computerised maintenance management systems (CMMS) could be classified as Preprocessed data because they are mostly system generated or from tags raised by operators and technician as a result of observed abnormally on the machine. To gain the necessary insight from the Preprocessed data, a classification algorithm is developed to model the input data based on maintenance criticality from parameter setting. For management to keep track of maintenance activities execution, the result from the data classified is passed through a postprocessing stage which involves the transformation of data into anomaly Scorecard easily interpretable by the domain engineer. The output of the data analytics process is alert information with classification parameters related to the analytics process for maintenance decision. The data analytic model is shown in Fig. 2.
3 Tagging System The tagging system entails the use of maintenance tag as a tool to identify and visualize a deviation on a machine from basic conditions. Failure or deviation is identified during an inspection, then tagged for easy identification. The tag is then registered on the computerised maintenance management system (CMMS) as a work request which is thus converted to work order, scheduled and executed after which feedback is given the operator to remove the tag (de-tagging). The steps involved are elaborated in the subsections with the flowchart in Fig. 4. 3.1
Initiation of Tag
In the event of any abnormity on the equipment irrespective of the seriousness level, the Operator raises a tag by filling out a tagging form, indicating Equipment, the type of abnormality and a description of the abnormality. Sample of different types of tags is shown in Fig. 3. 3.2
Notification
An anomaly will lead to a tag. A tag is treated as a work request on CMMS (in this case SAP is used) by raising notification and discussed in the daily meeting (installation control). 3.3
Work Order
When the tag is approved (can also be rejected!), and the repair requires material, the tag will become a work order. Note: not all tags lead to a work order. 3.4
First Line Maintenance
The work order is executed by a technician, who reports this to the operator. The operator now removes the tag.
706
C. I. Okonta and R. O. Edokpia
Start
DetecƟon of failure/abnormalit y
Raise tag
Enter tag detail into SAP
Tag to workorder
No Solved within 1 week?
Solved within 2 weeks?
Yes
Yes
No
Resolve failure/abnormality Solved within 3 weeks? De-tag Yes No Close on SAP Escalate to management End
Fig. 4. Flowchart of tagging and de-tagging process
Data Analytics-Based Maintenance Function Performance Measurement Framework
707
4 The Measurement Framework and Indicator The analytical software developed in this work interfaces with SAP for data streaming as shown in Fig. 5 and classifies them into periods as:
PM AnalyƟc SoŌware
CMMS Interfacing (SAP)
PM AnalyƟc SoŌware
Fig. 5. Data analytics software interfacing
1. 2. 3. 4.
Current week Two weeks Three weeks and Report tags.
One-week tags are tags between 0 to 7 days, two weeks tags are between 7 and 14 days while three weeks tags are between 14 and 21 days. Report tags are tags that had stayed for up to one month without being solved. It also groups the tags according to equipment and functional location i.e. lines. Electronic Tag boards are generated for each of the machines. Since the tagging software directly used data from SAP, it has an interface that allows the technicians (or anyone) to view tags for any machine with their various classification for a selected time frame. With this interface, technicians do not need to go to SAP to check tags on their machine as these are done automatically. Tagboard is generated for each selection and a copy of the board could be sent via outlook email as the tool is linked directly to email. Figure 6 shows the flowchart for this maintenance measurement framework and indicator. The summary of the procedure is shown in Fig. 7 with the pseudo code in Fig. 8.
5 Demonstration The Data analytics-based maintenance function performance measurement framework and indicators start with a highlight of the management key priorities which include structure, process & task, people & competences, information & systems, governance & performance and Reward &recognition. These six design elements make up a high performing organization (HPO) with total productive management (TPM) in place as shown in Fig. 9. The homepage interface in Fig. 10 gives an overview of the robustness of the tag management maintenance analytic tool covering all the production areas of the plant from brewing to utilities and packaging. The user interface allows navigation to
708
C. I. Okonta and R. O. Edokpia
Start
Navigate through SAP data
Send mail
Yes All IDs found? No
Rem= Not Sent?
Yes No Column not found
No
Update data table Yes Email = empty?
Generate analyses N o
Update mail data
Status=So lved? Yes
No
No of days =1,7,14,21,28?
Yes
End
Fig. 6. Maintenance measurement framework and indicator flowchart
Data Analytics-Based Maintenance Function Performance Measurement Framework
709
Start
Update from CMMS
Run Analyses
Visualize result
Send NoƟficaƟon
End
Fig. 7. Summary of the application framework
different sections of the maintenance data analytic and visualization software with useful information on the date and time of last update and synchronization with the linked database (in this case, SAP Enterprise). Figure 11 shows the embedded database that replicates what is stored in the CMMS general database for quick manipulation within the software GUI without having to update every time from the external database. For data integrity, this database is readonly and it is updated based on a request by the user using the update button on the homepage. The classification algorithm groups the notifications into different classes based on parameter as at last update. The classes which include; Pending tags, Current week tags, two weeks tags, three weeks tags, Report tags and, solved tags are distributed according to the selected time interval from the date selection interface. Information on
710
C. I. Okonta and R. O. Edokpia Pseudo code 1. Open applicaƟon 2. Check last update 3. Update current = OK 4. Go to 12 5. Update current = Not OK 6. Open CMMS and logon 7. Update database table 8. Complete without error=OK 9. Go to 12 10. Complete with error =Not OK 11. Go to 6 12. Run Machine maintenance tag Analyses 13. Run tagging and de-tagging update 14. Send noƟficaƟon 15. Visualize result
Fig. 8. Pseudo code
Fig. 9. Welcome page with management key priorities
Data Analytics-Based Maintenance Function Performance Measurement Framework
711
Fig. 10. Homepage interface
Fig. 11. Embedded database
the task assignment which includes the machine leader within that zone responsible for maintenance, the machine identification number, date of request, the notification number and a description of the maintenance are also grouped with the primary weekly classes as in Fig. 12. This is to enhance the robustness of the analytic tool as the maintenance team may want analysis within a particular time interval. Figure 13 illustrates the scorecard on the performance of the whole plant when it comes to tagging and de-tagging. Based on the different functional location, it analyses the number of tags that are being raised, pending and resolved. Copies of this report being sent to the appropriate authority for the necessary action. The performance of the plant can be ascertained by the number of maintenances requested raised by the process
712
C. I. Okonta and R. O. Edokpia
Fig. 12. Data classification
Fig. 13. Sectional performance scorecard
owners and the number of such request that had be resolved with the selected time epoch. The management tab in Fig. 14 is for alerts, as reports with information on pending maintenance notifications and requests are being escalated to the management team for quick decisions and implementation of actions. The report lists out all the selected class of actions with their notification, date of request, the initiator of the request and duration of such request.
Data Analytics-Based Maintenance Function Performance Measurement Framework
713
Fig. 14. Management reporting
The chart in Fig. 15 gives information on the selected machine from the functional location. The illustration shows that as at when this was done, only one tag is left within the current week and two-week class which is an indication that either the machine is in near perfect condition or all the notification raised within this period has being solved by the maintenance team. A closer look at the three-week tags and the report tags shows that there are more old tags on the machine which is an indication of either negligence or unavailability of spare or expertise to handle the maintenance. This has a lot to do with the complexity of the tasks and also the planning of the maintenance management team. The column for the total raised and solved tag gives insight on how many tags that are being raised and resolved. It shows the quality and integrity of the inspection being carried both by the operators and the maintenance team. If the machine availability is not up to the desired level, then it is expected to several pending tags showing that there are issues that need to be resolved in order to restore the machine to its basic condition. On the other hand, if the availability and reliability are not up to the desired standard and there are no or very few tags on the machine, then the maintenance team need to be questioned on the inspections and their ability to detect anomalies and report them for necessary interventions thereby highlighting the area that needs improvement. The navigation page in Fig. 16 allows the user to navigate through the different machines and areas within the plant with their maintenance status using the next and previous buttons. The TV icon enables to switch to slideshow mode for automatic animation of the various charts at a predetermined time interval. In an industrial setting, visualization is key in communicating priority objectives to the shop floor. The TV mode interface in Fig. 17 allows user-dependent slides to be displayed on any chosen display screen which could be a computer, television or even projector located on the shop floor and visible to everybody. Base on the work area, the chart could be selected and the slide timer could be changed from the timer combo box. People engagement is also captured using the tag championship system in Fig. 18. The system gives an overview of how many maintenance tags are being raised by an operator and how many are being solved or still pending within the selected area and time interval.
714
C. I. Okonta and R. O. Edokpia
Fig. 15. Individual machine tag analyses
Fig. 16. Navigation interface
This improves eye for details to a very large extent and it is currently supported by a reward for tag championship. The visualization of this tag championship charts serves as an ultimate motivation to drive shop floor excellence.
Data Analytics-Based Maintenance Function Performance Measurement Framework
715
Fig. 17. TV mode interface
Fig. 18. Tag championship charts
6 Conclusion A data analytic tool was developed to track and monitor the performance of the machines and evaluate the effectiveness of the production and maintenance team in carrying inspection and execution of maintenance. This system goes beyond visualization to give insight on maintenance decision to be made. For example, a machine in a bad state with little tags on it is an indication of the inability of the maintenance and production team to identify anomalous (inadequate inspection). Furthermore, a machine full of tags indicates the deficiency in the maintenance team toward solving tags. Nothing just happens without a cause; early detection of this cause is a measure to curtail the forthcoming forced deterioration which could eventually lead to failure or breakdown as the case may be.
716
C. I. Okonta and R. O. Edokpia
The time interval between when an anomaly is discovered and when it is solved is a factor that could be used to estimate the expected availability and reliability of a machine and the production line in general. A track of this trend will indicate the response rate of the maintenance team to an identified problem. Long response time will present an opportunity for linked deterioration following the Markov Network of state transition as one thing leads to another. A proper track of this response time will help in management decision on the nature of maintenance and the type of stop to be embedded in the maintenance decision.
References 1. Crespo Marquez, A., Gupta, J.N.D.: Contemporary maintenance management: process, framework and supporting pillars. Omega 34(3), 313–326 (2006). https://doi.org/10.1016/j. omega.2004.11.003 2. Muchiri, P., Pintelon, L., Gelders, L., Martin, H.: Development of maintenance function performance measurement framework and indicators. Int. J. Prod. Econ. 131(1), 295–302 (2011). https://doi.org/10.1016/j.ijpe.2010.04.039 3. Goossens, A.J.M., Basten, R.J.I.: Exploring maintenance policy selection using the analytic hierarchy process; An application for naval ships. Reliabil. Eng. Syst. Saf. 142, 31–41 (2015). https://doi.org/10.1016/j.ress.2015.04.014 4. Nicolai, R.P., Dekker, R.: Optimal maintenance of multi-component systems: a review. In: Kobbacy, K.A.H., Murthy, D.N.P. (eds.) Complex System Maintenance Handbook, pp. 263–286. Springer, London (2008) 5. Van Horenbeek, A., Pintelon, L., Muchiri, P.: Maintenance optimization models and criteria. Int. J. Syst. Assur. Eng. Manag. 1(3), 189–200 (2010). https://doi.org/10.1007/s13198-0110045-x 6. Gits, C.W.: Design of maintenance concepts. Int. J. Prod. Econ. 24(3), 217–226 (1992). https://doi.org/10.1016/0925-5273(92)90133-R 7. Van Horenbeek, A., Pintelon, L.: Development of a maintenance performance measurement framework—using the analytic network process (ANP) for maintenance performance indicator selection. Omega 42(1), 33–46 (2014). https://doi.org/10.1016/j.omega.2013.02. 006 8. Su, C.-J., Huang, S.-F.: Real-time big data analytics for hard disk drive predictive maintenance. Comput. Electr. Eng. 71, 93–101 (2018). https://doi.org/10.1016/j. compeleceng.2018.07.025 9. Golińska, P. (ed.): EcoProduction and Logistics: Emerging Trends and Business Practices. Springer, Heidelberg (2013) 10. Corporation, E.M.C. (ed.): Data Science & Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data. Wiley, Indianapolis (2015) 11. Tiwari, S., Wee, H.M., Daryanto, Y.: Big data analytics in supply chain management between 2010 and 2016: Insights to industries. Comput. Ind. Eng. 115, 319–330 (2018). https://doi.org/10.1016/j.cie.2017.11.017 12. Burak Gunay, H., Shen, W., Newsham, G.: Data analytics to improve building performance: a critical review. Autom. Constr. 97, 96–109 (2019). https://doi.org/10.1016/j.autcon.2018. 10.020 13. Uhlmann, E., Laghmouchi, A., Geisert, C., Hohwieler, E.: Decentralized data analytics for maintenance in industrie 4.0. Proc. Manuf. 11, 1120–1126 (2017). https://doi.org/10.1016/j. promfg.2017.07.233
Parallel Mapper Mustafa Hajij1(B) , Basem Assiri2 , and Paul Rosen3 1
Santa Clara University, Santa Clara, USA [email protected] 2 Jazan University, Jazan City, Saudi Arabia [email protected] 3 University of South Florida, Tampa, FL, USA [email protected]
Abstract. The construction of Mapper has emerged in the last decade as a powerful and effective topological data analysis tool that approximates and generalizes other topological summaries, such as the Reeb graph, the contour tree, split, and joint trees. In this paper we study the parallel analysis of the construction of Mapper. We give a provably correct parallel algorithm to execute Mapper on a multiple processors. Our algorithm relies on a divide and conquer strategy for the codomain cover which gets pulled back to the domain cover. We demonstrate our approach for topological Mapper then we show how it can be applied to the statistical version of Mapper. Furthermore, we discuss the performance results that compare our approach to a reference sequential Mapper implementation. Finally, we report the performance experiments that demonstrate the efficiency of our method. To the best of our knowledge this is the first algorithm that addresses the computation of Mapper in parallel. Keywords: Mapper computing
1
· Topological data analysis · High performance
Introduction and Motivation
The topology of data is one of the fundamental originating principle in studying data. Consider the classical problem of fitting data set of point in Rn using linear regression. In linear regression one usually assumes that data is almost distributed near a hyperplane in Rn . See Fig. 1(a). If the data does not meet this assumption then the model chosen to fit the data may not work very well. To the best of our knowledge this is the first algorithm that addresses the computation of Mapper in parallel. On the other hand, a clustering algorithm normally makes the shape assumption that the data falls into clusters. See Fig. 1(b). Data can come in many other forms and shapes, see Fig. 1(c). It is the shape of data [7] that drives the meaning of these analytical methods and determines the successfulness of application of these methods on the data. c Springer Nature Switzerland AG 2021 K. Arai et al. (Eds.): FTC 2020, AISC 1289, pp. 717–731, 2021. https://doi.org/10.1007/978-3-030-63089-8_47
718
M. Hajij et al.
(a)
(b)
(c)
Fig. 1. (a) The linear shape of the data is a fundamental assumption underlying the linear regression method. (b) Clustering algorithms assume that the data is clustered in a certain way. (c) Data can come in many other forms and shapes.
Topology is the field in Mathematics that rigorously defines and studies the notion of shape. Over the past two decades, topology has found enormous applications in data analysis and the application of topological techniques to study data is now considered a vibrant area of research called as Topological Data Analysis (TDA) [6–11,14]. Many popular tools have been invented in the last two decades to study the shape of data, most notably Persistent Homology [17,37] and the construction of Mapper [41]. Persistent Homology has been successfully used to study a wide range of data problems including three-dimensional structure of the DNA [18], financial networks [20], material science [25] and many other applications [35]. The construction of Mapper has emerged recently as a powerful and effective topological data analysis tool to solve a wide variety of problems [28,34,38] and it has been studied from multiple points of view [12,16,32]. Mapper works as a tool of approximation of a topological space by mapping this space via a “lens”, or a sometimes called a filter, to another domain. One uses properties of the lens and the codomain to then extract a topological approximation of the original space. We give the precious notion in Sect. 3. Mapper generalizes other topological summaries such as the Reeb graph, the contour tree, split, and joint trees. Moreover, Mapper is the core software developed by Ayasdi, a data analytic company whose main interest is promoting the usage of methods inspired by topological constructions in data science applications. As the demand of analyzing larger data sets grows, it is natural to consider parallelization of topological computations. While there are numerous parallel algorithms that tackle the less general topological constructions, such as Reeb graph and contour tree, we are not aware of similar attempts targeting the parallel computation of Mapper in the literature. Our work here is an attempt to fill in this gap.
Parallel Mapper
719
This article addresses the parallel analysis of the construction of Mapper. We give a provably correct algorithm to distribute the computation of Mapper on a set of processors and discuss the performance results that compare our approach to a reference sequential implementation for the computation of Mapper. Finally, we report the performance analysis experiments that demonstrate the efficiency of our method.
2
Prior Work
In this section we review the work that has been done towards parallel computing in topological data analysis in regard to the Mapper construction. We also give some of the related literature that is aimed at computing TDA constructions efficiently. 2.1
Parallel Computing in Topological Data Analysis
While there are numerous algorithms to compute topological constructions sequentially, the literature of parallel computing in topology is relatively young. One notable exception is parallelizing Morse-Smale complex computations [23,40]. Parallelization of merge trees is studied in [21,29,36,39] and more generally Reeb graphs [24]. Other parallel algorithms in topology include multicore homology computation [26] spectral sequence parallelization [27], distributed contour tree [30]. To the best of our knowledge this is the first algorithm that addresses the computation of Mapper in parallel. 2.2
Efficient Computation of Topological Constructions
There are several other attempts to speed up the serial computation of topological constructions including an optimized Mapper sequential algorithm for large data [42], a memory efficient method to compute persistent cohomology [4], efficient data structure for simplicial complexes [2], optimized computation of persistent homology [13] and Morse-Smale complexes [22].
3
Preliminaries and Definitions
We start this section by recall basic notions from topology. For more details the reader is referred to standard texts in topology. See for instance [33]. All topological spaces we consider in this paper will be compact unless otherwise specified. An open cover of a topological space is a collection of open sets U = {Aα }α∈I such that ∪α∈I Aα = X. All covers in this article will consist of a finite number of sets unless otherwise specified. Given a topological space X with a cover U, one may approximate this space via an abstract simplicial complex construction called the nerve of the cover U. The nerve of a cover is a simplicial complex whose vertices are represented by the open sets the cover. Each non-empty intersection
720
M. Hajij et al.
between two sets in the cover defines an edge in the nerve and each non-empty intersection between multiple sets defines higher order simplicies. See Fig. 3 for an illustrative example. Under mild conditions the nerve of a cover can be considered as an approximation of the underlying topological space. This is usually called the Nerve Theorem [19]. The Nerve Theorem plays an essential role in TDA: it gives a mathematically justified approximation of the topological space, being thought as the data under study, via simplicial complexes which are suitable for data structures and algorithms. In [41] Singh et al. proposed using a continuous map f : X −→ Z to construct a nerve of the space X. Instead of covering X directly, Singh et al. suggested covering the codomain Z and then use the map f to pull back this cover to X. This perspective has multiple useful points of view. On one hand, choosing different maps on X can be used to capture different aspects of the space X. In this sense the function f is thought of as a “lens” or a “filter” in which we view the space X. On the other hand, fixing the map f and choosing different covers for the codomain Z can be used to obtain multi-level resolution of the Mapper structure. This has been recently studied in details in [15,16] and utilized to obtain a notion of persistence-based signature based on the definition of Mapper. The Mapper construction is related to Reeb graphs. To illustrate relationship, we give the following definition.
U
f (U)
N1 (f (U ))
X (a)
(b)
(c)
Fig. 2. (a) Given a scalar function f : X −→ [a, b] and an open cover U for [a, b] we obtain an open cover f (U) for the space X by considering the inverse images of the elements of U under f . (b) The connected-components of the inverse images are identified as well as the intersection between these sets. (c) Mapper is defined as a graph whose vertices represent the connected component and whose edge represent the intersection between these components.
Parallel Mapper
721
Definition 1. Let X be a topological space and let U be an open cover for X. The 1-nerve N1 (U) of U is a graph whose nodes are represented by the elements of U and whose edges are the pairs A, B of U such that A ∩ B = ∅.
X
U
N (U )
Fig. 3. Each open set defines a vertex in the nerve simplicial complex. Each intersection between two sets define an edge and intersection between multiple sets define higher order simplicies.
A scalar function f on X and a cover for the codomain [a, b] of f give rise to a natural cover of X in the following way. Start by defining an open cover for the interval [a, b] and take the inverse image of each open set to obtain an open cover for X. This is illustrated in Fig. 2(a). In other words if U = {(a1 , b1 ), ..., (an , bn )} is a finite collection of open sets that covers the interval [a, b] then f (U) := {f −1 ((a1 , b1 )), ..., f −1 ((an , bn ))} is an open cover for the space X.The open cover f (U) can now be used to obtain the 1-nerve graph N1 (f (U)). With an appropriate choice of the cover U, the graph N1 (f (U)) is a version of the Reeb graph R(X, f ) [12,32]. This is illustrated in Fig. 2. Observe that the different covers for [a, b] give various “resolution” of the graph N1 (f (U)). The idea of mapper presented in Definition 1 can be generalized to encompass a larger set of problems. One can replace the interval [a, b] in Definition 1 by any parametization domain Z to obtain more sophisticated insights on the data X. This requires introducing the definition of a nerve of a cover of a topological space. Definition 2. Let X be a topological space and let U be a finite cover for X. The nerve of U is the abstract simplicial complex N (U) whose vertices are the elements of U and whose simplicies are the finite subcollections A1 , ...., Ak of U such that: A1 ∩ ... ∩ Ak = ∅. In this paper we will deal with nerves of multiple topological spaces simultaneously. For this reason we will sometimes refer to the nerve of a cover U of a space X by N (X, U). Figure 3 shows an illustrative example of nerve on a topological space X. We will denote the vertex in N (U) that corresponds to an open set A in U by vA . Let f : X −→ Z be a continuous map between two topological spaces X and Z. Let U be a finite cover of Z. The cover that consists of f −1 (U ) for all open sets U ∈ U will be called the pullback of U under f and will be denoted by f ∗ (U). A continuous map f : X −→ Z is said to be well-behaved if the inverse image of any path-connected set U in Z, consists of finitely many path-connected sets in X [15]. All maps in this paper will be assumed to be well-behaved.
722
M. Hajij et al.
Definition 3. Let f : X −→ Z be a continuous map between two topological space X and Z. Let U be a finite cover for Z. The Mapper of f and U, denoted by M (f, U), is the nerve N (f ∗ U). 3.1
Some Graph Theory Notions
Our construction requires a few definitions from graph theory. We include these notions here for completeness. See [3] for a more thorough treatment. Definition 4. Let G = (V, E) be a graph. Let ∼ be an equivalence relation defined on the node set V . The quotient graph of G with respect to the equivalence relation is a graph G/ ∼ whose node set is the quotient set V / ∼ and whose edge set is {([u], [v])|(u, v) ∈ E}. For example consider the cyclic graph C6 with V = {1, 2, 3, 4, 5, 6} 6 6 1 1 and edges (1, 2), (2,3), ..., (6, 1). Define 2 2 5 5 the partition ∼ on V by p1 = {1, 2}, 3 4 4 3 p2 = {3, 4} and p3 = {5, 6}. The quotient graph induced by ∼ is the cyclic Fig. 4. An example of a quotient graph. graph C3 . See Fig. 4. We will also need the definition of disjoint union of two graphs. We will denote to the disjoint union of two sets A and B by A B. Definition 5. Let G1 = (V1 , E1 ) and G2 = (V2 , E2 ) be two graphs. The disjoint union of G1 and G2 is the graph G1 G2 defined by (V1 V2 , E1 E2 ).
4
Parallel Computing of Mapper
The idea of parallelizing the computation of Mapper lies in decomposing the space of interest into multiple smaller subspaces. The subspaces will be chosen to overlap on a smaller portion to insure a meaningful merging for the individual pieces. A cover of each space is then chosen. Each subspace along with its cover is then processed independently by a processing unit. The final stage consists of gathering the individual pieces and merging them together to produce the final correct Mapper construction on the entire space. Let f : X −→ [a, b] be a continuous function. The construction of parallel Mapper on two units goes as follows: 1. Choose an open cover for the interval [a, b] that consists of exactly two subintervals A1 and A2 such that A := A1 ∩ A2 = ∅. See Fig. 5(a). 2. Choose open covers U1 and U2 for A1 and A2 respectively that satisfy the following conditions. First we want the intersection of the two coverings U1 and U1 to have only the set A. Furthermore we do not want the covers U1 and U2 to overlap in anyway on any open set other than A. 3. We compute the Mapper construction on the covers f ∗ (Ui ) for i = 1, 2. We obtain two graphs G1 and G2 . See Fig. 5(b).
Parallel Mapper
723
4. We merge the graphs G1 , G2 as follows. By the construction of A, U1 and U2 , the set A exists in both covers Ui , i = 1, 2. Let C1 , ..., Cn be the pathconnected components of f −1 (A). Since A appears in both of the covers then every connected component Ci in f −1 (A) occurs in both graphs G1 and G2 . In other words, the nodes v1 , ..., vn that correspond to the components C1 , ..., Cn occur in both G1 and G2 where each vertex vi corresponds to the set Ci . The merge of the graph is done by considering the disjoint union G1 G2 and then take the quotient of this graph by identifying the duplicate nodes v1 , ..., vk presenting in both G1 and G2 . See Fig. 5(c). The steps of the previous algorithm are summarized in Fig. 5.
G1
A1
G
A A2
G2
(a)
(b)
(c)
Fig. 5. The steps of the parallel Mapper on two units. (a) The space X is decomposition based on a decomposition of the codomain.(b) Each part is sent to a processing unit and the Mapper graphs are computed on the subspaces.(c) The graphs are merged by identifying the corresponding the nodes.
Remark 1. Note that the interval [a, b] in the construction above can be replaced by any domain Y and the construction above remains valid. However for the purpose of this paper we restrict ourselves to the simplest case when Y = [a, b]. Now define an N -chain cover of [a, b] to be a cover U of [a, b] that consists of N open intervals A1 , ..., AN such that Ai,j := Ai ∩ Aj = ∅ when |i − j| = 1 and empty otherwise. By convention, a 1-chain cover for an interval [a, b] is any open interval that contains [a, b].
5
The Design of the Algorithm
In this section we discuss the computational details of the parallel Mapper algorithm that we already explained in the previous section from the topological perspective. Before we give our algorithm we recall quickly the reference sequential version.
724
5.1
M. Hajij et al.
The Sequential Mapper Algorithm
The serial Mapper algorithm can be obtained by a straightforward change of terminology of the topological mapper introduced in Sect. 3. To this end, the topological space X is replaced by the data under investigation. The lens, or the filter, f is chosen to reflect a certain property of the data. Finally, the notion of path-connectedness is replaced by an appropriate notion of clustering. This is summarized in the Algorithm 1. Note that we will refer the mapper graph obtained using Algorithm 1 by the sequential Mapper. Algorithm 1: Sequential Mapper [41] Input: A dataset X with a notion of metric between the data points; a scalar function f : X −→ Rn ; a finite cover U = {U1 , ..., Uk } of f (X); Output: A graph that represents N1 (f (U)). 1 2
5.2
For each set Xi := f −1 (Ui ), its clusters Xij ⊂ Xi are computed using the chosen clustering algorithm.; Each cluster is considered as a vertex in the Mapper graph. Moreover we insert an edge between two nodes Xij and Xkl whenever Xij ∩ Xkl = ∅;
The Main Algorithm
We now give the details of the parallel Mapper algorithm. To guarantee that the output of the parallel Mapper is identical to that of the sequential Mapper we need to perform some processing on the cover that induces the final parallel Mapper output. In parallel Mapper, we consider an N -chain cover of open intervals A1 , · · · , AN of the interval [a, b] along with the their covers U1 , ..., UN . The details of the cover preprocessing are described in Algorithm 2.
Algorithm 2: Cover Preprocessing Input: A point cloud X; a scalar function f : X −→ [a, b]; a set of N processors (P); N Output: A collection of pairs {(Ai , Ui )}N i=1 where {Ai }i=1 is an N -chain cover of [a, b] and Ui is a cover of Ai . 1
2
Construct an N -chain cover of [a, b]. That is, cover [a, b] by N open intervals A1 , · · · , AN such that Ai,j := Ai ∩ Aj = ∅ when |i − j| = 1 and empty otherwise; For each open set Ai construct an open cover Ui . The covers {Ui }N i=1 satisfy the following conditions: (1) Ai,i+1 is an open set in both coverings Ui and Ui+1 . In other words Ui ∩ Ui+1 = {Ai,i+1 } and (2) if Ui ∈ Ui and Ui+1 ∈ Ui+1 such that Ui ∩ Ui+1 = ∅ then Ui ∩ Ui+1 = Ai,i+1 for each i = 1, ..., N − 1;
Parallel Mapper
725
Algorithm 3: Parallel Mapper
1 2 3
4
5
Input: A point cloud X; a scalar function f : X −→ [a, b]; a set of N processors (P); a collection of pairs {(Ai , Ui )}N i=1 obtained from the cover preprocessing algorithm; Output: Parallel Mapper Graph. for ( i ← 1 to i = N ) do Pi ← (Ai , Ui ); //Map each Ai , and its cover Ui to the processor Pi . Determine the set of point Xi ⊂ X that maps to Ai via f and run the sequential Mapper construction concurrently on the covers (f |Xi )∗ (Ui ) for i = 1, .., N . We obtain N graphs G1 , ...GN . If N = 1, return the graph G1 ; Let Cji1 , ..., Cjii be the clusters obtained from f −1 (Ai,i+1 ). These clusters are represented by the vertices vji1 , ..., vjii in both Gi and Gi+1 (each vertex vki corresponds to the cluster Cki ) by the choice of the coverings Ui and Ui+1 ; Merge the graphs G1 , ..., GN as follows. By the construction of Ai,i+1 , Ui and Ui+1 , each one of the sets f ∗ (Ui ) and f ∗ (Ui+1 ) share the clusters Cjik in f ∗ (Ai,i+1 ) . Hence Cjik is represented by a vertex in both graphs Gi and Gi+1 . The merging is done by considering the disjoint union graph G1 ... GN and then take the quotient of this graph that identifies the corresponding vertices in Gi and Gi+1 for 1 ≤ i ≤ N − 1.
After doing the preprocessing of the cover and obtaining the collection {(Ai , Ui )}N i=1 , every pair (Ai , Ui ) is mapped to a specific processor Pi which performs some calculations to produce a subgraph Gi . At the end, we merge the subgraphs into one graph G. The details of the algorithm are presented in Algorithm 3. 5.3
Correctness of the Algorithm
In here, we give a detailed proof of the correctness of parallel Mapper that discusses the steps of the algorithm. Proposition 1. The parallel Mapper algorithm returns a graph identical to the sequential Mapper. Proof. We will prove that the parallel Mapper performs the computations on X and correctly produces a graph G that is identical to the graph obtained by the sequential Mapper algorithm using induction. Denote by N to the number of units of initial partitions of interval I, which is the same number of processing units. If N = 1, then the parallel Mapper works exactly like the sequential Mapper. In this case A1 = X and the single cover U1 for X is used to produce the final graph which Algorithm 3 returns at step (3).
726
M. Hajij et al.
Now assume the hypothesis is true on k unit, and then we show that it holds on k + 1 units. In step (1) and (2) Algorithm 3 constructs a k + 1-chain cover for [a, b] consisting of the open sets A1 , ..., Ak , Ak+1 . Denote by Ui to the cover of Ai for 1 ≤ i ≤ k + 1. We can run Algorithm 3 on the collection {(Ai , Ui )}ki=1 and produce a sequential Mapper graphs Gi 1 ≤ i ≤ k in step (3). By the induction hypothesis, Algorithm 3 produces correctly a graph G obtained by merging the sequential Mapper graphs G1 , ..., Gk . In other words the graph G obtained from Algorithm 3 is identical to the graph obtain by running the sequential Mapper construction on the cover ∪ki Ui. Now we show that combining G and Gk+1 using our algorithm produces a graph G that is identical to running the sequential Mapper on the covering Ui. Let U be the union ∪ki Ui and a denote by A to the union consists of ∪k+1 i k ∪i=1 Ai . By the construction of the covers {Ui }k+1 i=1 in step (2), U covers A . Moreover, the covers U and Uk+1 only share the open set A ∩ Ak+1 . This means there are no intersections between the open sets of the cover U and the open sets of the cover Uk+1 except for A ∩ Ak+1 . Since there is no intersection between the open sets of U and Uk+1 then there will be no creation of edges between the nodes induced from them and hence the computation of edges done on the first k processors are independent from the computation of edges done on the k + 1 possessor. Now we compare the node sets of the graphs G , Gk+1 and the graph G. Recall that each node in a sequential Mapper is obtained by a connected component of an inverse image of an open set in the cover that defines the Mapper construction. Since the covers U and Uk+1 intersect at the open set f −1 (A ∩ Ak+1 ) then each connected component of f −1 (A ∩ Ak+1 ) corresponds to a node that exists in both graphs G and Gk+1 . This means that each connected component of f −1 (A ∩ Ak+1 ) is processed twice: one time on the first k processor and one time on the k + 1 processors.For each such component corresponds to a node in both G and Gk+1 . In step (5) the algorithm checks the graphs G and Gk+1 for node duplication and merge them according to their correspondence to produce the graph G.
6
Experimentation
In this section, we present practical results obtained using a Python implementation. We ran our experimentation on a Dell OptiPlex 7010 machine with 4-core i7-3770 Intel CPU @ 3.40 GHz and with a 24 GiB System Memory. The parallel Mapper algorithm was tested on different models and compared their run-time with a publicly available data available at [43]. The size of the point cloud data are shown in Table 1. The size of datasets given in Table 1 is the number of points in the point cloud data. The sequential Mapper algorithm relies on three inputs: the data X, the scalar function f : X −→ [a, b] and the choice of cover U of [a, b]. The existing publicly available Mapper implementations, see for instance [31], do not satisfy the level of control that we require for the cover choice and so we relied on our own Mapper implementation. The clustering algorithm that we used to specify the Mapper nodes is a modified version of the DBSCAN [5].
Parallel Mapper
727
Table 1. The number of points for each dataset used in our tests. Data
Size
Camel
21887 pt
Cat
7207 pt
Elephant 42321 pt Horse
8431 pt
Face
29299 pt
Head
15941 pt
Using parallel Mapper on the data given in Table 1, we obtained a remarkable speed up that can be 4 times faster, compared with the sequential Mapper. Figure 6, shows the speedup results of parallel Mapper that are obtained using our experiments. The x-axis represents the number of processes while the y-axis shows the speedup. It is clear from the figure that the curves are increasing in a monotonic fashion as we increase the number of processes. Indeed, at the beginning the speed up increases significantly as we increase the number of processes. However, at some point (when we use more than 10 processes), we increase the number of processes to 30 processes and the speedup does not show significant improvement. 6.1
Performance of the Algorithm
To verify our experimental results in Fig. 6, we use a well-known theoretical formula which is the Amedahl’s law to calculate the speedup ratio upper bound that comes from parallelism and the improvement percentage [1]. The Amedahl’s law is formulated as follows: 1 , S= (1 − part) + part/N
Fig. 6. Speedups obtained by the parallel Mapper using number of processes that run concurrently.
728
M. Hajij et al.
where S is the theoretical speedup ratio, part is the proportion of system or program that can be made in parallel, 1 − part is the proportion that remains sequential, and N is the number of processes. Generally, there are some systems and applications where parallelism cannot be applied on all data or processes. In this case, part of data can be processed in parallel, while the other should be sequential. This may happen because of the nature of data (e.g. dependencies), the natures of processes (e.g. heterogeneity) or some other factors. In the parallel Mapper, there are two computational pieces which are the clustering piece and the cover construction/merging subgraphs piece. Our algorithm makes the clustering piece completely in parallel while the cover construction/merging subgraphs piece is processed sequentially. Now, we use Amedahl’s law to calculate the theoretic speedup ratios to verify the experimental results. Indeed, considering the algorithm, the clustering piece is approximately 75% of execution time, while the cover construction/merging subgraphs piece is about 25%. Table 2. Speedup calculations based on Amedahl’s law, using different numbers of processes. It shows the speedup of the parallel mapper with respect to the sequential mapper Speedup N
(Parallel Mapper) part = 0.75
10
3.07
100
3.88
1000
3.99
10000 3.99
In Table 2, we use Amedahl’s law to calculate the theoretic speedup ratios using different numbers of processes. The table shows that the speedup increases as a response of the increase in the number of processes. Notice that at some points the performance almost stops improving even if we increase the number of processes. Table 2 shows that the speedup of part = .75 (the parallel Mapper) achieves to 3.07 when N = 10 and it goes up to 3.99 when N = 1000. Therefore, the theoretical calculations clearly matches the experimental results that appears in Fig. 6.
7
Conclusion and Future Work
In this work, we gave a provably correct algorithm to distribute Mapper on a set of processors and run them in parallel. Our algorithm relies on a divide and conquer strategy for the codomain cover which gets pulled back to the domain cover. This work has several potential directions of the work that we have not
Parallel Mapper
729
discussed here. For instance, the recursive nature of the main algorithm was implied throughout the paper but never discussed explicitly. On the other hand the algorithm can be utilized to obtain a multi-resolution Mapper construction. In other words, using this algorithm we have the ability to increase the resolution of Mapper for certain subsets of the data and decrease at others. This is potentially useful for interactive Mapper applications.
References 1. Amdahl, G.M.: Validity of the single processor approach to achieving large scale computing capabilities. In: Proceedings of the 18–20 April 1967, Spring Joint Computer Conference, pp. 483–485. ACM (1967) 2. Bauer, U., Kerber, M., Reininghaus, J., Wagner, H.: Phat-persistent homology algorithms toolbox. J. Symb. Comput. 78, 76–90 (2017) 3. Beineke, L.W., Wilson, R.J.: Topics in Algebraic Graph Theory, vol. 102. Cambridge University Press, Cambridge (2004) 4. Boissonnat, J.-D., Dey, T.K., Maria, C.: The compressed annotation matrix: an efficient data structure for computing persistent cohomology. Algorithmica 73(3), 607–619 (2015) 5. Cali´ nski, T., Harabasz, J.: A dendrite method for cluster analysis. Commun. Stat.Theory Methods 3(1), 1–27 (1974) 6. Carlsson, E., Carlsson, G., De Silva, V.: An algebraic topological method for feature identification. Int. J. Comput. Geom. Appl. 16(04), 291–314 (2006) 7. Carlsson, G.: Topology and data. Bull. Am. Math. Soc. 46(2), 255–308 (2009) 8. Carlsson, G., Ishkhanov, T., De Silva, V., Zomorodian, A.: On the local behavior of spaces of natural images. Int. J. Comput. Vision 76(1), 1–12 (2008) 9. Carlsson, G., M´emoli, F.: Persistent clustering and a theorem of j. Kleinberg. arXiv preprint arXiv:0808.2241, 2008 10. Carlsson, G., Zomorodian, A.: The theory of multidimensional persistence. Discrete Comput. Geom. 42(1), 71–93 (2009) 11. Carlsson, G., Zomorodian, A., Collins, A., Guibas, L.J.: Persistence barcodes for shapes. Int. J. Shape Model. 11(02), 149–187 (2005) 12. Carri`ere, M., Oudot, S.: Structure and stability of the 1-dimensional mapper. arXiv preprint arXiv:1511.05823 (2015) 13. Chen, C., Kerber, M.: Persistent homology computation with a twist. In: Proceedings 27th European Workshop on Computational Geometry, vol. 11 (2011) 14. Collins, A., Zomorodian, A., Carlsson, G., Guibas, L.J.: A barcode shape descriptor for curve point cloud data. Comput. Graph. 28(6), 881–894 (2004) 15. Dey, T.K., M´emoli, F., Wang, Y.: Multiscale mapper: topological summarization via codomain covers. In: Proceedings of the Twenty-Seventh Annual ACMSIAM Symposium on Discrete Algorithms, pp. 997–1013. Society for Industrial and Applied Mathematics (2016) 16. Dey, T.K., Memoli, F., Wang, Y.: Topological analysis of nerves, reeb spaces, mappers, and multiscale mappers. arXiv preprint arXiv:1703.07387 (2017) 17. Edelsbrunner, H., Letscher, D., Zomorodian, A.: Topological persistence and simplification. In: Proceedings of 41st Annual Symposium on Foundations of Computer Science, pp. 454–463. IEEE (2000)
730
M. Hajij et al.
18. Emmett, K., Schweinhart, B., Rabadan, R.: Multiscale topology of chromatin folding. In: Proceedings of the 9th EAI International Conference on Bio-inspired Information and Communications Technologies (formerly BIONETICS), pp. 177–180. ICST (Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering) (2016) 19. Ghrist, R.: Barcodes: the persistent topology of data. Bull. Am. Math. Soc. 45(1), 61–75 (2008) 20. Gidea, M.: Topology data analysis of critical transitions in financial networks (2017) 21. Gueunet, C., Fortin, P., Jomier, J., Tierny, J.: Task-based augmented merge trees with fibonacci heaps. In: IEEE Symposium on Large Data Analysis and Visualization (2017) 22. G¨ unther, D., Reininghaus, J., Wagner, H., Hotz, I.: Efficient computation of 3D morse-smale complexes and persistent homology using discrete morse theory. Vis. Comput. 28(10), 959–969 (2012) 23. Gyulassy, A., Pascucci, V., Peterka, T., Ross, R.: The parallel computation of morse-smale complexes. In: 2012 IEEE 26th International Parallel & Distributed Processing Symposium (IPDPS), pp. 484–495. IEEE (2012) 24. Hajij, M., Rosen, P.: An efficient data retrieval parallel reeb graph algorithm. arXiv preprint arXiv:1810.08310 (2018) 25. Hiraoka, Y., Nakamura, T., Hirata, A., Escolar, E.G., Matsue, K., Nishiura, Y.: Hierarchical structures of amorphous solids characterized by persistent homology. Proc. Natl. Acad. Sci. 113(26), 7035–7040 (2016) 26. Lewis, R.H., Zomorodian, A.: Multicore homology via mayer vietoris. arXiv preprint arXiv:1407.2275 (2014) 27. Lipsky, D., Skraba, P., Vejdemo-Johansson, M.: A spectral sequence for parallelized persistence. arXiv preprint arXiv:1112.1245 (2011) 28. Lum, P.Y., Singh, G., Lehman, A., Ishkanov, T., Vejdemo-Johansson, M., Alagappan, M., Carlsson, J., Carlsson, G.: Extracting insights from the shape of complex data using topology. Sci. Rep. 3, 1236 (2013) 29. Morozov, D., Weber, G.: Distributed merge trees. In: ACM SIGPLAN Notices, vol. 48, pp. 93–102. ACM (2013) 30. Morozov, D., Weber, G.H.: Distributed contour trees (2012) 31. M¨ ullner, D., Babu, A.: Python mapper: an open-source toolchain for data exploration, analysis, and visualization (2013). http://math.stanford.edu/muellner/ mapper 32. Munch, E., Wang, B.: Convergence between categorical representations of reeb space and mapper. arXiv preprint arXiv:1512.04108 (2015) 33. Munkres, J.R.: Elements of Algebraic Topology, vol. 2. Addison-Wesley, Menlo Park (1984) 34. Nicolau, M., Levine, A.J., Carlsson, G.: Topology based data analysis identifies a subgroup of breast cancers with a unique mutational profile and excellent survival. Proc. Natl. Acad. Sci. 108(17), 7265–7270 (2011) 35. Otter, N., Porter, M.A., Tillmann, U., Grindrod, P., Harrington, H.A.: A roadmap for the computation of persistent homology. EPJ Data Sci. 6(1), 17 (2017) 36. Pascucci, V., Cole-McLaughlin, K.: Parallel computation of the topology of level sets. Algorithmica 38(1), 249–268 (2004) 37. Robins, V.: Towards computing homology from finite approximations. Topol. Proc. 24, 503–532 (1999) 38. Robles, A., Hajij, M., Rosen, P.: The shape of an image: a study of mapper on images. In: VISAPP 2018 (2018, to appear)
Parallel Mapper
731
39. Rosen, P., Tu, J., Piegl, L.: A hybrid solution to calculating augmented join trees of 2D scalar fields in parallel. In: CAD Conference and Exhibition (2017, accepted) 40. Shivashankar, N., Senthilnathan, M., Natarajan, V.: Parallel computation of 2D Morse-Smale complexes. IEEE Trans. Vis. Comput. Graph. 18(10), 1757–1770 (2012) 41. Singh, G., M´emoli, F., Carlsson, G.E.: Topological methods for the analysis of high dimensional data sets and 3D object recognition. In: SPBG, pp. 91–100 (2007) 42. Sn´ aˇsel, V., Nowakov´ a, J., Xhafa, F., Barolli, L.: Geometrical and topological approaches to big data. Future Gener. Comput. Syst. 67, 286–296 (2017) 43. Sumner, R.W., Popovi´c, J.: Deformation transfer for triangle meshes. ACM Trans. Graph. (TOG) 23(3), 399–405 (2004)
Dimensional Analysis of Dataflow Programming William W. Wadge and Abdulmonem I. Shennat(&) Department of Computer Science, Faculty of Engineering, University of Victoria, Victoria, BC, Canada {wwadge,ashennat}@uvic.ca
Abstract. In this paper, we present an algorithm for the Dimensional Analysis (DA) of a two-dimensional dialect of the dataflow language Lucid, one in which the dimensions are ‘space’ as well as ‘time’. DA is indispensable for an efficient implementation of multidimensional Lucid. A Lucid program is a set of equations defining a family of multidimensional datasets; each data set being a collection of data points indexed by coordinates in a number of dimensions. Every variable in a Lucid program denotes one such dataset, and they are defined in terms of input and transformations applied to other variables. In general, not every dimension is relevant in every data set. It is very important not to include irrelevant dimensions because otherwise you have the same data duplicated with different values of the irrelevant dimension. In most multidimensional systems it is the administrator’s responsibility to exclude irrelevant dimensions and to keep track of changes in dimensionality that result from transformations. In other words, DA is performed manually. In Lucid, however, we have an alternative, namely, automated DA. Static program analysis allows us to calculate or estimate the dimensionality of program variables. This is the goal of our research. The problem is far from straightforward because Lucid programs can allow many potential dimensions, the programmer can declare local temporary dimensions, and the transformations can have complicated and even recursive definitions. Our software will be tested and incorporated in the PyLucid (Python-Based) interpreter. Keywords: Dimensional analysis Based) interpreter
Irrelevant dimension PyLucid (Python-
1 Introduction A multidimensional data set is one in which the individual values are indexed by two or more dimensions or parameters. For example, the measurement of an hour’s rainfall can depend on the location (two coordinates), the date (three more dimensions, year, month and day) and the time (a sixth dimension). Multidimensional data is very common in healthcare. There are numerous Multidimensional data base (MDB) systems as well as Online Analytical Processing (OLAP) systems, which are also based on a multidimensional model [1, 8]. It can happen that some dimensions may actually not be needed to specify values in a dataset. For instance, suppose we have a data set of blood test results, the © Springer Nature Switzerland AG 2021 K. Arai et al. (Eds.): FTC 2020, AISC 1289, pp. 732–739, 2021. https://doi.org/10.1007/978-3-030-63089-8_48
Dimensional Analysis of Dataflow Programming
733
“patient” dimension is obviously relevant because different patients have different measurements; but the “physician” dimension is not (usually), and this is because these are standard tests administered according to standard protocols. It is important to know which dimensions are irrelevant because retaining them causes duplicate entries and is very inefficient, as a result. Usually it is the data base manager who is responsible for identifying irrelevant dimensions and removing them [9, 10]. In [2], the aim of our research is to automate this process, at least for datasets produced by Lucid programs. To illustrate the phenomenon of irrelevance, consider a small dataset of blood sugar measurements in such health-related extension of the concept of health 3.0 whereby the users’ interface with the data and information available on the web is personalized to optimize their experience. In this simple example there are three parameters that determine a measurement (patient’s name, the physician’s name, and the particular finger chosen for the blood test). Figure 1 illustrates these measurements which are represented in three possible values. The top layer “yellow color” represents high blood sugar, middle layer “orange” is normal and the layer in the bottom “red” shows low blood sugar. The dimensions are represented in three coordinates: physician, patient and the finger chosen. In general a value can be obtained once the coordinates of all relevant dimensions are known. We don’t need to know the coordinates of irrelevant dimensions. The patient dimension is the only relevant dimension to this dataset. Every multidimensional dataset has relevant and irrelevant dimensions, and in this research, we will work on analyzing programs to figure out the relevant dimensions, as a result [6].
Fig. 1. Blood-sugar measurements
These are standard tests administered according to standard protocols our software will be tested and incorporated in the Pylucid interpreter currently being developed at UVIC by W. Wadge and Monem Shennat. We will proceed incrementally, solving increasingly difficult instances of DA corresponding to increasingly sophisticated
734
W. W. Wadge and A. I. Shennat
language features. In particular, we have already solved the case of two separate dimensions (space and time) and in this paper we report on this solution.
2 Background 2.1
Dataflow Programming Language
In [7], the data flow programming language “Lucid” appeared in 1976, it is a functional language in which every data object is a stream, which is a sequence of values. All Lucid operations map streams into streams. Therefore, a program in Lucid is simply an expression together with definitions of the transformations and data referred to in the expression. It can be considered as a dataflow language in which the variables (the streams) name the sequences of data values passing between the actors, which correspond to the functions and operators of the language. The output of the program is simply the data represented by the program as an expression such as the value of the program [9]. 2.2
Statement in Lucid
Lucid programs do have statements, which are simply equations. They have more in common with high school algebra than with C or JAVA. The programmer in an imperative language is concerned primarily with the instructions a machine is required to perform instead of the data that the machine is required to produce. Equations in Lucid are different, for example, some of equations look like conventional enough such as: x ¼ a þ b; whereas, others such as d ¼ x - next x; Or i ¼ 1 fby i þ 1 To use unconventional “tempora” operators not found in conventional languages. The definition in Lucid is the only kind of statement; therefore, there are no read or write statements, and no control statements of any kind. The programmers can define their own functions in Lucid, but they are evaluated lazily to extend the success of the data flow methodology by freeing it from its dependence on imperative languages such as C or JAVA, or UNIX shell language itself that allows commands too [4, 11].
Dimensional Analysis of Dataflow Programming
2.3
735
Lucid Programs
A simple Lucid program consists of definitions of variables representing streams. These definitions may be recursive; for example, the series fib of Fibonacci numbers is: fib ¼ 1 fby ð1 fby ðfib þ next fibÞÞ Here [3] next: takes a stream and discards the first element; fby: is written as an infix operator, taking two streams and producing a resulting stream which consists of the first element of the first stream followed by the whole second stream. It is easy to see that the first two elements of fib are 1; also, it can be seen that element n + 2 is equal to the sum of elements n and n + 1. Moreover; In Lucid the programmer can think of some variables as denoting stored values which are repeatedly modified in the course of the computation. Lucid in particular has one great advantage over other dataflow languages, and the programmer can estimate in terms of other operational concepts and can understand some statements as specifying an iterative algorithm, as a result [11]. 2.4
PyLucid
Lucid can best be understood as functional programming with an added time dimension. Various 'features' can be realized in terms of expressions and equations. PyLucid has arrays as well realized by introducing a space parameter s that works like the time parameter t. For example, PyLucid objects are functions of t and s, not just t. Therefore, the result would appear as time-varying infinite arrays. As a result, the value of a variable depends on the natural number index t and the other natural number index s. Therefore, V(s, t) denotes the value of variable V at space point s and time point t. Spatial analogs are added of first, next and fby called init (“initial”), succ (“successor”) and sby (“succeeded by”) [12]. These operators act only on the space dimension and ignore the time dimension, for example: succ (VÞðs; tÞ ¼ Vðs þ 1; tÞ. Variables that depend only on the space dimension can be thought of as infinite vectors; the equation below defines the vector of all-natural numbers: N ¼ 0 sby N þ 1 Variables that depend on both parameters can be pictured in at least three different ways: as streams of (infinite) vectors, as vectors of (infinite) streams, or as two dimensional matrices [5, 10].
3 Dimensional Analysis Algorithms In [1], two separate dimensions are in PyLucid; however, the intension is denoted by a variable can vary in both dimensions, over all. Also, some variables denotations vary only in one of these dimensions, or none. For example, if a variable is defined (directly or indirectly) in terms of temporal operators only, then it (i.e. its denotation) varies at most in the temporal dimension [5].
736
W. W. Wadge and A. I. Shennat
We proceed incrementally by the following stages, from simplest to more difficult processing. Firstly, constants where identifying the variables constant in time such as Z where Z = first Y. Secondly, Space (s) as well as time(t) dependencies, so there are four possibilities: s, t if the variable depends on both dimensions s and t; s if it depends on s only; t is t only; and for constants is {}. These DA programs can be compiled by Pylucid. For this paper there are two stages so far that are explained below: 3.1
Algorithm Steps of One Dimension “Time”
If we only have the time dimension, the DA algorithm is especially simple. We iteratively accumulate the set of variables that we estimate may vary in time. Variables that are never added to this set are guaranteed to be constant in time. Also, we need to begin with the empty set, then repeatedly add variables that we deduce may vary in time. When no new variables are added to the set, the algorithm terminates. As we indicated we are dealing with ‘atomic’ equations, equations with either constant or an expression with one operation on the right hand side. The form of the equation determines whether or not the variable defined is added to the set being accumulated. • If the variable is defined as a constant, for example: V=5 V is not added. • If the variable is defined in terms of a data operation, e.g. V=X+Y Then V is added if either X or Y are already in the set. • If the equation is of the form V = first X Then V is not added. • If the equation is of the form V = next X Then V is added if X is already in the set. • If the equation is of the form V = X fby Y Then V is added. These equations above complete the time-only algorithm. 3.2
Algorithm Steps of Two Dimensions “Time and Space”
The algorithm for time and space is slightly more complex. Instead of accumulating a single set we accumulate a table that assigns to each variable a subset of {s, t}. The subset is the estimated dimensionality of the variable, i.e. the set of dimensions relevant to the value of the variable.
Dimensional Analysis of Dataflow Programming
737
We begin by assigning the empty set {} to each variable. This means that for each variable we estimate that it is constant in time and space. Then we repeatedly update our estimates until there are no further changes. • If the variable is defined as a constant, the estimate remains {}. • If the variable is defined in terms of a data operation, e.g. V=X+Y Then V is assigned the union of the sets assigned to X and Y. • If the equation is of the form V = first X Then V is assigned the set for X minus the dimension t (if present). • If the equation is of the form V = next X V is assigned the set for X. • If the equation is of the form V = X fby Y Then V is assigned the union of the sets for X and Y plus the dimension t. • If the equation is of the form V = init X Then V is assigned the set for X minus the dimension s (if present). • If the equation is of the form V = succ X V is assigned the set for X. • If the equation is of the form V = X sby Y Then V is assigned the union of the sets for X and Y plus the dimension s. Again the algorithm terminates when no further changes are made in the table. Once this happens, we can guarantee that the sets assigned are upper bounds on the dimensionalities of the variables. In particular if V ends up being assigned {s} then it is time constant; if assigned {t} then it is space constant; and if it is assigned {}, it is constant in space and time, i.e. an absolute constant. Suppose that we have the equations: I=6 J = A fby B C = first J A=C+J D = next A Then there are five stages in accumulating the set of time sensitive variables are: {}, {J}, {J, A}, {J, A, D}, {J, A, D}
738
W. W. Wadge and A. I. Shennat
Now suppose that the equations are: P=8 Q = X sby T T = P fby Q V=T+Q W = init V X = first Q Then the four stages in the constructing table of dimensionalities are illustrated in Table 1.
Table 1. Variables vs relevant dimensions Variables Set of relevant dimensions P {} {} {} Q {} {s} {s, t} T {} {t} {s, t} V {} {} {s, t} W {} {} {t} X {] {} {s}
{} {s, t} {s, t} {s, t} {t} {s}
4 Conclusion Our DA software will make possible the efficient implementation of multidimensional Lucid. In particular, we worked on the first and second stages which are algorithm steps of one dimension “Time” and two dimensions “Time & Space”. We have already solved the case of two separate dimensions (space and time) where identifying variables that are constant in both dimensions as a first case, then the second is for those that are constant in time but not space, and then the third one which is for the variables are constant in space but not time, and the fourth case is for those that vary in both dimensions. Next, we will consider the case of a fixed number of dimensions. Furthermore, tools such as Tableau need to know which dimensions are relevant, and currently this is the user’s responsibility. DA users can look at data across multiple dimension and have a deeper understanding of user interfaces for end user systems such as healthcare system related challenges. Our new DA software will make possible the efficient implementation of multidimensional Lucid. Such focused on new advances in multidimensional data modeling will ultimately effectively realize improvement in real time data analysis, and improve efficiency and accuracy of such analysis, and further improve data analyst tracking, forecasting, and maintenance of the multidimensional data model.
Dimensional Analysis of Dataflow Programming
739
References 1. Hershberger, J., Shrivastava, N., Suri, S., Toth, C.D.: Adaptive spatial partitioning for multidimensional data streams. Algorithmica 46(1), 97–117 (2006). https://doi.org/10.1007/ s00453-006-0070-3 2. Dumont, P., Boulet, P.: Another multidimensional synchronous data simulating ArrayOL in ptolemy II. [Research Report] RR-5516, INRIA, p. 19 (2005) 3. Wadge, W.W., Ashcroft, E.A.: LUCID, the Dataflow Programming Language. Academic Press Professional Inc., San Diego (1985) 4. Stolte, C., Hanrahan, P.: Polaris: a system for query, analysis and visualization of multidimensional relational databases. In: Proceedings of the IEEE Symposium on Information Vizualization 2000, 09–10 October 2000, p. 5 (2000) 5. Chen, Y., Dong, G., Han, J., Wah, B.W., Wang, J.: Multi-dimensional regression analysis of time-series data streams. In: Proceedings of the 28th International Conference on Very Large Data Bases, 20–23 August 2002, Hong Kong, China, pp. 323–334 (2002) 6. Murthy, P.K., Lee, E.A.: Multidimensional synchronous dataflow. IEEE Trans. Signal Process. 50(8), 2064–2079 (2002). https://doi.org/10.1109/tsp.2002.800830 7. Wadge, W.W.: An extensional treatment of dataflow deadlock. Theor. Comput. Sci. 13(1), 3–15 (1981) 8. Najjar, W.A., Lee, E.A., Gao, G.R.: Advances in the data computational model. Parallel Comput. 25(13), 1907–1929 (1999) 9. Ackerman, W.B.: Data ow languages. Computer 15(2), 15–25 (1982) 10. Halbwachs, N., Lagnier, F., Ratel, C.: Programming and verifying real-time systems by means of the synchronous data-ow language LUSTRE. IEEE Trans. Softw. Eng. 18(9), 785– 793 (1992) 11. Chudik, J., David, G., Kotov, V.E., Mirenkov, N.N., Ondas, J., Plander, I., Valkovskii, V.A.: Algorithms, software and hardware of parallel computers. In: Miklosko, J., Kotov, V.J. (eds.) Literature Review. Springer (2013) 12. Jagannathan, R., Dodd, C.: GLU programmer’s guide. SRI International, Menlo Park, California, Technical report (1996)
EnPower: Haptic Interfaces for Deafblind Individuals to Interact, Communicate, and Entertain Nimesha Ranasinghe1,2(B) , Pravar Jain2 , David Tolley2 , Barry Chew2 , Ankit Bansal2 , Shienny Karwita2 , Yen Ching-Chiuan2 , and Ellen Yi-Luen Do2,3 1
2
School of Computing and Information Science, University of Maine, Orono, USA [email protected] Keio-NUS CUTE Center, National University of Singapore, Singapore, Singapore {pravar,dtolley}@nus.edu.sg, [email protected], [email protected], [email protected] 3 ATLAS Institute, University of Colorado Boulder, Denver, USA [email protected], [email protected] Abstract. For deafblind individuals, the absence of visual and auditory communication channels prevents meaningful interactions with the people and world around them, leading many to suffer from both mental and social issues. Among the remaining sensory channels (e.g., taste, haptic, and olfactory) of deafblind individuals, haptic stimuli are an effective medium that can be digitally employed to enable interactions with the outside world. Thus, this paper discusses the development of a digital communication platform for deafblind individuals called “EnPower” (Enable and Empower individuals with auditory and visual impairments). EnPower enables bi-directional communication between deafblind and non-impaired individuals by using sensory substitution techniques to present information via accessible haptic stimuli. The EnPower platform consists of a physical interface (in wearable and desktop versions) that is paired with a wireless mobile application. Similar to concepts such as Finger Braille, the assistive devices can deliver textual information via tactile stimuli to the deafblind individual, and vice versa. Additionally, the system can translate speech and visual input into tactile stimuli, providing users with greater access to digital information platforms. Using the two devices, a stimuli perception study, as well as a field trial, was conducted to evaluate the effectiveness of our approach (delivering information via tactile sensations on fingertips). Findings not only suggest that participants could understand the information presented via tactile sensations but also reveal several important avenues for future research. Keywords: Deafblind · Haptic interaction · Spatial interaction · Wearable computing
1 Introduction Effective communication and interaction with the people and objects around us is essential to our well-being and achieving an optimal quality of life. Performing these interactions requires us to constantly engage and process information from all of our sensory channels. Although the human body is typically capable of simultaneously sensing c Springer Nature Switzerland AG 2021 K. Arai et al. (Eds.): FTC 2020, AISC 1289, pp. 740–756, 2021. https://doi.org/10.1007/978-3-030-63089-8_49
EnPower
741
Fig. 1. Main components of EnPower: (A) wearable assistive device with vibrotactile sensations, (B) desktop version with linear actuators to simulate tapping sensations, and (C) the mobile application
and handling information from a range of modalities, including visual, auditory, tactile, olfactory, taste and vestibular, people who suffer from deafblindness are heavily restricted by the absence of visual and auditory information. Due to the frequent difficulties that deafblind individuals experience, and the unavailability of a platform that enables them to clearly express their emotions and thoughts with others, the alienation caused by this condition leads many deafblind individuals to suffer from depression and, in severe cases, even commit suicide [19]. Deafblindness1 is a combined sensory disorder wherein individuals possess varying degrees of both visual and auditory impairment. An estimation from 2010 suggests that there are approximately 358,000 individuals in the U.K who suffer from some form of deafblindness [16]. These individuals often find many everyday tasks extremely challenging and require support when interacting with people or their surroundings, specifically requiring intensive support in education, navigation and other key activities throughout their lives. Addressing these issues, sensory substitution is one method that can provide deafblind people with a means of communication. Sensory substitution is a non-invasive technique for circumventing the loss of one sense by feeding its information through another sensory channel to shift the cognitive load between available human senses [1]. For deafblind people, sensory substitution is required to convert inaccessible visual and auditory information into a form where it can be displayed via the individual’s haptic, taste and olfactory senses. A commonly utilized form of sensory substitution is Finger Braille, a type of tactile communication where the sender dots Braille code on the fingers of the receiver using the thumb, index, middle and ring fingers of each hand. Finger Braille is predominantly favoured for assisting deafblind people due to the speed and 1
https://senseinternational.org.uk/about-deafblindness/first-global-report-deafblindness.
742
N. Ranasinghe et al.
accuracy with which it can be used to communicate. Skilled users of Finger Braille can also express certain emotions through the prosody of their dotting [3]. However, this form of communication requires the presence of a non-disabled interpreter who is also skilled in Finger Braille. Due to the small number of Finger Braille interpreters, accessibility is a still a key factor that causes extremely restricted communication and expression for the deafblind. Addressing this need for assistive technology, we present EnPower: a bi-directional communication and information exchange platform that utilizes haptic stimuli to aid visually and auditory impaired users. As shown in Fig. 1, the EnPower system consists of two main modules: 1) a hardware device that employs tactile stimuli to provide wearable (based on vibrotactile feedback) and non-wearable (based on linear actuators) interfaces to extend the Finger Braille technique, and 2) a pairing mobile application that converts speech, text and visual information into tactile stimuli, and vice versa. The rest of the paper is organized as follows. In Sect. 2, we present background information on popular communication practices of deafblind individuals and describe several technologies for digital interactions, mainly passive (read-only) interactions. Section 3 presents the system description of the EnPower platform. To evaluate the tactile sensitivity of the fingertips, Sect. 4 details the experimental evaluation along with the main findings. In addition to the experiment on tactile perceptions, in Sect. 5, a field trial is conducted with deafblind participants. Before we conclude the paper in Sect. 7, we have depicted and described several future usage scenarios in Sect. 6.
2 Related Work Traditionally, deafblind individuals utilize several communication methods such as 1) Print-on-palm (tracing letters on the palm), 2) Tactile sign language and finger alphabet, 3) Bulista (braille printed out on tape) and 4) Finger braille [11]. As all of these methods require specialized training and extended learning, deafblind individuals must rely on dedicated interpreters for these types of communication. Alternatively, there are currently several deafblind communication technologies available, such as BrailleNote Apex and Brailliant Braille display by HumanWare Inc.2 , BrailleSense by HIMS Inc.3 , FSTTY, FaceToFace, and Focus communication solutions by Freedom Scientific4 . Unfortunately, these technologies are relatively expensive and only support text-based message delivery for the user. Consequently, existing real-time deafblind communication technologies such as these are either impractical or inaccessible for many users. With regard to developments in assistive tools for deafblind people, the majority of existing projects have focused on creating different types of tactile displays that employ a wide range of actuation methods to provide haptic feedback on the deafblind person’s fingers, hands or wrists. As noted by Velazquez, these actuation methods include the use of servomotors, electromagnetic coils, piezoelectric ceramics, pneumatics, shape 2 3 4
http://www.humanware.com/en-usa/products/. https://www.hims-inc.com/blindness/. https://www.freedomscientific.com/.
EnPower
743
memory alloys (SMAs), electroactive polymers (EAPs), electrorheological (ER) fluids and airborne ultrasound [21]. As Finger Braille is already an established form of communication that is known by many deafblind people, several hand-worn tactile displays have been developed to enable users to send and receive Finger Braille-based messages. In Finger-Braille, the fingers of both hands are likened to keys of a Braille typewriter and appropriate for real-time communication. By incorporating Finger Braille, these assistive tools aim to 1) minimize the time it takes for users to learn the system and 2) provide efficient communication with non-deafblind people who do not have strong knowledge of Finger Braille. These systems exist in multiple forms where accelerometers and actuators can be mounted on the user’s fingertips [6, 20], rings on the user’s fingers [2], the finger’s middle phalanges [5] or on bracelet-like wearables [8]. Alternatively to systems that are mounted on the user’s fingers, some assistive tools for deafblind people, such as the Mobile Lorm Glove [4], cover the user’s entire hand. This glove-based system translates touches on the hand, in the Lorm touch alphabet, into text and vice versa. In contrast, our approach utilities the Braille language, which is both simpler and more widely accepted. As a low-cost and real-time communication approach, EnPower utilizes the method of Finger braille and is primarily motivated by the aforementioned research works. It comprises of a robust two-way finger braille interface paired with a software application, enabling real-time communication between individuals (sensory disabled/nondisabled) in their preferred choice of communication protocol, respectively. Consequently side-stepping the burden of mastering the other individual’s preferred communication protocol. The system aims to serve as a single end-to-end solution for a range of sensory disabled communities, augmenting the quality of communication and providing the ability to express and access information via powerful public platforms such as the Internet and social media networks. With the EnPower platform, we specifically intend to improve on three factors compared to the existing solutions: – Reduced learning time on “how to use the device?” as it is intuitive and replicating the manual method of finger braille. – Enhanced ergonomics and portability of the devices - particularly with the wearable device. – Ability to express many types of data including visual as well as spatial representations of surroundings.
3 The EnPower Platform In this section, we present technical information pertaining to the EnPower platform and how it enables effective communication with and between deafblind, blind, deaf and vocally impaired individuals. As shown in the high-level system diagram of Fig. 5, the platform comprises of a tactile feedback interface in two form-factors: a wearable and a desktop version, which are linked to a mobile software application. Both devices contain a set of eight input/output points (for the thumb, index, middle and ring finger of each hand). As shown in Fig. 2, the wearable version consists of one module for each hand. Here, Eccentric Rotating Mass (ERM) vibration motors are
744
N. Ranasinghe et al.
Fig. 2. The wearable Finger Braille device with control module (communication and text-tobraille conversion), vibrotactile actuators (input), and force-sensing resistors (output/typing).
employed to generate vibrotactile events on the user’s fingertips (delivering information to the user) and sense input commands are detected through Flexi-force sensors (formulating text messages by typing Finger Braille codes on any surface). The desktop version of the device, shown in Fig. 3, utilizes Linear Electro-Mechanical Actuators (LEMA) to output tapping sensations on the user’s fingertips. Correspondingly, push buttons act as an input interface for users to type Finger Braille codes. The relative locations of the linear actuators and push buttons are arranged to enable users’ to place their fingers comfortably on the interface. Each text character that is received by the device generates a specific system code, which triggers the actuation of a specific tactile pattern in-accordance with the Braille protocol. Conversely, Braille-based tactile input from the user is translated back to representative characters, which are received by the partnering mobile application. By employing these configurations, both systems can provide the user with Finger Braille in 6-dot and 8-dot forms, depending on the user’s preference and ability. The main hardware modules of both devices are illustrated in Fig. 4. Both devices are powered by standard 9 V batteries and communicating with the mobile application via BluetoothTM . An Android smartphone application was developed to incorporate different modes of communication using the devices. In partnership with the devices, the mobile application handles data and performs conversions between different modalities including image-to-speech (visual to auditory), image-to-braille (visual to tactile), image-to-text (visual to textual), speech-to-braille (auditory to tactile), speech-to-text (auditory to textual), braille-to-speech (tactile to auditory), and braille-to-text (tactile to textual).
EnPower
745
Fig. 3. The desktop Finger Braille device with control module (communication and text-to-braille conversion), linear actuators (input), and push buttons (output/typing).
The application works in synchronization with the two devices using a series of commands to establish handshaking for lossless communication. Aside from facilitating direct communication between users, the mobile application also enables direct access to several other information sources such as social media content and visual information captured by the camera of the mobile device. To provide access to news and social media content, the mobile application features functionality for users to access a preset TwitterTM feed and for this data to be displayed to deafblind users through the partnering haptic interface devices. Additionally, the mobile application features functionality that uses the camera of the mobile device to capture images and translate visual information to both auditory (for blind users) and tactile (braille) stimuli (for deafblind users). In this functionality, the user is guided by auditory and vibrational feedback to hold the camera in a particular direction as images are automatically captured. These images are then analyzed to generate information about the user’s surroundings. In this manner, both deafblind and visually-impaired users can access information about their surroundings, such as the identity and spatial arrangement of nearby objects. To achieve this, the Clarifai API5 is incorporated to analyze image content, while Android Text-to-Speech and Speech-toText engines are used to handle voice input and output. 5
https://www.clarifai.com/.
746
N. Ranasinghe et al.
Fig. 4. Main hardware modules of the desktop and wearable versions
Fig. 5. Different components and their interactions of the EnPower platform
As visually-impaired users need to navigate the mobile application, it was imperative to provide an interaction technique that didn’t require visual information. To address this, we developed a tap-based navigation method where users’ interactions were made based on number of taps on the screen, as opposed to the location of taps. For example, in each page of the application, users can perform single-, double-, tripletaps and long presses to perform different interactions. Correspondingly, all available interactions are described through speech-based feedback every time the users reaches a new page. If a deafblind user wants to learn the UI navigation, they may request the help instructions via a special key input from the wearable or desktop devices. The respective navigation flow is shown in Fig. 6. By combining the mobile application with the tactile hardware interfaces, the EnPower platform provides users with an platform that can be used to support effective communication and improved accessibility to other forms of interactive information.
EnPower
747
Fig. 6. Navigation Flow of the mobile application. In the speech-to-haptic option, Jane, a deafblind user has one of the devices and receives Tom’s, a non-disabled person, voice messages through the mobile application and the device. The mobile app can be used by the deafblind user through the tapping navigation method or may also be configured by the non-disabled user (or the non-disabled user may install his own version and connects to the deafblind user’s device with his/her permission)
4 Experimental Evaluation As the EnPower platform aims to automate real-time information exchange following the finger-braille protocol, an experimental study is conducted to evaluate the tactile sensitivity of the fingertips. 4.1 Experimental Setup Both wearable and desktop versions were evaluated in an isolated and noise-proof room. Laptops were used to play white noise through disposable earphones, while smartphones were used to send actuation signals to the devices. A set of disposable eye-masks was used to blindfold the participants (simulated blindness).
748
N. Ranasinghe et al. Table 1. Stimuli configurations and corresponding significance Stimulus No. Fingers Condition being evaluated Actuated
4.2
1
1
Individual finger (Ring) sensation
2
2
Individual finger (Middle) sensation
3
3
Individual finger (Index) sensation + Braille code for ‘a’
4
1, 2
Confusion between Ring & Middle Finger
5
1, 3
Phantom sensation in Middle Finger
6
2, 3
Confusion between Middle & Index Finger + Braille code for ‘b’
7
1, 2, 3
All three fingers of single hand (Ring, Middle & Index Finger) + Braille code for ‘l’
8
3, 4
Confusion between Index fingers across both hands + Braille code for ‘c’
9
1, 2, 5, Confusion between Ring & Middle Finger 6 across both hands
10
1, 3, 4, Phantom sensation in Middle Finger across 6 both hands + Braille code for ‘x’
11
2, 3, 4, Confusion between Middle & Index Finger 5 across both hands + Braille code for ‘g’
12
2, 4, 5
13
1, 3, 4, Confusion in case of 5 simultaneous actu5, 6 ation + Braille code for ‘y’
Confusion in case of 3 simultaneous actuation + Braille code for ‘j’
Methodology
Thirty non-disabled participants (23 male and 7 female) were recruited for this experiment aged between 18–30 (AVG = 24.57, M = 25; SD = 3.85). The participants were blindfolded and briefed about the two devices being evaluated. To evaluate the desktop version, the participants were asked to rest their hands on the device with the fingers placed in line with the linear actuators. First, they were instructed to expect one or more simultaneous tactile stimuli, and identify the fingers on which tapping sensations were felt. Then, they were asked to give feedback on perceived stimuli combinations by typing in using the buttons on the input interface. Similarly, to evaluate the wearable version, the participants were guided to wear the finger capsules and asked to identify the fingers on which vibration patterns were felt and indicate by keying in using the flex-force sensors on the wearable interface. The administrators helped participants to switch between the devices. The order of evaluating the wearable and desktop versions were randomized and counterbalanced.
EnPower
749
Fig. 7. Accuracy in recognizing correct stimulus with different actuation durations and intervals: (A) single hand using linear actuators, (B) single hand using vibration motors, (C) both hands using linear actuators, and (D) both hands using vibration motors.
Fig. 8. Accuracy in recognizing correct number of stimulus with different actuation durations and intervals: (A) single actuation using linear actuators, (B) single actuation using vibration motors, (C) conveying Braille characters using linear actuators, and (D) conveying Braille characters using vibration motors.
4.3 Experimental Protocol The study with each device was split into five categories, each containing a set of 13 stimuli (Table. 1) that include permutations using individual fingers, a combination of fingers across single as well as both hands, and a set of selected braille characters. Parameters such as duration of actuation and time interval between each actuation were varied for each category to evaluate temporal resolution and the likelihood of rising any confusion (pseudo-tactile sensations also known as Phantom sensation) [7]. Based on previous research, 300ms was chosen as the optimal duration of actuation (it also allows enough time for ERM motors to reach full-speed) [2, 5, 9]. Based on the preliminary trials conducted, the duration and time interval between each actuation evaluated were 200 ms, 500 ms, 800 ms, and 0 ms, 200 ms, 400 ms, respectively. These parameters were kept consistent throughout each category. Participants were exposed up to three trials per stimulus to provide their feedback. 4.4 Results Here, we describe our main findings under five categories. 1. Sensory Evaluation of Individual Fingers: Stimuli with varying actuation durations of 200ms, 500ms, and 800ms show high perception accuracy (¿ 93.1%) of the fingers to recognize different stimuli, irrespective of the type of actuation.
750
N. Ranasinghe et al.
2. Actuation of Multiple Fingers on Single Hand: Here, only the participants’ responses that exactly match with the corresponding stimuli presented were counted as correct. The percentage of correct responses for each stimulus with varying actuation duration (200ms, 500ms, and 800ms) and intervals between actuation of the same stimulus (0ms, 200ms, and 400ms) using linear actuators and vibration motors is presented in Fig. 7 (A) and Fig. 7 (B), respectively. Based on the density of the scatter lines, the linear actuators seem to show a marginal improvement in accuracy with regards to the vibration motors. Participants expressed confusion in recognizing the presence or absence of actuation on the middle finger. With the actuation period set to 200ms, a relative reduction in response accuracy with the vibration motor setup was observed for the Ring+Index finger setup due to the occurrence of phantom sensation on the middle finger for some participants. At the same time, a decreased accuracy for Ring+Middle+Index finger setup was observed as actuation on the middle finger failed to generate a sensation in some participants. The data also shows that these confusions can be minimized by varying the 1) actuation duration and 2) interval between each actuation of the same stimulus. 3. Actuation of Multiple Fingers Across Both Hands: Similar to the previous category, responses that exactly match with the corresponding stimuli presented were counted as correct. The percentage of correct responses for each stimulus with varying actuation duration (200ms, 500ms, and 800ms) and intervals between actuation of the same stimulus (0ms, 200ms, and 400ms) using linear actuators and vibration motors is presented in Fig. 7 (C), and Fig. 7 (D), respectively. In this case, the density of the scatter lines indicates that the linear actuators show higher average accuracy when compared to vibration motors. As expected, participants expressed confusion in recognizing the presence or absence of actuation on the middle fingers across both hands. The data presented reiterates that these phantom effects can be minimized by varying the actuation duration and interval between each actuation of the same stimulus. The relative decrease in accuracy when compared to the actuation on a single hand (2nd category) indicates the rise of confusion with the increase in the number of stimuli. 4. Rise of Confusion with Increase in Number of Actuations: Average accuracy rates were calculated based on the correct response for stimuli presented comprising of one to five actuations for each stimulus with varying actuation durations (200ms, 500ms, and 800ms) and intervals between different actuations of the same stimulus (0ms, 200ms, and 400ms), presenting the content that users correctly perceive. Results show an increase in perceiving incorrect sensations (thus the content delivered) with the rise in the number of actuations (as in Fig. 8 (A) and (B)). Furthermore, results from a T-test analysis confirmed significantly reduced accuracy of feedback from both devices. For the desktop device (using linear actuators), the actuation period of 500ms with an interval of 400ms between each actuation in the same stimuli delivered the most reliable results in terms of correct feedback responses, with an average accuracy of 94.58% (STDEV = 2.33). Likewise, for the wearable device (using vibration motors), the actuation period of 500ms with an interval of 400ms between each actuation of the same stimuli delivered the most accurate results with an average accuracy of 90.3% (STDEV = 4.6).
EnPower
751
Fig. 9. Field Trial: evaluation of the device with sensory impaired volunteers
5. Accuracy of Recognizing Braille Characters: As the aim of this study was to evaluate the ability of these two devices to accurately convey braille characters but not to assess the participants’ memory on braille, they were asked to respond by indicating the fingers that they felt a sensation rather than reporting the corresponding braille character. Responses that exactly matched with the actuation of the corresponding braille character were counted as correct. As plotted in Fig. 8 (C) and (D), the density of the scatter lines indicates the increased accuracy of the responses recorded via linear actuators when compared to the vibration motors. A decrease in accuracy was noticed for characters such as ‘g’, ‘j’, ‘x’ and ‘y’ where they have a higher number of simultaneous actuations when compared to characters such as ‘a’, ‘b’ with a lower number of actuations. Character ‘x’ corresponds to the actuations on Ring+Index fingers of both hands simultaneously. Therefore, a decrease in accuracy was estimated due to the occurrence of phantom sensations on the middle finger when Ring+Index fingers are actuated at the same time. The lowest accuracy level is observed for character ‘y’, which resembles five simultaneous actuations. These observations are similarly reported from both devices regardless of the actuation method.
752
N. Ranasinghe et al.
5 Field Trial In addition to the stimuli perception study, we have also conducted a field trial with both devices. A total of four volunteers - volunteers A (Male, 46) and B (Female, 50) with conditions of deafblindness from early childhood, hearing impaired volunteer C (Male, 36) with loss of peripheral vision (Kalnienk vision) and hearing impaired volunteer D (Female, 32) with progressive vision loss, evaluated the devices at the Singapore Association For the Deaf. All the volunteers evaluated the desktop device while volunteers C and D evaluated the wearable device. The standard 6-dot braille protocol was employed. After few rounds of familiarization, braille characters and test words such as “hello” and “good” were conveyed through the devices. They were also encouraged to send messages via the input interface. Typical setup of the field trial is displayed in Fig. 9. Volunteer A who had prior experience with braille devices, evaluated the desktop version and easily comprehend the test characters and words. Volunteer A advised us to integrate haptic guides on the device to ensure that the end-user can locate the actuators and buttons independently. Volunteer B with no prior experience with braille devices was skeptical to step out of the comfort zone and evaluate the device. With assistance from care-takers and interpreters, she was introduced to the concept of finger braille. Although, she did not comprehend the concept during the first few rounds, later, she indicated comprehension of the test words in the successive rounds. Relatively younger volunteers C and D operated the desktop and wearable devices without any difficulties. To ensure comfort in keeping up with the conversation, the importance of a feedback of the displayed text or synthesized speech for confirmation by the deafblind individual was highlighted (reconfirmation). Volunteers C and D also enthusiastically interacted with and explored the image-to-braille and twitter functionality of the supporting software application.
6 Usage Scenarios and Discussion The proposed technology has potential applications in many different environments and scenarios. Possible applications for this platform include: 1. Smart Communication Tools: Deafblind individuals mainly use Finger Braille or other types of tactile sign language to communicate. However, only a small number of people are familiar with these languages and this can mean that a deafblind individual’s social interactions are severely restricted. The proposed assistive platform may help users to directly communicate with others by translating speech, text and image data into Finger Braille, and vice versa (as shown in Fig. 10). 2. Smart Technology for Mobility: Due to their impairments, deafblind individuals require others to accompany them whilst travelling. By integrating the proposed assistive platform into public transport services, users will have the opportunity to access information that could enable them to select transport routes and explore more independently (as shown in Fig. 11).
EnPower
753
Fig. 10. An illustration showing how the proposed assistive platform enables communication between impaired and non-impaired individuals by handling text, speech and Finger Braille.
3. Accessibility to Information on Surrounding Landmarks and Objects: In the example of a museum or gallery (as shown in Fig. 12), an exhibit may include a 3D replica that acts as a textural representation of the original work. At the same time, just as a non-impaired person may rent an audio guide, or listen to a docent (tour guide), deafblind visitors can receive the same real-time information via the proposed Finger Braille device. Furthermore, the assistive platform may be incorporated in such a way that users can also scan exhibits, using smart phones, to obtain additional tactile feedback information on the work. Additionally, it is essential to answer several research questions before future developments of this work as explained below: 1. Study Different Haptic Sensations on Different Parts of the Body for Effective Information Transfer: At present, there does not seem to exist a language for communicating and codifying haptic-based information delivery systems/methods, for example, Finger Braille. Different research groups around the world have conducted many experiments that have used different body parts, including the hand and back of the body (dorsal thoracic area). They haven’t used a consistent method (poking vs. electrical stimulation) to compare the effectiveness. Therefore, to answer this question, systematic studies of
754
N. Ranasinghe et al.
Fig. 11. An illustration showing how the proposed assistive platform may be integrated into public transport services to provide accessible information.
Fig. 12. An illustration showing how the proposed assistive platform may be incorporated into museums and galleries in order to provide accessible information regarding exhibits.
these different schemes are required, primarily focusing on muscle memory and motor learning through haptic sensations. 2. Design, Implement and Evaluate End-to-End New Media Platform for the People with Different Sensory Impairments to Interact with the Outside World: As explained above, once the taxonomy (and a proper model) is developed, it is required to conduct extended studies to understand the effects of these variables on delivering information. Besides, we would also study how these technological solutions will empower people with sensory impairments and give a sense of equality with the rest of the world. 3. How to Generalize and Apply this Proposed Platform to Work with Multisensory Modalities: Once we have a platform that can deliver information using multiple senses, a whole new avenue will open to study further: “How multisensory information is processed and understood by the human brain”. The platform we propose will require to generalize and extend even further while conducting experiments at the same time, to answer these future queries. In the future, it is also possible to utilize other senses such as various haptic modalities such as thermal, scratching, pinching, and smell and taste sensory modalities to establish alternative non-audiovisual interaction methods [10, 12–15, 17, 18].
EnPower
755
7 Conclusion In this paper, we have described EnPower, an assistive communication platform that converts text, speech, and spatial image data into accessible Finger Braille-based tactile stimuli. The EnPower platform consists of three main components: a wearable assistive device based on vibrotactile stimulation, a desktop version using linear actuation to simulate tapping sensations and a mobile application connected to these two devices. Preliminary experimental results and a field trial revealed several important findings on formulating different tactile stimuli in multiple fingertips, including the required delay in between each stimulus for improved perception. To provide real-world applications of the platform, we have included and illustrated several everyday scenarios in which the proposed system could be employed to enhance accessibility for individuals with visual and auditory impairments. To preserve effective, crucial, and meaningful interactions with the people and objects around us, we believe that the development of assistive platforms must be continually explored and supported by the emergence of new, more accessible technologies. Acknowledgments. This research is supported by the National Research Foundation, Prime Minister’s Office, Singapore under its International Research Centres in Singapore Funding Initiative.
References 1. Bach-y Rita, P., Kercel, S.W.: Sensory substitution and the human–machine interface. Trends Cogn. Sci. 7(12), 541–546 (2003) 2. Fukumoto, M., Tonomura, Y.: “Body coupled FingerRing” : wireless wearable keyboard. In: Proceedings of the ACM SIGCHI Conference on Human Factors in Computing Systems, pp. 147–154. ACM (1997) 3. Fukushima, S.: Person with DeafBlind and Normalization. Akashi Shoten, Tokyo (1997) 4. Gollner, U., Bieling, T., Joost, G.: Mobile lorm glove: introducing a communication device for deaf-blind people. In: Proceedings of the Sixth International Conference on Tangible, Embedded and Embodied Interaction, pp. 127–130. ACM (2012) 5. Hoshino, T., Otake, T., Yonezawa, Y.: A study on a finger-braille input system based on acceleration of finger movements. IEICE Trans. Fundam. Electron. Commun. Comput. Sci. 85(3), 731 (2002) 6. Koo, I.M., Jung, K., Koo, J.C., Nam, J.-D., Lee, Y.K., Choi, H.R.: Development of softactuator-based wearable tactile display. IEEE Trans. Robot. 24(3), 549–558 (2008) 7. Lee, J., Kim, Y., Kim, G.: Funneling and saltation effects for tactile interaction with virtual objects. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI’2012, New York, NY, USA, pp. 3141–3148. ACM (2012) 8. Ochi, T., Kozuki, T., Suga, H.: Bracelet type braille interface. Correspondence Hum. Interface, 5(1) (2003) 9. Osada, K., Katayama, S., Wang, K., Kitajima, K.: Development of communication system with finger braille robot for deaf-blind people. Proc. Hum. Interface Symp. 2000, 41–44 (2000) 10. Peiris, R.L., Feng, Y.-L., Chan, L., Minamizawa, K.: Thermalbracelet: exploring thermal haptic feedback around the wrist. In: Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, CHI’2019, New York, NY, USA, pp. 170:1–170:11. ACM (2019)
756
N. Ranasinghe et al.
11. Ramirez-Garibay, F., Olivarria, C.M., Aguilera, A.F.E., Huegel, J.C.: Myvox—device for the communication between people: blind, deaf, deaf-blind and unimpaired. In: IEEE Global Humanitarian Technology Conference (GHTC 2014), pp. 506–509. IEEE (2014) 12. Ranasinghe, N., Jain, P., Karwita, S., Tolley, D., Do, E.Y.-L.: Ambiotherm: enhancing sense of presence in virtual reality by simulating real-world environmental conditions. In: Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems, CHI’2017, New York, NY, USA, pp. 1731–1742. ACM (2017) 13. Ranasinghe, N., Jain, P., Tram, N.T.G., Koh, K.C.R., Tolley, D., Karwita, S., Liangkun, L.Y.Y., Shamaiah, K., Tung, C.E.W., Yen, C.C., Do, E.Y.-L.: Season traveller: multisensory narration for enhancing the virtual reality experience. In: Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, CHI’2018, New York, NY, USA, pp. 577:1–577:13. ACM (2018) 14. Ranasinghe, N., Tram Nguyen, T.N., Liangkun, Y., Lin, L.-Y., Tolley, D., Do, E.Y.-L.: Vocktail: a virtual cocktail for pairing digital taste, smell, and color sensations. In: Proceedings of the 25th ACM International Conference on Multimedia, pp. 1139–1147 (2017) 15. Ranasinghe, N., Tolley, D., Tram Nguyen, T.N., Yan, L., Chew, B., Do, E.Y.-L., Modulation of flavour experiences through electric taste augmentation: Augmented flavours. Food Res. Int. 117, 60–68 (2019) 16. Robertson, J., Emerson, E.: Estimating the Number of People with Co-Occurring Vision and Hearing Impairments in the UK. Centre for Disability Research, Lancaster University, Lancaster (2010) 17. Tewell, J., Bird, J., Buchanan, G.R.: Heat-NAV: using temperature changes as navigation cues. In: Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems, CHI’2017, New York, NY, USA, pp. 1131–1135. ACM (2017) 18. Tolley, D., Tram Nguyen, T.N., Tang, A., Ranasinghe, N., Kawauchi, K., Yen, C.C.: Windywall: exploring creative wind simulations. In: Proceedings of the Thirteenth International Conference on Tangible, Embedded, and Embodied Interaction, TEI’2019, New York, NY, USA, pp. 635–644. ACM (2019) 19. Turner, O., Windfuhr, K., Kapur, N.: Suicide in deaf populations: a literature review. Ann. Gen. Psychiatry 6(1), 26 (2007) 20. Uehara, N., Aoki, M., Nagashima, Y., Miyoshi, K.: Bidirectional telecommunication using wearable i/o terminal for finger braille. Proc. Hum. Interface Symp. 2000, 2000–9 (2000) 21. Vel´azquez, R.: Wearable assistive devices for the blind. In: Wearable and Autonomous Biomedical Devices and Systems for Smart Environment, pp. 331–349. Springer (2010)
Adaptive Customized Forward Collision Warning System Through Driver Monitoring Marco Stang, Martin Sommer(B) , Daniel Baumann, Yuan Zijia, and Eric Sax Karlsruhe Institute of Technology, Institute for Information Processing Technologies, Engesserstraße 5, 76131 Karlsruhe, Germany {marco.stang,ma.sommer,d.baumann,eric.sax}@kit.edu
Abstract. Forward Collision Warning (FCW) is an advanced driver assistance system (ADAS) designed to prevent or reduce the severity of a collision by providing early warnings to the driver. The core algorithm of a FCW-system is based on the distance between the vehicle and the obstacle as a warning threshold. Since the system assumes a particular reaction time for all drivers, the threshold value cannot be individualized for different driver states and thus leads to false or unnecessary alarms. Therefore, this paper proposes an adaptive custom collision warning system. The system relies on camera images to collect data like age, emotion, fatigue, and attention of the driver through Deep Residual Networks and the PERCLOS-method (Percentage eye openness tracking). The information obtained is evaluated using fuzzy logic and an appropriate reaction time is derived. The improved safe-distance algorithm calculates a safety distance appropriate to the driver’s condition. The system is evaluated with the National Highway Traffic Safety Administration (NHTSA) FCW standard test through the simulation environment IPG CarMaker. Keywords: Advanced Driver Assistance System (ADAS) · Forward Collision Warning (FCW) · Facial information recognition · Fuzzy logic inference
1
Introduction
According to the World Health Organization’s Global Status Report on Road Safety 2018, approximately 1.35 million people are killed in road accidents every year. Between 20 and 50 million people suffer non-fatal injuries, many of whom are disabled due to their injuries [1]. Safety is first and foremost, a case of the “human factor”. Bekiaris et al. stated that about 90% of all road accidents are due to human error [2]. A system, which is directly related to the driver’s condition such as the forward-collision warning system, is therefore essential to assist the driver. The shared basis of a FCW is the use of sensors to detect moving or stationary vehicles. As soon as the distance to an obstacle falls below c Springer Nature Switzerland AG 2021 K. Arai et al. (Eds.): FTC 2020, AISC 1289, pp. 757–772, 2021. https://doi.org/10.1007/978-3-030-63089-8_50
758
M. Stang et al.
a safe distance and a crash is imminent, a signal warns the driver. But it’s still possible to perform actions such as braking or avoiding maneuvers and hence prevent a possible crash. A driver in an observant condition can avoid potential dangers in advance. However a driver in adverse conditions such as exhaustion or distraction could cause an accident even in a simple traffic scenario. The condition of the driver can be analyzed by many modalities, such as video signals, audio signals, physiological signals [3] and driving behavior. However, not every condition can be detected by every modality. For example, a person’s speech information may better reflect their emotional state, but it is difficult to reflect whether they are distracted or tired because the relevance between the speech and the monitored state is limited [4].
2 2.1
State of the Art Collision Warning Algorithm
Collision warning algorithms, also called collision avoidance algorithms, utilize variables of the vehicle and environment as a useful measure for rating the severity of traffic conflicts and for discriminating critical from normal behavior. The collision warning algorithm determines the start logic and time of the warning system and plays a decisive role in the performance of the warning system. A functioning collision warning algorithm should not only provide the collision warning signal in time, but also ensure that the driver is not disturbed by an early warning. For this purpose there are currently two main algorithms: safe time logic algorithm and safe distance logic algorithm. The safe time logic algorithm is based on the principle that when the time to collision (TTC) is less than a certain fixed threshold, a collision warning is issued. Conversely, the principle of the safety distance algorithm is to issue a collision warning when the distance between the vehicle and the obstacle ahead is less than a certain fixed distance threshold. These two types of collision algorithms are widely used in various FCW systems and many experts and scholars have improved them to better adapt to the actual driving situation. 2.2
Safe Distance Model
The safe distance model is an algorithm that uses the distance between the vehicle and the obstacle ahead for the evaluation of the current risk situation. The model observes and analyzes the interaction of the entire process, from detecting the obstacle in front to braking and stopping the vehicle, using information from the vehicle’s sensors. The distance during this process is calculated, and is set as the warning threshold. Because the safety distance algorithm takes full account of the vehicle dynamics during braking, its results are more reliable. Traditional safe distance models include Mazda model, Honda model and Berkeley model. Many new algorithms have made various improvements on the basis of these models. The formulas for calculation the safe distance for each model are shown in Table 1.
Adaptive Forward Collision Warning System
759
Table 1. Safe distance models Model
Formula
Mazda model
d = 12 ( a11 − a22 ) + v1 t1 + v2 t2 + d0 a t2 d v2 v2 t2 + a1 t1 t2 − 121 ≥ t2 a2 d= 2 2 v2 a1 (t1 −t2 ) v2 v1 t1 − − 2a2 < t2 2 a2
Honda model
Berkeley model d =
2.3
v2
2 2 v2 −v1 2a
v2
+ v2 t2 + d0
Fuzzy Control System
Based on the literature review, reaction time is the main indicator for assessing the driver’s condition. However, there are only very limited data sets containing information about the relation between: age and reaction time, fatigue and reaction time, emotion and reaction time, and attention and reaction applicable for machine learning or data analysis. The mentioned factors are difficult to measure and the metrics are also heterogeneous. For example, it is difficult to measure the people’s reaction time for different emotions, because there is no effective way to stimulate various emotions. In addition, even if one factor can be measured, the relationship between the four factors and reaction time is still hard to be calculated. Nevertheless, the qualitative relationship between the four inputs and the output is known according to basic logic and experience. For example, an old, tired, angry and distracted driver should have a longer reaction time than an attentive young driver. Fuzzy control systems are a class of knowledge based artificial intelligent systems that can describe the uncertain and indescribable variables, which cannot be calculated with accurate mathematical models. Therefore, the fuzzy inference system is especially appropriate for the proposed application. The block diagram of the fuzzy control system is shown in Fig. 1. A fuzzification interface converts controller inputs into information used by the inference mechanism to activate and apply rules. A set of rules (if-then rules) contain a fuzzy logic quantification of the expert’s linguistic description of how to achieve an appropriate control. A defuzzification interface converts the conclusions of the inference mechanism into actual inputs of the process [5].
Fig. 1. Block diagram of a Fuzzy Control System [5]
760
M. Stang et al.
Driving a vehicle represents a traditional field of application for fuzzy control, since a human factor is included and therefore no binary decisions can be assumed. Chattaraj et al. implemented a fuzzy logic for reaction time prediction including the control factors like, age, experience, intensity of driving, speed of the vehicle and distance to the obstacle ahead [6]. Ruhai et al. analyze the impact of the driver reaction time on the safety distance and determine the reaction time based on the theory of fuzzy mathematics considering various main factors [7]. Arroyo et al. used a fuzzy classifier to derive the degree of inattentiveness and alertness of the driver via smartphones or an active infrared radiation (IR) emitter [8].
3
Concept and Realization
The forward collision warning algorithm is the primary component of a collision warning system. Since it assumes a deterministic response time for all drivers, the warning threshold cannot be individualized for different driving conditions, leading to false or disturbing warnings. In order to adapt the collision warning algorithm to different driver conditions, a safety distance model with a selfupdated algorithm is proposed. The factors with the most significant influence on driver conditions are age, emotion, fatigue and attention. These parameters are recorded by a camera system and transferred to the fuzzy control system. With these input parameters the fuzzy control system calculates a reaction time adapted to the driver’s condition. This reaction time is finally transferred into a safe distance by the self-update algorithm. The described approach is shown in Fig. 2.
Fig. 2. Structure of the subsystems of the Forward Collision System (FCW)
3.1
Driver Monitoring Systems
Currently existing technologies for monitoring the driver’s condition include camera monitoring, microphone arrays, electrocardiograms, electromyograms, respiratory monitoring and monitoring of manipulation behavior. The comparison and evaluation between these approaches is shown in Table 2.
Adaptive Forward Collision Warning System
761
Table 2. Comparison of sensors for an environment perception system Performance Approaches Bioelectric Steering motion Face monitoring Age
No
No
Yes
Emotion
No
No
Yes
Fatigue
Yes
Yes
Yes
Attention
No
Yes
Yes
Accuracy
Very Good Good
Moderate
Simplicity
Difficult
Moderate
Easy
Speed
Fast
Slow
Fast
By comparing the results in the table the method of facial monitoring was identified as most promising. In addition to the differences in the table, physiological, signal-based detection devices can have an invasive effect on driver’s comfort and operation since they must be in direct contact with the driver. Furthermore, invasive methods create a strong feeling of being monitored. Therefore, this application decides to use the camera monitoring system. Another advantage is the easy installation of the camera system and the minimal influence of the camera system on the driver’s behaviour during the measurement. 3.2
Analysis and Recognition of Driver Condition Factors
As face recognition is an emerging subject in many industries, a general approach for facial image processing has emerged which is shown in Fig. 3.
Fig. 3. Face recognition process
To further deal with facial information, the specific point in the face - the distinctive points of view such as nose, eyes, mouth, facial contour points need to be extracted. Vahid Kazemi [9] et al. use an ensemble of regression trees to estimate the positions of the facial landmarks directly from a sparse subset of pixel intensities, thus achieving real-time performance with high-quality predictions. The method is integrated into the Dlib Library [10] to quickly and easily implement the classifier. 3.3
Driver State Recognition System
Helen et al. [11] divided the Fatality Analysis Reporting System (FARS) database from the National Highway Traffic Safety Administration (NHTSA)
762
M. Stang et al.
into several groups with the k-means cluster method. Factors like age, alcohol, drugs can be classified under long term factors. Fatigue, carelessness due to distraction or emotional agitation can be classified under short term factors. Since alcohol and drug driving is forbidden in many countries, the application considers age as the main long term factor and fatigue, emotion and attention as short term factors. Many researchers focus on the reaction time as the main indicator of the driver state. Because of the variance in physiological factors and measurement difficulties, there is no mathematical model to describe the relationship between reaction time and driver state. However, the qualitative relationships are evident for the research conducted to date. According to [12] the reaction time of braking increases progressively between the ages of 20 and 80, from about 0.4 s to 1.8 s. Guo Mengzhu found out that the average increase in reaction time from an alert state to a fatigued state is 16.72% [13]. Utumporn Kaewken [14] stated that the range of reaction times is from about 0.5 s to 1.8 s depending on different degrees of distraction. Hu Tianyi discovered that emotions can negatively influence the driver’s driving behavior in terms of risk perception and attitude [15]. Therefore, the driver’s state could be analyzed from age, emotion, fatigue and attention. 3.4
Age and Emotion Recognition and Measurement
A camera is used as a sensor to monitor the driver. Consequential age and emotion recognition become a problem of image classification. The key to the classification of images is the extraction of features of the human face, therefore, given the above mentioned basics, a Convolutional Neural Network (CNN), especially the ResNet, is suitable for recognition. ResNet is the residual network designed by Kaiming He [16]. It is characterized by solving the vanishing/exploding gradients problem effectively and thus is not limited by the number of hidden layers. The databases FER2013 [17] and CK+ Face [18] are used as training basis for the realisation of the emotion recognition. FER2013 is a face database with seven categories of emotions including anger, disgust, fear, happiness, sadness, surprise and neutrality. It consists of 28709 training examples and 3589 validation examples. CK+ is the extended Cohn-Kanade dataset including 327 sequences of images with emotion labels. In order to increase the training set, the two data sets are merged to a bigger mixed dataset. As for the age dataset for training, the UTKFace [19] is used. UTKFace dataset is a large-scale face dataset with a long age span (range from 0 to 116 years old). After training, the overall accuracy reaches 90% in the age validation dataset and about 80% in the emotion validation dataset. Figure 4 shows the emotion confusion matrix. The predicted values for happy, angry and neutral emotions, that play an important role in the judgement of the driver’s state, achieve values in a range from 62% to maximum 90%. The age and emotion can be recognized directly through the trained ResNet model. To continuously monitor the driver’s condition, the system uses a cumulative counting method to count different emotional states per time unit. The different emotions are labeled with different weights according to the emotion’s
Adaptive Forward Collision Warning System
763
Fig. 4. Emotion Confusion Matrix
effect. The positive emotion such as happiness is considered to reduce the reaction time, thus these emotions are weighted with 0. The neutral emotion is weighted with 1 and negative emotions such as anger, fear and sadness are weighted with 2. By detecting the driver’s emotions and counting emotion labels per time unit, the driver’s emotional state could be quantified according to the following formula: ⎞ ⎛ tpositive (w0 , w1 , w2 ) ⎝ tneutral ⎠ tnegative . (1) AverageEmotionState = tunit where (w0 , w1 , w2⎞ ) is the weight matrix of the different emotions and ⎛ tpositive ⎝ tneutral ⎠ is the time of each emotion state per unit time. tnegative
764
3.5
M. Stang et al.
Fatigue and Attention Recognition and Measurement
The problem of fatigue detection exists not only in the recognition but also lies in finding a metric method. The NHTSA conducted a project to assess the validity and reliability of various sleep deprivation detection measures. Among the fatigue detection measures and technologies evaluated in this study, the measure known as “PERCLOS” proved to be the most reliable and valid determination of a driver’s degree of alertness. PERCLOS is the percentage of eyelid closure over the pupil over time and reflects slow eyelid closures (“droops”) rather than blinks. A PERCLOS alertness metric was established in a driving simulator study as the proportion of time in a minute that the eyes are at least 80% closed. Federal Highway Administration (FHA) and NHTSA consider PERCLOS to be among the most promising known real-time measures of alertness for in-vehicle drowsiness-detection systems [20]. In order to recognize the closure of eyes through the camera, the landmarks of eyes can be recognized. Through computing the longitudinal and lateral euclidean distance of landmarks of the eyes, the aspect can be calculated. An aspect smaller than a threshold means that the eyes can be considered as closed. The PERCLOS Fatigue State can be obtained as described in the following formula:
F atigueState = P ERCLOS =
Time of eyes at least 80% closed . tunit
(2)
Attention detection face the same difficulties as fatigue detection. There is no official or generally accepted metric method for quantifying fatigue. Sung Joo Lee et al. [21] used an image-based real-time view zone estimator based on the driver’s head orientation.The head orientation is composed of yaw and pitch to estimate the driver’s gaze zone, which is aimed at detecting inattention for interior observation. After measuring across 2000 images, they determined a viewing zone comprising yaw and pitch angles. If the driver’s head orientation is out of the gaze zone, the driver will be considered as distracted: DistractionState =
tabnormal angle . tunit
(3)
The proposed system counts the distraction time per unit of time based on Sung’s research to continuously monitor the driver’s head orientation. The system does not consider situations that require a turn of the head, such as looking over the shoulder or looking in the rear-view mirror, to be distractions. With the aforementioned methods, the information including age, fatigue, emotion and attention can be recognized and displayed in real-time. 3.6
Fuzzy Control System
R in order This subsystem uses the Fuzzy Logic Toolbox provided by MATLAB to calculate the reaction time in accordance with the fuzzy logic.
Adaptive Forward Collision Warning System
765
A young driver at the age of 20–35 years and no abnormal states is inferred to have a reaction time of 0.46 s. When the driver is heavily fatigued, the reaction time changes to 1.1 s. An old driver with all bad states is inferred to have a reaction time of 1.64 s. These results approximately match the measurement data in the literature. Figure 5 and 6 display the reaction time and the corresponding distraction and fatigue, respectively emotion value. According to Fig. 5, the three phases of fatigue are located in the intervals 0 to 0.06, 0.06 to 0.198 and 0.198 to 1, which mean no fatigue, mild fatigue and heavy fatigue. The distraction values are located in the intervals 0 to 0.2, 0.2 to 0.4 and 0.4 to 1 meaning no distraction, mild distraction and heavy distraction. According to the fuzzy rules, if the fatigue value is in the interval 1 and the distraction value is also in the interval 1, the reaction time should be minimum. With an increase of the distraction and the fatigue value, the reaction time also increases. An increase of the distraction and the fatigue value to the third phase (heavy distraction and heavy fatigue) leads to a maximum reaction time which can be seen in the top area of Fig. 5.
Fig. 5. Reaction time with emotion and fatigue degrees
3.7
Fig. 6. Reaction time with attention and emotion degrees
Improved Safe Distance Model
This paper presents an adaptive driver state algorithm based on a safe distance model and the fundamentals of the above mentioned collision warning algorithms (see Sect. 2.2). The algorithm was combined with the multi-scene safety distance model given by Zhewen et al. [22]. The reaction time item was added to reflect the driver’s characteristics and his risk level in order to obtain an algorithm that can take both the driver’s characteristics and the vehicle braking kinematics process into account. First of all, the following reasonable assumptions are made: 1. The time interval between hearing the signal and applying the brake is called reaction time 2. The process of brake force growth is ignored.
766
M. Stang et al.
Fig. 7. Velocity over time diagram for the scenario described in Fig. 8
Fig. 8. Relevant distances for the computation of the safe distance logic algorithm for two vehicles with the vehicle in front (Lead Vehicle) being slower than the ego vehicle
3. During the driver’s reaction time, the ego vehicle and the leading vehicle will maintain in the current motion state. After hearing the warning signal, the vehicle does not brake during the reaction time, but right after the reaction time. The algorithm will be analyzed with a v-t diagram, which is shown in Fig. 7, to show the kinematic process. Assuming a1 is the deceleration of the ego car, a2 is the deceleration of the lead car, tn is the time when v1 = v2 and, tr is the reaction time of the driver. The time until the ego vehicle stops ts is calculated as follows: ts = tr +
v1 a1
if a1 ≤ a2 & tr ≤ tn ≤ ts .
(4)
When the front sensors detect the deceleration of the lead car and the warning signal is issued, the driver needs the time tr to brake, thus the ego vehicle moves with the current constant speed during the reaction time. The lead car will keep decelerating. The most dangerous time is tn , where v1 = v2 and the distance
Adaptive Forward Collision Warning System
767
of both vehicle is the closest. Assuming d0 is the distance between the two cars when both stop (see Fig. 8), the safe distance can be calculated as follows: tn tn v1 dt − v2 dt + d0 d = s1 − s2 + d0 = 0 0 (5) (v1 − v2 )2 + 2a1 tr (v1 − v2 ) + a1 a2 t2r + d0 . = 2(a1 − a2 ) If the lead car is moving with a constant speed or is fully stopped, the formula (5) can be simplified to: tn tn d = s1 − s2 + d0 = v1 dt − v2 dt + d0 0 0 (6) (v1 − v2 )2 + 2a1 tr (v1 − v2 ) + a1 a2 t2r + d0 . = 2(a1 − a2 ) If the condition is not satisfied, which means that the ego car is always faster than the lead car, the distance will decrease all the time during the braking process. The most dangerous time is also the closest point tn , therefore the distance can be calculated through the formula: d = s1 − s2 + d0 = v1 tr +
v12 v2 − 2 + d0 . 2a1 2a2
By combining Eq. (5) and (7), the whole algorithm can be written as: (v −v )2 +2a t (v −v )+a a t2 1 2 1 r 1 2 1 2 r + d0 a1 ≤ a2 , tr ≤ tn ≤ ts 2(a1 −a2 ) d= . 2 2 v1 v2 v1 tr + 2a1 − 2a2 + d0 other conditions
(7)
(8)
Since the performance and the environment doesn’t change, the safe distance depends only on the driver, especially the reaction time tr . Figure 9 shows that different reaction times tr and vehicle speeds lead to remarkable different safe distances in the above assumed scenario.
4
Self-Update Algorithm
The self-update algorithm implements a procedure for updating the reaction time tr according to the driver’s condition in real time. Taking the alarm at time t and the driver’s actual reaction after the warning at time t + td (td is a short time after the warning) into consideration, four different situations can be extracted. Table 3 describes these situations. A true warning at t means the warning is issued, and if the driver brakes at t + td it is considered as positive. Thereby hard braking is an indicator for a too late warning signal. Consequently, the driver’s reaction time must be increased. For a warning where the driver does not brake, the reaction time is decreasing. As long as there is no warning signal and the driver does not brake, the update algorithm does not have to perform anything.
768
M. Stang et al.
Fig. 9. Relationship between safe distance, reaction time and speed
A driver in a bad state, implying distraction by a mobile phone or emotional distraction, will certainly not brake in time, so the reaction time tr should not be changed and maintain at an early warning time. But when the driver’s state is good, his behavior needs to be considered into the update algorithm as well because his risk level is normal. Table 3. Good driver state and according actions in the self-update algorithm Driver at t + td
Warning at t True
False
Positive
In case of hard braking the reaction time increases
In case of hard braking the reaction time increases with bigger step size
Negative
The reaction time decreases
Do nothing
Algorithm 1 describes the whole update algorithm. α is a small step size of the changing process and tr avg is an average reaction time during the driving period, so that the reaction time will increase towards the average reaction time step by step. An undistracted, non-tired driver (good state) may prefer a short following distance and might not accept an early warning. Thus tr should be shorter, in order to issue the warning later. If the warning is not generated but the driver brakes hardly, tr should increase substantially with a bigger step size β so that the safe distance becomes bigger, because the warning is too late for the driver, even later than his brake movement.
Adaptive Forward Collision Warning System
769
Algorithm 1. Self-Update Algorithm if driver state good == true then if warning f lag(t) == true then if brake f lag(t + d) == true && |a| > 4 then tr = tr + α(tr − tr avg) else tr = tr − α(tr − tr avg) else if brake f lag(t + td) == true && |a| > 4 then tr = tr + β(tr − tr avg) end end else Do nothing end
5
Simulation and Test
To test the performance of the system, test procedures are conducted with the simulation environment IPG CarMaker for Simulink. The National Highway Traffic Safety Administration (NHTSA) provides specifications for conducting tests to confirm the existence of a Forward Collision Warning (FCW) system on a passenger vehicle [23]. A Volkswagen Beetle was selected as the demo car. It is equipped with two millimeter radars with 24 GHz and 77 GHz for short and long distance. The driver is captured and modeled with the described face recognition method in order to measure the reaction time of the driver. The test was divided into three groups: lead car stopped, lead car brakes, and both cars drive with constant speed [23]. The tests were then conducted with a young driver in a good state and an old driver in a bad state. Test 1 was alsoo conducted with a young driver who is distracted by looking at a phone. Figure 10 shows the young, distracted driver in test 1. The NHTSA has defined time-to-collision (TTC) criteria that must be observed. The TTC indicates the time it takes for the subject vehicle to collide with the lead vehicle if the current relative velocity is maintained (velocity difference between the two vehicles) [24]. The calculated results of the TTC of all three tests with different drivers and the minimum required TTC are displayed in Table 4. The different tests point out the adaption of the TTC value due to different drivers and the states they are in. The measured TTC values all satisfy the requirements of the NHTSA.
770
M. Stang et al.
Fig. 10. Procedure with a young driver with mild distraction and neutral emotion Table 4. Test results for different drivers Test
Measured TTC Required TTC (NHTSA)
Lead car stopped YD GS
a
Lead car stopped YD MDb Lead car stopped OD BS
c
3.03 s
2.1 s
3.36 s
2.1 s
4.2 s
2.1 s
Lead car braking YD GS
5.8 s
2.4 s
Lead car braking OD BS
6.64 s
Both cars at constant speed YD GS 3.2 s Both cars at constant speed OD BS 5.06 s YD GS=young driver in a good state b YD MD=Young driver mild distracted by phone c OD BS= Old driver in a bad state
2.4 s 2.0 s 2.0 s
a
6
Summary
This paper designs an adaptive customization of vehicle warning signals through a camera-based driver monitoring system. Traditional FCW systems cannot adapt to different driver states, because the core algorithm-safe distance model or safe time model adopts a deterministic reaction time for all drivers instead of an individualized reaction time for different driving states, which occasionally causes false warnings. One of the worst aspects of any deterministic FCW system is generating enormous number of annoying false warnings. Thus, this system analyses the main factors affecting the driver and determines the main four factors, namely, age, emotion, fatigue and attention. Then the relationship between these factors and the driver’s reaction time is analyzed. This system uses the camera to collect the facial information of drivers, recognize and measure these states through a Deep Residual Network, PERCLOS and image processing methods. When the driver is fatigue, distracted or in bad emotion, the system will issue corresponding warnings. Then the driver state will be transferred into the fuzzy logic inference system to approach the reasonable reaction time. The safe distance will be calculated by the safe distance algorithm including the reaction
Adaptive Forward Collision Warning System
771
time. When the headway is smaller than the safe distance, the warning is generated to remind the driver to brake. The threshold will be fine-tuned according to the realistic driving condition through a learning-based self-update algorithm. Finally, the performance is tested and evaluated according to NHTSA FCW Standard Test. The test is based on the simulation in Simulink with CarMaker. The results show that the system satisfies the test standards successfully and can adapt to different driver states.
References 1. World Health Organization: Global status report on road safety 2018: Summary. World Health Organization, Technical report (2018) 2. Bekiaris, E., Petica, S., Brookhuis, K.: Driver needs and public acceptance regarding telematic in-vehicle emergency control aids. In: Mobility for everyone. In: 4th World Congress on Intelligent Transport Systems, 21–24 October 1997, Berlin (1997) 3. Bl¨ ocher, T., Schneider, J., Schinle, M., Stork, W.: An online PPGI approach for camera based heart rate monitoring using beat-to-beat detection. In: 2017 IEEE Sensors Applications Symposium (SAS), IEEE, pp 1–6 (2017) 4. Eyben, F., W¨ ollmer, M., Poitschke, T., Schuller, B., Blaschke, C., F¨ arber, B., Nguyen-Thein, N.: Emotion on the road—necessity, acceptance, and feasibility of affective computing in the car. In: Advances in Human-Computer Interaction 2010 (2010) 5. Mahmoud, M.S.: Fuzzy Control, Estimation and Diagnosis. Springer, Saudi Arabia (2018) 6. Chattaraj, U., Dhusiya, K., Raviteja, M.: Fuzzy inference based modelling of perception reaction time of drivers. Int. J. Comput. Inf. Eng. 11(1), 8–12 (2016) 7. Ruhai, G., Weiwei, Z., Zhong, W.: Research on the driver reaction time of safety distance model on highway based on fuzzy mathematics. In: 2010 International Conference on Optoelectronics and Image Processing, IEEE, vol. 2, pp. 293–296 (2010) 8. Arroyo, C., Bergasa, L.M., Romera, E.: Adaptive fuzzy classifier to detect driving events from the inertial sensors of a smartphone. In: 2016 IEEE 19th International Conference on Intelligent Transportation Systems (ITSC). IEEE, pp 1896–1901 (2016) 9. Kazemi, V., Sullivan, J.: One millisecond face alignment with an ensemble of regression trees. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1867–1874 (2014) 10. King, D.E.: Dlib-ML: a machine learning toolkit. J. Mach. Learn. Res. 10, 1755– 1758 (2009) 11. Helen, W., Almelu, N., Nivethitha, S.: Mining road accident data based on diverted attention of drivers. In: 2018 Second International Conference on Intelligent Computing and Control Systems (ICICCS), IEEE, pp. 245–249 (2018) 12. Svetina, M.: The reaction times of drivers aged 20 to 80 during a divided attention driving. Traf. Inj. Prevention 17(8), 810–814 (2016) 13. Guo, M., Li, S., Wang, L., Chai, M., Chen, F., Wei, Y.: Research on the relationship between reaction ability and mental state for online assessment of driving fatigue. Int. J. Environ. Res. Public Health 13(12), 1174 (2016)
772
M. Stang et al.
14. Kaewken, U.: Driving distraction effects on reaction time in simulated driving. PhD thesis (2016) 15. Hu, T.Y., Xie, X., Li, J.: Negative or positive? The effect of emotion and mood on risky driving. Transp. Res. Part F Traff. Psychol. Behav. 16, 29–40 (2013) 16. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) 17. Goodfellow, I.J., Erhan, D., Carrier, P.L., Courville, A., Mirza, M., Hamner, B., Cukierski, W., Tang, Y., Thaler, D., Lee, D.H.: Challenges in representation learning: a report on three machine learning contests. In: International Conference on Neural Information Processing, Springer, pp 117–124 (2013) 18. Lucey, P., Cohn, J.F., Kanade, T., Saragih, J., Ambadar, Z., Matthews, I.: The extended cohn-kanade dataset (ck+): a complete dataset for action unit and emotion-specified expression. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition-Workshops, IEEE, pp. 94–101 (2010) 19. Zhang, Z., Song, Y., Qi, H.: Age progression/regression by conditional adversarial autoencoder. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5810–5818 (2017) 20. Rau, R., Knipling, P.: A valid psychophysiological measure of alertness as assessed by psychomotor vigilance (2005) 21. Lee, S.J., Jo, J., Jung, H.G., Park, K.R., Kim, J.: Real-time gaze estimator based on driver’s head orientation for forward collision warning system. IEEE Trans. Intell. Transp. Syst. 12(1), 254–267 (2011) 22. Zhewen, T., Fengyu, L., Yadong, D., Hongwei, D., Shuai, L.: Study on the mathematics model for the forewarning system of automobile rear-end collision avoidance. School of Automobile Engineering, Wuhan (2009) 23. Administration NHTS. Forward collision warning system confirmation test. Office of Vehicle Safety, Office of Crash Avoidance Standards, National Highway Traffic Safety Administration, Washington, DC (2013) 24. Seiffert, U., Wech, L.: Automotive safety handbook (2003)
JettSen: A Mobile Sensor Fusion Platform for City Knowledge Abstraction Andres Rico(B) , Yasushi Sakai, and Kent Larson Massachussetts Institute of Technology, Cambridge, USA {aricom,yasushis,kll}@mit.edu
Abstract. In the past years, mobility trends in cities around the world have been pushing for safer, greener, and more efficient transportation systems. This shift in mobility trends creates an opportunity for using mobile lightweight infrastructure, such as bicycles, as a generator of knowledge that will benefit commuters alongside the environmental and societal performance of cities. We propose a system architecture design for an open source mobile sensor fusion apace a platform with a knowledge abstraction framework that enables citizens, urban planners, researchers, and city officials to better address the complex issues that are innate to cities. The system is mounted on a commercial electric assist bike and is able to combine sensor input that describes the bicycle’s electro-mechanical, geospatial, and environmental states. The system proposes sensor flexibility and modularity as key characteristics, and the abstraction framework conceptualizes the way in which these characteristics can be best exploited for city improvement. We demonstrate the functionality of the system and framework through the creation of a use case implementation for clustering bike trip patterns using unsupervised learning clustering techniques. This platform outlines a way to migrate focus from providing solutions to asking the right questions in order to satisfy citizens’ needs.
Keywords: Data systems Machine learning
1
· Sensor fusion · Knowledge abstraction ·
Introduction
In recent years, we have seen a clear shift in mobility patterns worldwide. Cities are becoming ever more interested in deploying systems that serve as a better mobility alternative than privately owned cars. Shared mobility systems like Uber, Lyft, Bird, Lime, and multiple bicycle sharing companies are attracting users at unprecedented rates [19]. These systems can sometimes complement mass transit. Studies also show that utilization of bike sharing services also increases mass transit use [7], which will push cities to improve services for these public transportation options. In addition, ongoing research points to lightweight, electric, shared, and autonomous as the key characteristics for the c Springer Nature Switzerland AG 2021 K. Arai et al. (Eds.): FTC 2020, AISC 1289, pp. 773–791, 2021. https://doi.org/10.1007/978-3-030-63089-8_51
774
A. Rico et al.
mobility of the future [3,17,29]. This means that bicycles and e-bikes, both shared and privately owned, will increase in relevance in the upcoming years. The increasing demand for these types of systems is a great opportunity for cities to develop decentralized mobile sensor networks. These networks can tap into intelligent decision making that can positively impact the development of infrastructure, communities, and improve upon user experience in a way that was not possible until only a few years ago. City governments will be able to make commuting much more safe, healthier, and more sustainable while balancing privacy concerns [10]. In the realm of cycling and bicycle sharing systems, specifically, there are many ways in which mobile sensor devices have been used to improve a city’s and driver’s performance [22]. Accelerometer, ultrasound, and wind flow sensor data from bicycles has been used to improve user safety while driving [11,27]. Other approaches have added extra sensors and interfaces using smart helmets [28] and added GPS location coordinates for stations and bicycles have been used by many to identify user behavior patterns [13] and to inform bike sharing companies about areas where stations are lacking, in need of maintenance or underutilized [2]. Although the idea of sensorizing bicycles has been extensively explored and implemented [30], there does not yet exist a system implementation that acts as a platform that is capable of enabling the fusion of multiple sensors. A platform with such characteristics would allow for faster and cheaper exploration of concepts and implementations that could bring continuous change for cities and users. Another important characteristic of bike related infrastructure is that it is reversible and easier to iterate compared to heavy infrastructure [25]. This iterative qualities make bikes highly attractive for rapid testing of smart infrastructure within cities. Sensor systems that are unable to be quickly adapted for different needs are limited due to the inherent complex dynamics that abound within cities. Despite mobile sensor networks being an attractive alternative for smart infrastructure, current deployments on bikes are based on systems designed with a specifically constrained problem in mind, and therefore lack flexibility for implementation on a wider range of problems than the ones they were designed to solve. Flexibility in sensor systems enables a more humane exploration of improvements for cities, migrating focus from providing solutions to asking the right questions in order to satisfy citizens’ needs. This paper proposes a system architecture design for an open source mobile sensor fusion system, specifically implemented on a prototype electric assist bicycle1 manufactured by a large consumer electronics company. The system aims to offer a flexible platform that can enable citizens, companies and governments to extract information that can be later used within an iterative, citizencentric, decision making process. 1
The bikes used in this study provides assist torque in addition to human pedalling. This bicycle does not drive solely from the motor unit.
JettSen
775
The platform’s design incorporates an array of custom made and commercially available sensors that are capable of continuously saving data containing a varied set of descriptors for a bike ride. The diversity of continuously recorded variables is what enables the system to act as a flexible platform for the development of numerous projects with varied goals. The mobile sensor platform is accompanied by a knowledge abstraction framework that helps to exploit the flexible nature of the platform. The knowledge abstraction framework is a crucial pillar to the proposal due to the fact that it gives a solid structure to understand the ways in which the platform can have an impact in different scales and dimensions; community level, driver level or infrastructure level. When used together, the sensor system and the abstraction framework can empower citizens, researcher, urban planners, companies and governments to better explore urban necessities. In order to test the system and the knowledge framework, a use case scenario which uses multiple components of the platform was constructed. It has the goal of identifying distinct driving patterns that users can have on a bike with the use of well known unsupervised machine learning techniques. In addition, it demonstrates the value of having a large set of sensors in helping to address questions that could not be well addressed or explored by only one sensor. The goal of the implementation is to understand different interventions that can be made to improve the experience of the driver as well as the performance of the city through the use of sensor fusion, localization and machine learning guided by the knowledge abstraction framework. The rest of this publication is structured as follows: Sect. 2 gives details on the architecture and technical functions of the mobile sensor fusion system, capable of combining the input of four different sensor modules. Section 3 describes the knowledge abstraction framework of such system along with possible use cases that it’s built in flexibility can enable. Section 4 presents a study carried out, using the proposed sensor system and abstraction framework, to cluster distinct driving patterns. Lastly we elaborate a discussion of future uses for the system and future improvements to the platform.
2 2.1
Mobile Sensor Fusion Platform System Architecture Overview
The mobile sensor fusion system was built as a platform with a modular nature to allow for multiple use case scenarios. The architecture was designed in a way that each type of sensor value can be accessed individually in order to make the system adaptable to various research projects or questions. The system is capable of continuously recording a total of 27 different variables. Data acquisition is carried out on a commodity central computer (Raspberry Pi 3 B+) that also acts as a local storage unit for the data. A central unit is used to directly communicate with four external sensor modules (1) environmental sensor module, (2) camera module, (3) GPS module and (4) internal bicycle sensor. All modules use a wired serial communication
776
A. Rico et al.
to transmit data over to the central unit. The central unit is responsible for accessing data of all four sensor modules, assigning timestamps, merging and saving it on to three independent files. Figure 1 illustrates the different modules of the system as well as their data input and output interactions. In order to collect data, the central unit starts running all sensor programs from start-up. For activating data logging, a built in button is programmed for manual activation. When the button is pressed, 3 simultaneous data collection processes are activated. The first process collects values from the environmental sensor and the motor unit. The second process stores GPS values and the third process saves raw camera data. Each data process creates an individual file to store data, identifiable by a universally unique identifier (UUID). Each data instance is tagged with a timestamp to facilitate proper post processing of the data. The files are stored separately within the internal memory of the central unit and carry the same UUID as name in order to match the different data sets that belong to a single trip, allowing for future fusion of data sets containing distinct sensor values. Each process runs individually in order to maintain a collection frequency that is favorable for each type of sensor module.
Fig. 1. Mobile Sensor Fusion System Architecture. Four sensor modules feed data into central unit. The central unit processes data, adds timestamps and generates three separate files labeled by a UUID. Data is later extracted and analyzed on an external computer.
2.2
Variable Description and Definition
Variables collected by the system are grouped into three main description categories (1) bicycle state variables, (2) environmental state variables and (3) geospatial state variables. The chosen classification gives a holistic description of a given bike ride, taking into account variables that may be of interest for
JettSen
777
distinct purposes and allowing for exploration of the relationship of a user, with the bike, the environment and the city itself. Bicycle State Variables. Bicycle state variables are variables that represent the current state of the ride. This variables describe the mechanical and electrical state of the bicycle and are directly related to the use of the bike itself. Most notably, the motor drive provides the “input torque” which is a direct measure of the interaction of the user and the bike. Table 1 gives a description of each variable that is categorized as a Bicycle state variable. Table 1. Bicycle state variable description and units. Variable Battery temperature
Unit ◦
C
Battery voltage
mV
Output current
mA
Remaining battery percentage % Remaining distance
0.1 km
Remaining time
0.1 min
Input torque
kgf
Rotation state
Dicrete
Crank RPM
rpm
Power voltage
mV
Motor temperature
◦
C
Motor duty
Dicrete
Motor RPM
rpm
Motor speed
km/h
Encoder count
inc-dec
Bicycle speed
km/h
Drive mode
Dicrete
X Axis acceleration
m/s2
Y Axis acceleration
m/s2
Z Axis acceleration
m/s2
Environmental State Variables. Environmental state variables are variables collected by a research oriented environmental sensor [23,26]. The variables describe the environment that surrounds the bicycle along the bike trip. This variables describe external factors that have to do, mostly, with the atmosphere surrounding the bicycle. Table 2 enlists variables classified as environmental state variables along with their units.
778
A. Rico et al. Table 2. Environmental state variable description and units. Variable
Unit
Light
lux
Temperature
a
C
Humidity
Relative %
Pressure
% Pa
Geospatial State Variables. Geospatial state variables are the variables that describe the geo location of the sensor system as well as the visual state (through a camera) of the bicycle. These variables are useful for locating and giving context to the bicycle state and environmental state variables as well as for mapping infrastructure that may be surrounding the bicycle at a given time within a trip. Table 3 enlists variables classified as geospatial state variables along with their units. Table 3. Geospatial variable description and units. Variable
Unit
Latitute
Decimal − Decimal
Longitude
Decimal − Decimal
Camera output RGB values (0–255)
2.3
Component Description and Connections
The current section describes details regarding the components used to build the sensor platform as well as the connectivity that exists between them. Figure 2 shows the physical implementation of the system and Fig. 3 shows the electric connections, highlighting power lines and data transmission lines. The next subsections give detailed information about the five main components of the system’s architecture. Central Unit and Power Supply. Central collection and processing is done on a Raspberry Pi Model 3B+. The raspberry Pi is connected to the rest of the modules through USB, UART and Software Serial Communication. The system’s power is drawn directly from the e-bike’s battery pack which operates at 26 V. The battery is connected to a DROK Buck Converter, which is a commercial voltage regulator, that drops down the 26 V from the motor battery to the 5 V required by the Raspberry Pi and all sensor modules. The central unit, the motor unit, activation button, indicators and all sensor modules share the same digital ground reference.
JettSen
779
GPS Module. The GPS board is a commercially available GPS breakout board, Gowoops GPS Breakout Module, based on the GPS NEO-6M module. The board allows for direct communication through the use of software serial on the Raspberry Pi. The device is connected to power directly through the break out pins and uses a regular GPIO pin on the Raspberry Pi to transmit unparsed NMEA strings into the central unit. The software process collecting the GPS data verifies that GPS has proper fix to a satellite, parses the NMEA strings and extracts the latitude and longitude variables expressed in Decimal-Decimal units. The variables can later be converted to Decimal-Degrees in order to make mapping and visualization easier on commonly used visualization programs. Camera Module. A commercial Raspberry Pi Camera Module V2 based on Sony’s IMX219 8-megapixel is connected directly to the Raspberry Pi’s camera connector. The module was used to avoid adding multiple soldered connections to the system. The camera is placed on the bicycle’s front handle bar in order to capture the front facing view of the system2 . Environmental Sensor Module. The environmental sensor is a custom research board based on the ESP WiFi modules (MIT terMITe) that is capable of measuring 3 axis accelerations, ambient temperature, relative humidity, light intensity, atmospheric pressure and infrared proximity [23,26]. The board is connected using a micro-USB cable. Communication is carried out by using the hardware USB serial communications on the Raspberry Pi. Internal Bicycle Sensor Module and Activation Button. The internal bicycle sensor module was custom developed by Panasonic Corporation and Panasonic Cycle Technology Corporation as part of an ongoing research collaboration with the authors of this publication. It is a unit that has control over all the bicycle state variables, excluding x, y and z accelerations, which are recorded by the environmental sensor module. The motor unit is connected directly to the Raspberry Pi and uses UART communication to send data to the central unit. It is relevant to note that there is a constant two way communication between the central unit and the internal bicycle sensor module due to the fact that the module is capable of receiving commands from the central unit that can later be passed directly to the motor in order to change and modulate electric assist. This two way communication is used by the activation button mounted on the handlebar to signal the beginning and ending of data logging. The system uses the data indicator (led indicator) to signal to the user that data collection has begun or ended3 . 2 3
Camera module is not used on the use case implementation seen in Sect. 4 but has been installed for future work on the platform. While the two way communication with the motor unit is only being used by the activation button, as the platform is implemented on different uses, the communication can become relevant to enhance the user’s interaction with the bicycle
780
A. Rico et al.
Fig. 2. System Implementation. Installation of the mobile sensor fusion system on prototype bicycle provided by Panasonic Cycle Technology. As the system components are placed at separate locations, cable lines run through the three main bike frame tubes (top, down and seat tubes) and connect directly to the central unit, placed in a case behind the bike’s seat.
Fig. 3. Power Supply and Data Transmission Connections. Diagram shows electric connections and flow of information for the systems four sensor modules and the central unit. Note that all devices have a common ground provided by the bicycle’s battery pack, the diagram does not include ground connections to make it more simple to read. Red lines represent power (V+) connections and blue lines are all data transmission lines.
JettSen
2.4
781
Data Collection and Storage
The system allows for simultaneous collection of data points from a large set of modules. To deal with the challenge of storing distinct data units and accessing modules operating at different frequencies, data collection was divided into three independent processes. Timestamps are added to each data point in each process to make fusion of the data sets possible. The first process accesses and saves data from the internal bicycle sensor and from the environmental sensor board. This process creates two new data points every second f = 2Hz, making it the fastest process of the three. This sampling frequency was chosen due to the fact that the variables collected by this process have variations that are highly relevant in very short time spans. We note that collection frequency for this process can be accelerated in case the research question being addressed requires it. The second parallel process collects data from the GPS unit. Data is collected at the average update rate shown by most commercial GPS boards which is f = 1Hz. The third process saves the data that comes from the camera module. The camera collects data at 24 fps and stores the raw RGB values of each captured frame. Three separate CSV files are created every time the system is activated, each one of the files has the same name but is stored in a different directory. The names of the files are created using pyhton’s standard UUID library [8] in order to create unique file ID’s that will never have conflicting name problems in storage. The file name also includes a timestamp from the moment of activation and the specific name of the bike that is recording the data. The file naming system was built to enable the scalability of the sensor system by using a format that allows to easily combine multiple sensor files built from multiple sensor sources coming from multiple bicycles.
3
JettSen Knowledge Abstraction Framework
This section presents a framework for understanding the value of a mobile platform that enables sensor fusion for the exploration of different research questions. Sensor fusion represents the combination of sensors in order to obtain a data source that is more complete than individual sensor data collection. A variety of methods have been proposed from existing sensor fusion studies and how to classify those systems [5]. For this publication, we would like to consider what kind of system should be targeted in the effort of data acquisition in urban planning specifically using e-bikes. In the case of a mobile fusion sensor system for bicycles, sensor fusion allows for more holistic understanding of a specific bike trip and therefore enables the creation of models that can take into account a larger amount of variables. A holistic data collection process is more likely to yield models that can be more accurate and that can approach reality in a less reductionist manner. Figure 1 shows the three data file outputs of the system. Processing and fusion of the data files is not done during the collection process in order to keep collection frequencies stable. Rather, it is done during the analysis and post processing of the data files.
782
A. Rico et al.
The framework outlines the way in which it can be possible to take raw sensor data and turn it into valuable fusion knowledge, specifically in the context of bicycle trips and cities. To describe the knowledge that can be obtained from the system we use two different dimensions. The first dimension describes the level of abstraction. The distinction between abstract and rigid knowledge is crucial in the context of cities because it can help to define projects or goals that can have impact in the short term from those that can have impact in the long run as well as those projects that address technical questions from the ones that address societal ones. An example of rigid knowledge generation would be to map the exact spots of potholes in a given area in order for governments to have a detailed map of city infrastructure that needs maintenance. In the other hand, knowledge with higher abstractions are needed for building consensus such as master plans composed with community values. The fused sensor data may illustrate the behaviour of social demographics which lead to value propositions that the community wants to push. This may involve discussions around data ownership and privacy. High levels of abstraction are an important factor when considering the use of urban planning adjunct with citizen participation. The second dimension deals with the layer of the trip at which the knowledge is relevant for. The three proposed layers are (1) the drive (2) the community and (3) the infrastructure. As examples, a project that uses the system’s sensors to predict the drivers likely next action would be on the drive layer. In contrast a project that uses camera and GPS to map the types of lanes would be classified within the infrastructure layer. Figure 4 shows the relationship that different layers of abstraction can have with the different layers of the city along with example uses placed on each one of the different dimensions. If we look at the existing methods of data utilization in urban planning, it is a Waterfall System [18] which has a drawback in the fact that a feedback system is not designed into the system itself. In the next section, we will consider how to create a model that progressively asks questions of urban planning issues.
4
Use Case: Clustering Driving Patterns
It can be highly beneficial to understand different driving patterns for users in order to make bike rides and cities more comfortable, safe and goal adaptive. With this assumption in mind, a use case for the platform was developed with the aim of clustering distinct driving patterns shown on a bike trip. The goal of the implementation is to demonstrate the value of (1) sensor fusion within the context of a mobile sensor system for urban planning and (2) to show how the system, data and algorithm can yield results that are relevant at different level of abstraction and at different trip layers, as illustrated in Fig. 4. Most implementations for identifying driving patterns and states rely on the use of supervised machine learning techniques and commonly use accelerometer data as input for their models. In contrast, our implementation leverages on a diverse set of sensors and on unsupervised clustering algorithms that allow for a reduction of bias in the created models.
JettSen
783
Fig. 4. Sensor variables Knowledge Impact. The diagram allows us to visualize the way in which different trip layers are related to the different levels of abstraction. The different layers of abstraction allow for exploration of highly specific and technical issues and their relationship with higher level values that the community might hold. Focusing on a specific area within the diagram can yield entirely different knowledge, even if the same data source and algorithm is used, as can be seen in Sect. 4.
Clustering algorithms, when used for anomaly and pattern recognition, leverage on the reality that some patterns will not be regular and in many ways, unknown and unpredictable [21]. For our implementation, we assume that the complexity surrounding cities is best addressed by algorithms that are not restricted to predefined classes like the ones commonly used in supervised machine learning techniques. We therefore chose to base our implementation, specifically, on the well know K means clustering algorithm. 4.1
Implementation and Input Matrix Description
The use case scenario was implemented in the following order. First the mobile sensor fusion platform was installed on an electric bicycle prototype, Jetter e-bicycle, provided by a large consumer electronics company. After the system installation, we collected data from 16 different bike trips. The data was stored locally on the bicycle and later extracted. Once extracted, experimentation of different algorithmic parameters was done for the clustering algorithm. In order to show the different results along the abstraction gradient, we ran our clustering algorithm on three different matrix variations built from the same data source. The variations allow us obtain different information from running the same algorithm on matrices that use different fusion variations. The first matrix only contained data describing the x axis’ acceleration of a single trip. The second matrix was built using data from the same single trip as the first matrix, but arrays describing speed, torque, ambient temperature, light intensity, humidity and atmospheric pressure were added, yielding a matrix
784
A. Rico et al.
with 9 dimensions. Lastly, the third matrix, contained the same variables as the second matrix but was sequentially concatenated with the trip data of 15 other trips, in this way creating a matrix that contained 9 different dimensions and 16 different bike trips. 4.2
Clustering Algorithm Details
The K Means algorithm is one of the most accepted algorithms for creating unsupervised clusters on a multidimensional space [16]. For our use case we decided to use an unsupervised clustering algorithm because it allows us to reduce classification bias by keeping class label identification out of human error. For our k means implementation, we took into account variables that can be found on the csv file that mixes bicycle state data and environmental data. This allows us to create a data set that combines 9 different variables. Used variables are specified in Table 4 along with their units. We first access individual sensor arrays and create a matrix that only contains the data arrays for the variables listed in Table 4. Table 4. Selected variables for clustering analysis Variable
Unit
X Axis acceleration m/s2 Y Axis acceleration m/s2 Z Axis acceleration m/s2 Input torque
kgf
Bicycle speed
km/h
Light
lux
Temperature
a
C
Humidity
Relative %
Pressure
% Pa
After our variables of interest are isolated, we normalize their values so that all of them only have values that fall between 0 and 1. Normalization helps to balance the square distances in distinct dimensions, which is highly relevant for minimizing the cost function of the K means algorithm [24]. Once the data has been arranged and normalized, the algorithm is implemented using the scikit-learn python library [14]. The library implements the algorithm with normal random initialization of centroids and looks to minimize squared distance between points. It is well known that k means can continuously reduce the sum of squared distances between points as more clusters are added. Nevertheless, if too many clusters are added, the probability of overfitting is increased. To avoid this,
JettSen
785
the number of optimal clusters was chosen through the use of an elbow curve analysis. As can be seen in Fig. 5, iterating from one to 50 clusters gives an optimal number of clusters, k, of seven for the three matrix variations that were used.
Fig. 5. Normalized elbow method analysis for determination of optimal clustering parameters for each one of the matrix variations (x acceleration, fusion and multiple trip). Results indicate that seven is the optimal number of clusters for each one of the matrix variations. The point on each curve was chosen by analyzing the moment at which the error rate change decreases for the first time.
4.3
Use Case Clustering Discussion
Single Trip x Acceleration Clustering. As previously explained, the first matrix variation only contains sequential data of the x axis acceleration for a single trip. The input matrix has a shape of [n, 1] where n represents the number of recorded samples. Figure 6 is the plot of the x axis acceleration data against time, the color code on the image represents the different clusters that were generated by the algorithm. Empirically, we can conclude that the algorithm manages to identify and cluster different acceleration intensities across time. We can see that high accelerations are less common and that the strongest group is centered around the average acceleration of the trip. Through the generation of definable clusters that detect intensities of acceleration, we demonstrate that using individual sensor data for a single trip can help to generate models that are able to respond to rigid issues. Examples of use for such model would be to address issues of locating potholes, intersections or accidents in a city by combining the high value acceleration clusters with the GPS coordinates were they were saved.
786
A. Rico et al.
Fig. 6. Single Trip X-Acceleration Clustering. The graph plots normalized acceleration data on the y axis and relative time (each unit represents 0.5 ms) on the x axis. The K means algorithm is capable of effectively clustering different acceleration points.
Single Trip Fusion Data Clustering. For the next analysis, we turn to the second matrix variation. This matrix contains information for the same trip as the past uni-dimensional analysis alongside a wider range of variables, giving it a shape of [n, 9] where n is the number of recorded samples. The variables that are added to this matrix are accelerations for the y and z axis, torque, bike speed, temperature, humidity, light intensity and atmospheric pressure. From Fig. 7, we can identify that the clustering algorithm behaves in a more sequential way, meaning that clusters appear to be spread out horizontally over time as opposed to Fig. 6 were clusters are spread out vertically. This sequentially demonstrates the possibility of extracting information relevant to different states in a single bike trip. Clusters tend to describe different moments in the bike ride meaning that questions revolving around driving patterns and driving states can be addressed. Multiple Trip Fusion Data Clustering. The last matrix variation includes the same sensor data as the second variation. In addition to this data, 15 other trips were sequentially concatenated meaning that the matrix includes information about 16 different trips and 9 different variables for each one of the trips. The matrix has a shape of [m, 9] where m is the sum of the vector [n1 , n2 , n3 , ..., n16 ] and ni is the number of recorded samples for each one of the trips. Figure 8 shows the results of running the clustering algorithm on the third variation matrix. We can observe that it maintains the sequential characteristics that the results in Fig. 7 show. Nevertheless, the results show a less granular or more abstract clustering pattern. Instead of clustering different states of a single trip, clusters seem to be dominated by entire trips.
JettSen
787
Fig. 7. Single Trip Fusion Data Clustering. The graph plots normalized data for 3 axis acceleration, speed, torque, temperature, light, humidity and pressure on the y axis and relative time (each unit represents 0.5 ms) on the x axis. The K means algorithm is capable of effectively clustering different instances in a single trip in a sequential manner.
Fig. 8. Multiple Trip Fusion Data Clustering. The graph plots normalized data for 3 axis acceleration, speed, torque, temperature, light, humidity and pressure for 16 different trips on the y axis and relative time (fitted to start from zero) on the x axis. The K means algorithm is capable of effectively clustering different trips in a sequential manner.
788
A. Rico et al.
Comparing the results from the second matrix and third matrix variations, we can show the value of sensor fusion for enabling an abstraction of different types of knowledge from the same data source and algorithm. Results from the second matrix variation speak about characteristics of different times within a bike trip while results from the third matrix speak about characteristics and differences between trips as a whole.
5
Discussion and Future Work
The combination of a mobile sensor fusion platform with a knowledge abstraction framework that embraces citizen and community input, can become a powerful tool for the development of better cities. Our platform and framework give the flexibility that is needed to make cities more capable of managing their inherent complexity. At it’s simplest form, a project detecting potholes may only consider accelerometor readings. Yet for a city to maintain infrastructure, prioritizing which road to improve from a limited budget is non trivial. By providing a platform with multiple sensors we provide a way to incrementally accumulate knowledge for each step presented in the use case. Extracting insight from data has long been studied in the field in data mining, more formally under the terms of Knowledge Discovery in Databases [6]. For our use case, we chose clustering which is an approach that is widely adopted to detect anomalies [4,9,20]. We have built up different clusters by using different portions of the same data source illustrating different aspects of data. The first step may show anomalies which could be tied to dangerous road conditions like potholes. This first type of knowledge can be used to define danger. By associating GPS data with this data, we will be able to know where these sensor readings occur. Yet, it is early to decide which occasions has more impact. The second step gives context to the incidents and how pothole-like data influences the drive relative to the whole trip. We can now start to ask questions: Do we want to allocate maintenance efforts where people drive at high speed or near car traffic? The third stage gives indications of which trip has relative risk, leading discussions on community preferences. The focus of the inquiries could shift to the time of the trip or the differences in the environment: Does the community weights more to improve bike riding experience in leisure time or when citizens ride their bikes aggressively? At this point, the questions leans on community values to choose between improving recreation time or making it safe for rushed riding behavior. In addition, for applications like urban planning where interventions are interrelated to other issues [12], a platform that incrementally drives more questions is of great value. By questions we point to value judgments or moral perspectives that are addressed in studies such as [1]. This research examines a platform that not only acquires data, but also leads to stacking up knowledge that leads to these conversations. The platform itself does not cover all the aspects that a city needs to take in account. For example, the city may need to use income levels to
JettSen
789
balance equity. By designing a platform that bridges quantitative to qualitative data, we aim for data oriented urban planning to avoid a reductionist approach and impose a single technical solution, but to provide options to explore possibilities connecting other aspects that constrains human behavior such as law enforcement and normative values [15]. Based on the discussion of results from the use case in Sect. 4, we point out the value of using the platform along with machine intelligence as tools for creating citizen consensus. Future work on the platform will be centered around making the system scalable to a larger amount of bicycles, this will require improvements on the mechanical enclosures of the system as well as the electrical connection layouts. Scaling the system to more bikes will allow us to incorporate multi-user classification (classifying multiple trips for multiple users) and to further explore how different fusions and algorithms could help address key questions for enabling a more humane development of our cities. Acknowledgments. The authors of this publication thank Life Solutions Company, Panasonic Corporation (Jin Yoshizawa, Nanako Yamasaki, Yoshio Shimbo) as well as Panasonic Cycle Technology Co., Ltd. (Hiroyuki Kamo) for the financial and technological support given for the development of this project. Specifically, the development and provision of the hackable micro unit (internal bicycle sensor) which acts as a pillar to the JettSen system along with the Jetter e-bicycle which was used for prototyping and testing the system.
References 1. Awad, E., Dsouza, S., Kim, R., Schulz, J., Henrich, J., Shariff, A., Bonnefon, J.-F., Rahwan, I.: The moral machine experiment. Nature 563(7729), 59–64 (2018) 2. Berke, A.: Income, race, bikes (2019). https://www.media.mit.edu/projects/ income-race-bikes/overview/. Accessed 29 April 2020 3. Coretti, N., Pastor, L.A., Larson, K.: Autonomous bicycles: a new approach to bicycle-sharing systems. In: International IEEE Conference on Intelligent Transportation Systems IEEE ITSC (2020, Accepted) 4. Elbasiony, R.M., Sallam, E.A., Eltobely, T.E., Fahmy, M.M.: A hybrid network intrusion detection framework based on random forests and weighted k-means. Ain Shams Eng. J. 4(4), 753–762 (2013) 5. Elmenreich, W.: A review on system architectures for sensor fusion applications. In: IFIP International Workshop on Software Technolgies for Embedded and Ubiquitous Systems, pp. 547–559. Springer (2007) 6. Fayyad, U., Piatetsky-Shapiro, G., Smyth, P.: From data mining to knowledge discovery in databases. AI Mag. 17(3), 37–37 (1996) 7. Fishman, E., Washington, S., Haworth, N.: Bike share: a synthesis of the literature. Transp. Rev. 33(2), 148–165 (2013) 8. Python Software Foundation. Uuid. https://docs.python.org/3/library/uuid. html#module-uuid, 2002–2020. Accessed 10 Mar 2020 9. Gaddam, S.R., Phoha, V.V., Balagani, K.S.: K-means+ id3: a novel method for supervised anomaly detection by cascading k-means clustering and id3 decision tree learning methods. IEEE Trans. Knowl. Data Eng. 19(3), 345–354 (2007)
790
A. Rico et al.
10. Green, B.: The Smart Enough City: Putting Technology in its Place to Reclaim our Urban Future. MIT Press, Cambridge (2019) 11. Gupta, P., Kumar P.: Behavior study of bike driver and alert system using IoT and cloud. In: Proceedings of ICRIC 2019, Lecture Notes in Electrical Engineering, pp. 579–593 (2019) 12. Head, B.W., et al.: Wicked problems in public policy. Public policy, 3(2), 101 (2008) 13. Kiefer, C., Behrendt, F.: Smart e-bike monitoring system: real-time open source and open hardware GPS assistance and sensor data for electrically assisted bicycles. IET Intell. Transp. Syst. 10(2), 79–88 (2014) 14. Scikit learn Developers. Clustering, 2007-2019. https://scikit-learn.org/stable/ modules/clustering.html. Accessed 27 April 2020 15. Lessig, L.: The new chicago school. J. Legal Stud. 27(S2), 661–691 (1998) 16. Likas, A., Vlassis, N., Verbeek, J.J.: The global k-means clustering algorithm. Pattern Recogn. 36(2), 451–461 (2003) 17. Lin, M.: Persuasive electric vehicle (pev) 2015-2020. https://www.media.mit.edu/ projects/pev/overview/. Accessed 29 April 2020 18. Markin, M., Harris, C., Bernhardt, M., Austin, J., Bedworth, M., Greenway, P., Johnston, R., Little, A., Lowe, D.: Technology foresight on data fusion and data processing. Publication of The Royal Aeronautical Society (1997) 19. Millonig, A., Wunsch, M., Stibe, A., Seer, S., Dai, C., Schechtner, K., Chin, R.C.C.: Gamification and social dynamics behind corporate cycling campaigns. Transp. Res. Procedia 19, 33–39 (2016) 20. Muniyandi, A.P., Rajeswari, R., Rajaram, R.: Network anomaly detection by cascading k-means clustering and c4.5 decision tree algorithm. Procedia Engineering 30, 174–182 (2012) 21. M¨ unz, G., Li, S., Carle, G.: Traffic anomaly detection using k-means clustering. In GI/ITG Workshop MMBnet, pp. 13–14 (2007) 22. Namiot, D., Sneps-Sneppe, M.: On bikes in smart cities. Automat. Control Comput. Sci. 53(1), 67–71 (2019) 23. Nawyn, J., Smuts, C., Larson, K.: A visualization tool for reconstructing behavior patterns in built spaces. In: Proceedings of the 2017 ACM International Joint Conference on Pervasive and Ubiquitous Computing and Proceedings of the 2017 ACM International Symposium on Wearable Computers, pp. 269–272 (2017) 24. Patel, V.R., Mehta, R.G.: Impact of outlier removal and normalization approach in modified k-means clustering algorithm. Int. J. Comput. Sci. Issues (IJCSI) 8(5), 331 (2011) 25. Sadik-Khan, J., Solomonow, S.: Streetfight: Handbook for an urban revolution. Penguin, New York (2017) 26. Smuts, C.: Termites (2018). http://termites.synthetic.space/. Accessed 20 April 2020 27. Swamy, U.B.M., Khuddus, A.: A Smart bike. In: 2019 1st International Conference on Advances in Information Technology (ICAIT), pp. 462–468 (2019) 28. Swathi, S.J., Raj, S., Devaraj, D.: Microcontroller and sensor based smart biking system for driver’s safety. In: 2019 IEEE International Conference on Intelligent Techniques in Control, Optimization and Signal Processing (INCOS), pp. 1–5 (2019)
JettSen
791
29. Yao, J.W.-H.: IDK: An Interaction Development Kit to design interactions for lightweight autonomous vehicles. Master’s thesis, Massachusetts Institute of Technology (2019) 30. Zhang, Y., Chen, K., Yi, J.: Rider trunk and bicycle pose estimation with fusion of force/inertial sensors. IEEE Trans. Biomed. Eng. 60(9), 2541–2551 (2013)
No Jitter Please: Effects of Rotational and Positional Jitter on 3D Mid-Air Interaction Anil Ufuk Batmaz(B) , Mohammad Rajabi Seraji, Johanna Kneifel, and Wolfgang Stuerzlinger Simon Fraser University, Vancouver, BC, Canada {abatmaz,mrajabis,jkneifel,w.s}@sfu.ca https://vvise.iat.sfu.ca
Abstract. Virtual Reality (VR) 3D tracking systems are susceptible to minor fluctuations in signal (jitter). In this study, we explored how different levels of jitter affect user performance for 3D pointing. We designed a Fitts’ Law experiment investigating target positional jitter and cursor rotational jitter at three different depth distances. Performance was negatively affected when up to ±0.5◦ rotational jitter was applied to the controller and up to ±0.375 cm positional jitter was applied to the target. At 2.25 m distance, user performance did not improve with decreasing positional jitter or rotational jitter compared to the no jitter condition. Our results can inform the design of 3D user interfaces, controllers, and interaction techniques in VR. Specifically, we suggest a focus on counteracting controller rotational jitter as this would globally increase performance for ray-based selection tasks. Keywords: 3D pointing Fitts’ law
1
· Jitter · Input devices · Tracking devices ·
Introduction
Recent Virtual Reality (VR) applications designed for specific tasks, such as surgical training systems, typically require precise and accurate interaction between a user and the virtual environment (VE), including selection, positioning, and pointing tasks in 3D. However, such interaction might be negatively affected by jitter, which is defined as unintentional fluctuations in movement which overlap with the original information in the signal intended through the action of the user. When a signal is acquired by the sensors of a VR tracking device, such as an Inertial Measurement Unit, the data is affected by several noise sources, such as thermal, flicker, and coupled noise. When this data is transferred to the VR system, additional noise could be added in the transmission, e.g., due to slight delays. Similarly, the data received from optical sensors and cameras is also affected by the noise introduced by image processing. c Springer Nature Switzerland AG 2021 K. Arai et al. (Eds.): FTC 2020, AISC 1289, pp. 792–808, 2021. https://doi.org/10.1007/978-3-030-63089-8_52
No Jitter Please
793
Moreover, when a user holds a controller, the data received by the VR system is affected by natural user behaviours, such as hand tremors, breathing, or body sway. For instance, hand tremor frequencies vary 4 Hz 12 Hz [1,15,19,33] in healthy humans, and this tremor can have detrimental effects on the tracking data. Specifically, these detrimental effects become more visible with increasing (depth) distances from the user. A 0.5◦ rotation at the controller can alter the position of a cursor by 0.65 cm at 75 cm, by 1.13 cm at 1.5 m, and by 1.96 cm at 2.25 m. Such small changes may also occur when user a user selects a target through physically interacting with the controller, such as pulling a trigger. This kind of error is called the “Heisenberg effect” [9]. After the position and rotation data of the trackers are received by the VR system, they may be further processed to mitigate noise-related effects. Various filtering algorithms, such as the Extended Kalman Filter, e.g., [36], or the One-Euro filter [11], are frequently used to reduce signal noise in VR. However, such filters can add additional noise because of the phase shift introduced by the filtering. Moreover, even after the filtering, the positional and rotational tracking data still exhibits fluctuations. Examples of rotational jitter and positional jitter are shown in Fig. 1(a) and Fig. 1(b), respectively. In these figures, the position and the rotation of the cursor and the target are expected to be at 0◦ and 0 cm, respectively. However, due to jitter, there is a notable deviation from the reference. Additionally, the figures also show substantial variation in the magnitude of the jitter. If we compare this with data for a 2D mouse on a desktop, there would be no visible jitter at this scale, due to a combination of substantially better sensors, surface friction, fewer degrees of freedom, and support for the hand holding the mouse. Previous work [31] has compared different input devices, including a VR controller and 2D mouse, but this topic is outside the scope of this project. In real-life VR systems, positional and rotational jitter can be found in all tracked objects, including the headset, the controllers, and other trackers1 , which all record the real-world position and rotation of the head, hands, or anything that the trackers are attached to, so that they can be used within the virtual environment. Both positional and rotational jitter have significant effects on VR system design. Especially for the design of novel VR input devices, jitter affects both user performance in the VE and the usability of the system. Recent work by Batmaz et al. [5] showed that the presence of jitter significantly decreases user performance for a novel pen-like input device. In their research, they also showed that pen-like controllers are affected by rotational jitter and hypothesized that user performance decreased due to that. The subjective results and the quantitative jitter data analysis for the input device supported their hypothesis. Thus, even though current hardware and software designs are improving in terms of decreasing jitter, research on the relationship between jitter and user performance enables our results to be used in system design and to let system system 1
A representative example of a current state-of-the-art tracker is the HTC VIVE system, https://www.vive.com/us/vive-tracker.
794
A. U. Batmaz et al.
designers make more educated decisions on the various trade-offs they are faced with. Previous studies showed that user performance significantly decreases above ±0.5◦ rotational jitter [6]. Moreover, Batmaz and Stuerzlinger showed that using a second VR controller to perform the selection action, i.e., pressing a trigger button, does not mitigate the negative effects of the rotational jitter [7]. Here, we define positional jitter as the jitter that affects the 3D position of the target, and rotational jitter as the jitter that affects the 3D rotation of the VR input device. We chose to vary the target position, as jitter in the controller position has (relatively speaking) less effect on pointing. On the other hand, jitter on the controller rotation affects pointing clearly more than jitter on target rotation [22]. With current VR controllers, the level of residual rotational jitter can easily be observed when pointing at distant objects, which has detrimental effects for distal pointing. Positional jitter is mostly observed in the position data of the trackers themselves, which is observable when the tracked device is static and/or if the user is trying to match real world object positions with the virtual environment. With this work, we extend previous work on the effects of rotational jitter for targets at a single distance [6,7], by studying the effect of jitter on targets at different depth distances. Further, we also explore the effects of positional jitter and compare the effects of positional and rotational jitter. In this study, we investigate the following research questions: At which jitter level does user performance start to significantly decrease at different depth distances? And how much do different levels of jitter affect cursor positioning in VEs, in terms of time and throughput? Research on the accuracy and precision of current state-of-the art VR devices, e.g., [29], helps to identify new ways to improve the quality of the VR experience and to apply such innovations within new systems. We believe that the analysis of the effects of jitter on user performance presented here will inform the design of new input devices by manufacturers and decrease the adverse effects of tracking limitations on pointing precision and accuracy.
2
Previous Work
Here we review relevant previous work, including Fitts’ law, 3D selection methods for VR, and previous work on the effects of jitter. 2.1
Fitts’ Law
Fitts’ law [16] models human movement times for pointing. Equation 1 shows the Shannon formulation [23]. A + 1 = a + b ∗ ID (1) M ovement T ime = a + b ∗ log2 W In Eq. 1, a and b are empirical constants, typically identified by linear regression. A is the amplitude of the movement, which is the distance between two
No Jitter Please
795
Fig. 1. An example of (a) cursor jitter and (b) target jitter. For measuring cursor jitter, a user pointed the controller at a (distant) target in a VE. For target jitter, the HTC VIVE controller was placed on a table. (c) Experimental virtual environment.
796
A. U. Batmaz et al.
targets, and W the target width. The logarithmic term in Eq. 1 represents the task difficulty and is called the index of difficulty, ID. We also use throughput (based on effective measures), as defined in the ISO 9241-411:2012 [20]: IDe (2) Throughput = Movement Time In Eq. 2, movement time is the time between initiation of the movement and the selection of the target. The effective index of difficulty (IDe ) incorporates the user accuracy in the task [20]: Ae +1 (3) IDe = log2 We In Eq. 3, Ae represents the effective distance, the actual movement distance to the target position, and We is the effective target width, the distribution of selection coordinates, calculated as We = 4.133 × SDx , where SDx is the standard deviation of selection coordinates along the task axis. SDx represents the precision of the task performance [24,25]. 2.2
3D Pointing in Virtual Environments
Pointing is a fundamental task for users interacting with an environment [14]. Various studies in the literature have explored pointing tasks, e.g., in real life or on 2D desktops. However, 3D pointing in VEs is relatively more complex and less explored compared to other pointing tasks. A recent survey reviewed 3D pointing and investigated various devices and approaches [2]. Different mid-air selection methods have also been evaluated, e.g., [10,25]. 2.3
Ray Casting
While selection with a virtual hand metaphor is easy in VR, it is challenging to select targets that are further away with this technique [22]. For the selection of a distant object, ray casting is the preferred interaction technique in many VR systems [14]. Still, as it requires accurate pointing, ray casting does not perform well for small and/or distant targets [32], similar to how a laser pointer behaves in the real world. Usually, a virtual ray is shown between the pointing device and the cursor position on the respective intersected surface of the virtual environment to facilitate keeping track of the pointing direction and to increase the visibility of the cursor [14]. 2.4
Selection Method
To select an object in VR, the user has to interact with the system to activate the corresponding selection action. If that action is communicated by physical
No Jitter Please
797
interaction, such as pulling a trigger or pushing a button, this can affect the cursor position or ray rotation, and an error called the “Heisenberg effect” of spatial interaction [9] can occur. Especially for distant target selection, ray casting is prone to this effect, since the smallest noise at the origin of the ray is magnified at larger distances [6]. To reduce the Heisenberg effect, previous studies, e.g., [34], [7], proposed to use asymmetric bi-manual interaction, where the user points with the dominant hand while they press the button to select with the non-dominant hand. 2.5
3D Tracking Noise in VR
While jitter and how it affects a signal has been studied in many domains, to our knowledge, how jitter affects user performance during 3D pointing tasks in VR has not been studied in detail. Previous work on rotational jitter showed that user performance significantly decreases with ±0.5◦ of jitter [6]. In this study, the authors used a Fitts’ task with a constant ray length, but previous studies showed that user performance with an infinite and fixed ray length is not equal [8]. Batmaz and Stuerzlinger also explored White Gaussian Noise rotational jitter and tried to reduce the negative effects of jitter by using a second controller [7] to avoid the “Heisenberg” effect upon the button press [9]. However, interestingly, using a second controller did not decrease the effects of rotational jitter on pointing. Previous work on positional jitter in 2D positioning tasks with a mouse showed that 0.3 mm of positional jitter did not affect user performance [35]. Yet, larger levels of positional jitter significantly reduced user performance for smaller targets [30].
3
Motivation and Hypotheses
Previous work showed that 0.5◦ rotational jitter significantly reduces user performance, even when the distance between target and user is as small as 50 cm [6,7]. These studies did not investigate target jitter, i.e. signal fluctuations on a tracker attached to an object in the real world and represented as a virtual object in the VE. Since the user performance in VR is significantly affected by stereo display deficiencies, e.g. through the vergence and accommodation conflict [3,4], how such jitter affects user performance at different depth distances still needs to be investigated to guide both practitioners and developers. Based on these results, we formulated the following hypotheses: H-1 When the distance between user and target increases, user performance significantly decreases above 0.5◦ rotational jitter for larger depth distances. H-2 Similar to rotational jitter, user performance significantly decreases with increased target jitter in VR. Moreover, this detrimental effect is larger when the depth distance increases.
798
4
A. U. Batmaz et al.
User Study
To investigate the above-mentioned hypotheses we designed a user study as follows: 4.1
Participants
Eighteen participants (ten female, eight male) with ages ranging from 21 to 33 (mean 26 ± 4.16) took part in the experiment. All participants were righthanded. While most reported that their dominant eye is the right one, one of them was left-eye dominant. Sixteen of them indicated previous experience with VR environments. However, the majority of users (thirteen of them) reported using VR devices and environments less than four times in a month and only three of them reported six times or more. Fourteen participants played computer games and/or used 3D CAD systems 0–5 hours/week, and four of them 5–10 h/week. 4.2
Apparatus
We used a PC with an Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz Processor, 16GB of DDR4 RAM, and a nVIDIA GeForce(R) GTX 1080 Ti graphics card. We used an HTC Vive Pro with two V2 Light houses, with two HTC Vive Pro controllers as input devices. 4.3
Procedure
After completing an informed consent form, participants first filled out a demographic questionnaire. The researchers then briefed the participants by explaining the tasks. To assess pointing performance in 3D, we used a ISO 9241-411 task [20]. To get used to the VR system and environment, subjects were allowed to practice the task before beginning trials. For the study, participants were asked to select targets as quickly and precisely as possible. After completing the tasks, participants filled out a post-questionnaire about their perceived pointing speed and accuracy with each condition and their preferences. The study lasted about 40 min. Before each task, participants were asked to fixate at a cross at eye-level, which ensured that the targets would appear at a comfortable, yet consistent position. The targets appeared as grey spheres arranged in a circular pattern at the eye level of the subjects (Fig. 1 (c)). Participants were asked to point at the targets with the pointer ray emanating from the right controller and to click the trigger of the left controller to select a target, eliminating any potential “Heisenberg effect” [9]. When the cursor interacted with the target, the target color was changed to green. If the user selected the target while it was green, we record a successful “hit”. If the user “missed” the target, the target turned red and an error sound was played to ensure adequate feedback.
No Jitter Please
799
We selected our Target Distance T D3 and Target Size T S2 conditions based on previous work [4,7]. For the closest depth distance, we chose 0.75 m, since just beyond the edge of peri-personal space, i.e. the user could not reach the targets with a virtual hand. Other depth distances were chosen as linear increments of 0.75 m [17]. We artificially added ± 0.5◦ and ± 1◦ of rotational jitter to the starting point of the ray from the controller. Similarly, we added either 0.375 cm of artificial positional jitter to the target position, which is 1/4 of the first target size (T S1 /4), or 0.625 cm, which is 1/4 of the second target size (T S2 /4). All artificial jitter was generated with the Marsaglia Polar Method [27] as White Gaussian Noise and applied to all three dimensions. For rotational jitter, we artificially added noise to all 3 Euler axes of the VR controller rotation data received from the software. Analogously, we added artificial noise to the position of virtual targets along all three coordinate axes for positional jitter. 4.4
Experimental Design
The 18 participants selected 11 targets in 27 experimental conditions: three positional Target Jitter (T J 3 : 0, ±T S 1 /4 cm, and ±T S 2 /4 cm), three Rotational Jitter (RJ3 : 0, ±0.5◦ , and ± 1◦ ), and three Depth Distances (DD3 : 0.75, 1.5 and 2.25 m) in a T J 3 x CJ 3 x DD3 within-subject design. We counterbalanced Target and Cursor Jitter conditions across the experiment. The Depth Distance condition was counterbalanced across participants. As common in Fitts’ law experiments, and to enable us to analyze internal validity, we also varied the task difficulty ID, by using three Target Distances (T D3 : 10, 20, and 30 cm) and two Target Sizes (T S 2 : 1.5 and 2.5 cm), which means we evaluated 6 unique ID’s between 1.94 and 4. Subjects’ movement time (ms), error rate (%), and (effective) throughput (bit/s) were measured as dependent variables. In total, each subject performed T J 3 x CJ 3 x DD3 x ID6 x 11 repetitions, corresponding to a total of 1782 trials.
5
Data Analysis
The results were analyzed using three-way repeated measures (RM) ANOVA with α = 0.05 in SPSS 24. For the normality analysis, we used Skewness and Kurtosis and, based on results from previous work [18,26], considered the data as normally distributed when Skewness and Kurtosis values were within ±1.5. We used the Sidak method for post-hoc analyses. We only report significant results here. Results are illustrated with *** for p < 0.001, ** p < 0.01, and * for p < 0.05 in figures. One-way ANOVA RM results are shown in Table 1. We first present the results for the main factors and then mention interactions from the three-way RM ANOVA.
800
A. U. Batmaz et al. Table 1. One-Way RM ANOVA Results
5.1
Depth distance
Cursor jitter
Target jitter
ID
Time
F(2,34)=17.085 p