Intelligent Computing: Proceedings of the 2020 Computing Conference, Volume 1 [1st ed.] 9783030522483, 9783030522490

This book focuses on the core areas of computing and their applications in the real world. Presenting papers from the Co

398 71 98MB

English Pages XI, 829 [841] Year 2020

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Front Matter ....Pages i-xi
Demonstrating Advanced Machine Learning and Neuromorphic Computing Using IBM’s NS16e (Mark Barnell, Courtney Raymond, Matthew Wilson, Darrek Isereau, Eric Cote, Dan Brown et al.)....Pages 1-11
Energy Efficient Resource Utilization: Architecture for Enterprise Network (Dilawar Ali, Fawad Riasat Raja, Muhammad Asjad Saleem)....Pages 12-27
Performance Evaluation of MPI vs. Apache Spark for Condition Based Maintenance Data (Tomasz Haupt, Bohumir Jelinek, Angela Card, Gregory Henley)....Pages 28-41
Comparison of Embedded Linux Development Tools for the WiiPiiDo Distribution Development (Diogo Duarte, Sérgio Silva, João M. Rodrigues, Salviano Pinto Soares, António Valente)....Pages 42-53
FERA: A Framework for Critical Assessment of Execution Monitoring Based Approaches for Finding Concurrency Bugs (Jasmin Jahić, Thomas Bauer, Thomas Kuhn, Norbert Wehn, Pablo Oliveira Antonino)....Pages 54-74
A Top-Down Three-Way Merge Algorithm for HTML/XML Documents (Anastasios G. Bakaoukas, Nikolaos G. Bakaoukas)....Pages 75-96
Traceability Framework for Requirement Artefacts (Foziah Gazzawe, Russell Lock, Christian Dawson)....Pages 97-109
Haptic Data Accelerated Prediction via Multicore Implementation (Pasquale De Luca, Andrea Formisano)....Pages 110-121
Finding the Maximal Independent Sets of a Graph Including the Maximum Using a Multivariable Continuous Polynomial Objective Optimization Formulation (Maher Heal, Jingpeng Li)....Pages 122-136
Numerical Method of Synthesized Control for Solution of the Optimal Control Problem (Askhat Diveev)....Pages 137-156
Multidatabase Location Based Services (MLBS) (Romani Farid Ibrahim)....Pages 157-168
wiseCIO: Web-Based Intelligent Services Engaging Cloud Intelligence Outlet (Sheldon Liang, Kimberly Lebby, Peter McCarthy)....Pages 169-195
A Flexible Hybrid Approach to Data Replication in Distributed Systems (Syed Mohtashim Abbas Bokhari, Oliver Theel)....Pages 196-207
A Heuristic for Efficient Reduction in Hidden Layer Combinations for Feedforward Neural Networks (Wei Hao Khoong)....Pages 208-218
Personalized Recommender Systems with Multi-source Data (Yili Wang, Tong Wu, Fei Ma, Shengxin Zhu)....Pages 219-233
Renormalization Approach to the Task of Determining the Number of Topics in Topic Modeling (Sergei Koltcov, Vera Ignatenko)....Pages 234-247
Strategic Inference in Adversarial Encounters Using Graph Matching (D. Michael Franklin)....Pages 248-262
Machine Learning for Offensive Security: Sandbox Classification Using Decision Trees and Artificial Neural Networks (Will Pearce, Nick Landers, Nancy Fulda)....Pages 263-280
Time Series Analysis of Financial Statements for Default Modelling (Kirill Romanyuk, Yuri Ichkitidze)....Pages 281-286
Fraud Detection Using Sequential Patterns from Credit Card Operations (Addisson Salazar, Gonzalo Safont, Luis Vergara)....Pages 287-296
Retention Prediction in Sandbox Games with Bipartite Tensor Factorization (Rafet Sifa, Michael Fedell, Nathan Franklin, Diego Klabjan, Shiva Ram, Arpan Venugopal et al.)....Pages 297-308
Data Analytics of Student Learning Outcomes Using Abet Course Files (Hosam Hasan Alhakami, Baker Ahmed Al-Masabi, Tahani Mohammad Alsubait)....Pages 309-325
Modelling the Currency Exchange Rates Using Support Vector Regression (Ezgi Deniz Ülker, Sadik Ülker)....Pages 326-333
Data Augmentation and Clustering for Vehicle Make/Model Classification (Mohamed Nafzi, Michael Brauckmann, Tobias Glasmachers)....Pages 334-346
A Hybrid Recommender System Combing Singular Value Decomposition and Linear Mixed Model (Tianyu Zuo, Shenxin Zhu, Jian Lu)....Pages 347-362
Data Market Implementation to Match Retail Customer Buying Versus Social Media Activity (Anton Ivaschenko, Anastasia Stolbova, Oleg Golovnin)....Pages 363-372
A Study of Modeling Techniques for Prediction of Wine Quality (Ashley Laughter, Safwan Omari)....Pages 373-399
Quantifying Apparent Strain for Automatic Modelling, Simulation, Compensation and Classification in Structural Health Monitoring (Enoch A-iyeh)....Pages 400-415
A New Approach to Supervised Data Analysis in Embedded Systems Environments: A Case Study (Pamela E. Godoy-Trujillo, Paul D. Rosero-Montalvo, Luis E. Suárez-Zambrano, Diego H. Peluffo-Ordoñez, E. J. Revelo-Fuelagán)....Pages 416-425
Smart Cities: Using Gamification and Emotion Detection to Improve Citizens Well Fair and Commitment (Manuel Rodrigues, Ricardo Machado, Ricardo Costa, Sérgio Gonçalves)....Pages 426-442
Towards a Smart Interface-Based Automated Learning Environment Through Social Media for Disaster Management and Smart Disaster Education (Zair Bouzidi, Abdelmalek Boudries, Mourad Amad)....Pages 443-468
Is Social Media Still “Social”? (Chan Eang Teng, Tang Mui Joo)....Pages 469-490
Social Media: Influences and Impacts on Culture (Mui Joo Tang, Eang Teng Chan)....Pages 491-501
Cost of Dietary Data Acquisition with Smart Group Catering (Jiapeng Dong, Pengju Wang, Weiqiang Sun)....Pages 502-520
Social Engineering Defense Mechanisms: A Taxonomy and a Survey of Employees’ Awareness Level (Dalal N. Alharthi, Amelia C. Regan)....Pages 521-541
How Information System Project Stakeholders Perceive Project Success (Iwona Kolasa, Dagmara Modrzejewska)....Pages 542-554
Fuzzy Logic Based Adaptive Innovation Model (Bushra Naeem, Bilal Shabbir, Juliza Jamaludin)....Pages 555-565
A Review of Age Estimation Research to Evaluate Its Inclusion in Automated Child Pornography Detection (Lee MacLeod, David King, Euan Dempster)....Pages 566-580
A Comprehensive Survey and Analysis on Path Planning Algorithms and Heuristic Functions (Bin Yan, Tianxiang Chen, Xiaohui Zhu, Yong Yue, Bing Xu, Kai Shi)....Pages 581-598
Computational Conformal Mapping in Education and Engineering Practice (Maqsood A. Chaudhry)....Pages 599-608
Pilot Study of ICT Compliance Index Model to Measure the Readiness of Information System (IS) at Public Sector in Malaysia (Mohamad Nor Hassan, Aziz Deraman)....Pages 609-628
Preliminary Experiments on the Use of Nonlinear Programming for Indoor Localization (Stefania Monica, Federico Bergenti)....Pages 629-644
Improved Deterministic Broadcasting for Multiple Access Channels (Bader A. Aldawsari, J. Haadi Jafarian)....Pages 645-660
Equivalent Thermal Conductivity of Metallic-Wire for On-Line Monitoring of Power Cables (M. S. Al-Saud)....Pages 661-672
A Novel Speed Estimation Algorithm for Mobile UE’s in 5G mmWave Networks (Alawi Alattas, Yogachandran Rahulamathavan, Ahmet Kondoz)....Pages 673-684
In-App Activity Recognition from Wi-Fi Encrypted Traffic (Madushi H. Pathmaperuma, Yogachandran Rahulamathavan, Safak Dogan, Ahmet M. Kondoz)....Pages 685-697
A Novel Routing Based on OLSR for NDN-MANET (Xian Guo, Shengya Yang, Laicheng Cao, Jing Wang, Yongbo Jiang)....Pages 698-714
A Comparative Study of Active and Passive Learning Approaches in Hybrid Learning, Undergraduate, Educational Programs (Khalid Baba, Nicolas Cheimanoff, Nour-eddine El Faddouli)....Pages 715-725
Mobile Learning Adoption at a Science Museum (Ruel Welch, Temitope Alade, Lynn Nichol)....Pages 726-745
Conceptualizing Technology-Enhanced Learning Constructs: A Journey of Seeking Knowledge Using Literature-Based Discovery (Amalia Rahmah, Harry B. Santoso, Zainal A. Hasibuan)....Pages 746-759
Random Sampling Effects on e-Learners Cluster Sizes Using Clustering Algorithms (Muna Al Fanah)....Pages 760-773
Jupyter-Notebook: A Digital Signal Processing Course Enriched Through the Octave Programming Language (Arturo Zúñiga-López, Carlos Avilés-Cruz, Andrés Ferreyra-Ramírez, Eduardo Rodríguez-Martínez)....Pages 774-784
A Novel Yardstick of Learning Time Spent in a Programming Language by Unpacking Bloom’s Taxonomy (Alcides Bernardo Tello, Ying-Tien Wu, Tom Perry, Xu Yu-Pei)....Pages 785-794
Assessing and Development of Chemical Intelligence Through e-Learning Tools (E. V. Volkova)....Pages 795-805
Injecting Challenge or Competition in a Learning Activity for Kindergarten/Primary School Students (Bah Tee Eng, Insu Song, Chaw Suu Htet Nwe, Tian Liang Yi)....Pages 806-826
Back Matter ....Pages 827-829
Recommend Papers

Intelligent Computing: Proceedings of the 2020 Computing Conference, Volume 1 [1st ed.]
 9783030522483, 9783030522490

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Advances in Intelligent Systems and Computing 1228

Kohei Arai Supriya Kapoor Rahul Bhatia   Editors

Intelligent Computing Proceedings of the 2020 Computing Conference, Volume 1

Advances in Intelligent Systems and Computing Volume 1228

Series Editor Janusz Kacprzyk, Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland Advisory Editors Nikhil R. Pal, Indian Statistical Institute, Kolkata, India Rafael Bello Perez, Faculty of Mathematics, Physics and Computing, Universidad Central de Las Villas, Santa Clara, Cuba Emilio S. Corchado, University of Salamanca, Salamanca, Spain Hani Hagras, School of Computer Science and Electronic Engineering, University of Essex, Colchester, UK László T. Kóczy, Department of Automation, Széchenyi István University, Gyor, Hungary Vladik Kreinovich, Department of Computer Science, University of Texas at El Paso, El Paso, TX, USA Chin-Teng Lin, Department of Electrical Engineering, National Chiao Tung University, Hsinchu, Taiwan Jie Lu, Faculty of Engineering and Information Technology, University of Technology Sydney, Sydney, NSW, Australia Patricia Melin, Graduate Program of Computer Science, Tijuana Institute of Technology, Tijuana, Mexico Nadia Nedjah, Department of Electronics Engineering, University of Rio de Janeiro, Rio de Janeiro, Brazil Ngoc Thanh Nguyen , Faculty of Computer Science and Management, Wrocław University of Technology, Wrocław, Poland Jun Wang, Department of Mechanical and Automation Engineering, The Chinese University of Hong Kong, Shatin, Hong Kong

The series “Advances in Intelligent Systems and Computing” contains publications on theory, applications, and design methods of Intelligent Systems and Intelligent Computing. Virtually all disciplines such as engineering, natural sciences, computer and information science, ICT, economics, business, e-commerce, environment, healthcare, life science are covered. The list of topics spans all the areas of modern intelligent systems and computing such as: computational intelligence, soft computing including neural networks, fuzzy systems, evolutionary computing and the fusion of these paradigms, social intelligence, ambient intelligence, computational neuroscience, artificial life, virtual worlds and society, cognitive science and systems, Perception and Vision, DNA and immune based systems, self-organizing and adaptive systems, e-Learning and teaching, human-centered and human-centric computing, recommender systems, intelligent control, robotics and mechatronics including human-machine teaming, knowledge-based paradigms, learning paradigms, machine ethics, intelligent data analysis, knowledge management, intelligent agents, intelligent decision making and support, intelligent network security, trust management, interactive entertainment, Web intelligence and multimedia. The publications within “Advances in Intelligent Systems and Computing” are primarily proceedings of important conferences, symposia and congresses. They cover significant recent developments in the field, both of a foundational and applicable character. An important characteristic feature of the series is the short publication time and world-wide distribution. This permits a rapid and broad dissemination of research results. ** Indexing: The books of this series are submitted to ISI Proceedings, EI-Compendex, DBLP, SCOPUS, Google Scholar and Springerlink **

More information about this series at http://www.springer.com/series/11156

Kohei Arai Supriya Kapoor Rahul Bhatia •



Editors

Intelligent Computing Proceedings of the 2020 Computing Conference, Volume 1

123

Editors Kohei Arai Saga University Saga, Japan

Supriya Kapoor The Science and Information (SAI) Organization Bradford, West Yorkshire, UK

Rahul Bhatia The Science and Information (SAI) Organization Bradford, West Yorkshire, UK

ISSN 2194-5357 ISSN 2194-5365 (electronic) Advances in Intelligent Systems and Computing ISBN 978-3-030-52248-3 ISBN 978-3-030-52249-0 (eBook) https://doi.org/10.1007/978-3-030-52249-0 © Springer Nature Switzerland AG 2020 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Editor’s Preface

On behalf of the Committee, we welcome you to the Computing Conference 2020. The aim of this conference is to give a platform to researchers with fundamental contributions and to be a premier venue for industry practitioners to share and report on up-to-the-minute innovations and developments, to summarize the state of the art and to exchange ideas and advances in all aspects of computer sciences and its applications. The aim of this conference is to give a platform to researchers with fundamental contributions and to be a premier venue for industry practitioners to share and report on up-to-the-minute innovations and developments, to summarize the state of the art and to exchange ideas and advances in all aspects of computer sciences and its applications. For this edition of the conference, we received 514 submissions from 50+ countries around the world. These submissions underwent a double-blind peer review process. Of those 514 submissions, 160 submissions (including 15 posters) have been selected to be included in this proceedings. The published proceedings has been divided into three volumes covering a wide range of conference tracks, such as technology trends, computing, intelligent systems, machine vision, security, communication, electronics and e-learning to name a few. In addition to the contributed papers, the conference program included inspiring keynote talks. Their talks were anticipated to pique the interest of the entire computing audience by their thought-provoking claims which were streamed live during the conferences. Also, the authors had very professionally presented their research papers which were viewed by a large international audience online. All this digital content engaged significant contemplation and discussions amongst all participants. Deep appreciation goes to the keynote speakers for sharing their knowledge and expertise with us and to all the authors who have spent the time and effort to contribute significantly to this conference. We are also indebted to the Organizing Committee for their great efforts in ensuring the successful implementation of the conference. In particular, we would like to thank the Technical Committee for their constructive and enlightening reviews on the manuscripts in the limited timescale.

v

vi

Editor’s Preface

We hope that all the participants and the interested readers benefit scientifically from this book and find it stimulating in the process. We are pleased to present the proceedings of this conference as its published record. Hope to see you in 2021, in our next Computing Conference, with the same amplitude, focus and determination. Kohei Arai

Contents

Demonstrating Advanced Machine Learning and Neuromorphic Computing Using IBM’s NS16e . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mark Barnell, Courtney Raymond, Matthew Wilson, Darrek Isereau, Eric Cote, Dan Brown, and Chris Cicotta

1

Energy Efficient Resource Utilization: Architecture for Enterprise Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dilawar Ali, Fawad Riasat Raja, and Muhammad Asjad Saleem

12

Performance Evaluation of MPI vs. Apache Spark for Condition Based Maintenance Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tomasz Haupt, Bohumir Jelinek, Angela Card, and Gregory Henley

28

Comparison of Embedded Linux Development Tools for the WiiPiiDo Distribution Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Diogo Duarte, Sérgio Silva, João M. Rodrigues, Salviano Pinto Soares, and António Valente FERA: A Framework for Critical Assessment of Execution Monitoring Based Approaches for Finding Concurrency Bugs . . . . . . . Jasmin Jahić, Thomas Bauer, Thomas Kuhn, Norbert Wehn, and Pablo Oliveira Antonino A Top-Down Three-Way Merge Algorithm for HTML/XML Documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Anastasios G. Bakaoukas and Nikolaos G. Bakaoukas Traceability Framework for Requirement Artefacts . . . . . . . . . . . . . . . . Foziah Gazzawe, Russell Lock, and Christian Dawson

42

54

75 97

Haptic Data Accelerated Prediction via Multicore Implementation . . . . 110 Pasquale De Luca and Andrea Formisano

vii

viii

Contents

Finding the Maximal Independent Sets of a Graph Including the Maximum Using a Multivariable Continuous Polynomial Objective Optimization Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 Maher Heal and Jingpeng Li Numerical Method of Synthesized Control for Solution of the Optimal Control Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 Askhat Diveev Multidatabase Location Based Services (MLBS) . . . . . . . . . . . . . . . . . . 157 Romani Farid Ibrahim wiseCIO: Web-Based Intelligent Services Engaging Cloud Intelligence Outlet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 Sheldon Liang, Kimberly Lebby, and Peter McCarthy A Flexible Hybrid Approach to Data Replication in Distributed Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196 Syed Mohtashim Abbas Bokhari and Oliver Theel A Heuristic for Efficient Reduction in Hidden Layer Combinations for Feedforward Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208 Wei Hao Khoong Personalized Recommender Systems with Multi-source Data . . . . . . . . . 219 Yili Wang, Tong Wu, Fei Ma, and Shengxin Zhu Renormalization Approach to the Task of Determining the Number of Topics in Topic Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234 Sergei Koltcov and Vera Ignatenko Strategic Inference in Adversarial Encounters Using Graph Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248 D. Michael Franklin Machine Learning for Offensive Security: Sandbox Classification Using Decision Trees and Artificial Neural Networks . . . . . . . . . . . . . . . 263 Will Pearce, Nick Landers, and Nancy Fulda Time Series Analysis of Financial Statements for Default Modelling . . . 281 Kirill Romanyuk and Yuri Ichkitidze Fraud Detection Using Sequential Patterns from Credit Card Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287 Addisson Salazar, Gonzalo Safont, and Luis Vergara Retention Prediction in Sandbox Games with Bipartite Tensor Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297 Rafet Sifa, Michael Fedell, Nathan Franklin, Diego Klabjan, Shiva Ram, Arpan Venugopal, Simon Demediuk, and Anders Drachen

Contents

ix

Data Analytics of Student Learning Outcomes Using Abet Course Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309 Hosam Hasan Alhakami, Baker Ahmed Al-Masabi, and Tahani Mohammad Alsubait Modelling the Currency Exchange Rates Using Support Vector Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326 Ezgi Deniz Ülker and Sadik Ülker Data Augmentation and Clustering for Vehicle Make/Model Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334 Mohamed Nafzi, Michael Brauckmann, and Tobias Glasmachers A Hybrid Recommender System Combing Singular Value Decomposition and Linear Mixed Model . . . . . . . . . . . . . . . . . . . . . . . . 347 Tianyu Zuo, Shenxin Zhu, and Jian Lu Data Market Implementation to Match Retail Customer Buying Versus Social Media Activity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363 Anton Ivaschenko, Anastasia Stolbova, and Oleg Golovnin A Study of Modeling Techniques for Prediction of Wine Quality . . . . . 373 Ashley Laughter and Safwan Omari Quantifying Apparent Strain for Automatic Modelling, Simulation, Compensation and Classification in Structural Health Monitoring . . . . . 400 Enoch A-iyeh A New Approach to Supervised Data Analysis in Embedded Systems Environments: A Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 416 Pamela E. Godoy-Trujillo, Paul D. Rosero-Montalvo, Luis E. Suárez-Zambrano, Diego H. Peluffo-Ordoñez, and E. J. Revelo-Fuelagán Smart Cities: Using Gamification and Emotion Detection to Improve Citizens Well Fair and Commitment . . . . . . . . . . . . . . . . . . 426 Manuel Rodrigues, Ricardo Machado, Ricardo Costa, and Sérgio Gonçalves Towards a Smart Interface-Based Automated Learning Environment Through Social Media for Disaster Management and Smart Disaster Education . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443 Zair Bouzidi, Abdelmalek Boudries, and Mourad Amad Is Social Media Still “Social”? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 469 Chan Eang Teng and Tang Mui Joo Social Media: Influences and Impacts on Culture . . . . . . . . . . . . . . . . . 491 Mui Joo Tang and Eang Teng Chan

x

Contents

Cost of Dietary Data Acquisition with Smart Group Catering . . . . . . . . 502 Jiapeng Dong, Pengju Wang, and Weiqiang Sun Social Engineering Defense Mechanisms: A Taxonomy and a Survey of Employees’ Awareness Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 521 Dalal N. Alharthi and Amelia C. Regan How Information System Project Stakeholders Perceive Project Success . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 542 Iwona Kolasa and Dagmara Modrzejewska Fuzzy Logic Based Adaptive Innovation Model . . . . . . . . . . . . . . . . . . . 555 Bushra Naeem, Bilal Shabbir, and Juliza Jamaludin A Review of Age Estimation Research to Evaluate Its Inclusion in Automated Child Pornography Detection . . . . . . . . . . . . . . . . . . . . . . 566 Lee MacLeod, David King, and Euan Dempster A Comprehensive Survey and Analysis on Path Planning Algorithms and Heuristic Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 581 Bin Yan, Tianxiang Chen, Xiaohui Zhu, Yong Yue, Bing Xu, and Kai Shi Computational Conformal Mapping in Education and Engineering Practice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 599 Maqsood A. Chaudhry Pilot Study of ICT Compliance Index Model to Measure the Readiness of Information System (IS) at Public Sector in Malaysia . . . . . . . . . . . . 609 Mohamad Nor Hassan and Aziz Deraman Preliminary Experiments on the Use of Nonlinear Programming for Indoor Localization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 629 Stefania Monica and Federico Bergenti Improved Deterministic Broadcasting for Multiple Access Channels . . . 645 Bader A. Aldawsari and J. Haadi Jafarian Equivalent Thermal Conductivity of Metallic-Wire for On-Line Monitoring of Power Cables . . . . . . . . . . . . . . . . . . . . . . . . 661 M. S. Al-Saud A Novel Speed Estimation Algorithm for Mobile UE’s in 5G mmWave Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 673 Alawi Alattas, Yogachandran Rahulamathavan, and Ahmet Kondoz In-App Activity Recognition from Wi-Fi Encrypted Traffic . . . . . . . . . . 685 Madushi H. Pathmaperuma, Yogachandran Rahulamathavan, Safak Dogan, and Ahmet M. Kondoz A Novel Routing Based on OLSR for NDN-MANET . . . . . . . . . . . . . . . 698 Xian Guo, Shengya Yang, Laicheng Cao, Jing Wang, and Yongbo Jiang

Contents

xi

A Comparative Study of Active and Passive Learning Approaches in Hybrid Learning, Undergraduate, Educational Programs . . . . . . . . . 715 Khalid Baba, Nicolas Cheimanoff, and Nour-eddine El Faddouli Mobile Learning Adoption at a Science Museum . . . . . . . . . . . . . . . . . . 726 Ruel Welch, Temitope Alade, and Lynn Nichol Conceptualizing Technology-Enhanced Learning Constructs: A Journey of Seeking Knowledge Using Literature-Based Discovery . . . 746 Amalia Rahmah, Harry B. Santoso, and Zainal A. Hasibuan Random Sampling Effects on e-Learners Cluster Sizes Using Clustering Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 760 Muna Al Fanah Jupyter-Notebook: A Digital Signal Processing Course Enriched Through the Octave Programming Language . . . . . . . . . . . . . . . . . . . . 774 Arturo Zúñiga-López, Carlos Avilés-Cruz, Andrés Ferreyra-Ramírez, and Eduardo Rodríguez-Martínez A Novel Yardstick of Learning Time Spent in a Programming Language by Unpacking Bloom’s Taxonomy . . . . . . . . . . . . . . . . . . . . . 785 Alcides Bernardo Tello, Ying-Tien Wu, Tom Perry, and Xu Yu-Pei Assessing and Development of Chemical Intelligence Through e-Learning Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 795 E. V. Volkova Injecting Challenge or Competition in a Learning Activity for Kindergarten/Primary School Students . . . . . . . . . . . . . . . . . . . . . . 806 Bah Tee Eng, Insu Song, Chaw Suu Htet Nwe, and Tian Liang Yi Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 827

Demonstrating Advanced Machine Learning and Neuromorphic Computing Using IBM’s NS16e Mark Barnell1(B) , Courtney Raymond1 , Matthew Wilson2 , Darrek Isereau2 , Eric Cote2 , Dan Brown2 , and Chris Cicotta2 1 Air Force Research Laboratory, Information Directorate, Rome, NY 13441, USA

[email protected] 2 SRC, Inc., 6225 Running Ridge Road, North Syracuse, NY 13212, USA

Abstract. The human brain can be viewed as an extremely power-efficient biological computer. As such, there have been many efforts to create brain-inspired processing systems to enable advances in low-power data processing. An example of brain-inspired processing architecture is the IBM TrueNorth Neurosynaptic System, a Spiking Neural Network architecture for deploying ultra-low power machine learning (ML) models and algorithms. For the first time ever, an advanced scalable computing architecture was demonstrated using 16 TrueNorth neuromorphic processors containing in aggregate over 16 million neurons. This system, called the NS16e, was used to demonstrate new ML techniques including the exploitation of optical and radar sensor data simultaneously, while consuming a fraction of the power compared to traditional Von Neumann computing architectures. The number of applications that have requirements for computing architectures that can operate in size, weight and power-constrained environments continues to grow at an outstanding pace. These applications include processors for vehicles, homes, and real-time data exploitation needs for intelligence, surveillance, and reconnaissance missions. This research included the successful exploitation of optical and radar data using the NS16e system. Processing performance was assessed, and the power utilization was analyzed. The NS16e system never used more than 15 W, with the contribution from the 16 TrueNorth processors utilizing less than 5 W. The image processing throughput was 16,000 image chips per second, corresponding to 1,066 image chips per second for each watt of power consumed. Keywords: Machine vision · High Performance Computing (HPC) · Artificial Intelligence (AI) · Machine learning image processing · Deep Learning (DL) · Convolutional Neural Networks (CNN) · Spiking Neural Network (SNN) · Neuromorphic processors Received and approved for public release by the Air Force Research Laboratory (AFRL) on 11 June 2019, case number 88ABW-2019-2928. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors, and do not necessarily reflect the views of AFRL or its contractors. This work was partially funded under AFRL’s Neuromorphic - Compute Architectures and Processing contract that started in September 2018 and continues until June 2020. © Springer Nature Switzerland AG 2020 K. Arai et al. (Eds.): SAI 2020, AISC 1228, pp. 1–11, 2020. https://doi.org/10.1007/978-3-030-52249-0_1

2

M. Barnell et al.

1 Background Background and insight into the technical applicability of this research is discussed in Sect. 1. Section 2 provides an overview of the hardware. Section 3 provides detail on our technical approach. Section 4 provides a concise summary of our results. Section 5 addresses areas of future research, and conclusions are discussed in Sect. 6. We are currently in a period where the interest and the pace of research and development in ML and AI technological advances is high. In part, the progress is enabled by increases in investment by government, industry, and academia. Currently, ML algorithms, techniques, and methods are improving at an accelerated pace, i.e., with methods to recognize objects and patterns outpacing human performance. The communities’ interest is supported by the number of applications that can use existing and emerging ML hardware and software technologies. These applications are supported by the availability of large quantities of data, connectivity of information, and new high-performance computing architectures. Such applications are now prevalent in many everyday devices. For example, data from low cost optical cameras and radars provide automobiles the data needed to assist humans. These driver assistants can identify road signs, pedestrians, and lane lines, while also controlling vehicle speed and direction. Other devices include smart thermostats, autonomous home floor cleaners, robots that deliver towels to hotel guests, systems that track and improve athletic performance, and devices that help medical professionals diagnose disease. These applications, and the availability of data collected on increasingly smaller devices, are driving the need and interest in low-power neuromorphic chip architectures. The wide applicability of information processing technologies has increased competition and interest in computing hardware and software that can operate within the memory, power, and cost constraints of the real world. This includes continued research into computing systems that are structured like the human brain. The research includes several decades of development, in-part pioneered by Carver Mead [1]. Current examples of the more advanced neuromorphic chips include SpiNNaker, Loihi, BrainScaleS1, NeuroGrid/Braindrop, DYNAP, ODIN and TrueNorth [2]. These systems improve upon traditional computing architectures, such as the Von Neumann architecture, where physical memory and logic are separated. In neuromorphic systems, the colocalization of memory and computation, as well as reduced precision computing, increases energy efficiencies, and provides a product that uses much less power than traditional compute architectures. IBM’s TrueNorth Neurosynaptic System represents an implementation of these newly available specialized neuromorphic computing architectures [3]. The TrueNorth NS1e, an evaluation board with a single TrueNorth chip, has the following technical specifications: 1 million individually programmable neurons, 256 million individually programmable synapses, and 4,096 parallel & distributed cores. Additionally, this chip uses approximately 200 mW of total power, resulting in 20 mW/cm2 power density [4–6].

Demonstrating Advanced Machine Learning and Neuromorphic Computing

3

The latest iteration of the TrueNorth Neurosynaptic System includes the NS16e, a single board containing a tiled set of 16 TrueNorth processors, assembled in a four-byfour grid. This state-of-the-art 16 chip computing architecture yields a 16 million neuron processor, capable of implementing large, multi-processor models or parallelizing smaller models, which can then process 16 times the data. To demonstrate the processing capabilities of the TrueNorth, we developed multiple classifiers. These classifiers were trained using optical satellite imagery from the United States Geological Survey (USGS) [7]. Each image chip in the overall image was labeled by identifying the existence or non-existence of a vehicle in the chip. The chips were not centered and could include only a segment of a vehicle [8]. Figure 1 shows the raw imagery is on the left and the processed imagery on the right. In this analysis, a single TrueNorth chip was able to process one thousand, 32 × 32 pixel chips per second.

Advanced Network Classification USGS imagery was

Raw Imagery

Processed Imagery

Chipped, Human-labeled and preprocessed to train, test, and validate our network model.

14 Layer Neural Network

Results Accuracy of 97.6% Probability of Detection 89.5%, Probability of False Alarm 1.4%

Vehicle Detection 24,336 Total Chips Classified at 1,000 chips/sec 3 Watts

Fig. 1. Electro-optical (EO) image processing using two-class network to detect car/no car in scene using IBM’s neuromorphic compute architecture, called TrueNorth (using one chip)

Previous work was extended upon through the use of new networks and placement of those networks on the TrueNorth chip. Additionally, results were captured, and analyses were completed to assess the performance of these new network models. The overall accuracy of the best model was 97.6%. Additional performance measures are provided at the bottom of Fig. 1.

2 Hardware Overview Development of the TrueNorth architecture dates to the DARPA Systems of Neuromorphic Adaptive Plastic Scalable Electronics (SyNAPSE) project beginning in 2008. This project sought to develop revolutionary new neuromorphic processors and design tools. Each TrueNorth chip is made of 5.4 billion transistors, and is fabricated using

4

M. Barnell et al.

a 28 nm low-power complementary metal-oxide-semiconductor (CMOS) process technology with complimentary pairs of logic functions. The True North chip is 4.3 cm2 using under 200 mW of power per chip. The NS16e board is configured with 16 TrueNorth chips in a 4 × 4 chip configuration. In aggregate, this board provides users with access to 16 million programmable neurons and over 4 billion programmable synapses. The physical size of the NS16e board is 215 mm × 285 mm [9]. To expand on this configuration, four of these NS16e boards were emplaced in a standard 7U space. Thereby, this new neuromorphic system occupies about 19 by 23 by 7 inches in a rack space. The four-board configuration, called the NS16e-4, results in a neuromorphic computing system with 64 million neurons and 16 billion synapses. Such a configuration enables users to extend upon the single chip research described in Sect. 1, and implement inferencing algorithms and data processing in parallel. Additionally, the system uses a fraction of the processing power compared to traditional computing hardware occupying the same physical footprint. The Air Force Research Laboratory’s (AFRL’s) rack mounted neurosynaptic system, called BlueRaven, is shown in Fig. 2.

BlueRaven NeurosynapƟc System • Rack mounted • Four NS16e boards with an aggregate of 64 million neurons and 16 billion synapses • Enabling parallelization research and design • Used to process 5000 x 5000 pixels of information every 3 seconds.

Fig. 2. AFRL’s BlueRaven system – equivalent to 64 million neurons and 16 billion synapses

The BlueRaven High Performance Computer (HPC) also contains a 2U Penguin Computing Relion 2904GT. The Penguin server is utilized for training network models

Demonstrating Advanced Machine Learning and Neuromorphic Computing

5

before being deployed to the neuromorphic hardware, as well as for data pre-processing. Table 1 details Blue Raven’s specifications. Table 1. BlueRaven system architecture configuration detail Specification

Description

Form Factor

2U Server + 2U NS16e Sled

NS16e

4× IBM NS16e PCIe Cards

Neurosynaptic Cores

262,144

Programmable Neurons

67,108,864

Programmable Synapses 17,179,869,184 PCIe NS16e Interface

4× PCIe Gen 2

Ethernet - Server

1x 1 Gbit

Ethernet – NS16e

1x 1 Gbit per NS16e

Training GPUs

2x NVIDIA Tesla P100

Volatile Memory

256 GB

CPUs

2× 10-Core E5-2630

3 Approach The NS16e processing approach includes the use of deep convolutional Spiking Neural Networks (SNN) to perform classification inferencing of the input imagery. The deep networks were designed and trained using IBM’s Energy-efficient Deep Neuromorphic Networks (EEDN) framework [4]. The neurosynaptic resource utilization of the classifiers were purposely designed to operate within the constraints of the TrueNorth architecture. Specifically, they stayed within the limits of the single TrueNorth’s 1 million neurons and 256 million synapses. The benefit of this technical approach is that it immediately allowed us to populate an NS16e board with up to sixteen parallel image classifier networks, eight to process optical imagery and eight to process radar imagery. Specifically, the processing chain is composed of a collection of 8 duplicates of the same EEDN network trained on a classification task for each chosen dataset. 3.1 USGS Dataset EO imagery happens to be a very applicable data set to use to exercise the BlueRaven system. It is applicable because it is freely available, and is of favorable quality (i.e., highresolution). The quality of the data enabled us to easily identify targets in the imagery. Additionally, the data could be easily chipped and labeled to provide the information necessary for network model training and validation.

6

M. Barnell et al.

This overhead optical imagery includes all 3 color channels (red, green and blue). The scene analyzed included 5000 × 5000 pixels at 1-foot resolution. From this larger scene, image chips were extracted. Each image chip from the scene was 32 × 32 pixels. There was no overlap between samples, thereby sampling the input scene with a receptive field of 32 × 32 pixels and a stride of 32 pixels. This resulted in over 24,336 (156 × 156) sub-regions. The USGS EO data was used to successfully build TrueNorth-based classifiers that contained up to six object and terrain classes (e.g., vehicle, asphalt, structure, water, foliage, and grass). For this multi-processor neurosynaptic hardware demonstration, a subset of the classes was utilized to construct a binary classifier, which detected the presence or absence of a vehicle within the image chip. The data set was divided up into a training and test/validation sets. The training set contained 60% of the chips (14,602 image chips). The remaining 40% of the chips (9,734) were used for test/validation. The multi-processor demonstration construct and corresponding imagery is shown in Fig. 3.

Multi-Processor Neurosynaptic Demonstration Optical Imagery

Binary classification of a vehicle/no vehicle within the image Example Imagery and Demonstration Construct

Fig. 3. Example USGS tile and demonstration construct

3.2 Multi-chip Neurosynaptic Electro-Optical Classification The content of the chip was defined during data curation/labeling. The label was in one of two categories: no vehicle or a vehicle. Additionally, the chips were not chosen with the targets of interest centered in the image chip. Because of this, many of the image chips contained portions of a vehicle, e.g., a chip may contain an entire vehicle, fractions of a vehicle, or even fractions of multiple vehicles. The process of classifying the existence of a vehicle in the image starts with object detection. Recognizing that a chip may contain a portion of a vehicle, an approach was developed to help ensure detection of the vehicles of interest. This approach created

Demonstrating Advanced Machine Learning and Neuromorphic Computing

7

multiple 32 × 32 × 3 image centroids. These centroids were varied in both the X and Y dimensions to increase the probability of getting more of the target in the image being analyzed. A block diagram showing the processing flow from USGS imagery to USGS imagery with predicted labels is shown in Fig. 4. This includes NS16e implementations with 8 parallel classifier networks, 1 per each TrueNorth on half the board.

USGS 5000x5000 Image

USGS Image w/ Predicted Labels Model Model

Image Sub-region

Model

ClassificaƟon Processor

Model

Image Chipper

Model Model Model TrueNorth USGS Model

Fig. 4. NS16e USGS block diagram

The copies of the EO network were placed in two full columns of the NS16e, or eight TrueNorth processors in a 4 × 2 configuration, with one network copy on each processor. As a note, the remainder of the board was leveraged to study processing with additional radar imagery data. 3.3 Electro-Optical Classification Hardware Statistics Analyses of the systems power consumption was completed. The TrueNorth system operates at a rate of 1 kHz. This rate directly correlates to the number of image chips that can be processed per second. The USGS/Radar network models were replicated across 16 TrueNorth chips and resulted in a processing speed of 16,000 inferences per second. At this rate and using 8 TrueNorth chips, the new compute and exploitation architecture was able to process the full 5,000 × 5,000 pixel optical imagery with 24,336 image chips in 3 s.

8

M. Barnell et al.

The NS16e card’s power usage during inferencing is shown in Table 2. The total utilization of the board was less than 14 W. The runtime analyses included the measurement of periphery circuits and input/output (I/O) on the board. Table 2. NS16e board power usage Board power Board

Voltage (V)

Current (A)

Power (W)

Nominal

Measured

Computed

Interposer (Inclrding MMP)

+12

0.528

6.336

16-chip board (Including TN chips)

+12

0.622

7.462

1.150

13.798

Total

Table 3 details the power utilization of the TrueNorth chips without the boards peripheral power use. The contribution from the TrueNorth accounted for approximately 5 W of the total 15 W. Table 3. NS16e TrueNorth power usage TrueNorth power only Component

Voltage (V) Current (A) Power (W) Measured

Computed

TrueNorth Core VDD 0.980

Measured

4.74

4.636

TrueNorth I/O Drivers 1.816

0.04

0.063

TrueNorth I/O Pads

0.00

0.002

Total

1.000

4.701

Table 4 provides detail on the power utilization without loading on the system (idle).

4 Results In Fig. 5, we see an example of predictions (yellow boxes) overlaid with ground truth (green tiles). Over the entirety of our full-scene image, we report a classification accuracy of 84.29% or 3,165 of 3,755 vehicles found. Our misclassification rate, meaning the number of false positives or false negatives, is 35.39%. Of that, 15.71% of targets are false negatives, i.e. target misses. This can be tuned by changing the chipping algorithm used with a trade off in the inference speed of a tile.

Demonstrating Advanced Machine Learning and Neuromorphic Computing

9

Table 4. Idle NS16e power usage Board power Board

Voltage (V)

Current (A)

Power (W)

Nominal

Measured

Computed

Interposer (Including MMP)

+12

0.518

6.216

16-chip board (Including TN chips)

+12

0.605

7.265

1.123

13.481

Current (A)

Power (W)

TOTAL TrueNorth power only Component

Voltage (V) Measured

Measured

Computed

TrueNorth Core VDD

0.978

4.64

4.547

TrueNorth I/O Drivers

1.816

0.03

0.051

TrueNorth I/O Pads

0.998

0.00

0.001

TOTAL

4.599

Multi-Processor Neurosynaptic Demonstration Legend Target Positively Identified Target Falsely Identified Truth (Human Labeled Cars)

• Detected 3165 of 3755 (84.29%) of the Vehicles • 8000 inferences per second

Fig. 5. Example USGS tile results

5 Future Research Neuromorphic research and development continue with companies such as Intel and IBM. They are contributing to the communities’ interest in these low power processors. As an example, the SpiNNaker system consists of many ARM cores and is highly

10

M. Barnell et al.

flexible since neurons are implemented at the software level, albeit somewhat more energy intensive (each core consumes ~1 W) [10, 11]. As new SNN architectures continue to be developed, new algorithms and applications continue to surface. This includes technologies such as bioinspired vision systems [12]. Additionally, Intel’s Loihi neuromorphic processor [13] is a new SNN neuromorphic architecture which enables a new set of capabilities on ultra-low power hardware. Loihi also provides the opportunity for online learning. This makes the chip more flexible as it allows various paradigms, such as supervisor/non-supervisor and reinforcing/configurability. Additional research of these systems, data exploitation techniques, and methods will continue to enable new low power and low-cost processing capabilities with consumer interest and applicability.

6 Conclusions The need for advanced processing algorithms and methods that operate on low-power computing hardware continues to grow out an outstanding pace. This research has enabled the demonstration of advanced image exploitation on the newly developed NS16e neuromorphic hardware, i.e., a board with sixteen neurosynaptic chips on it. Together, those chips never exceeded 5 W power utilization. The neuromorphic board never exceeded 15 W power utilization.

References 1. Mead, C.: Neuromorphic electronic systems. Proc. IEEE 78(10), 1629–1636 (1990) 2. Rajendran, B., Sebastian, A., Schmuker, M., Srinivasa, N., Eleftheriou, E.: Low-power neuromorphic hardware for signal processing applications (2019). https://arxiv.org/abs/1901. 03690 3. Barnell, M., Raymond, C., Capraro, C., Isereau, D., Cicotta, C., Stokes, N.: High-performance computing (HPC) and machine learning demonstrated in flight using Agile Condor®. In: IEEE High Performance Extreme Computing Conference (HPEC), Waltham, MA (2018) 4. Esser, S.K., Merolla, P., Arthur, J.V., Cassidy, A.S., Appuswamy, R., Andreopoulos, A., et al.: CNNs for energy-efficient neuromorphic computing. In: Proceedings of the National Academy of Sciences, p. 201604850, September 2016. https://doi.org/10.1073/pnas.160485 0113 5. R.F. Service: The brain chip. In: Science, vol. 345, no. 6197, pp. 614–615 (2014) 6. Cassidy, A.S., Merolla, P., Arthur, J.V., Esser, S.K., Jackson, B., Alvarez-Icaza, R., Datta, P., Sawada, J., Wong, T.M., Feldman, V., Amir, A., Rubin, D.B.-D., Akopyan, F., McQuinn, E., Risk, W.P., Modha, D.S.: Cognitive computing building block: a versatile and efficient digital neuron model for neurosynaptic cores. In: The 2013 International Joint Conference on Neural Networks (IJCNN), pp. 1–10, 4–9 August 2013 7. U.S. Geological Survey: Landsat Data Access (2016). http://landsat.usgs.gov/Landsat_S earch_and_Download.php 8. Raymond, C., Barnell, M., Capraro, C., Cote, E., Isereau, D.: Utilizing high-performance embedded computing, agile condor®, for intelligent processing: an artificial intelligence platform for remotely piloted aircraft. In: 2017 IEEE Intelligent Systems Conference, London, UK (2017)

Demonstrating Advanced Machine Learning and Neuromorphic Computing

11

9. Modha, D.S., Ananthanarayanan, R., Esser, S.K., Ndirango, A., et al.: Cognitive computing. Commun. ACM 54(8), 62–71 (2011) 10. Furber, S.B., Galluppi, F., Temple, S., Plana, L.A.: The SpiNNaker project. Proc. IEEE 102(5), 652–665 (2014) 11. Schuman, C.D., Potok, T.E., Patton, R.M., Birdwell, J.D., Dean, M.E., Rose, G.S., Plank, J.S.: A survey of neuromorphic computing and neural networks in hardware. CoRR abs/1705.06963 (2017) 12. Dong, S., Zhu, L., Xu, D., Tian, Y., Huang, T.: An efficient coding method for spike camera using inter-spike intervals. In: IEEE DCC, March 2019 13. Tang, G., Shah, A., Michmizos, K.P.: Spiking neural network on neuromorphic hardware for energy-efficient unidimensional SLAM. CoRR abs/1903.02504. arXiv:1611.05141 (2019)

Energy Efficient Resource Utilization: Architecture for Enterprise Network Towards Reliability with SleepAlert Dilawar Ali1(B) , Fawad Riasat Raja2 , and Muhammad Asjad Saleem2 1 Ghent University, Ghent, Belgium

[email protected] 2 University of Engineering and Technology Taxila, Taxila, Pakistan

Abstract. Enterprise networks usually require all the computing machines to remain accessible (switched-on) at all times regardless of the workload in order to entertain user requests at any instant. This comes at the cost of excessive energy utilization. Many solutions have been put forwarded, however, only few of them are tested in a real-time environment, where the energy saving is achieved by compromising the systems’ reliability. Therefore, energy-efficient resource utilization without compromising the system’s reliability is still a challenge. In this research, a novel architecture, “Sleep Alert”, is proposed that not only avoids the excessive energy utilization but also improves the system reliability by using Resource Manager (RM) concept. In contrary to traditional approaches, Primary and Secondary Resource Managers i.e. RMP and RMS respectively are used to avoid the single point of failure. The proposed architecture is tested on a network where active users were accessing the distributed virtual storage and other applications deployed on the desktop machines, those are connected with each other through a peer-to-peer network. Experimental results show that the solution can save considerable amount of energy while making sure that reliability is not compromised. This solution is useful for small enterprise networks, where saving energy is a big challenge besides reliability. Keywords: Enterprise networks · Resource manager · Green computing · Sleep proxy · Energy-efficient computing

1 Introduction Efficient utilization of energy is one of the biggest challenges around the globe. The difference between demand and supply is always on rise. For high performance, computing a reliable, scalable, and cost-effective energy solution satisfying power requirements and minimizing environmental pollution will have a high impact. The biggest challenge in enterprise networks is how to manage power consumption. Data centers utilize huge amount of energy in order to ensure the availability of data when accessed remotely. Major problem now a day is that energy is scarce, that is why renewable energy i.e. producing energy by wind, water, solar light, geothermal and bio-energy is a hot issue in research. It is of equal importance that how efficiently this limited energy would © Springer Nature Switzerland AG 2020 K. Arai et al. (Eds.): SAI 2020, AISC 1228, pp. 12–27, 2020. https://doi.org/10.1007/978-3-030-52249-0_2

Energy Efficient Resource Utilization: Architecture for Enterprise Network

13

be utilized so an investment in green technology that leads to strengthen the economy besides reducing the environment pollution be made. In US Department of Energy, office of Energy Efficiency and Renewable Energy (EERE) is also working on energy efficiency and renewable energy resources with an aim to reduce the dependence on imported oil [1]. The number of internet users has increased 5 times from 2000 to 2009. Currently the internet users are more than 2.4 billion [2]. Whereas Microsoft report shows, it will be more than 4 billion in coming years. Some other sources say it will be more than 5 billion by year 2020 [3]. More the number of users, more is the energy consumed, and an increase amount of CO2 will be emitted The global Information and Communications Technology (ICT) industry elucidate approximately 2% of global carbon dioxide (CO2 ) emissions, number equivalent to aviation, which means rapid increase in environmental pollution. The power deficiency issue and energy crisis are currently main topics of debate on discussion forums and in professional conferences. It is considered a worldwide goal to optimize energy consumption and minimize CO2 emissions in all critical sectors of an economy [4]. Therefore, major concern is to reduce the utilization of energy in enterprise networks using our prescribed scheme. A contribution to this scheme is to widen the research area in Green computing, where main goal is not just to save the operational energy of a product but also the overall energy, which is consumed from product development till the completion of recycling process [4]. Data centers contain servers and storage systems, which operate and manage the enterprise resource planning solutions. Major components of data center are environmental controls (e.g., ventilation/air conditioning), redundant or backup power supplies, multiple data communication connections and security devices (e.g. camera). Large data centers are often considered as a major source of air pollution in the form of CO2. By releasing plenty of heat, they raise global warming and consume as much energy as does a small town [5]. In enterprise networks a machine (often a desktop) is usually in active mode so as it remains available whenever accessed remotely. This reveals that some machines having no load (i.e. idle one) may remain continuously in active state. To cope with energy management issues, many hardware and software based solutions have been put forward. Vendors have manufactured such devices e.g. Dynamic Voltage and Frequency Scaling (DVFS) enabled devices, which will consume less energy than the Non-DVFS enabled devices’. Many software solutions, too, have been proposed which reduce energy consumption. The most prominent is the one that takes the machines into sleep mode when they are idle. However, in the later scheme, there are a number of issues e.g. reliability which will be addressed in next section. This proposed architecture not only emphasizes on the shortcomings of existing architectures which are identified in this research work but also cater most of the issues occurred during real-time test environment. These include reliability of sleep proxy architecture and overhead - where the proxy node sends a periodic signal to the proxy server after every five minutes, notifying its presence on the network - as these have not been addressed in the traditional approaches. The main benefit of proposed scheme is no extra or costly hardware is required to implement the solution. Sleep Alert, a cost effective solution, eliminates the concept of single point failure and addition of any extra hardware to make the network reliable and energy efficient. This

14

D. Ali et al.

solution is useful for small enterprise networks, where saving energy is a big challenge besides reliability. 1.1 Major Contributions The major contributions are as follows: 1. A simple architecture to cope with the challenge of single point failure and ensure the service until last node is available in the network. 2. Low cost solution to save considerable amount of energy while making sure that reliability is not compromised.

2 Literature Review The traditional approach regarding the design of a computer network is to ensure the accessibility of each machine connected in the network at any cost. Energy consumption is not taken into account while designing such networks [6]. Generally, a network design is highly redundant for fault tolerance. All network equipment stays powered-on even if unused or slightly used. Many solutions [7–16] have been proposed to efficiently use the energy. In Green Product Initiative [12], the use of energy efficient devices by using Dynamic voltage and frequency scaling (DVFS) technique has been introduced but this solution is not appropriate for the jobs having a short deadline. This is because DVFS-enabled devices take more time than the normal execution time of a job. Hardware based solutions [12, 17] require particular hardware such as GumStix [17], when the host sleeps these low powered devices becomes active. Sometime particular hardware device is required for each machine beside operating system alteration on both application and host. Apple has come up with a sleep proxy that is compatible only with Apple-designed hardware and is appropriate for home networks only [15]. Sleep proxy architecture [13] is a common software based solution to lesser energy consumption. The major flaw with this technique is that if sleep proxy server is down, then there is no other way to access any machine. Using the designated machine as sleep proxy server is, too, not a good approach as the danger of Single Point Failure always lurks. A different approach is SOCKs based approach, which includes the awareness of the power state of a machine that how a Network Connectivity Proxy (NCP) could enable substantial energy savings by letting idle machines to enter a low-power sleep state and still ensures their presence in the network. This is just a design and prototype, not yet implemented. Furthermore, there is always a significant difference in simulationbased test of sleep proxy architecture and real time testing in enterprise networks [14]. Software based solution to energy efficient computing normally based on sleep approach such as Wake on LAN (WOL) [18] is a standard methodology used to make available a sleeping machine in a network when required by sending a magic packet. Cloud and fog computing are the common trend introduced in recent decade. The way the Information and Communication Technology (ICT) and computing systems are increasing, it is necessary to take in account the future challenges of energy consumption

Energy Efficient Resource Utilization: Architecture for Enterprise Network

15

due to these latest technologies [19]. Cloud computing involve the large number of data set on different location. Many companies like Google, Amazon, Yahoo, Microsoft, having huge data centers are currently following the cloud architecture to handle the large number of data sets. The progression toward the cloud architecture results in increase of data centers which lead to increase in the huge amount of energy consumption. Survey shows that the energy consumption by data centers in US was between 1.7% and 2.2% of entire power consumption of US in 2010 [20]. To control the huge energy consumption issues in cloud servers the concept of greening cloud computing [21] and Pico servers [22], power aware computing, were introduced. Common methods to achieve energy saving in green cloud computing data center are adaptive link rate, implementing virtual network methodology, sleeping less utilized servers, green/power aware routing and server load/network traffic consolidation [23]. Efficient resource allocation is a new diversion to cope with energy challenge. Research shows different method of resource allocation one of common is allocation based on priority [24]. This technique requires the intelligent system to predict the behavior of a network based on different machine learning concepts [25]. Whereas while implementing the virtual machine based solution for energy efficiency, the priority allocation and resource sharing algorithms were designed to allocate the resources, to save maximum energy. The major flaw in technique is excessive load on the network due to constant reallocations of resources [26]. Intelligent approach to minimize the consumption of energy is to predict the network work load based on different models this model turn the work load in different classes and then evaluate each class based on several available models [27]. Most of available software based solutions are not tested in real time environment; others are complex and have many shortcomings when tested in real-time environment. Ensuring reliability is a big challenge for such kind of applications while saving sufficient amount of energy. Single point failure is one of major point of concern in this research domain. Many solutions were proposed but some solutions are complex to contrivance while other requires a lot of infrastructure to implement. Most of companies in developing countries even have the major budgetary constraints to set up such a large infrastructure. In case of small enterprises environments, even if they establish such a large and complex network solutions, the cost to operate, maintain and make these solutions serviceable is too much to afford. This lead to even an expensive solution then they get the benefit from energy savage. They normally need a short, simple and less cost effective energy saving solutions and hoping the high reliability rate and the good performance as well. Solution that is beneficial for these kind of organizations especially in under developed countries is simply the “Sleep Alert”, where they are afraid to upgrade the whole network or the change cost is much large then the annual revenue of an organization and need a simple and cost effective solution.

3 System Architecture Concept of sleep proxy is much appreciated but as the time goes on researchers came to know some of the major problems that caused the deadlocks in network and effect the availability of the sleeping machine to respond to a remote request. Some issues with current sleep proxy that we discuss in this research are as follows:

16

D. Ali et al.

• For making the environment green we can’t take a risk of letting the network on the sack of just one proxy. • Because of one proxy, if it went down due to some reason, states of all sleeping machine lost and they are unable to come back in wake up mode when required. • As a proxy is a dedicated machine, that have to maintain the state of sleeping machine, an extra overhead as it consumes extra energy. • Taking a decision when to go in sleep mode. • Some sleep approaches lead system to shutdown and restart when required, these requires lot of energy at start and also takes much time to restart then as from sleep state. In contrary to previous sleep proxy approaches, the concept of Resource Manager (RM) is proposed to avoid the single point failure, further categorization of the RM is Primary Resource Manager (RMp ) and Secondary Resource Manager (RMs ). In proposed architecture, there is no dedicated machine to act as a RM like in traditional sleep proxy approaches. Any ordinary machine can act as a RM when required. To avoid the single point failure two machines were used to act as RMp and RMs at the same time. Whenever RMp stops working RMs will take over as RMp and use the next available machine as RMs and this cycle continues till there is a last machine available in the network. If RMp went down or stop working the intermediate device (router) will update its routing table and this update will redirect the incoming traffic towards RMs . Receiving the traffic by RMs is the signal to RMs that RMp is down, so RMs will update its status from RMs to RMp . It’s the time to ping the whole network to get the current state of all the desktop machines and allow a next available machine to act as RMs . There are two modes of each desktop machine, Energy Saving (ES) Mode and Energy Consumption (EC) Mode. A machine, as a Resource Manager, can be in two states, either ‘Working State’ (W) or ‘Down State’ (D). To access a sleeping machine remote user’s request will be forwarded to RMp or RMs - incase RMp is down. RMp will send a WOL packet to the particular machine to bring it back in the awake state or EC mode. Each machine has a small database that contains the states of sleeping machines. Usually this database is updated when a particular machine acts as a Resource Manager. The information recorded in this database is IP of each machine, Mac address, Connecting ports and current state of machine. Five states of a machine defined in proposed architecture are; Primary Resource manager [RMp ], Secondary Resource manager [RMs ], Machine Active [A], Energy saving Mode [S], Machine not available in the network [N]. Proposed architecture is shown in Fig. 1. This is a software-based solution and no extra hardware is required to implement this solution. 3.1 General Terminologies Status Indicator. Status indicator is an application running on each machine and conveys the present status (sleep or wake) of a particular machine to RMP .

Energy Efficient Resource Utilization: Architecture for Enterprise Network

17

Fig. 1. Proposed system architecture

Sleep Status. When there is no activity on any machine for a specific amount of time i.e. a machine is in idle state then it will turn into sleep state and through status indicator it will send the message about its status (sleep) to RMp . There is no need to keep track which machine is acting as RMp . Ordinary machine will broadcast its status message in the network and only RMp will save this status of a particular machine and other machine will discard this message. Wake Status. When a machine is in sleep state and there is network traffic (Remote user accessing the machine) for that machine, RMp will send a WOL message to that machine. If sleeping machine acknowledged this WOL message, then RMp will update the status (sleep to wake) of that machine in its database. But if there is no response at all from the machine after sending three WOL messages, RMp will consider that the machine is no more in the network. Message Packet. A message’s that is forwarded and received are of following format. Protocol defined for the identification of message packet is presented below and shown in Fig. 2. Message contains following five tokens, which are separated by [-]: Message Protocol : [Message_Type − Requested _Ip − Source_Ip − Destination_Ip − Termination]

• • • • •

Message Type: Code representing the task to be required from requested pc. Requested IP: IP of targeted machines, which respond to user request, is requested IP. Source IP: is the IP of sender machine Destination IP: is the IP of receiving machine. Termination: is usually a termination character, which in our case is ‘$’.

Message Type. Message type is predefined codes those are helpful in determining the type of request. Message codes are shown in Table 1:

18

D. Ali et al.

Fig. 2. Message packet architecture

Table 1 Defined message type and message code Message type

Message code



00



01

Connect

02

End Connection 03 Error

04

Ok

05

Sleep Status

06

TakeOver

07

Wake Status

08

WakeOnLan

09

Table 2 shows that RMp and RMs keep swapping when required (such as failure of RMp ). T1, T2 T3 … Tn, shows the time interval after which a new Resource Manager takes over. Table 2. Different machine states Machine T1

T2

T3

T4

T5

… Tn

A

RMp N

N

N

A

… RMp

B

N

A

C

RMs RMp N A RMs N

N

N

… RMs … A

D

A

A

RMp N

N

… A

E

A

A

F

A

S

G

A

S

RMs RMp N … A A RMs RMp … A A N N … A

H

A

S

S

S

RMs … A

I

A

S

S

S

S

… A

Energy Efficient Resource Utilization: Architecture for Enterprise Network

19

3.2 Architecture Explanation Some use-cases of proposed architecture are as follows: Requested Machine is in EC Mode and RMp is in Working State. Request of remote user directly forwarded to the particular machine without interrupting RMp as shown in Fig. 3.

Fig. 3. Generalized flow of proposed architecture

Requested Machine is in EC Mode and RMp is in Down state. It does not matter whether RMp is down or active, request will be forwarded to machine in EC mode as shown in Fig. 3. Requested Machine is in ES Mode and RMp is in Working State. If a remote user wants to access a machine that is in ES mode then remote user request will be forwarded to the requested machine via RMp as shown in Fig. 3. If there is no activity on a machine for a specific amount of time then it will switch its mode from EC to ES. But before entering into ES mode it will notify RMP about its new status. RMp will save the status in the database. Intermediate device will update its routing table and will forward the remote user’s request to RMp as it is not possible to access a machine that is in ES mode. When RMp receives the request for a machine that is in ES mode, it will send WOL message to that machine so that machine can switch its mode from ES to EC. In case of no response from the machine after sending three WOL messages then RMp will consider that the respective machine is not available in the network and this will be notified to remote user. Otherwise, machine will be shifted to EC mode and remote user request will be forwarded to the machine.

20

D. Ali et al.

Requested Machine is in ES Mode and RMp is in Down State While RMs is in Working State. When RMp is down then it cannot receive request from remote user. If RMp is not responding to remote user’s request till third attempt then RMs will take over and will play the role of RMp and respond to user request as shown in Fig. 3. Now, new RMp will ping all the machines in the network. Machines those respond to ping will be considered as in EC (active) mode and rest are considered in ES (sleep) mode. Requested Machine is in ES Mode and RMp and RMs is in Down State. It is also shown in Fig. 3 that when intermediate device receives no response from RMp and RMs then it will send the remote user’s request to next available machine having EC mode in the network. When that machine receives the request it will consider that both RMp and RMs are down so it will take over as RMp and appoint a new RMs. If RMp and RMs are down and all machines in the network are in ES mode then there will be a pause until a machine awakes, on its turn, in the network. This delay is the only limitation in this research work and is considered as a future work to overcome this delay to minimum by analyzing and predicting the network behavior. Some algorithms of proposed architecture are presented below. Algorithm: Entertaining user requests function SleepAlert(Request, Target_Machine) while request Pack < /body > pair of “Opening” and “Closing” Tags. In their turn, both these sections are encapsulated within a < html >< /html > pair of Tags that indicate the type of the document, acting at the same time as the “Starting” and “Ending” landmarks of the entire document. Between the two fundamental sections of a HTML/XML document the one of particular interest, as far as merging algorithms are concerned, is the < body >< /body > section. It is in this section that the information people see on their browsers is contained. Naturally, the target of every merging algorithm ever

Merge Algorithm for HTML/XML Documents

79

Fig. 2. An example of a fully structured HTML/XML document.

developed in the past was to achieve an as accurate as possible synchronization of content in exactly this section. Because of its hierarchically structured nature a HTML/XML document provides the ideal ground for applying the fundamental rule of “Nesting”. This rule is a very significant one because expands the structural arrangement of a HTML/XML document to a logical, mathematical in its nature, by dictating that when we introduce an “Opening” Tag A and then another “Opening” Tag B, we have to successively introduce later first a “Closing” Tag for B and then a “Closing” Tag for A (Fig. 2). In subsequent sections the fact that modern browsers are only very loosely following the “Nesting” rule will be demonstrated. Because of that, they are capable of producing a result with very minor differences when the rule is violated between dissimilar Tags, and an identical result for cases where the rule is violated between similar Tags.

3

An Algorithm Oriented Classification of HTML/XML TAGs

In this section we will formulate a clear classification of HTML/XML Tags according mainly to two criteria: a) the syntactical structure of the Tags and, b) a hierarchical order of the Tags. The second of these two criteria has not been previously neither theoretically introduced nor practically used by any of the currently existing algorithms in the field. The categorization of the HTML/XML Tags to “Paired” and “Unpaired” is well known and as a classification has been in use for many years. For the benefit of the reader we remark that a HTML/XML Tag is considered as “Paired” when any other code must be placed between an “Opening” Tag and its corresponding “Closing” Tag as shown in the simple example below for the case of a text block:

80

A. G. Bakaoukas and N. G. Bakaoukas

The Paragraph Tag is one of the most characteristic and well-known Tags; also one that belongs to the “Paired” category, since it requires an “Opening” Tag (< p >) and a “Closing” Tag (< /p >). The introduced between the “Opening” and “Closing” Tags block of text, is then automatically formatted by the browser accordingly. The same pattern is followed by all the Tags belonging to the same category. In the “Unpaired” category of Tags (also known as “Singular” or “Standalone”), belong those Tags that simply require only an “Opening” part in their syntax and not a corresponding “Closing” one:

Both the Line Break and the Thematic Change Tags do not syntactically require the existence of a “Closing” Tag and no other block of code is expected to be found directly attached to them either. In the past, a further break down of categories proposed (like, “Formatting” Tags, “Page Structure” Tags, “Control” Tags, etc.) but for the purposes of this discussion we will only consider these two fundamental ones. Further, for the algorithm we introduce in this paper extended forms of “Paired” and/or “Unpaired” Tags with attributes are considered as in a rather binary way. That is, a) as being consisted of an “Opening” Tag and a “Closing” Tag when the extension text has not been modified in the T1 & T2 documents and, b) as being consisted by an “Opening” Tag, a Text part and a “Closing” Tag, when the extension text has been modified in either one of the two documents T1 & T 2 . For example, the following extended form of the “Opening” Tag for < body > is considered as only representing one Tag (with the style part ignored in the algorithm and considered merged with the main Tag label) when no modifications have been applied, while it would have been considered as consisting out of two parts otherwise. According to this assumption, the following lines of HTML/XML code are considered as representing an “Opening” Tag, a Text part and a “Closing” Tag in the algorithm, for the case that no modifications have been introduced in the style part. When this is not the case and modifications have been introduced, are then considered as consisting of two Text parts:

The Text parts in the HTML/XML code structure are assigned by the algorithm (for convenience in handling them algorithmically) a unique label. The labeling takes place in a top-down fashion and has the following form:

Merge Algorithm for HTML/XML Documents

81

Finally, for a better utilization of the part of the algorithm that reads a HTML/XML document, we propose a structural arrangement of the code close to the arrangement practices that are practiced in modern programming. In such practices a “New Line” is introduced for every new command and “Tabs” are heavy used to classify lines of commands as belonging in the same logical block of commands that collectively achieve a particular task. The role of “Opening” and “Closing” directives for logical blocks of commands (e.g. the “{” and “}” characters in C++) is naturally assigned to the “Opening” and “Closing” Tags in a HTML/XML document, so there is no need for the introduction of another scheme. This structural arrangement is directly equivalent to the Indentations way of representing a tree structure. A fully fledged example of the proposed structural methodology for a HTML/XML document to be used with the algorithm we propose here is presented below:

So, the parts of the HTML/XML code that actually need to be consider by the algorithm can be isolated as:

4

The Three-Way HTML/XML Programming Code Merging Algorithm

An explanation of the merging algorithm, down to its operational details, is provided in this section along with examples of its performance on HTML/XML

82

A. G. Bakaoukas and N. G. Bakaoukas

programming code that highlight its main operational features. In general the new approach underlining this algorithm is based on the correspondence ratio between the branched arrangements in a document’s programming code. More specifically, the merging procedure includes the following five processing stages: Stage 1: The “Original Document” O and the two “Current Versions” T1 & T2 that contain modifications in respect to O are provided as input to the algorithm. Stage 2: As is the case with all algorithms in the field, in the inner structure of the algorithm we present here every HTML/XML document’s programming code is represented as a Node-labeled Ordered Tree, that is, a DOM (Document Object Model) tree, with each HTML/XML Tag corresponding to a node of that tree. Since the tree is ordered, the children nodes of every parent node are ordered as well and they can be uniquely identified by their path. An example of this practice is provided below:

So, at this stage document’s O programming code and that of documents T1 & T2 are converted into a DOM tree representation. Stage 3: The algorithm compares document’s O DOM tree with each of those for documents T1 & T2 and marks the nature of the operation that caused every difference identified in them, i.e. as the result of an “Insert” operation, as a result of a “Delete” operation, etc. There are five operations that can be used to modify a HTML/XML document’s programming code and reflected in its DOM tree arrangement. These are the operations: Delete, Move, Copy, Update and Insert. In the algorithm’s code structure they take the following form:

Stage 4: The algorithm, considering operations’ priorities and Tags’ priorities, proceeds to merge the two documents’ tree representations as per tree level, per level branch and per node. In other words, the algorithm seeks for an optimum

Merge Algorithm for HTML/XML Documents

83

way to map the two trees, on a Node-to-Node basis. The result of each merging instance is the creation of a new node for the output tree. Since achieving effective consolidation using a merging algorithm requires an effective method for handling conflicts, the algorithmic approach allows for the identification and treatment of such conflicts through operations’ and Tags’ priorities. The priorities are introduced so a critical case can be considered inside the algorithm. A typical example of such a case arises when an operation removes a sub-tree (Delete operation) while another concurrent one appends a node to this sub-tree (Update operation). Undeniably, this is a characteristic case of conflict. According to the operations’ priorities that will be discussed below in every detail the optimum solution to this specific problem (so extensive experimentation suggests) is to assign a higher priority to the Delete operation so the sub-tree is removed even if the concurrent changes performed by the Update operation on this very sub-tree are lost. In general such a practice insures, at least in the grand majority of similar situations, data convergence. Some of the existing algorithms in the field include a kind of “Undo” function in order to avoid the loss of data by restoring through that function the lost changes. The algorithm presented here does not, out of choice, include such a functionality and the initiative for such an action is left with the user. The operations’ priorities are user specified but set initially to default values in the algorithm that have been decided on the basis of experimental results. The user can completely overwrite these default values by introducing changes in order to achieve specific results (e.g. by having a specific pair of operations to map directly to another user specified operation). The default operations’ priorities for the algorithm, with the bottom pair of operations representing input and the top representing the output after decision, are as follows:

The above priorities arrangement indicates that the Delete operation by default possesses a higher priority than any of the other operations.

84

A. G. Bakaoukas and N. G. Bakaoukas

Since, in far more cases duplication of information in a HTML document has no meaning.

Since, maximum accuracy is achieved if we keep in a HTML/XML document the new content and then perform the Move with the new content included.

Since, maximum accuracy is achieved if we keep in the HTML/XML document the newly inserted line of code and then perform the Move on the old one.

Since, maximum accuracy is achieved if we first Update the existing line of code in the HTML/XML document and then Copy it to the new location.

Since, maximum accuracy is achieved if we first Insert the new line of code in the HTML/XML document and then Copy the existing one to the new location.

Merge Algorithm for HTML/XML Documents

85

Since, maximum accuracy is achieved if we first Update the existing line of code in the HTML/XML document and then Insert the new line at the new location. After what presented so far is now apparent that the operations’ priorities are functioning in the algorithm in a similar way to this of “Time-stamps” in other algorithms in the field. The need for this arises from the fact that in a Threeway Merge situation we cannot attach any kind of a Time-stamp to operations performed on the programming code in any meaningful way but at the same time we need to keep a logical order (free of contradicting acts) between the operations performed by the merging algorithm during the merging procedure. As is the case with the operations’ priorities, Tags’ priorities are assigned default values that the user can completely overwrite introducing changes in order to achieve specific results. Again, a specific pair of Tags can be set to map directly to another user specified Tag, if this is what required. The by default set of Tags’ priorities for those Tags we are using later in the algorithmic examples is as follows:

86

A. G. Bakaoukas and N. G. Bakaoukas

- : result decision is taken according to the “Nested Parentheses” principle.

- : result decision is taken according to the “Nested Parentheses” principle. - : result decision is taken according to the “Nested Parentheses” principle. - : result decision is taken according to the “Nested Parentheses” principle. The “Nested Parenthesis” principle resembles the one used in mathematics to introduce order of execution between calculations in an unambiguous way. The only difference, when used in the ordered trees sector, is that the role of

Merge Algorithm for HTML/XML Documents

87

“Opening” and “Closing” parentheses is taken over by “Opening” and “Closing” Tags. The following is an illustration of how the “Nested Parenthesis” principle is used in the logical structure of the algorithm to maintain a strictly hierarchical ordering between tree branches.

The employment of the “Nested Parenthesis” principle, safe guards the algorithm against permitting for unorthodox matching between Tags, something that has been criticized as being one of the major drawbacks for many of the currently existing algorithms (“Diff3”,“XyDiff”, “DaisyDiff”, etc.). Since, all priorities set in the algorithm are fully customizable and extendible, the user can introduce similar priorities for the full extend of HTML/XML Tags, something that covers completely the case of dynamic Tags very commonly found in XML code. Despite the above stated priorities, in the event of a conflict still arising the algorithm treats the situation in one of three ways: a) deciding for an action by employing the “Nested Parentheses” principle on the Tags considered, b) automatically selecting an action on the Tags and, c) marking it as to be handled later by the user (for really complicated situations). Finally, the only priority rules we need to set for the nodes in the ordered tree containing a text are: (a) all “Opening” Tags’ nodes possess a higher priority than text nodes, (b) all “Closing” Tags’ nodes possess a lower priority than text nodes, (c) nodes containing non-similar text that need to be merged are arranged at the output tree in the priority order dictated by their original locations in their corresponding, to be merged, documents T1 & T2 and, (d) “text duplication is not permitted”; that is, whenever two nodes from the modified versions of the original document are to be compared and these nodes contain the same text, this text is only once forwarded to the output document. Stage 5: The output DOM tree representation is provided and converted into HTML/XML programming code, out of which the resulting document is constructed. To conclude the discussion about the algorithmic principles of the method and for direct comparison reasons, the operational diagrammatic form of the algorithm is presented in Fig. 3 along with the equivalent for the “Diff3” algorithm.

5

Experimental Results and Algorithm Evaluation

This section presents experimental results with the purpose of evaluating the proposed algorithm. The attempt here is not to provide a full list of results by considering every possible situation that can be encountered while merging HTML/XML documents, but rather to present representative examples that

88

A. G. Bakaoukas and N. G. Bakaoukas

Fig. 3. (a) The operational diagram for the algorithm presented in this paper, (b) The operational diagram for the “Diff3” algorithm.

illustrate the operational characteristics of the algorithm. For the rest of this section we will follow a “simple first, towards more complicated examples later” approach in how we proceed with the discussion of example cases. We first consider the following simple HTML document and its corresponding DOM tree:

and then the documents T1 & T2 , which are as shown in Fig. 4. In such a case, despite the apparent simplicity of the HTML code, the algorithms that do not incorporate the “Nested Parentheses” principle are bound to return a wrongly arranged programming code for the output document, similar to the following:

Merge Algorithm for HTML/XML Documents

89

The logic behind a merging algorithm that can produce such a merged document is shown in Fig. 5. Despite the fact that this programming code, if observed through a browser, will still produce acceptable display results, remains clearly wrongly structured.

Fig. 4. (a) The modifications including document T1 , and (b) the modifications including document T2 .

By having the “Nested Parentheses” principle incorporated into its algorithmic structure, the algorithm proposed in this paper is capable of recognizing the

90

A. G. Bakaoukas and N. G. Bakaoukas

Fig. 5. The algorithmic logic diagram for resulting to a wrongly arranged structure for the merged document.

situation and of producing a programming code with the correct arrangement for the merged document:

The logic behind the operation of the algorithm proposed in dealing with this particular instance of documents’ parameters, is as shown in Fig. 6. Proceeding to a more complicated instance of an example, the original document O and its corresponding DOM tree are as shown in Fig. 7. The modified document T1 and its corresponding DOM tree are shown in Fig. 8. In the DOM tree arrangement for this document, for ease of comparison, with green are marked the operations that have been performed on the original during the first modification stage. The modified document T2 and its corresponding DOM tree are shown in Fig. 9. In the DOM tree arrangement for this document, again, with green are marked all the operations that have been performed on the original during the second modification stage.

Merge Algorithm for HTML/XML Documents

91

Fig. 6. The algorithmic logic diagram for resulting to correctly arranged structure for the merged document.

Fig. 7. (a) The original document O, and (b) its corresponding DOM tree arrangement.

92

A. G. Bakaoukas and N. G. Bakaoukas

Fig. 8. (a) The modified document T1 , and (b) its corresponding DOM tree arrangement.

Fig. 9. (a) The modified document T2 , and (b) its corresponding DOM tree arrangement.

Merge Algorithm for HTML/XML Documents

93

Fig. 10. The output (merged document) of the algorithm proposed in this paper for the second, more complicated instance of an original document O.

As becomes apparent from observing Fig. 8 and Fig. 9, the modifications introduced produce a far more complicated situation this time, with them extending from the text fields of the original document to the extension part of some of its Tags. The algorithm’s final output is as presented in Fig. 10. Finally, the logic behind the operation of the proposed algorithm on this last more complicated document, is as shown in Fig. 11. Tags marked with red in the T2 document’s programming code signify Tags that are ignored by the algorithm during the final stages of the merging procedure because with all the earlier Tags matched and paired according to the “Nested Parentheses” principle they correspond to ones that produce no equivalent at the output document. At this point we need to highlight the fact that not only the merged document is mapped on a correctly structured programming code, but also (as becomes apparent from Fig. 11) that the merging approach materialized by the algorithm is at the same time simple, straight forward in its implementation and capable of producing descent results in a variety of different merging situations.

94

A. G. Bakaoukas and N. G. Bakaoukas

Fig. 11. The logic behind the algorithm’s decision in merging the second example document presented in structural diagrammatic form.

6

Conclusion

One of the most fundamental problems restricting the optimum operation of many of the currently available Three-way Merge algorithms for HTML/XML documents is their strong dependency, when it comes to decision making, on the original document O. In this paper, a new algorithm proposed for the Three-way

Merge Algorithm for HTML/XML Documents

95

Merge approach that completely eliminates the need for the original document O to be involved in the merging procedure. The proposed algorithm generates a merged document that is encapsulating a well presented at the browser document within a properly structured HTML/XML programming code. The algorithm proceeds by gathering all important information for each of the Tags directly from the modified input documents T1 & T2 . This fundamental to the algorithm characteristic makes it capable of achieving far better results than those produced by the existing algorithms in its category, also making it suitable to be used primarily (but not restrictively to) as a programming assisting tool. A key aspect of the algorithm presented in this paper is also its level of flexibility, which is provided by allowing configurable operations’ and Tags’ priorities, along with the “Nested Parentheses” principle. By doing so, the algorithm provides the user with the ability to focus on either the HTML element or the XML element features (or both), whatever is in every case more suitable according to the nature of the documents about to be merged. Acknowledgment. The authors would like to thank editors and anonymous reviewers for their valuable and constructive suggestions on this paper.

References 1. Khanna, S., Kunal, K., Pierce, B.C.: A formal investigation of Diff3. In: Arvind, V., Prasad, S. (eds.) Foundations of Software Technology and Theoretical Computer Science (FSTTCS), December 2007 2. IBM Alphaworks: XML Diff And Merge Tool Home Page. http://www.alphaworks. ibm.com/tech/xmldiffmerge 3. The “DeltaXML” Project. http://www.deltaxml.com. Accessed 29 Mar 2019 4. Lindholm, T.: A three-way merge for XML documents. In: Proceedings of The 2004 ACM Symposium on Document Engineering, pp. 1–10 (2004). https://doi. org/10.1145/1030397.1030399 5. Dinh, H.: A new approach to merging structured XML files. Int. J. Adv. Res. Comput. Eng. Technol. (IJARCET), 4(5) (2015) 6. Ba, M.L., Abdessalem, T., Senellart, P.: Merging uncertain multi-version XML documents, January 2013 7. Oliveira, A., Tessarolli, G., Ghiotto, G., Pinto, B., Campello, F., Marques, M., Oliveira, C., Rodrigues, I., Kalinowski, M., Souza, U., Murta, L., Braganholo, V.: An efficient similarity-based approach for comparing XML documents. Inf. Syst. 78, 40–57 (2018) 8. Document Object Model (DOM) Level 2 Core Specification v1.0, W3C Recommendation. http://www.w3.org/TR/DOM-Level-2-Core/Overview.html 9. Matthijs, N.: HTML, The Foundation of The Web. http://www.wpdfd.com/issues/ 86/html the foundation of the web/ 10. Rozinajov´ a, V., Hluch´ y, O.: One approach to HTML wrappers creation: using document object model tree. In: Proceedings of CompSysTech, pp. 41–41 (2009) 11. Barnard, D.: Tree-to-tree correction for document trees. http://citeseer.ist.psu. edu/47676.html 12. Cobena, G.: A comparative study for XML change detection. http://citeseer.ist. psu.edu/696350.html

96

A. G. Bakaoukas and N. G. Bakaoukas

13. Chawathe, S.S., Rajaraman, A., Garcia-Molina, H., Widom J.: Change detection in hierarchically structured information. In: Proceedings of The 1996 ACM SIGMOD International Conference on Management of Data, Montreal, Canada, pp. 493–504 (1996) 14. Cobena G., Abiteboul, S., Marian, A.: Detecting changes in XML documents. In: Proceedings of the 18th International Conference on Data Engineering, San Jose, CA, pp. 41–52 (2002)

Traceability Framework for Requirement Artefacts Foziah Gazzawe(B) , Russell Lock, and Christian Dawson Department of Computer Science, Loughborough University, Loughborough, UK {F.Gazzawe,R.Lock,C.W.Dawson1}@lboro.ac.uk

Abstract. In a bid to improve requirement traceability techniques, a framework is presented which aims to clarify the link between artefacts, stakeholders who deal with the software, SDLC models, and their stages by providing definitions and classifications. Through finding the missing links using a semantic traceability approach, risk within software project is minimised and the process better understood. Identifying the links will also improve traceability and in doing so support the software development lifecycle. The links found between the artefacts, stakeholders, and SDLC will be stored in an ontology so that they can be put to use in a framework. This paper will discuss why a conceptual framework is a suitable choice for the clarification of the links found. It also discusses the design of this framework, including its features and process. A description of why, to whom, and how this framework will be of benefit is provided. The potential contribution of the framework and its usefulness are also explained, whereby interviews on a target company are carried out and highlight where the tool developed could be improved, as well as the great advantages it provides. This study thus provides an important asset applicable to all sectors of software development. Keywords: Requirement traceability · Design requirement artefacts · Traceability framework · Artefacts link · Mapping the requirement artefacts

1 Introduction The main objective of this research is to build a framework for traceability to increase support for software developers. The creation of a traceability framework makes it easier to understand the relationship that exists between software design, implementation, and requirements. Requirements and architectural frameworks enable software developers to understand the links that exist between traceability and requirement artefacts [7]. It works by analyzing and reasoning information, and communicates between the different aspects through links [2]. The framework to be developed is aimed at supporting software development in smaller-medium sized organizations with poor traceability. Smaller organizations require more support due to their limited budgets, as purchasing a traceability tool would be of very high cost to them. Hence, this free and open-sourced tool will help minimize the risks, improve traceability, and deliver a high-quality project. The requirement artefacts software framework would be useful throughout all the SDLC stages. Clearer understanding of relationships aids with the understanding of: © Springer Nature Switzerland AG 2020 K. Arai et al. (Eds.): SAI 2020, AISC 1228, pp. 97–109, 2020. https://doi.org/10.1007/978-3-030-52249-0_7

98

• • • • •

F. Gazzawe et al.

Requirement analysis Design Development and Implementation Testing Maintenance

The application of conceptual frameworks enables software engineers to understand the links that occur between artefacts and the iterations involved in any given design phase during software development processes. In [9], author asserts that a traceability framework for requirements identification is an important tool in identifying relationships that exist in software development processes and the components involved. It is a platform which identifies possible errors in traceability and requirement artefacts recognition and correcting them in time before the completion of the software development process [4]. Although this is advantageous, the tool lacks the ability of application, due to its limited scope and use boundaries. In [11], authors showed that the dependencies between software development components and their subcomponents is usually outlined by traceability framework designs so that stakeholders are able to filter information that relationships are going to meet their desired needs in software development. In [12], authors are also of the view that for an effective traceability framework to be designed for the purpose of viewing traceability relationships, there are three main aspects that must be considered: The first is that there is need for automation of maintenance traceability relationships. Secondly, traceability relationships must be created based on their information needs. Third is that traceability must be designed in such a manner that users are able to view it within familiar and common tools. Moreover, the evolution of artefacts’ relationships should be enabled in terms of evaluation via a comprehensive framework that is easy to understand [4]. According to [4], the use of open information retrieval and integration provides a good conceptual framework that enables the understanding of the relationships that exist between traceability and requirement artefacts during software development processes. A conceptual framework presents an information integration environment framework that is guided by an open information retrieval system [12]. Using an information retrieval approach, the links between artefacts and traceability are outlined, creating a model that exhibits relationships between multiple sources. The role of analyzing design frameworks for software development and traceability is to identify the missing links between artefacts and relevant stakeholders during the development and implementation processes, amongst other issues faced. Thus, according to [18] traceability frameworks are designed to outline relationships between different software development components as well as overcoming the problem of heterogeneous artefacts that different tools provide [12]. The potential problem of inconsistency often found after manually checking items for consistency leads to the recommendation of mapping mechanisms to clearly relate different artefacts. This is where the traceability framework would be of benefit, as they will rule out inconsistency by linking the heterogeneous artefacts together. It is important to recognize that an artefact can come in different types and forms such as design, technical, and business requirement artefacts. The perspective proposed would be to distinguish between technical implementations and business requirements, and

Traceability Framework for Requirement Artefacts

99

recognize the sub-categories of user oriented components and technical components for business requirements, and design and implementation for technical implementations. The requirement artefacts are divided into two types based on the order of the SDLC phases commencing with Business Requirements. The purpose of these is to adhere to the business aspects of the project, and they are further divided into User Oriented Components and Technical components based on shared characteristics to help distinguish between them. According to [21], the User Oriented Components can have use cases or stories, as an example, and the Technical Components can have system requirements. The capture of Business Requirements may take place during the Requirement Gathering (Specification Document) phase and depends on it. Then a second type of artefacts also exists, namely Technical Software Implementation, which serves to satisfy the design aspects of the software. It is further divided into the Design and Implementation components. Although [21], provides a basis for the artefact categorisation, it has a number of limitations, the main being its ambiguity. It does not link each artefact, but a group of related artefacts, making the classification very general and lacking in detail. Hence, the need for this research. The next section will briefly review some of the main concepts and definitions as gathered from secondary research in order to understand the purpose of the framework. This will link into the discussion of the framework and its design, such as its architecture and features, and make it more comprehensible. The following section will then explain the methodology used in this research, and then the findings will be displayed. The paper ends with a discussion of the findings as well as a look into the evaluation carried out.

2 Literature Review Traceability can be said to be a process used in a software development project by stakeholders to identify the relationship that exists between software artefacts. The association between these artefacts and the way they depend on each other is understood through the process of software development [15]. The ability to trace the evolution of an artefact or requirement, in both forward and backward directions after the description of its links to its life cycle, can be defined as software artefact traceability [20]. There is a massive, inseparable link between requirements and traceability as traceability is exhibited by requirements. Requirements are the specific needs of a project that can be used to address an existing issue of the problem facing a software development project. The requirements can hardly be identified by the developers without a set process to detect those needs, known as traceability. Traceability involves a series of traces of tools used to fulfil the properties that are desired [6]. The process of linking requirements to the issues affecting a project forms the relationship between requirements and traceability, the two are treated separately and then linked by commonly desired factors [14]. Requirement traceability is a process which is necessary in software development but with numerous setbacks and challenges in its implementation. In cases of unchecked scope creep in requirement traceability, projects tend to get an inevitable downstream in quality, cost and time management. The elicitation of software requirements helps the development team identify the challenges and problems that may block the quality

100

F. Gazzawe et al.

and effective completion of software. There are challenges related to link identification, designs, manual versus technological traceability, tool implementation and identification and the process used [4]. Identification of challenges and problems facing traceability is essential in filling the gaps of poor traceability processes as well as ensuring that future processes are improved in quality, time and cost managements. There are several traceability tools that already exist, such as Rational Dynamic Object-Oriented Requirement Systems (DOORS), it optimizes requirement communication, verification and collaboration in the supply chain and software management of an organization [2]. There’s also Requirements Tracing On-target (RETRO), which is known to employ the Information Retrieval (IR) methods in the tracing of requirements and software maintenance. The CRADLE Traceability Tool is another example, it makes the documentation process simple, defines product features, and is able to handle data modules. However, the use of these existing tools is limited. One of the major limitations includes the ability to automatically define relationships between different RAs. Finding the links between the relationships manually is an iterative process which often requires a certain level of expertise, and not enough clarity is provided by the existing tools. Thus, current traceability tools do not provide a clear guideline across different stages within the SDLC. Another limitation is that most traceability tools do not take into account the risk assessment based on missing relationships between RAs, nor how it impacts the end-user product. Therefore, the developed tool within this study attempts to overcome these limitations, and pave the way for a practical approach that incorporates traceability within smaller-medium sized organizations.

3 Research Methodology This research, which is guided by the philosophy of Interpretivism, employs the qualitative methods of case study and interviews as a strategy to obtain the information required and help develop a framework for modelling requirements traceability. The data used is cross-sectional, and prior to the primary research, a comparison was made of existing techniques to help in developing the new framework and evaluating it. In this research, the inductive approach was followed, starting off with observation through the literature review and testing with the case study, and leading to finding the research gap. One of the objectives set include the development of a traceability tool to aid developers with evaluating requirement artefacts, their relationships and level of risk, in line with satisfying the research’s aim. This research contains a mixture of quantitative and qualitative data. One of the main contributions of the research is the experimental tool, which is a quantitative measure. However, this research also provides a literature review, case study, interview, and an evaluation of the experimental tool using thematic analysis, which are all qualitative contributions. The methods used are summarised in Fig. 1.

4 Framework Design and Implementation The categorization of artefacts, as mentioned, is very important to the basis of this framework. An example of how an artefact is linked is with Technical Software Implementation, which serves to satisfy the design aspects of the software. It is further divided

Traceability Framework for Requirement Artefacts

101

Fig. 1. Flowchart of research methods

into the Design and Implementation components. The Design part has different tasks and models under it that may be used. A mock-up, as an example, is a prototype that enables testing and therefore obtains feedback from users. After this is conducted, Implementation then comes into play, where code is developed. All of these are different artefacts in the system, which link to each other as explained. A. Framework Type There are different types of frameworks that can be developed; conceptual and theoretical. According to [13], a theoretical framework is where one theory is implemented in order to clarify an ambiguous matter. A conceptual framework, on the other hand, is more suited for this research because it uses concepts from different theories to find an explanation for a particular subject. For instance, when different concepts are gathered in this research about artefacts then are connected to each other to make relations. There are several ways of which a conceptual framework can be built, such as content analysis methods and a grounded theory method. According to [16], the content analysis method depends on setting a theory in place, as a hypothesis, before carrying out the testing for its validation. It also uses quantitative analysis, which is not suitable for this research, as conclusions need to be made from data collected. Opposingly, the Grounded theory process starts with defining different categories by collecting information and then finding the links between them [5]. It also follows a qualitative analysis method, which is best for defining relationships. As the framework in this research aims to define the relationships between different artefacts and concepts in a software system, the Grounded theory is the method most fit for this purpose. In addition, [10] stated how this theory is most commonly used due to its effectiveness. The aim of this paper is to present a framework which can support software developers in managing traceability. The aforementioned relationships in the ontology were created based on the analysis of secondary research to identify their types, factors that influence them, and common properties. An instance of these relationships can be seen in Fig. 2 below. Figure 2 shows the main classes and subclasses in the ontology. The requirement artefact class models fifteen types of requirement artefacts. The requirement artefacts are modelled as subclasses in the ontology, for instance, User Story, Data Model, and User Interface.

102

F. Gazzawe et al.

Fig. 2. Example of requirement artefacts and their subclasses

B. Framework Features A feature can be defined as a characteristic that is advantageous to the user, it is one of the most critical components of a framework, according to [19]. The main features of the missing links framework are as follows: • Defining roles for people involved in each SDLC stage • Categorisation of requirement artefacts

Traceability Framework for Requirement Artefacts

• • • •

103

Defining links between entities in the system Risk management through characteristics in the system Ability to customise the provided entities to allow for adaptation Providing guidelines for users to follow when tracing the artefacts relationships

The framework presented can be used to assign functions to people who play a key role in each stage of the SDLC to enable the adaptability of the framework, as well as it being able to set dependencies between the artefacts. This allows the categorization and linking of artefacts to be devised, providing the basis of the framework. Another essential feature in a conceptual framework is its ability to show the different types of links between entities and also to allocate requirement artefacts within each stage of SDLC. The aforementioned features include considering the different attributes and characteristics of each artefact to find the links between them, allowing the creation of more valid relations. Moreover, the framework has the ability to determine and manage risk and identify characteristics that can improve the system framework and also displays warnings when there are unmapped entities (requirement artefacts) to help prevent issues. Furthermore, an additional key feature of the framework is the ability to reuse the already provided entities already provided by customizing them according to the project. This allows for the framework to be more adaptive and useful in different contexts. C. Framework Development As the framework is developed, in order to aid with understanding, it is suggested that each artefact in each of the models in the ontology is broken down, decoupling them. The ontology was developed to store data and put it to use in the framework. This would help the developer to establish between the constituent parts of the waterfall model and the constituent agile artefacts model, which would make defining the missing links easier for the software developer, thereby linking them. Some requirement artefacts aid with explaining the functions, building, and the design of the software, others are involved with the development and maintenance of the system itself. With the framework identifying the relations, it enables the developers to determine the importance of each requirement hence prioritising their implementation [3]. The guidelines the framework provides are derived from the datasets of the ontology and assist the user in achieving the tasks appropriately and minimising the errors that might be encountered. Also, they have a role in ensuring the types of relationships between different entities are understood, and thereby the correct dependency properties are chosen. In addition, because the framework already provides the relationships between artefacts, the traceability process is quicker and less complicated. Not to mention, the artefacts determined at the start are turned into the deliverables of the system, achieving software development. As to the maintenance of the artefacts, the role of an artefact needs to be determined beforehand in order to be maintained, for example, practical artefacts need to be more heavily maintained [4]. The framework makes this easier by already identifying the roles of the artefacts.

104

F. Gazzawe et al.

5 Finding and Results As mentioned previously, the relationships in the ontology which were devised from secondary research are the basis of the framework. Hence, the process of the framework depends on retrieving information from the ontology. This was done based on the ontological analysis using Protégé software, as it stores all the relationships between the artefacts of the system, which are ready to be retrieved through the use of the framework. The Protégé program was chosen because it was seen as best fit to depict this framework as it has many advantages such as it being an open source, its extensibility to plug-ins, and its ability to describe ontologies explicitly, defining where individuals belong and what the class hierarchies are [17]. An example of how the framework process works is if a user wanted to find out the risks within a given configuration of requirement artefacts. They would select the entities they require, the system would then highlight the potential risks based on the relationships existing in the ontology. This would benefit the user by saving time and minimizing difficulties in case the risk was to take place with them being prepared. In order to delve into the actual process and understand deeply how it works, Fig. 3 provides a more detailed view.

Fig. 3. Framework process

Figure 3 shows the stages of the process of the framework. The user would start by choosing from a drop-down list of expected questions in the Web GUI. The system then

Traceability Framework for Requirement Artefacts

105

continues by reasoning the ontology in order to collect the results and send them back to the user. A. Framework and Queries To understand how the framework functions in more depth, the process is explained in this section: 1) To start, the developer chooses from a list of questions such as if the requirement artefacts are complete, if the relationships are correct, and if the current relationships contain any risk, etc. In addition to this, the user can choose from different requirement artefacts and stakeholder list then select a question that is related to stakeholder. 2) Next, for example, the relationship between a stakeholder and requirement artefacts or a stakeholder role, etc. is chosen. The user then presses the button of simulation and the system collects the results from the ontology. 3) If there is a relationship, the system shows the result of the user’s choices and prints it as graph or text as appropriate. 4) If there is no relationship or link, then the system will give a suggestion to the user. For example, it shows a message saying there is no direct link between your choices please change them, or the system may suggest specific requirement artefacts that have a link. The system would then return to the earlier window (lists page) and the user selects again and submits the choices. 5) The results are received and printed.

6 Evaluation The evaluation consisted of two parts, the verification of the theory and testing the tool. For the first, there was a need to verify the existence of traceability problems and the need for a tool. The target company was interviewed in a semi-structured format. The target participants consist of three developers and two software mangers within the company. The findings demonstrate the need for development of a traceability tool to effectively utilise the theory put forward by this thesis, as the target company faces issues when attempting to make changes to a system. If something goes wrong or needs to be adjusted, they cannot go back and fix the issue, they would need to start all over again. Hence, these results from the case study and interviews verify the need for a tool which would provide traceability and reduce cost and efforts. As to the evaluation carried out on the tool itself, it was done through testing the tool developed on software developers in a target company. Interviews were carried out and the outputs thematically analysed to highlight themes found. One of the important themes found outlines the difficulties encountered while using the tool, such as trouble with understanding the format and detailing of the output as well as with the sequence of the stages in the tool. However, with that in mind, the participants also identified a number of positives about the tool, which included capturing different stakeholders’ roles and responsibilities, as well as the value of the provided guidelines, information, and the clear defining of the process. Table 1 shows more detail on the various themes found in the evaluation.

106

F. Gazzawe et al. Table 1. Themes of the framework evaluation

Theme

Participant (1)

Participant (2)

Participant (3)

Outcome

Perspective and Concepts on Traceability

“It’s a method for project that settle down relation between deliverables and requirement”

“It’s a mechanisms and process to follow up and check project status and always being aware of any project changes”

“Traceability is a capability within a device to be allowed to trace the code during the developing any software”

The participants had different aspects of traceability pointed out, it’s clear that the basis of traceability as a process was understood and that their responses could be useful towards evaluating the tool hand

“It was useful explaining some parts really important for developing any project but I think the tool needs to make some structured process and put all details compared to the process with software verification to fill the data”

“Calculating the risk was not accurate when I miss important elements. Also, defining the importance of requirement artefacts was not clear for me though it calculated well the risk in the chosen elements”

It was indicated that the process followed may need further clarification within each stage to explain the flow of the data Also, the last quotation specific difficulties faced when highlighting the missing requirement artefacts. The first was accuracy of the risks when missing the important elements; the second was defining importance of the requirement artefacts, and finally clarity of calculating of risk

Identification of “The tool helps Missing Links me listing the project development cycle main points. It also provided me some useful information about stakeholders roles”

(continued)

Traceability Framework for Requirement Artefacts

107

Table 1. (continued) Theme

Participant (1)

Participant (2)

Participant (3)

Outcome

Expectation of Traceability Functions

“In my opinion, I expect a traceability tool is to detect all data types, the relationship between different elements, and to analyse the development team and requirements of a project”

“There are different functions that must be in any traceability tool such as: a way to collect requirement artefacts, and a way to include skills and experiences for people who involved in any project”

“My experience is to have a tool cover all phases and parts of project, such as identifying every stage, plus being dynamic in creation of LCPI, milestone, and stakeholder”

They all had important functions to mention. The tool developed incorporates all the functions that were identified by the participants, and hence the responses were a helpful way of validating it

7 Discussion The framework developed in this research is intended to benefit both software developers and designers by identifying the links between artefacts, people in control of the software, and SDLC models, improving traceability. The framework would aid the developers in implementing traceability and thereby enhancing the quality of software developed. The framework aims to support the definition and classification of traceability and artefacts through an approach that enables semantic traceability to identify the missing link between them. This framework also provides guidelines for the developers to follow during tracing the relationships between artefacts in each stage of the software lifecycle, verifying and validating them. These guidelines aim to help software designers by providing them with necessary information. They are derived from the datasets of the ontology and assist the user in achieving the tasks appropriately and minimizing the errors that might be encountered. Also, they have a role in ensuring the types of relationships between different entities are understood, and thereby the correct dependency properties are chosen. In the ontology used, one of the most important features used is a reasoner. As mentioned by [11], one of a reasoner’s contribution is testing to see if a class is a subclass of another. This feature allows to find any possible inconsistencies by ensuring that every class has instances relating to its conditions. If a class doesn’t have instances, it is considered inconsistent. Another function of a reasoner is that it can compute the class hierarchy so it doesn’t need to be done manually. According to [1], a reasoner’s use is vital because not only does it ensure there are no logical contradictions in the ontology, it can also use the information it has to infer more knowledge to add to its database. The restrictions placed on the classes or the properties it has is what allows the relations in an ontology to be inferred, as opposed to simply having a hierarchy of

108

F. Gazzawe et al.

entities. The reasoner also aids with the overall building and maintenance of an ontology by performing these tasks. In addition, because the framework already provides the relationships between artefacts, the traceability process is more efficient and less complicated than without the use of the framework. As to the maintenance of the artefacts, the role of an artefact needs to be determined beforehand in order to be maintained, for example, practical artefacts need to be more heavily maintained [8]. The framework makes this easier by already identifying the roles of the artefacts.

8 Conclusion and Future Work In this paper, a conceptual framework was proposed to aid software developers in solving traceability problems and easing the overall process of it. This was done through the use of an ontology that stores the relations between the artefacts of the system, i.e. the people, SDLC, and requirement artefacts. The proposed framework is thus fundamental in gathering information about software development requirements and in specifying the links that exist in each artefact in relation to the software and other artefacts. A limitation of the research is that the ontology developed is not linked any other ontologies, therefore the features within it are limited to a specific type of user. In future research, there is the possibility of combining different ontologies together to expand the tool and offer more features to the user, which could allow for application to larger companies. In addition, this research covers quite a good amount of artefacts and links them together. However, there is also possibilities of different artefacts existing that could be defined and link in future research. This would build on the ontology and allow for wider use of the framework.

References 1. Abburu, S.: A survey on ontology reasoners and comparison. Int. J. Comput. Appl. 57(17) (2012) 2. Angelopoulos, K., Souza, V.E.S., Pimentel, J.: Requirements and architectural approaches to adaptive software systems: a comparative study. In: 2013 ICSE Workshop on Software Engineering for Adaptive and Self-Managing Systems (SEAMS), pp. 23–32. IEEE, May 2013 3. Bashir, M.F., Qadir, M.A.: Traceability techniques: a critical study. In: 2006 IEEE Multitopic Conference, INMIC 2006, pp. 265–268. IEEE, December 2006 4. Bourque, P., Fairley, R.E.: Guide to the Software Engineering Body of Knowledge (SWEBOK), Version 3.0. IEEE Computer Society Press, Washington, DC (2014) 5. Charmaz, K.: Constructing Grounded Theory: A Practical Guide Through Qualitative Analysis. Sage, Thousand Oaks (2006) 6. Denney, E., Fischer, B.: Software certification and software certificate management systems. In: Proceedings of ASE Workshop on Software Certificate Management, SCM 2005, Long Beach, CA, November 2005, pp. 1–5 (2005). https://doi.org/10.1109/ase.2009.71 7. Elamin, R., Osman, R.: Towards requirements reuse by implementing traceability in agile development. In: 2017 IEEE 41st Annual Computer Software and Applications Conference (COMPSAC), vol. 2, pp. 431–436. IEEE, July 2017

Traceability Framework for Requirement Artefacts

109

8. Ghazi, P., Glinz, M.: Challenges of working with artifacts in requirements engineering and software engineering. Requirements Eng. 22(3), 359–385 (2017) 9. Stout, G.A.: Requirements traceability and the effect on the system development lifecycle (SDLC). In: Systems Development Process Research Paper, pp. 3–17 (2001) 10. Harris, I.: What does “the discovery of grounded theory” have to say to medical education? Adv. Health Sci. Educ. 8(1), 49–61 (2003) 11. Horridge, M., Simon, J., Georgina, M., Alan, R., Robert, S., Chris, W.: A practical guide to building owl ontologies using protégé 4 and co-ode tools edition 1.3, p. 107. The University of Manchester (2011) 12. Helming, J., Maximilian, K., Helmut, N.: Towards traceability from project management to system models. In: 2009 ICSE Workshop on Traceability in Emerging Forms of Software Engineering (2009). https://doi.org/10.1109/tefse.2009.5069576 13. Imenda, S.: Is there a conceptual difference between theoretical and conceptual frameworks? J. Soc. Sci. 38(2), 185–195 (2014) 14. Jarke, M.: Requirements tracing. Commun. ACM 41(12), 32–36 (1998). https://doi.org/10. 1145/290133.290145 15. Kannenberg, A., Hossein, S.: Why software requirements traceability remains a challenge. CrossTalk J. Defense Softw. Eng. 22(5), 14–19 (2009) 16. Neuendorf, K.A.: The Content Analysis Guidebook. Sage, Los Angeles (2016) 17. Noy, N.F., McGuinness, D.L.: Ontology development 101: a guide to creating your first ontology (2001) 18. Torkar, R., Gorschek, T., Feldt, R., Svahnberg, M., Raja, U.A., Kamran, K.: Requirements traceability: a systematic review and industry case study. Int. J. Softw. Eng. Knowl. Eng. 22(03), 385–433 (2012) 19. Trieloff, L.: Feature function benefit vs. feature advantage benefit (2014). https://med ium.com/@trieloff/feature-function-benefit-vs-feature-advantage-benefit-4d7f29d5a70b. Accessed 12 Feb 2018 20. Flynt, J.P., Salem, O.: Software Engineering for Game Developers. Software Engineering Series. Course Technology PTR (2004) 21. Liskin, O.: How artifacts support and impede requirements communication. In: International Working Conference on Requirements Engineering: Foundation for Software Quality, pp. 132–147 (2015)

Haptic Data Accelerated Prediction via Multicore Implementation Pasquale De Luca1(B) and Andrea Formisano2 1

Department of Computer Science, University of Salerno, via Giovanni Paolo II, Fisciano, Italy [email protected] 2 Department of Electrical Engineering and Information Technology, University of Naples “Federico II”, via Claudio, Naples, Italy [email protected]

Abstract. The next generation of 5G wireless communications will provide very high data-rates combined with low latency. Several applications in different research fields are planning to exploit 5G wireless communications to achieve benefits for own aims. Mainly, Teleoperations Systems (TS) expect to use this technology in order to obtain very advantages for own architecture. The goal of these systems, as implied by their name, is to provide to the user (human operator) the feeling of presence in the remote environment where the teleoperator (robot) exists. This aim can be achieved thanks to the exchange of a thousand or more haptic data packets per second to be transmitted between the master and the slave devices. Since Teleoperation Systems are very sensitive to delays and data loss, TS challenge is to obtain low latency and high reliability in order to improve Quality of Experience (QoE) for the users in realtime communications. For this reason a data compression and reduction are required to ensure good system stability. A Predictive-Perceptive compression model based on prediction error and human psychophysical limits has been adopted to reduce data size. However, the big amount of Haptic Data to be processed requires very large times to execute their compression. Due to this issue a parallel strategy and implementation have been proposed with related experimental results to confirm the gain of performance in time terms. Keywords: 5G

1

· Teleoperation · Parallel algorithm · Prediction model

Introduction

In the last decade, the evolution of technological devices and use of smartphones have caused an exponential increase in data traffic. According to AGCOM data, the traffic volume has grown from 350 Petabytes to almost 2000 PetaBytes during last 5 years, with an average monthly traffic per user increased from 1 to 4 GigaBytes. So, 5-th Generation Network (5G) has been developed to satisfy previous demanding requests. Several improvements achieved by this new c Springer Nature Switzerland AG 2020  K. Arai et al. (Eds.): SAI 2020, AISC 1228, pp. 110–121, 2020. https://doi.org/10.1007/978-3-030-52249-0_8

Motion and Force Accelerated Prediction

111

network technology are: high datarate, low latency, better mobility, high bandwidth, 100% coverage, and others [1]. The essential key of next generation 5G wireless networks lies in exploring an unused high frequency band: the millimeterWave band, from now mmWave, ranging from 3 to 300 GHz. According to Shannon’s second theorem, channel Capacity will grow by increasing bandwidth; consequently it will be able to achieve a major data rate (up to 10 Gb/s). The only problem of the mmWaves is related to health concerns: it is thought that radiation at such high frequencies can cause damages to health. Unlike ionizing radiation (ultraviolet, X-ray, and gamma radiation), which has been linked to cancer due to the displacement of electrons during exposure, mmWave radiation is non-ionizing because the photon energy (ranges from 0.1 to 1.2 meV) is not nearly sufficient to remove an electron from an atom or a molecule (typically 12 eV is required) [2]. Instead of make problems, 5G can useful greatly contribute to improve human health. This work aims are based on one of the most important 5G applications in the biomedical field: the telesurgery (Tactile Internet) [3]. The latter belongs to Teleoperations domain that is a kind of systems which allow an user’s immersion in a remote environment by the exchange of multimodal information: combination of audio, video and haptic signals. Multi-modal Telepresence and Teleaction (TPTA) systems for haptic telemanipulation, also known as Telehaptic systems are shown in Fig. 1. These kind of systems usually consist of one human operator using a haptic interface (Master device) to transmit motion data (position and velocity data) over a bilateral communication channel. At the other side, a Teleoperator receives the data and respond with the feedback of remote environment in the form of kinesthetic data (such as the reflected force). Teleoperations’s main challenge is to provide ultra-low delay communication over the network which would enable real-time interactions across wireless networks [4]. Communication of haptic data for teleoperation systems imposes strong demands on the communication network: large amounts of data have to be transmitted with a sampling rate of 1 kHz or even higher (1000 or more packets per second to be transmitted) [5]. So, data processing and compression solutions, using perceptive and predictive methods, occurred to reduce data dimension. A lossless compression is performed: after a small initial amount of data has been transmitted, the prediction error (difference between the previous and the next sample) is calculated, quantized to 8 bits and transmitted. The accuracy related to predicted data is very important in oder to obtain good results [6]. The acquired 3D haptic data are independently adapted, analyzed, and performed along each axis x, y, z. Therefore, the aim of this work deals with transform this compression algorithm from sequential to parallel by processing the 3 different data components separately. Hence, thanks to powerful of modern high performance architecture [7–10], more precisely in a multicore environment [11,12] a parallel strategy and implementation for solving previous problem has been performed. Several experimental results will show the gain of performance achieved by using our parallel strategy. The rest of the paper is organized as follows: in Sect. 2, kinesthetic data reduction problem is shown; Sect. 3 deals with the description of both sequential and parallel implementation details of

112

P. De Luca and A. Formisano

the algorithm; in Sect. 4 we provide tests and experimental results that confirm efficiency, in terms of performance, of the parallel implementation; finally, in Sect. 5, we draw conclusions.

Fig. 1. TPTA system

2

Kinesthetic Data Reduction

Kinesthetic data reduction techniques are mainly based on two different approaches: 1. Statistical Schemes: where the data’s statistical redundancy is used to compress the packet size. 2. Perceptual Schemes: it focuses on human perception limits to reduce the packet rate over the communication network [14]. Firstly, statistical schemes follow two different kinds of compression: lossless or lossy. In a lossless compression, after the transmitter and receiver devices have exchanged enough raw data, a simple position prediction method is proposed. Compression is achieved by performing an exclusive-or operation between predicted and previously predicted values, and the result is reduced to 8 important bits [15]. The whole process is shown by block diagram in Fig. 2:

Fig. 2. Prediction error model

Motion and Force Accelerated Prediction

113

So this approach decorrelates the data, that is, it simply eliminates linear dependencies. It is very simple to implement and allows us to obtain good results so as to be adopted in the most popular compression standards. We will consider the following example to understand how important and useful data decorrelation is in compression algorithms. Suppose we want to transmit the following finite sequence of symbols: xn = {10, 11, 12, 13, 12, 13, 14, 15, 16, 17, 16}

1 ≤ n ≤ 12

To transmit it in binary format it would be necessary to use at least 5 bits for each symbol, for a total of 60 bits. However, if you look closely at the data, you can see that they show a slowly increasing trend, that is, there is a linear dependence between them. So, by following the method illustrated before, we calculate the sequence obtained by the difference between the previous and the next sample of. Prediction error is denoted by following sequence: ˆn = {1, 0, 1, 1, −1, 1, 1, 1, 1, 1, −1} en = xn − x It is possible to see that the previous sequence is composed only by three symbols. Thanks to its nature, only 2 bits are necessary to represent each element. This confirm a good compression rate with respect to original sequence which need 5 bits per symbol. Lossy kinesthetic data compression has also been achieved by using discrete cosine transform (DCT) [13], similarly to the JPEG codec, or Wavelet Packet Transform (WPT). However, it is not a good idea to use lossy compression and data quality reduction in very sensitive processes (such as telesurgery) that require excellent accuracy and very-low errors. Second, perceptual schemes uses data’s psychophysical redundancy, that is human’s perceptual limits. Following this method, only samples that contain changes more than a fixed threshold are transmitted [16]. When a value is not transmitted, the receiver reacts to a missing sample by holding the value of the most recently received one.

Fig. 3. Perceptive compression model

Figure 3 denotes just explained approach. State-of-the-art methods of perceptual data reduction have shown that, it is possible to represent the limitations of human through a difference threshold, called Just Noticeable Difference (JND) [17].

114

P. De Luca and A. Formisano

The JND is the minimum amount of change in stimulus intensity needed for a perceptible increment. This threshold is based on Weber’s Law following this: ΔI =c I where I is the stimulation intensity, ΔI is the difference of stimulation intensity to be perceived (the JND) and c is a constant [18]. These thresholds is different by depending on the movement scenario and the muscles involved. For example, the JND when a human operator perceives force feedback to the index finger is approximately 10%. In this paper we focus on idea to mix the two schemes (Statistical and perceptual one) and use a Predictive-Perceptive Model to optimize the data reduction. In the sender side, the predictor generates the predicted haptic signal at every sample instant. If the prediction error is smaller than the corresponding JND, no update is triggered. Otherwise, the input sample is transmitted to the other side and is used for updating the prediction model. The explained model is based in following picture in Fig. 4:

Fig. 4. Predictive-Perceptive Model

2.1

Prediction Model

The proposed algorithm for motion and force prediction is based on an autoregressive (AR) model. The autoregressive model is a linear prediction model that attempts to predict future outputs of a system based on previous output values [19]. The generic haptic sample is modeled by the vector v ∈ {P, F }, where P and F denote position and force haptic information respectively along the axis. So, each sample is a six-components vector as following: v = (Px , Py , Pz , Fx , Fy , Fz ) Mathematically, the AR model is based on the following: v (n) =

m  i=1

Φm,i v (n − i) + εv (n)

(1)

Motion and Force Accelerated Prediction

115

where Φm,i are the coefficients of the AR model of order m and εv (n) denotes a zero-mean white Gaussian noise. We consider that the current vector v (n), independently of the force type, can be modeled from previous vectors and their rate of change. Therefore, the acceleration of the motion and the snap (second derivative of the force) have to be modeled with a first-order AR model (m=1): 





v (n) = βv (n − i) + εv (n)

(2)



Where v (n) corresponds to the acceleration (or snap). The relation between  v (n) and v (n) is given by the second derivative performed through the “first difference” between two consecutive samples: 







v (n) = v (n) − v (n − 1) → v (n) = v (n) − v (n − 1) = = v (n) − 2v (n − 1) + v (n − 2) Therefore, the second derivative normalized to the time interval between two samples Δt is :  v (n) − 2v (n − 1) + v (n − 2) v (n) = (3) Δt where Δt = f1s , sampling rate of data acquisition fs = 1000 Hz ( Δt = 1ms ). From the expressions (2) and (3), the following third-order AR model is obtained: v (n) = (2 + β) v (n − 1) + (−1 − 2β) v (n − 2) + βv (n − 3) + εv (n)

(4)

Therefore, from Eq. (1), and Eq. (4), the third-order AR model coefficients are: [Φ3,1 Φ3,2 Φ3,3 ] = [(2 + β) (−1 − 2β) β] (5) As shown in Eq. (4), it is possible to calculate the current value v (n) through the previous three samples, the coefficient β and Gaussian noise εv (n). Conditional maximum likelihood estimation is used to determine the adaptive AR model coefficients Φm,i . The objective is to compute the values Φm,i and the variance σ 2 of the zero-mean stationary Gaussian noise εv (n) that maximize the conditional likelihood function. The generic v (n) is modeled as a Gaussian random sequence:  m  2 Φm,i v (n − i) ; σ v (n) ∼ N i=1

Therefore, the likelihood function is:   2   N −3  N  v (n) − 3i=1 Φ3,i v (n − i) 1 2 L v|Φ3,1 , Φ3,2 , Φ3,3 , σ = √ exp − 2σ 2 2πσ 2 n=4

(6)

116

P. De Luca and A. Formisano

where N denotes the number of observations [20]. Consequently the logarithmic likelihood function is as follows:   Λ v|Φ3,1 , Φ3,2 , Φ3,3 , σ 2  2 3 N  (7) 1  N − 3  2 N −3 ln (2π) − ln σ − 2 v (n) − Φ3,i v (n − i) =− 2 2 2σ n=4 i=1 In order to reduce the computational requirements of this procedure, as seen before in Eq. (5), it is possible to represent the coefficients of the third order AR model in relation to the only coefficient β. Hence, the logarithmic likelihood function of the third order AR model illustrated in Eq. (7) can be reduced to the following logarithmic likelihood function of a first order AR model: N

2 N − 3  2 N −3 1   v (n) − βv (n − 1) ln (2π) − ln σ − 2 Λ v|β, σ = − 2 2 2σ n=4 (8) Next, the parameters σ 2 and β that maximize this log-likelihood function have to be calculated. To achieve this, Eq. (8) is partially differentiated with respect to β and σ 2 , and then setting the derivatives equal to zero. The estimated parameters are: N   n=4 v (n) v (n − 1) ˆ β= N 2  n=4 v (n − 1)



2



σ ˆ2 =

N

2  1   v (n) − βv (n − 1) N − 3 n=4

It can be noticed that the AR model requires a minimum of four observations in order to be initialized and used (in our case N is set to 4).

3

Implementation

In this section our sequential and parallel implementation will be explained. In order to perform an effective implementation of previous formula we firstly developed a sequential version of the code following this schema:

Motion and Force Accelerated Prediction

117

Algorithm 1. Coefficient Prediction algorithm 1: STEP 0: load dataset - posx , posy , posz and related forces % forces are stored in vectors 2: compute β and σ 2 3: STEP 1: compute domain decomposition 4: for all threads do 5: copy n loc points chunk in local memory 6: copy coefficient in thread cache 7: end for 8: STEP 2: evaluate local coefficient 9: for all threads do 10: for each space component do 11: β = Compute second derivative of local components(9) 12: σ 2 = Compute second derivative with β(10) 13: end for 14: end for 15: STEP 3: gathering all local results 16: for all threads do − resultthread 17: master thread ← 18: end for 19: STEP 4: reshape final results 20:  = gaussian noise 21: for entry point do 22: ei = compute prediction (4) 23: end for

Despite the good accuracy achieved from our experimental tests showed in related section, a parallel version in order to overcome the large overhead has been performed. More precisely, the sequential version requires very large execution time due to big number of entry points to compute. Moreover, an ad-hoc multicore implementation for general multicore architectures has been developed. 3.1

Parallel Implementation

Our parallel implementation is based on domain decomposition approach. More precisely, the data associated with a problem is decomposed. Each parallel task then works on a portion of the data. In STEP 0 we load dataset composed by spatial and forces components. Each point is represented by three spatial components and related forces (e.g. for P1 we have x1 , y1 and z1 with fx1 , fy1 and fz1 ). Hence, the in first execution the firstthree observations do not change due to reserve own state for next prediction. In STEP 1 we perform a decomposition of problem’s domain by splitting global domain by each thread. Hence, every chunk divided will be stored in local memory of thread by using an ad-hoc memory strategy allocation. Buddy memory allocation has been used as allocation strategy on local heap of each

118

P. De Luca and A. Formisano

thread. In this strategy the memory is divided in equal chunks in order to satisfy all requests. In order to perform a good distribution of sub-domain both start and end point are computed by evaluating the index of each thread multiplied with n loc size of each subdomain. We perform the starting position by following: start = thread index * n loc and last position is done by: last = (thread index + 1) * n loc After, a copy of chunk in local thread memory is done in order to make a safe computation on local data and make memory consistency. In STEP 2 we deal with compute the evaluation of coefficients by using global coefficients σ 2 and β. The latter for each threads will be computed. A suitable parallel work balancing has been adopted by using a powerful of OpenMP framework. Each thread builds locally the space component computed by (4) formula. In STEP 3 we provide to gather all local results of each thread to master thread. The latter waits the overall work has been done by other threads. Following previous strategy we perform a good asynchronous parallelism. In last STEP 4 we perform the final prediction by using the parallel routine of OpenMP that exploits multicore parallelism architecture. More precisely, the loop-for computes Gaussian noise for each entry point in parallel way. Last, final prediction has been computed by using first-three observation with last computed coefficients combined with Gaussian noise.

4

Results

Our parallel algorithm is developed and tested its executions on two CPU Intel Xeon with 6 cores, E5-2609v3, 1.9 Ghz, 32 GB of RAM memory, 4 channels 51 Gb/s memory bandwidth. We use the OpenMP library [21], for UNIX systems: a multicore framework for execute parallel algorithm by using a simple interface. 4.1

Performance Analysis

In this section, in order to confirm our main contribution experimental results will be shown. In Table 1 several executions of our software are shown, by varying the input size problem and threads number. We denotes that execution time of parallel version has improved than sequential version due to an ad-hoc memory allocation of each data-structure. We observe that execution time decreases just by using two threads. The comparison of parallel execution times with respect to sequential version confirms our strong performance obtained with parallelization of main algorithm. Figure 5 highlights the high speed-up computed by executing the parallel software with respect to sequential version. More precisely the suitable domain

Motion and Force Accelerated Prediction

119

Table 1. Execution times in seconds (s) achieved by varying both number of threads t and size of the problem N . N

Serial time (s) Parallel time (s) 2 4 8 2

6.2 × 10

2

4.1 × 10

1.3 × 103 3.19 × 103 2.5 × 103 4

1.2 × 10

2.8 × 103 5.1 × 103

12

64.12

35.32

20.17

92.38

73.15

52.12 12.38

8.24

127.53

98.94

73.96 25.41

201.98 145.26 101.37 43.34

Fig. 5. Speed-up evaluation

decomposition applied performs the reducing of execution times. Several bottleneck are again present. In order to delete several overhead problems and related issue, in 5 different solutions are proposed.

5

Conclusions

In this work a parallel strategy for motion prediction in haptic media has been proposed. We propose this approach in order to allow telehaptic communications until now not used due latency issues. The gain of performance between our implementation and sequential version based on canonical algorithm confirms our main contribution in this paper. For next aims and idea we suggest a GPU approach to improve the performance of this method useful in several teleoperation systems.

References 1. Agiwal, M., Roy, A., Saxena, N.: Next generation 5G wireless networks: a comprehensive survey. IEEE Commun. Surv. Tutor. 18(3), 1617–1655 (2016) 2. Wu, T., Rappaport, T.S., Collins, C.M.: Safe for generations to come: considerations of safety for millimeter waves in wireless communications. IEEE Microwave Mag. 16(2), 65–84 (2015)

120

P. De Luca and A. Formisano

3. de Mattos, W.D., Gondim, P.R.L.: M-health solutions using 5G networks and M2M communications. IT Prof. 18(3), 24–29 (2016) 4. Antonakoglou, K., Xu, X., Steinbach, E., Mahmoodi, T., Dohler, M.: Toward haptic communications over the 5G tactile internet. IEEE Commun. Surv. Tutor. 20(4), 3034–3059 (2018) 5. Lawrence, D.A.: Stability and transparency in bilateral teleoperation. IEEE Trans. Robot. Autom. 9(5), 624–637 (1993) 6. Cuomo, S., Farina, R., Galletti, A., Marcellino, L.: An error estimate of Gaussian recursive filter in 3Dvar problem. In: 2014 Federated Conference on Computer Science and Information Systems, FedCSIS 2014, art. no. 6933068, pp. 587–595 (2014) 7. De Luca, P., Galletti, A., Giunta G., Marcellino, L., Raei, M.: Performance analysis of a multicore implementation for solving a two-dimensional inverse anomalous diffusion problem. In: Proceedings of NUMTA2019, The 3rd International Conference and Summer School. Lecture Notes in Computer Science (2019) 8. De Luca, P., Galletti, A., Ghehsareh, H.R., Marcellino, L., Raei, M.: A GPU-CUDA framework for solving a two-dimensional inverse anomalous diffusion problem. In: Advances in Parallel Computing. IOS Press (2020) 9. De Luca, P., Fiscale, S., Landolfi, L., Di Mauro, A.: Distributed genomic compression in mapreduce paradigm. In: Montella R., Ciaramella A., Fortino G., Guerrieri A., Liotta A. (eds) Internet and Distributed Computing Systems, IDCS 2019. Lecture Notes in Computer Science, vol. 11874. Springer, Cham (2019) 10. Cuomo, S., De Michele, P., Galletti, A., Marcellino, L.: A GPU parallel implementation of the local principal component analysis overcomplete method for DW image denoising. In: Proceedings - IEEE Symposium on Computers and Communications, 2016-August, art. no. 7543709, pp. 26–31 (2016) 11. Cuomo, S., Galletti, A., Marcellino, L.: A GPU algorithm in a distributed computing system for 3D MRI denoising. In: 2015 10th International Conference on P2P, Parallel, Grid, Cloud and Internet Computing (3PGCIC), pp. 557–562. IEEE (2015) 12. De Luca, P., Galletti, A., Marcellino, L.: A Gaussian recursive filter parallel implementation with overlapping. In: 2019 15th International Conference on Signal Image Technology & Internet Based Systems. IEEE (2019) 13. Damore, L., Campagna, R., Galletti, A., Marcellino, L., Murli, A.: A smoothing spline that approximates Laplace transform functions only known on measurements on the real axis. Inverse Prob. 28(2), 025007 (2012) 14. Nasir, Q., Khalil, E.: Perception based adaptive haptic communication protocol (PAHCP). In: 2012 International Conference on Computer Systems and Industrial Informatics (2012) 15. You, Y., Sung, M.Y.: Haptic data transmission based on the prediction and compression. In: 2008 IEEE International Conference on Communications (2008) 16. Steinbach, E., Hirche, S., Kammerl, J., Vittorias, I., Chaudhari, R.: Haptic data compression and communication. IEEE Sig. Process. Mag. 28(1), 87–96 (2011) 17. Nitsch, V., Farber, B., Geiger, L., Hinterseer, P., Steinbach, E.: An experimental study of lossy compression in a real telepresence and teleaction system. In: 2008 IEEE International Workshop on Haptic Audio visual Environments and Games (2008) 18. Weber, E., De Pulsu, R.: Annotationes Anatomicae et Physiologicae. In: Koehler, C.F. (ed.) Auditu et Tactu, Leipzig, Germany (1834). https://books.google.co.uk/ books?id=bdI-AAAAYAAJ

Motion and Force Accelerated Prediction

121

19. Sakr, N., Georganas, N.D., Zhao, J., Shen, X.: Motion and force prediction in haptic media. In: 2007 IEEE International Conference on Multimedia and Expo (2007) 20. Shanmugan, K.S., Breipohl, A.M.: Random Signals: Detection, Estimation, and Data Analysis. Wiley, New York (1988) 21. https://openmp.org

Finding the Maximal Independent Sets of a Graph Including the Maximum Using a Multivariable Continuous Polynomial Objective Optimization Formulation Maher Heal(B) and Jingpeng Li Department of Computing Science and Mathematics, University of Stirling, Stirling, UK {maher.heal,jli}@cs.stir.ac.uk

Abstract. We propose a multivariable continuous polynomial optimization formulation to find arbitrary maximal independent sets of any size for any graph. A local optima of the optimization problem yields a maximal independent set, while the global optima yields a maximum independent set. The solution is two phases. The first phase is listing all the maximal cliques of the graph and the second phase is solving the optimization problem. We believe that our algorithm is efficient for sparse graphs, for which there exist fast algorithms to list their maximal cliques. Our algorithm was tested on some of the DIMACS maximum clique benchmarks and produced results efficiently. In some cases our algorithm outperforms other algorithms, such as cliquer. Keywords: Independent set · Continuous optimization Maximal cliques · Sparse graphs

1

· MATLAB ·

Introduction

The maximum independent set problem and the maximal independent set problem of a certain size and the related problems of maximum/maximal cliques are important problems in combinatorial optimization. They have many applications in diverse range of domains such as computer vision/pattern recognition, information/coding theory, molecular biology and scheduling. As the problems are NP-hard, it is unlikely there will be an ultimate solution to the problems unless P = NP. However, many algorithms and heuristics were proposed to solve the problems for certain graphs [1–4]. In this paper we confine ourselves with quadratic and continuous programming formulations to find the maximum(maximal) independent set(s). We propose new quadratic and continuous polynomial formulations to find these graph invariants. Our formulations are most suitable for sparse graphs, since we need to list the maximal cliques of the graph first to find the maximum(maximal) independent set(s). Second the nonlinear optimization solvers - we used matlab - are more efficient c Springer Nature Switzerland AG 2020  K. Arai et al. (Eds.): SAI 2020, AISC 1228, pp. 122–136, 2020. https://doi.org/10.1007/978-3-030-52249-0_9

Polynomial Formulation

123

for sparse graphs. The paper is organized as follows: Sect. 1 is the introduction, Sect. 2 is the literature review of mainly continuous optimization algorithms to solve the independent set problem, Sect. 3.1 explains our formulation using a quadratic objective function, Sect. 3.2 extends that formulation to multivariable polynomial objective function formulation, Sect. 4 gives a second proof to our main result, Sect. 5 gives a geometric third proof to our formulation, Sect. 6 gives examples and extensions to our formulation, Sect. 7 are results obtained by applying our algorithm to DIMACS list of clique benchmarks and finally Sect. 8 is the conclusion.

2

Literature Review

First of all, assume we have a finite graph G(N , L). N is the set of N vertices and L is the set of L edges. Associate with each vertex i a continuous variable θi . Shor [5] proved that the binary formulation in Eq. 1 is equivalent to the following quadratic optimization formulation: max

θ1 + θ2 + ... + θN

θi θj = 0, ∀(i, j) ∈ L θi2

− θi = 0, i = 1, 2, ..., N

and reported good computational results. In [6] Motzkin and Straus found a noteworthy relation between the maximum clique of a graph and the following quadratic programming problem. They proved that the global optima of: 1 T θ AG θ 2 eT θ = 1

max f (θ) =

θ≥0 is given by 1 1 (1 − ) 2 ω(G) where ω(G) is the clique number of graph G, AG is the adjacency matrix of the graph and e is an N dimensional vector of all 1s. Harant et al. [7,8] proved the following continuous and quadratic formulations about the independence number of a graph. First the continuous polynomial formulation is given by: α(G) =

max

0≤θi ≤1,i=1,...,N

F (θ) =

max

0≤θi ≤1,i=1,...,N

N 

(1 − θi )

i=1

Π

(i,j)∈L

θj

and the quadratic formulation is given by: α(G) =

max

0≤θi ≤1,i=1,...,N

H(θ) =

max

0≤θi ≤1,i=1,...,N

N   ( θi − θi θj ) i=1

(i,j)∈L

124

M. Heal and J. Li

where α(G) is the independence number of the graph. It is clear the Harant formulations are optimizations over a unit hypercube. Other formulations reported in the literature over a hyper sphere, such as [9]. They proved that if θ¯ is a solution to the following optimization problem: N  1 θi − 1)2 V (k) = min θT AG θ + ( 2 i=1

subject to:

N 

θi2 ≤

i=1

1 k

θ≥0 then V (k) = 0 if there exists an independent set I in G such that |I| ≥ k.

3

The Formulation

We state here the main result that sets out the main body of our algorithm. First, we prove the result for a quadratic programming formulation, then we extend it to a multivariable continuous polynomial programming formulation. Again as stated in Sect. 2, assume we have a finite graph G(N , L). N is the set of N vertices and L is the set of L edges. Associate with each vertex i a continuous variable θi . Before we state our formulation, it is good to recall a binary programming formulation to find the maximum independent set of any graph. It is well known that the solution of the following binary programming optimization problem yields the maximum independent set:  max θ1 + θ2 + ... + θN θi ≤ 1 at each maximal clique θi ∈ {0, 1}i = 1, 2, ..., N 3.1

(1)

Quadratic Formulation

A global solution of the quadratic optimization problem 

θi ≤ 1

max Ψ at each maximal clique 0 ≤ θi ≤ 1

(2)

2 is the independence number of the graph and the where Ψ = θ12 + θ22 + ... + θN solution vector is a binary vector of 1s and 0s. Moreover, the vertices that form a maximum independent set, have their θ variable equals 1 and all other θs equal to 0. In addition, a local maxima of Ψ under the same conditions is a binary

Polynomial Formulation

125

vector such that the θ variable equals to 1 for vertices that form a maximal independent set and 0 for the rest. Proof: We will prove that a local maxima is a maximal independent set with a solution / the vector that has θi = 1 ∀i ∈ the maximal independent set and θi = 0 ∀i ∈ maximal independent set. Let i = 1, 2, ..., p be a maximal independent set. We need to prove θi = 1 for i = 1, 2, ..., p and θi = 0 for i = p + 1, ..., N is a local maxima. Recall the definition of a local maxima of a multivariable function f (x1 , x2 , ..., xn ). (x∗1 , x∗2 , ..., x∗n ) is a local maxima if ∃ > 0 such that f (x1 , x2 , ..., xn ) ≤ f (x∗1 , x∗2 , ..., x∗n )∀ |x1 − x∗1 | ≤ , |x2 − x∗2 | ≤ , ..., |xn − x∗n | ≤  As each 0 ≤ θi ≤ 1, we need to prove that ∃ such that for 0 <  ≤ 1 and at 1 −  ≤ θi ≤ 1, i = 1, ..., p 0 ≤ θi ≤ , i = p + 1, ..., N Ψ is less than that Ψ at θi = 1, i = 1, ..., p and θi = 0, i = p + 1, ..., N . However, since {1, ..., p} is a maximal independent set, then each j ∈ {p + 1, ..., N } must be connected to at least one of {1, ..., p} and each maximal clique contains one and only one of {1, ..., p}. Given that  θi ≤ 1 at each maximal clique and taking 0 ≤ θi ≤ 1 − Δ, i = 1, ..., p, 0 < Δ ≤ 1, we must have 0 ≤ θi ≤ Δ, i = p + 1, ..., N . Now 2 Ψ = θ12 + ... + θN ≤ (1 − Δ)2 + ... + (1 − Δ)2 + Δ2 + ... + Δ2       N-p times

p times

Three cases are considered now: Case I: p = N − p Ψ ≤ (1 − Δ)2 + Δ2 + ... + (1 − Δ)2 + Δ2    p times

Since (1 − Δ)2 + Δ2 ≤ 1 because a2 + b2 ≤ (a + b)2 for positive a and b, we have Ψ ≤ p. We can take  any value between 0 and 1 inclusive. Case II: N − p < p 2

2

2

2

2

2

Ψ ≤ (1 − Δ) + Δ + ... + (1 − Δ) + Δ + (1 − Δ) + ... + (1 − Δ) ≤ N − p + 2p − N ≤ p       N-p times

2p-N times

again we can take  any value between 0 and 1 inclusive.

126

M. Heal and J. Li

Case III: N − p > p Ψ ≤ (1 − Δ)2 + Δ2 + ... + (1 − Δ)+ Δ2 +(1 − Δ)2 + (N − 2p + 1)Δ2    p-1 times 2 for Δ ≤ N −2p+2 we have Ψ ≤ p. Taking the limit as Δ → 0, we have Ψ → p , θi → 1, i = 1, ..., p and θi → 0, i = p + 1, ..., N . Recalling the definition of the limit, for each  > 0 ∃δ such that |θi − 1| ≤ , i = 1, ..., p and |θi − 0| ≤ , i = p + 1, ..., N for all |Δ − 0| ≤ δ, or 1 −  ≤ θi ≤ 1, i = 1, ..., p and 0 ≤ θi ≤ , i = p + 1, ..., N 2 , so for some δ such that 0 ≤ Δ ≤ δ. Take the  that corresponds to δ = N −2p+2 accordingly we found the .

Clearly the global maxima is a maximum independent set. 3.2

Extension to Multivariable Polynomial Formulation

We show that if

r Ψ = θ1r + ... + θN

r > 1 the result proved in Sect. 3.1 is correct. This surely includes Ψ is a polynomial with r ≥ 2. Following the same logic of proof in Sect. 3.1 Case I and II are correct since xr + (1 − x)r ≤ x + (1 − x) ≤ 1 and (1 − x)r ≤ 1 for 0 ≤ x ≤ 1 and r > 1. For Case III Ψ = (1 − Δ)r + Δr + ... + (1 − Δ)r + Δr +(1 − Δ)r + QΔr    p-1 times

Q = N − 2p + 1 r

the function f (x) = (1−x) +Qxr is convex with one minimum at x0 =

1

1

1+Q r−1

.

This can be shown using calculus. At a local minima df (x) =0 dx and

d2 f (x) >0 dx2

df (x) = −r(1 − x)r−1 + rQxr−1 = 0 dx 1 solving for x we have x0 = 1 . Now 1+Q r−1

d2 f (x0 ) 1 1 r−2 r−2 = r(r − 1)(1 − + r(r − 1)Q( >0 1 ) 1 ) dx2 1 + Q r−1 1 + Q r−1 Figure 1 shows a sample graph of f (x). It is clear for Δ < x0 f (x) ≤ 1 and hence Ψ < p. Now by using the same logic as the proof in Sect. 3.1 as Δ → 0, we can find the .

Polynomial Formulation

127

Fig. 1. Graph of f (x), Q = 8 and r = 4. x0 = 0.3333

4

A Second Proof

It has been shown by Jain et al. [10] (please refer to that paper to understand his network flow model under interference) that if the maximal independent sets are I1 , I2 , ..., Ik then  fi = λj i∈Ij

λ1 + λ2 + ... + λk = 1 0 ≤ λj ≤ 1, j = 1, 2, ..., k and

N 

fi = independence number

i=1

for a two nodes network. λj is the time allocated to independent set Ij . It can be easily seen: M ax Ψ  θi ≤ 1 at each maximal clique (3) 0 ≤ θi ≤ 1 given that Ψ is the quadratic or polynomial function as in Sect. 3.1 or Sect. 3.2 is equivalent to: M ax Ψ θi ≤ 1 at each maximal clique 0 ≤ θi ≤ 1  λj 0 ≤ λj ≤ 1, j = 1, 2, ..., k. θi = 

i∈Ij

(4)

128

M. Heal and J. Li

However: λ1 + λ2 + ... + λk may or may not equal to 1. Now, we prove λ1 = 1, λ2 = λ3 = ... = λk = 0 which is equivalent to θi = 1∀i ∈ I1

and θi = 0∀i ∈ / I1

is a local maxima to (3). The same can be proved for I2 , I3 , ..., Ik . Let λ1 ≤ 1−Δ. Now each one of the other independent sets I2 , l3 , ..., lk either contains a vertex that is also a member of I1 and in that case λ of that independent set is ≤ Δ since θi ≤ 1; or the 2nd case is that the independent set contains a vertex that is not a member of the independent set I1 . Hence, this vertex must  be connected θi ≤ 1 at each to one of the vertices of independent set I1 . However, since maximal clique we have λ ≤ Δ for this independent set. Based on that 2 Ψ = θ12 + θ22 + ... + θN ≤ (1 − Δ)2 + (1 − Δ)2 + ... + (1 − Δ)2 +ZΔ2   

(5)

|I1 | times

for some integer Z. The proof now proceeds as in Sect. 3.1 for the three cases. Similarly, we can prove it for any power more than 1 in Ψ .

5

A Third Geometric Proof

We will illustrate the proof by considering a four vertices graph, and it is clear to see that can be extended to graphs of arbitrary size. Furthermore the proof suggests an extension of our model to find the maximal weighted independent sets including the maximum weighted independent set. Now, let’s us state a connection between the maximum independent sets and the capacity of flow of two nodes network. Consider a two nodes network, see Fig. 2. Node 1 is sending data to node 2 over l links(channels) in one unit of time and the capacity of each link is one data unit per time unit. The links are interfering in such way that the transmission from node 1 to node 2 is successful only when the data is transmitted over non-interfering links. Assuming there is a one-to-one mapping between the links in this two nodes network and the vertices of the graph G = (N , L), provided that two links in the network are interfering if and only if the corresponding vertices of the graph G are connected. It is not difficult to see the independence number of the graph G is equal to the maximum successful flow from node 1 to node 2. The maximum flow of the network can be carried on any set of links that maps to a maximum independent set of the graph, [10]. We assume θi , i = 1, 2, ...l, l = N , as the flow in data units/time unit for each link. Consider the graph in Fig. 3, it is clear there are two maximum independent sets (1, 2, 3) and (1, 2, 4) and one maximal clique (3, 4). Link (vertex) 3 is

Polynomial Formulation

129

Fig. 2. Two nodes networks of l links that conflict according to graph G.

interfering(connected) with link (vertex) 4. It is clear the maximum successful flow from node 1 to node 2 is equal to the independence number 3. Now when θ1 = θ2 = θ3 = 1 and θ4 = 0 we have maximum flow from node 1 to node 2, see Fig. 4(a); or it can be attained by splitting the flow of link (vertex) 3 between links (vertices) 3 and 4 since we have the sum of flows is less than or equal to one in each maximal clique, see Fig. 4(b). It is clear the total areas of the grey squares (each square has a side length of 1 unit) is equal to maximum transmitted data and maximum transmitted data equals to the independence number of the graph. It can be easily seen for any flows of θs we have θ12 + θ22 + ...θl2 , l = N , is less than the independence number of the graph or maximum transmitted data and we attain the maximum when links(vertices) form a maximum independent set and θs equal 1 for maximum independent links (vertices) set and zero otherwise since the area of the inner squares (black square which has a side length of θ) are less than 1, see Fig. 4(b). It is straightforward to see the logic is valid if we consider a set of links that forms a maximal independent set and not for only a maximum independent set.

Fig. 3. Four vertices graph example.

130

M. Heal and J. Li

Fig. 4. Schedule of flows for a network that has the four vertices graph as a conflict graph (a) an independent set carries all the flow, (b) link 3 flow is splitted over 2 links, also inner square area is less 1.

Now if we assume links capacities are different such as C1 , C2 and . . . Cl for links 1, 2, . . . l, respectively, then it is not difficult to see that the maximum of C1 θ12 + C2 θ22 + ... + Cl θl2 is a maximum weighted independent set or a maximal weighted independent set such that the capacities are links weights and depending on if it is a global or local maximum respectively. This should be the objective function in our optimization formulation. As an example if the weights of our four vertices graph in Fig. 3 are 1, 2, 3, 4 for vertices 1, 2, 3, 4, respectively, then the maximum of data sent (which is equivalent to maximum weighted independent set) will be 1 + 2 + 4 = 7 or the maximum weighted independent set will be links (vertices) 1, 2 and 4.

6

Examples and Extensions

As an example, we applied our algorithm to Hoffman Singleton graph [11]. After more than 7 h of computer crunching, WolfRam Mathematica 11.01 FindIndependentVertexSet function didn’t converge to a maximum independent set on MacBook Pro, 2.5 GHz Intel Core i5, 8 GB memory. We used Matlab R2017a to code our algorithm. The first phase is finding the maximal cliques of the graph. To that end we used a code from matlab File Exchange for Bron-Kerbosch algorithm to list the maximal cliques [12]. In spite of that this algorithm is not the best known algorithm for large sparse graphs, but it did serve our purpose. There are almost polynomial algorithms to list maximal cliques for sparse graphs reported in literature, such as [13]. The matlab code is shown below:

Polynomial Formulation

131

f u n c t i o n [ f i n a l x , f i n a l f v a l , e x i t f l a g , output ] = c o n v e r s i o n A ( Adjacency ) % c o n v e r t t h e Adjacency matrix i n t o t h e o p t i m i z a t i o n parameters of % fmincon %∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗ % LB 2 % Hessian r e q u i r e d h f =−2∗eye ( 2 0 8 ) ; end end The code is three functions: (1) throughput function which is the objective function with its gradient, (2) the maximalCliques function which is not shown here downloaded from matlab code sharing community, and (3) conversionA which prepares the optimization problem parameters to be in the format for fmincon function and solves the problem using 100 uniformly distributed random seeds. By using matlab Parallel Add-on, a maximum independent set was yielded in 0.1451 min where the main time is due to fmincon of 0.1326 min and 0.0124 min to list maximal cliques. Another important problem is finding an independent set of a certain size. This can be achieved by adding the constraint θ1 + θ2 + ... + θN = M where M is the required independent set size.

(6)

Polynomial Formulation

7

133

Results Obtained Form DIMACS Benchmarks

We see in Table 1 and Table 2 some of the DIMACS maximum cliques benchmarks [14]. KBA stands for Bron-Kerbosch algorithm and SQP stands for sequential quadratic matlab algorithm. To find the maximum clique in these graphs, we found the maximum independent set in the complement graph. As can be seen our algorithm is efficient especially for sparse graphs that have short time to list their maximal cliques. Sequential quadratic programming is more accurate than interior-point and needs less runs. Since fmincon sometimes halts while trying to find a global maximum, we used in some graphs the equality constraint to find an independent set of that size which happens to be a maximum independent set for the graphs listed such as Keller4. In the second table for Keller4 graph the parallel processing version of fmincon halts responding in spite of finding the independence number in the first runs; for this we used a nonparallel fmincon and the result shown in the table is for one such run. Figure 5 shows the average running time when we use different powers of our polynomial formulation for 5 runs, each with 100 seeds for C125.9 graph [14], that has clique number equals to 34. It is clear increasing the polynomial power hasn’t reduced execution time and as can be seen from Fig. 6 the accuracy is even less with more runs required to coverage to a global optima. We used a 500 seeds for a polynomial of degree 4 and we found independent sets of sizes 34 and 33, and with 1000 seeds for a polynomial of degree 10 we found an independent set of size 31. Now we come to some of the graphs where our algorithm performs better than cliquer [15]. One such graph is C250.9.clq [14] that has a clique number of 44. The maximal cliques listing time, quadratic programming algorithm time and total time were 0.1309, 125.6358 min (2.09393 h), 125.7669 min (2.096115 h) respectively by our algorithm to get a clique number of 44 while the maximum clique obtained by cliquer [15] is 39 after more than 7 h of running. Indeed cliquer halted at 184/250 (max 39) 26202.11 (3330.10 s /round) and stopped responding. Another graph is P Gamma U34 On Nonisotropic Points [11], our quadratic programming algorithm yielded the independence number (clique number in the Table 1. DIMACS results using interior-point algorithm for fmincon. Graph

KBA time No. of iterations Interior-point time ω

johnson8-2-4

0.0052

1

0.0075

4

MANN-a9

0.0013

5

0.0084

16

hamming6-2

0.0026

1

0.0072

32

hamming6-4

0.4790

10

23.2787

4

johson8-4-4

0.0937

20

0.1043

14

johnson16-2-4 0.0123

10

0.0827

8

0.6695

34

12.8564

11

C125.9

0.0097

100

keller4a 2.2866 10 a with equality constraint, sum of θs = 11

134

M. Heal and J. Li

complement graph) 44 in 3.9597 min with 0.0182 min maximal cliques listing time, 3.9415 min for the quadratic solver fmincon; while cliquer took more than 7 h without converging to the right independence number. Here is the last output of cliquer 146/208 (max 38) 25465.65 s (8904.73 s/round). Table 2. DIMACS results using sequential quadratic programming algorithm for fmincon. Graph

KBA time No. of iterations SQP time ω

johnson8-2-4

0.0035

1

0.0095

4

MANN-a9

0.0020

1

0.0062

16

hamming6-2

0.0030

1

0.0086

32

hamming6-4a 0.4933

1

0.8052

4 14

johson8-4-4

0.0103

20

0.0206

johnson16-2-4 0.0130

10

0.0251

8

C125.9

50

0.1152

34

0.0099

keller4b 1.8339 1 2.8839 11 with equality constraint, sum of θs = 4 b run function stops responding, we used fmincon without parallel processing a

Fig. 5. Average time of 5 runs, each with 100 seeds for different powers of the objective.

Polynomial Formulation

135

Fig. 6. Average maximum independent set found of 5 runs, each with 100 seeds for different powers of the objective.

8

Conclusion

We proposed a quadratic optimization formulation to find maximal independent sets of any graph. We extended that to a polynomial optimization formulation. We need to list first the maximal cliques of the graph and then we solve a nonlinear optimization problem. Our formulation is efficient when tested on some of the DIMACS maximum clique benchmarks and proved to be more efficient than a popular maximum clique algorithm such as cliquer for some graphs. However due to the time required to solve the quadratic or polynomial optimization problem, our algorithm works better when the listing time of maximal cliques of the graph is short, i.e. for sparse graphs. We can even reduce the time by coding an algorithm such as that in [13] to reduce maximal cliques listing time. This is to be tested in future for large sparse graph. The model can be extended easily to find an independent set of a certain size and to find the maximal weighted independent sets in weighted graphs. Acknowledgment. Maher Heal thanks Dr. Una Benlic for advice on the maximal independent set and maximal clique problems while working on this research.

References ¨ 1. Osterg˚ ard, P.R.: A fast algorithm for the maximum clique problem. Discrete Appl. Math. 120(1), 197–207 (2002)

136

M. Heal and J. Li

2. Fang, Z., Li, C.-M., Xu, K.: An exact algorithm based on maxsat reasoning for the maximum weight clique problem. J. Artif. Intell. Res. 55, 799–833 (2016) 3. Tavares, W.A., Neto, M.B.C., Rodrigues, C.D., Michelon, P.: Um algoritmo de branch and bound para o problema da clique m´ axima ponderada. In: Proceedings of XLVII SBPO, vol. 1 (2015) 4. Shimizu, S., Yamaguchi, K., Saitoh, T., Masuda, S.: Fast maximum weight clique extraction algorithm: optimal tables for branch-and-bound. Discrete Appl. Math. 223, 120–134 (2017) 5. Shor, N.Z.: Dual quadratic estimates in polynomial and Boolean programming. Ann. Oper. Res. 25(1), 163–168 (1990) 6. Motzkin, T.S., Straus, E.G.: Maxima for graphs and a new proof of a theorem of Tur´ an. Can. J. Math. 17(4), 533–540 (1965) 7. Harant, J.: Some news about the independence number of a graph. Discussiones Math. Graph Theory 20(1), 71–79 (2000) 8. Harant, J., Pruchnewski, A., Voigt, M.: On dominating sets and independent sets of graphs. Comb. Probab. Comput. 11, 1–10 (1993) 9. Pardalos, P., Gibbons, L., Hearn, D.: A continuous based heuristic for the maximum clique problem. Technical Report, University of Michigan, Ann Arbor, MI, United States (1994) 10. Jain, K., Padhye, J., Padmanabhan, V.N., Qiu, L.: Impact of interference on multihop wireless network performance. Wireless Netw. 11(4), 471–487 (2005) 11. https://hog.grinvin.org/ 12. https://uk.mathworks.com/matlabcentral/fileexchange/30413-bron-kerboschmaximal-clique-finding-algorithm 13. Eppstein, D., L¨ offler, M., Strash, D.: Listing all maximal cliques in sparse graphs in near-optimal time. In: International Symposium on Algorithms and Computation, pp. 403–414. Springer (2010) 14. http://iridia.ulb.ac.be/∼fmascia/maximum clique/ 15. https://users.aalto.fi/∼pat/cliquer.html

Numerical Method of Synthesized Control for Solution of the Optimal Control Problem Askhat Diveev1,2(B) 1

Federal Research Center “Computer Science and Control” of Russian Academy of Sciences, Vavilova str., 44, 119333 Moscow, Russia [email protected] 2 RUDN University, Miklukho-Maklaya str., 6, 117198 Moscow, Russia http://www.frccsc.ru/

Abstract. The new computing method of solution of a classical problem of optimum control is considered. The method demands preliminary transformation of right part of a system of differential equations of mathematical model of an object of control to a form of the contracting mapping. The decision represents piecewise constant vector function which switches the provision of the fixed points of the contraction mapping. As a result, we receive sustainable solutions to a problem of optimum control. Received solutions are less sensitive to indignation of model and to an integration step. For ensuring property of the contraction mapping the evolutionary method of symbolical regression is used. Numerical examples of the solution of a problem of optimum control of group of mobile robots with phase restrictions are reviewed.

Keywords: Contraction mapping regression

· Optimal control · Symbolic

Introduction The optimal control problem is one of the main tasks of the control theory. In spite of the fact that this task is formulated more than sixty years ago [1] effective numerical methods of its decision are absent. Important scientific result in the optimal control theory is the Pontryagin’s maximum principle [1,2]. Use of this principle allowed to find the exact analytical solution of some optimal control problems of not high dimension. But this principle was a little suitable for creation of a numerical method. Firstly, when using the Pontryagin’s maximum principle the dimension of model of a control object increases twice. Secondly, at the numerical decision on each integration step it is necessary to look for a Hamiltonian maximum, i.e. to solve a problem of nonlinear programming. Here it is supposed that the control delivering a maximum to a Hamiltonian can be found analytically and has one decision. Otherwise the numerical method becomes unrealizable as the quantity of solvable problems of looking for the c Springer Nature Switzerland AG 2020  K. Arai et al. (Eds.): SAI 2020, AISC 1228, pp. 137–156, 2020. https://doi.org/10.1007/978-3-030-52249-0_10

138

A. Diveev

maximum of a Hamiltonian grows with decrease of an integration step. Thirdly, the boundary value problem of searching of initial conditions for the conjugate variables is hard, and target function for this task can be nonconvex and not unimodal on space of initial conditions for the conjugate variables. The main lack of use of Pontryagin’s maximum principle for the solution of applied problems of optimal control is what found control depends on a Hamiltonian that includes right parts of a differential equations’ system of a control object model. In actual practice the model of control object isn’t never known definitely therefore the received control isn’t optimal for a real object. It is also necessary to note that the solution of differential equations with the found control in the form of function of time in a right part isn’t steady therefore it is sensitive both to small indignations of model, and to an integration step. Searching in a class of unstable decisions significantly complicates creation of a numerical method for solution of the optimal control problem on the basis of Pontryagin’s maximum principle. From the computing point of view, it is much simpler to solve a optimal control problem of transformation it to a nonlinear programming problem [2,3]. At this approach, we break a time axis into intervals, we define control functions in each interval to within values of some number of parameters, and we look for best values of these parameters according to quality functional of initial problem. Increase in dimension of a nonlinear programming problem due to increase in quantity of time intervals for the modern high-performance computers isn’t a big problem today. At this approach the problem of instability of solutions of differential equations and sensitivity of these decisions to an integration step is more important. Here at different values of parameters we can receive decisions which for exact reproduction need to be integrated with various integration steps. Let’s notice that time of calculation of the solution of an optimizing task in inverse proportion to an integration step therefore the choice of very small integration step leads to essential time expenditure. It is also necessary to consider that for non-linear models of control object the goal functional in a nonlinear programming problem isn’t convex and unimodal. Here too we interfere with a problem that after finding of the decision we can’t realize it because it isn’t steady. For ensuring stability of an optimal solution in practice often supplement an object with the special control systems providing an object dynamic stability in the neighborhood of the found optimum trajectory. But control systems change model of control object therefore the found before optimal control becomes not optimal for new model of a control object. In this work we offer other approach to the numerical solution of an optimal control problem. Originally we provide stability of control object of rather some point in the field of the state space where the movement of an object on an optimum trajectory is supposed. We exercise control of an object by changing of position of a stability point. We solve an optimal control problem finding of positions of stability points. We call such control the synthesized optimal control [4,5]. At the same time searching of the decision is always carried out on sustainable solutions to differential equations which can easily be reproduced

Numerical Method of Synthesized Control

139

in real objects. In work for the solution of ensuring stability problem we use evolutionary methods of symbolical regression. Further we will show that the providing of stability of an object can replace with obtaining property of the contraction mapping for right parts of model of control object.

1

The Optimal Control Problem

Consider a classical statement of the optimal control problem [1]. Given mathematical model of control object in the form of an ordinary differential equations’ system. x˙ = f (x, u), (1) where x is a vector of a states’ space, x ∈ Rn , u is a vector of control, u ∈ U ⊆ Rm , U is a compact set. Given initial conditions (2) x(0) = x0 . Given terminal conditions

x(tf ) = xf ,

where tf is not set but limited.    t, if t < t+ and xf − x(t) ≤ ε tf = + t otherwise ε and t+ are given positive values. Given a quality criterion in the form of integral functional  tf f0 (x(t)u(t))dt → min J=

(3)

(4)

(5)

0

It is necessary to find control as a function of time u = v(t) ∈ U

(6)

x˙ = f (x, v(t))

(7)

that solution the system (1) from initial conditions (2) achieves terminal conditions (3) with optimal value of quality criterion (5).

2

A Contraction Mapping

Let a M be complete metrical space. Let ρ(a1 , a2 ) be a distance metric between elements a1 and a2 of this space. Then a mapping A in the metrical space M is contraction [6] if there is such positive number α < 1, that for any two elements a1 , a2 of the space M the ratio is carried out ρ(A(a1 ), A(a2 )) < αρ(a1 , a2 ).

(8)

140

A. Diveev

According to the Banach theorem any contraction mapping has one fixed point. [6]. Consider the system of differential equations in Cauchy’s form: x˙ = f (x, t).

(9)

If in vector space Rn norm is defined, then a system of ordinary differential equations is one-parametric mapping in metric space Rn . To obtain value of x in any moment of time the equation is used  x(t0 + δt) =

t0 +δt

t0

f (x(t), t)dt.

(10)

The mapping (10) displays any element x of the space Rn into other element of this space in depend on parameter δt. Let ∀x (tk ) ∈ Rn and ∀x (tk ) ∈ Rn , x (tk ) = x (tk ) ρ(x (tk ), x (tk )) = x (tk ) − x (tk ) > 0.

(11)

If ∃α, 0 < α < 1 and δt such that ρ(x (tk + δt), x (tk + δt)) < αρ(x (tk ), x (tk )),

(12)

where x (tk + δt) and x (tk + δt) are obtained by the mapping (10), then the mapping (8) is a one-parametric contraction mapping in the space Rn with a parameter δt. Then according to the Banach’s theorem this mapping has to have one fixed point.

3

Synthesis of Control as Transformation to Contraction Mapping

For ensuring property of the contraction mapping for model of control object it is necessary to find control function:

such that the system ∗

u = h(x∗ − x),

(13)

x˙ = f (x, h(x∗ − x)),

(14)



n

where x is a some point in an area X ⊆ R , has such private solutions x(t, x0 ) that for any initial conditions from some area x(0) = x0 ∈ X∗ , will reach the given neighborhood of point x∗ for limited time t1 > 0  ∗  x − x(t, x0 ) ≤ ε1 , (15) t ≤ t1 , ε1 is a given small positive value.

Numerical Method of Synthesized Control

141

In fact the problem of searching for such control function corresponds to the control general synthesis problem. Further for the solution of this problem of the control general synthesis numerical methods of symbolical regression are used. By computer seek of the solution the condition (15) can’t be checked for all area X ⊆ Rn , therefore the following assumption is used: Proposition 1. For any area X ⊆ Rn and the system of differential equations (14) there is always a finite number of points of initial conditions such that, if private solutions of differential equations (14) for some initial conditions, has some property A, then all points of area X have this property. At the numerical solution of a problem of synthesis of control we will set a finite number of initial conditions. Let’s find a control function which will provide achievement of the given target point from all these initial values. Then we will check realization of these conditions for other points of area. Such method is used at tutoring artificial neural network. The artificial neural network is a universal approximating function. At a training of artificial neural network, it approximates a finite number of set points. This set of points is called the training selection. Then the received neural network is checked for other points which weren’t included in the training selection. If the neural network doesn’t give right results, then it is retrained with new points in the training selection. Let’s formulate a numerical problem of the general synthesis of control. Let it be given the system of ordinary differential equation (1). Given an area X∗ ⊆ Rn . Given target point in the area x∗ ∈ X∗ . Given a set of point of initial conditions X0 = {x0,1 , . . . , x0,K } ⊂ X∗ .

(16)

It is necessary to find a control in the form (13) such that h(x∗ − x) ∈ U,

(17)

and a solutions of the system (14) has minimized of functional J = tf + α

K    ∗ x − x(tf , x0,k ) → min,

(18)

k=1

where tf is determined from (4), α is a positive weight coefficient. At a numerical solution of the synthesis problem we consider that ε is the smallest number, therefore, if ρ(x , x ) ≤ ε, then x = x and ρ(x , x ) = 0.

(19)

142

A. Diveev

Theorem 1. If the problem of synthesis (1), (13)–(18) is solved. Then the oneparametric mapping  t0 +δt x(t0 + δt) = f (x(t), h(x∗ − x(t)))dt, (20) t0

is a contraction mapping at some value parameter δt. Proof. Let x1 (t0 ) and x2 (t0 ) be any two points in area X∗ . Let ρ(x1 (t0 ), x2 (t0 )) > ε. Then ∃δt ≤ tf such, that ρ(x1 (t0 + δt), x∗ ) ≤ ε, and ρ(x2 (t0 + δt), x∗ ) ≤ ε, because according to (19) ρ(x1 (t0 + δt), x∗ ) = 0, and ρ(x2 (t0 + δt), x∗ ) = 0. From here a unequation is received ρ(x1 (t0 + δt), x2 (t0 + δt)) ≤ ρ(x1 (t0 ), x2 (t0 )).



(21)

According to (12) the point x∗ is a fixed point of this contraction mapping.

4

Synthesized Optimal Control

Theorem 2. If the optimal control problem (1)–(6) has a solution for any initial conditions in some area X∗ ⊆ Rn , then the problem of synthesis of control (1), (13)–(18) has a solution too. Proof. According to conditions of the theorem. Let x∗ = xf , then for all point of initial conditions (16) the optimal control problem (1)–(6) has a solution. The set of these solutions consists of an optimal controls’ set. ˜ = {˜ ˜ K (t)} U v1 (t) . . . , v

(22)

˜ = {˜ ˜ (t, x0,K )}. X x(t, x0,1 ) . . . , x

(23)

and a set of extremals

Both of these sets determinate for each moment of time control function (13). ˜ → U. ˜  h:X

(24)

Theorem 3. Let the problem of synthesis of control (1), (13)–(18) be had solve such, that a norm of distance between the solutions x(t, x0,1 ) and x(t, x0,2 ) never more than a set positive value ε∗ ,     x(0, x0,1 ) − x(0, x0,2 ) ≤ ε∗ , then x(t, x0,1 ) − x(t, x0,2 ) ≤ ε∗ , (25) ∀t > 0. Then for the problem of optimal control (1)–(6) always it is possible to find such target points x∗,1 , . . . , x∗,S and time interval Δt that the solution ¯ (t, x0 ) of system x x˙ = f (x, h(w(t) − x)), (26) where w(t) is a piecewise constant function w(t) = x∗,s , if (s − 1)Δt ≤ t < sΔt, s = 1, . . . , S,

(27)

˜ (t, x0 ) of the from initial condition (2) won’t differ from the optimal solution x ∗ optimal control problem (1)–(6) more than on ε .

Numerical Method of Synthesized Control

143

Proof. At the solution of the synthesis problem (1), (13)–(18) we are setting that ˜ (tβ , x0 ). ε < ε∗ . Let xβ is a point of the optimal solution in the moment tβ , xβ = x ∗,1 We always can find the target point x such that the solution of the system (26) won’t be differ from point xβ in the moment tβ more than on value ε∗ , because at the solution of the synthesis problem (1), (13)–(18) the functional of minimum time is used. If now there is a moment tγ < tβ , where the solution of the system (26) is differed from the optimal solution more on ε∗ , then we consider, that x∗,2 = x∗,1 and we find a new target point x∗,1 such that to get to ˜ (tγ , x0 ) in the moment tγ . neighborhood of the point xγ of the optimal solution x If such moment is absent, then we find the following target point. We continue to find target points x∗,s , until the solution of system (26) doesn’t get the terminal point (3). In the worst case we look for target points for hit in the points of an optimum trajectory located no more than at a size ε∗ from each other. The method of the synthesized optimal control allows to find approximate solution of the problem of optimal control (1)–(6) not very different from the exact solution. The received solution is a little sensitive to inaccuracy of definition of initial conditions. Really, let initial conditions are known with the accuracy to Δx,   (28) Δx = ρ(x0 , x0,δ ) = x0 − x0,δ  , where x0,δ is a known initial condition. Then according to property of contraction mapping (12) the following inequality is received ρ(x(t1 , x0 ), x(t1 , x0,δ (t1 ))) < Δx.

(29)

The synthesized optimal control is a new method for a solution of the optimal control problem and a new problem of search for an optimal control.

5

Symbolic Regression Methods for the Control Synthesis Problem

Symbolic regression methods are appeared for the problem of automation programming in the end of twenty century [7]. In the synthesis problem it is necessary to find a mathematical expression for control function (13). If the numerical method can find the algorithm of program, then it can find a formula, therefore the methods of symbolic regression can be used for the solution of the synthesis problem. Now it is known more than ten methods of symbolic regression. All methods usually code possible solution in a special form and search optimal solution in the space of codes. Methods differ a form of code. For each code in a method of symbolic regression, operations crossover and mutation for genetic algorithm were developed. The first method of symbolic regression was the genetic programming. This method codes possible solution in the form of computing tree. For implementing of crossover operation parent trees exchange the branches.

144

A. Diveev

In this operation lengths of codes can be changed. This isn’t comfortable for programming. That why other methods of symbolic regression were created. Consider the network operator method [8,9]. This method of symbolic regression codes mathematical expression in the form of oriented graph. For codding of mathematical expression, sets only functions with one and two arguments are used. Functions with two arguments have to be associative and commutative and have unit elements. The set of function with one argument has to include the identity function. The set of arguments of mathematical expression has to include some number of constant parameters additionally. Values of these parameters are determined in during of search. Parameters were included in arguments for expand of values of functions in mathematical expressions. Let us consider codding of a mathematical expression y = q1 sin(cos(x1 )) + exp(− cos(q2 x2 )). For this mathematical expression, the set of arguments has the following form F0 = (f0,1 = x1 , f0,2 = x2 , f0,3 = q1 , f0,4 = q2 ), where x1 , x2 are variables, q1 , q2 are constant parameters. A set of function with one argument has the following form: F1 = (f1,1 (z) = z, f1,2 (z) = −z, f1,3 (z) = sin(z), f1,4 (z) = cos(z), f1,5 (z) = exp(z)). A set of function with two arguments has the following form: F2 = (f2,1 (z1 , z2 ) = z1 + z2 , f2,2 (z1 , z2 ) = z1 z2 ). In designation of functions a first index indicates number of arguments, and a second index on the item number in a set. Let us write down the mathematical expression by the help of the sets’ elements. y = f2,1 (f2,2 (f0,3 , f1,3 (f1,4 (f0,1 ))), f1,5 (f1,2 (f1,4 (f2,2 (f0,4 , f0,2 ))))). For the constructing an oriented graph of mathematical expression it is necessary that functions with one argument alternated with functions with two arguments or with arguments of the mathematical expression. For implementation of this requirement, identity functions with one argument and some functions with two arguments and with unit element as second argument are added in the record of the mathematical expression. The following mathematical expression is received. y = f2,1 (f1,1 (f2,2 (f1,1 (f0,3 ), f1,3 (f2,1 (f1,4 (f0,1 ), 0)))), f1,5 (f2,1 (f1,2 (f2,1 (f1,4 (f2,2 (f1,1 (f0,4 ), f1,1 (f0,2 ))), 0)), 0)).

Numerical Method of Synthesized Control

145

Now let us build an oriented graph of the mathematical expression. In the graph functions with two arguments are connected with nodes of graph, functions with one argument are connected with arcs, and arguments of mathematical expression are connected with source nodes of the graph. The graph of the mathematical expression is represented in Fig. 1.

Fig. 1. An oriented graph of the mathematical expression

In source nodes of the graph, the numbers of elements from the set of arguments are located. Near at arcs of the graph the numbers of function with one argument are written. In nodes of the graph, the numbers of function with two arguments are presented. On the graph unit elements for function with two arguments aren’t presented. In upper parts of all nodes numbers of nodes are written. The numbers of the nodes are set such that the number of node, where from arc come out, was less than the number of node, where arc come in. If the oriented graph has no a loops, then its nodes can always be numbered such thus. In the computer memory the network operator is presented in the form of the network operator integer matrix. An adjacency matrix of the graph has the following form: ⎤ ⎡ 0000100000 ⎢0 0 0 0 0 1 0 0 0 0⎥ ⎥ ⎢ ⎢0 0 0 0 0 0 1 0 0 0⎥ ⎥ ⎢ ⎢0 0 0 0 0 1 0 0 0 0⎥ ⎥ ⎢ ⎢0 0 0 0 0 0 1 0 0 0⎥ ⎥. ⎢ A=⎢ ⎥ ⎢0 0 0 0 0 0 0 1 0 0⎥ ⎢0 0 0 0 0 0 0 0 0 1⎥ ⎥ ⎢ ⎢0 0 0 0 0 0 0 0 1 0⎥ ⎥ ⎢ ⎣0 0 0 0 0 0 0 0 0 1⎦ 0000000000

146

A. Diveev

In the adjacency matrix, ones are replaced with the corresponding numbers of functions with one argument and numbers of function with two arguments are set in the main diagonal in the line with the corresponding number of node. Thus a matrix of network operator is obtained. ⎤ ⎡ 0000400000 ⎢0 0 0 0 0 1 0 0 0 0⎥ ⎥ ⎢ ⎢0 0 0 0 0 0 1 0 0 0⎥ ⎥ ⎢ ⎢0 0 0 0 0 1 0 0 0 0⎥ ⎥ ⎢ ⎢0 0 0 0 1 0 3 0 0 0⎥ ⎥. ⎢ Ψ=⎢ ⎥ ⎢0 0 0 0 0 2 0 4 0 0⎥ ⎢0 0 0 0 0 0 2 0 0 1⎥ ⎥ ⎢ ⎢0 0 0 0 0 0 0 1 2 0⎥ ⎥ ⎢ ⎣0 0 0 0 0 0 0 0 1 5⎦ 0000000001 For decoding of the network operator the network operator matrix and a nodes’ vector are used. z = [z1 . . . z10 ]T . Each component of vector of nodes is connected with the corresponding node of graph. Firstly, the vector of nodes is initialized. If component of vector of nodes, corresponds a source node then component is equal the corresponding argument of the mathematical expression. If a component of vector of nodes is connected with function with two arguments, then this component is equal unit element for this function with two arguments. z(0) = [x1 x2 q1 q2 0 1 1 0 0 0]T . Further components of nodes’ vector of are changed by the following equation:  (i−1) (i−1) f2,ψj,j (f1,ψi,j (zi ), zj ), if ψi,j = 0 (i) zj = , (30) (i−1) zj otherwise where i = 1, . . . , L − 1, j = i + 1, . . . , L, L is a dimension of a nodes’ vector or is a number of nodes. For the example L = 10. Calculating of values of nodes’ vector components for the example leads to the following result: z(1) = [x1 x2 q1 q2 cos(x1 ) 1 1 0 0 0]T , z(2) = [x1 x2 q1 q2 cos(x1 ) x2 1 0 0 0]T , z(3) = [x1 x2 q1 q2 cos(x1 ) x2 q1 0 0 0]T , z(4) = [x1 x2 q1 q2 cos(x1 ) q2 x2 q1 0 0 0]T , z(5) = [x1 x2 q1 q2 cos(x1 ) q2 x2 q1 sin(x1 ) 0 0 0]T , z(6) = [x1 x2 q1 q2 cos(x1 ) q2 x2 q1 sin(x1 ) cos(q2 x2 ) 0 0]T , z(7) = [x1 x2 q1 q2 cos(x1 ) q2 x2 q1 sin(x1 ) cos(q2 x2 ) 0 q1 sin(x1 )]T ,

Numerical Method of Synthesized Control

147

z(8) = [x1 x2 q1 q2 cos(x1 ) q2 x2 q1 sin(x1 ) cos(q2 x2 ) − cos(q2 x2 ) q1 sin(x1 )]T , z(9) = [x1 x2 q1 q2 cos(x1 ) q2 x2 q1 sin(x1 ) cos(q2 x2 ) − cos(q2 x2 ) q1 sin(x1 ) + exp(− cos(q2 x2 ))]T ,

The result of calculation of the mathematical expression is located in the last component of the nodes’ vector. For seek of the optimal network operator, a variational genetic algorithm is used. This algorithm uses the principle of small variations of basic solution [10]. According to this principle the code of basic solution is set. Codes of others possible solutions are determined by the sets of variation vectors. A variation vector codes one small variation and consists of four components: w = [w1 w2 w3 w4 ]T ,

(31)

where w1 is a kind of the small variation, w2 is a number of line of a network operator matrix, w3 is a number of column of a network operator matrix, w4 is a new value of an element of a network operator matrix. For network operator the following small variations are used: w1 = 0 is a change of nondiagonal nonzero element, w1 = 1 is a change nonzero diagonal element, w1 = 2 is a zeroing of a nonzero nondiagonal element, if this line and this column have more than one nonzero nondiagonal elements, w1 = 3 is an insert of new value of nondiagonal element instead of zero element. Let us consider an example of a variation vector. w = [3 5 8 3]T , It is an insert of new value 3 in line number 5 in column number 8 of the network operator matrix. The new matrix has the following form: ⎤ ⎡ 0000400000 ⎢0 0 0 0 0 1 0 0 0 0⎥ ⎥ ⎢ ⎢0 0 0 0 0 0 1 0 0 0⎥ ⎥ ⎢ ⎢0 0 0 0 0 1 0 0 0 0⎥ ⎥ ⎢ ⎢0 0 0 0 1 0 3 3 0 0⎥ ⎥ w◦Ψ=⎢ ⎢0 0 0 0 0 2 0 4 0 0⎥. ⎥ ⎢ ⎢0 0 0 0 0 0 2 0 0 1⎥ ⎥ ⎢ ⎢0 0 0 0 0 0 0 1 2 0⎥ ⎥ ⎢ ⎣0 0 0 0 0 0 0 0 1 5⎦ 0000000001 For crossover operation, two sets of variation vectors are selected Wα , Wβ . Then a crossover point kc is determined.

148

A. Diveev

A crossover is an exchange of tails after a crossover point of the sets of variation vectors. This matrix of network operator accords to the following mathematical expression: y˜ = q1 sin(x1 ) + exp(−(cos(q2 x2 ) + sin(x1 ))). Each possible solution is coded by a set of variation vectors Wi = (wi,1 , . . . , wi,R ),

(32)

where i ∈ {1, . . . , H}, R is a length of a set, H is a number of possible solutions in a population. Genetic operations are carried out on sets of variation vectors. For crossover operation, two sets of variation vectors are selected Wα , Wβ . Then a crossover point kc is determined. A crossover is an exchange of tails after a crossover point of the sets of variation vectors WH+1 = (wα,1 , . . . , wα,kc , wβ,kc +1 , . . . , wβ,R ), WH+2 = (wβ,1 , . . . , wβ,kc , wα,kc +1 , . . . , wα,R ).

(33) (34)

After that mutation operation is applied. A mutation point is determined randomly, kμ ∈ 1, . . . , R for each new possible solution. Then new variation vector is generated wH+1,kµ . A new matrix of network operator is calculated wH+1,R ◦ . . . ◦ wH+1,1 ◦ Ψ and this matrix is estimated by a goal function. If new possible solution better than the worst possible solution in the population, then the new possible solution changes the worst possible solution. The same operations are carried out for the second new possible solution. Together with a formula of mathematical expression the optimal value of vector of parameters is seek by the same genetic algorithm. Crossover operation for structures of mathematical expression and vectors of parameters lead to obtaining four new possible solutions. Two new possible solutions have new structures and values of parameters and other two new possible solutions have structures, like parents, and only new values of parameters.

6

Parametrical Optimization for Searching of Target Points

After solution of synthesis problem, it is necessary to find optimal positions of target points. This problem is a nonlinear programming problem. Let us present the statement of the problem of search for target points. The system of differential equation (14) is given and control function (13) is known. The initial conditions (2) and terminal conditions (3) are given. The integral functional of quality (5) is given.

Numerical Method of Synthesized Control

149

It is necessary to find coordinates of K target points x1,∗ , . . . , x∗,K ∈ X∗ ⊆ Rn .

(35)

If these points are inserted in control function through a given time interval Δt instead of x∗ , then a partial solution of the system (14) from initial condition (2) will lead to terminal condition (3) with optimal value of quality criterion (5). The problem consists of finding finite number of parameters q = [q1 . . . qp ]T ,

(36)

where p = nK, n is a dimension of states’ space of target points. For search coordinates of target points an evolutionary algorithm is applied, because goal function for this problem can be nonconvex and multimodal in the space of parameters. Now there are many evolutionary algorithms. All these algorithms have similar stages. Generation of initial set of possible solutions. This set is called a population. Calculating estimations for each possible solution. After that some possible solutions are changed a little on the base of received estimations. These changes are called an evolution. If evolution improves estimate of possible solution, then a new possible solution replaces old one. For solve of the parametrical optimization problem, the particle swarm optimization (PSO) algorithm is offered to use. The PSO algorithm is one of the most popular evolutionary algorithms today [11,12]. In the algorithm, the evolution of each possible solution is carried out on the basis of the values of the best possible solution found so far, the best among several randomly selected ones, and the information about the best solutions found earlier is also used. The algorithm contains the following steps. Generation of a set of possible solutions qij = ξ(qi+ − qi− ) + qi− , j = 1, . . . , H, i = 1, . . . , p,

(37)

where ξ is a random number in interval from (0; 1), qi+ , qi− are top and bottom restrictions for values of parameters qi , H is number of possible solutions in the initial population, p is a dimension of the vector of parameters. Generation of historical vectors, or velocity vectors. The initial values of the historical vectors are zeros. vij = 0, j = 1, . . . , H, i = 1, . . . , p,

(38)

Calculating the values of the goal function (5) for each possible solution in the population fj = J(qj ), j = 1, . . . , H. (39) The best possible solution qj− of all is determined in current moment fj− = min{fj : j = 1, . . . , H}.

(40)

150

A. Diveev

Next, for each possible solution qj the best possible solution q(r(j)) among k randomly selected is determined. fr(j) = min{fj1 , . . . , fjK }.

(41)

After that the historical vector is changed vj r(j)

vij = αvij + ξβ(qi ij− − qij ) + ξγ(qi

− qij ), i = 1, . . . , p,

(42)

where α, β, γ are parameters of the algorithm. ˜j Each possible solution qj , is changed according to rules of evolution q ⎧ j j + + ⎪ ⎨qi , if qi + σvi > qi q˜ij = qi− , if qij + σvij < qi− . (43) ⎪ ⎩ j j qi + σvi otherwise Calculating the value of the goal function for new possible solution obtained by evolution qj ). (44) f˜j = J(˜ If the value of the goal function for new possible solution is better than the value of the goal function for old possible solution, then the new possible solution replaces the old possible solution ˜ j , if f˜j < fj . qj ← q

(45)

The evolution (41)–(45) is performed for each possible solution from the population and this process is repeated together with the search for the best current possible solution (40) P times. After all cycles, the best found solution is a solution of the problem.

7

Computation Experiment

Let us consider a problem of optimal control of group interaction of two mobile robots, moving one cart on hinged rigid leashes. Group interaction of robots is called an execution of one task by all robots and this problem cannot be executed on parts by one or some robots. The problem statement has the following description. Given a mathematical model of each robot x˙ i = 0.5(u1,i + u2,i ) cos(θi ), y˙ i = 0.5(u1,i + u2,i ) sin(θi ), θ˙i = 0.5(u1,i − u2,i ),

(46) (47) (48)

where xi , yi are mass center coordinates of the robot i on a plane, θi is a turn angle of axis of robot i to the coordinate axis {0, x}, u1,i , u2,i are control vector components of robot i, i = 1, 2. Initial conditions for robots are set

Numerical Method of Synthesized Control

151

Initial conditions for robots are set xi (0) = x0i , yi (0) = yi0 , θi (0) = θi0 , i = 1, 2.

(49)

Restrictions for control are set + u− i,j ≤ ui,j ≤ ui,j , i, j = 1, 2.

(50)

The scheme of movement by robots of the cart is provided in Fig. 2.

Fig. 2. Scheme of movement of the cart by two robots

In Fig. 2 x0 , y0 are mass center coordinates of cart. Coordinates of the cart are calculated by the following formulas If x2 − x1 > ε0 , where ε0 is a set small positive value, then x0 = x cos(γ) − y[sin(γ), y0 = x sin(γ) + y  cos(γ), where 

x =

 l12

 2 c(c − l1 )(c − l2 )(c − Δ) − y y 2 1  , − y 2 , γ = arctan , y = x2 − x1 Δ  c = 0.5(l1 + l2 + Δ), Δ = (x1 − x2 )2 + (y1 − y2 )2 .

l1 , l2 are lengths of cart leashes. If |x2 − x1 | < ε0 and y1 < y2 , then x0 = x + x1 , y0 = y  + y1 ,   2 c(c − l1 )(c − l2 )(c − Δ) , y  = l12 − x2 . x = Δ If |x2 − x1 | < ε0 and y2 ≤ y1 , then

where



x0 = x + x1 , y0 = y  + y1 ,

(51)

152

A. Diveev

  2 c(c − l1 )(c − l2 )(c − Δ) , y  = − l12 − x2 . x = Δ If x2 − x1 < −ε0 , then

where



x0 = x ˜ cos(γ) − y˜ sin(γ) + x1 , y0 = x ˜ sin(γ) + y˜ cos(γ) + y1 , where   2 c(c − l1 )(c − l2 )(c − Δ) y˜2 − y˜1 , x ˜ = l12 − y˜2 . γ = arctan , y˜ = x ˜2 − x ˜1 Δ Static phase restrictions in the form of limited areas in R2 to which robots, the cart or leashes shouldn’t get are. ϕi (xj , yj ) = ri2 − (x∗i − xj )2 = (yi∗ − yj )2 ≤ 0, ϕi (xh , yh ) = ri2 − (x∗i − xh )2 = (yi∗ − yh )2 ≤ 0, where ri is a size of restrictions, i = 1, . . . , M , j = 0, 1, 2, M is a number of obstacles, xh , yh are coordinates of leashes centers. If (0.5(x1 + x2 ) − x∗i )2 − (0.5(y1 + y2 ) − yi∗ )2 < (x1 − x∗i )2 − (y1 − yi∗ )2 or (0.5(x1 + x2 ) − x∗i )2 − (0.5(y1 + y2 ) − yi∗ )2 < (x2 − x∗i )2 − (y2 − yi∗ )2 , then yi∗ − y1 + k1 x1 − k2 xi y2 − y1 , k1 = , k1 − k2 x2 − x1 x1 − x2 yh = k1 (xh − x1 ) + y1 , k2 = , y2 − y1

xh =

otherwise x + h = x1 , yh = y1 . The dynamic restrictions defining lack of collisions between robots are set χ(x1 , x2 , y1 , y2 ) = r2 − (x2 − x1 )2 − (y2 − y1 )2 ≤ 0, where r is size of robot. Terminal condition is set  (xf − x0 )2 − (yf − y0 )2 ≤ ε1 , where ε1 is a small positive value. The quality criterion of control is set

Numerical Method of Synthesized Control



M  2 tf 

J = tf + a1 

0

 ϑ(ϕi (xj , yj ))dt + a1

i=1 j=0

tf

χ(x1 , y1 , x2 , y2 )dt + a2

a1

0



M tf 

ϑ(xh , yh )dt +

153

(52)

i=1

(xf − x0 )2 − (yf − y0 )2 ,

0

where a1 , a2 are a weight coefficients, ϑ(A) is a Heaviside step function  1, if A > 0 ϑ(A) = 0 otherwise Firstly, the synthesis control problem for one robot was solved. In result the following control function was received ⎧ + ⎪ ˜ i > u+ ⎨ui if u i − ui = ui if u , i = 1, 2, (53) ˜ i < u− i ⎪ ⎩ u ˜i otherwise where

u ˜1 = A−1 +

√ 3 A + sgn(q3 (θ∗ − θ)) exp(−|q3 (θ∗ − θ)|)+ + sgn(θ∗ − θ) + μ(B),

(54)

u ˜2 = u ˜1 + sin(˜ u1 ) + arctan(H) + μ(B) + C − C , √ 3  A = tanh(0.5D) + B + 3 x∗ − x + C + sin(q3 (θ∗ − θ)), 3

B = G + sgn(sgn(x∗ − x)q2 (y ∗ − y)) exp(−|sgn(x∗ − x)q2 (y ∗ − y)|)+ sin(x∗ − x) + tanh(0.5G) + x∗ − x, C = G + sgn(sgn(x∗ − x)q2 (y ∗ − y))× exp(−|sgn(x∗ − x)q2 (y ∗ − y)|) + sin(x∗ − x), D = H + C − C 3 + sgn(q1 (x∗ − x)) + arctan(q1 ) + ϑ(θ∗ − θ), G = sgn(x∗ − x)q2 (y ∗ − y) + q3 (θ∗ − θ) + tanh(0.5q1 (x∗ − x)),  H = arctan(q1 (x∗ − x)) + sgn(W ) |W | + W + V +  √ 3 2sgn (W + tanh(0.5V )) + W + tanh(0.5V ) + 3 x∗ − x+  √ sgn(x∗ − x) |x∗ − x| + 3 x∗ − x + tanh(0.5V ), W = sgn(x∗ − x) + sgn(q2 (y ∗ − y))sgn(x∗ − x) tanh(0.5(x∗ − x)), V = q3 (θ∗ − θ) + sgn(x∗ − x)q2 (y ∗ − y) + tanh(0.5(x∗ − x)), μ(a) = sgn(a) min{1, |a|}, tanh(a) = q1 = 11.7282, q2 = 2.0271, q3 = 4.0222.

1 − exp(−2a) , 1 + exp(−2a)

(55)

154

A. Diveev

At the experiments the following values of parameters are used: u+ i = 10, = −10, i = 1, 2, qj+ = 20, qj− = −20, j = 1, 2, 3. After that the optimal synthesized control problem was solved. In this problem for each robot, four fixed points in the state space were found. Positions of these points have to provide movement the cart from initial condition to terminal condition with optimal value of quality criterion (52). For solution of this problem PSO algorithm was used. Parameters of PSO algorithm were: a number of possible solution in initial population H = 200, a number of cycles of evolutions P = 500, other parameters α = 0.7, β = 0.85, γ = 0.1, σ = 1, a number of randomly selected possible solutions k = 4. At the search for positions of fixed points, the following restrictions were used: −1 ≤ x∗ ≤ 12, −1 ≤ y ∗ ≤ 12, −1.57 ≤ θ∗ ≤ 1.57. Initial position of the cart was x0 = −2, y0 = 0.5. Terminal position for the cart was xf0 = 10, y0f = 10. There were four phase M = 4 restrictions in the problem. Parameters of phase restrictions were r1 = 1, r2 = 1, r3 = 1.5, r4 = 1.5, x∗1 = 7.5, y1∗ = 7.5, x∗2 = 2.5, y2∗ = 2.5, x∗3 = 2, y3∗ = 8, x∗4 = 8, y4∗ = 2. Weight coefficients were a1 = 2.5, a2 = 1. The size of robot was r = 2. Lengths of leashes were l1 = l2 = 2. The following optimal value of vector of parameters was found u− i

q = [0.1754 3.5241 0.6046 7.5281 4.6063 0.4181 6.0444 6.0224 −0.4171 2.7332 6.3575 1.3160 9.6510 8.9384 0.2588 9.9981 9.7186 0.0218 8.7641 8.8070 0.9473 1.2345 11.768 0.6302]T The result of simulation with found optimal solution is presented in Fig. 3. Value of functional for found optimal solution is 4.4380. In Fig. 3 black lines are trajectories of robots, blue line is a trajectory of the cart, small black squares are fixed points for robots, red spheres are phase restrictions. Further, studies were carried out sensitivity of the functional to change initial values. The results of the experiment are shown in Table 1. The table in line 1 shows the level of disturbance. The last line shows the average values of the functionals for the five tests. Table 1. Values functional for optimal control at perturbations of the initial conditions

0.05

0.1

4.7621 5.0743 4.7483 4.9611 5.0685

8.8651 6.5329 7.7054 9.7300 8.8319 6.3117 5.5275 12.1343 6.0576 10.5120

4.9229 7.3975

0.2

9.0442

Numerical Method of Synthesized Control

155

Fig. 3. Optimal trajectories

8

Conclusions

The new method for solution of the optimal control synthesis problem a method of the synthesized optimum control is considered. The method includes the initial solution of the control system synthesis problem. Stability of control object model concerning a point in the state space is as a result provided. It is shown that as a result of the solution of a control synthesis problem property of the contraction mapping at model of a control object is received. This property allows to create a control system insensitive to change of the object initial condition. At the second stage the optimal positions of the fixed points by criterion of an initial problem of optimum control are looked for. The example of application of the developed method for the solution of an optimal control problem of group interaction of two robots transporting one cart on rigid leashes is given. Acknowledgment. The theoretical parts of work, Sects. 1–4, were performed at support from the Russian Science Foundation (project No 19-11-00258). Experimental and computational parts of the work, Sects. 5–7, were performed at support from the Russian Foundation for Basic Research (project No 18-29-03061-mk).

156

A. Diveev

References 1. Pontryagin, L.S., Boltyanskii, V.G., Gamkrelidze, R.V., Mishchenko, E.F.: L. S. Pontryagin Selected Works: The Mathematical Theory of Optimal Process, vol. 4, p. 360. Gordon and Breach Science Publishers, New York (1985) 2. Lee, E.B., Markus, L.: Foundation of Optimal Control Theory, p. 576. Wiley, New York (1967) 3. Tabak, D., Kuo, B.C.: Optimal Control by Mathematical Programming, p. 280. Prentice - Hall Inc., Englewood Cliffs (1971) 4. Diveev, A.I., Shmalko E.Yu.: Self-adjusting control for multi robot team by the network operator method. In: Proceedings of the 2015 European Control Conference (ECC), Linz, Austria, 15-17 July 2015, pp. 709–714 (2015) 5. Diveev, A.I., Sofronova E.A.: Automation of synthesized optimal control problem solution for mobile robot by genetic programming. In: Be, Y., Bhatia, R., Kapoor, S. (eds.) Intelligent Systems and Applications. Proceedings of the 2019 Intelligent Systems Conference (Intellisys) Volume 2. Advances in Intelligent Systems and Computing, vol. 1038, pp. 1054–1072. Springer (2019) 6. Kolmogorov, A.N., Fomin, S.V.: Elements of the Theory of Functions and Functional Analysis. Volume 1. Metric and Normed Spaces, p. 130. Graylock Press, Rochester (1957) 7. Koza, J.R.: Genetic Programming: On the Programming of Computers by Means of Natural Selection. MIT Press, Cambridge (1992) 8. Diveev, A.I.: Numerical method for network operator for synthesis of a control system with uncertain initial values. J. Comp. Syst. Sci. Int. 51(2), 228–243 (2012) 9. Diveev, A.I., Sofronova, E.A.: Numerical method of network operator for multiobjective synthesis of optimal control system. In: Proceedings of Seventh International Conference on Control and Automation (ICCA 2009), Christchurch, New Zealand, 9–11 December 2009, pp. 701–708 (2009) 10. Diveev, A.I.: Small variations of basic solution method for non-numerical optimization. IFAC-PapersOnLine 48(25), 28–33 (2015) 11. Kennedy, J., Eberhart, R.: Particle swarm optimization. In: Proceedings of IEEE International Conference on Neural Networks, Perth, Australia, vol. IV, pp. 1942– 1948 (1995) 12. Diveev, A.I., Konstantinov, S.V.: Study of the practical convergence of evolutionary algorithms for the optimal program control of a wheeled robot. J. Comput. Syst. Sci. Int. 57(4), 561–580 (2018)

Multidatabase Location Based Services (MLBS) Romani Farid Ibrahim(B) Mohammed Bin Naïf Academy for Marine Science and Security Studies, Jeddah, Saudi Arabia [email protected]

Abstract. Can LBS be an added feature to many types of applications? It will add values to those applications and provides more services for the users of the system, such as police work, digital battlefield, emergency services, and tourism services, etc. For example, in the security field, security officers can use it to get the information they need in a particular place about a particular person. Using the technology of reading the license plate they can retrieve all the car data, the owner’s data, and the validity of car license and owner license, as well as whether the car owner is wanted by security or not. Connecting different databases can facilitate a lot of tasks in all areas by providing data access to authorized users. In this research, we propose a new framework for processing location-based services queries that access multidatabase system (traditional databases, geodatabase, and spatiotemporal databases). We view a user query as an initiator of a transactional workflow that accesses multidatabase system, and we proposed a technique for processing it. The query analyzer analyzes the query to identify the type of database required, records, and attributes constraints that will be retrieved. We implemented a simulation protype as a multiphase and multipart application to test the validity of the proposed framework. Keywords: Relational database · Spacial database · Spatiotemporal database · Multi-database · GIS · Location-Based Services (LBS) · Transaction · Concurrency control · Caching · Workflow · Data mining · Actionability rules

1 Introduction The terms location-based service (LBS), location-aware service, location-related service, and location service are often interchangeably used [1]. Connecting existing systems increases their added values and provides more services for the users of the systems. Users can easily perform tasks that require access to different systems such as LBS users who need information from traditional database systems, or traditional system users who need information from LBS systems. LBS can be defined from the perspective of its structure as a service that supports the determination of the location of an object (human, car, mobile phone, etc.), by integrating the technologies of positioning systems, GIS, Internet, and mobile computing. There are many classifications for mobile queries, such as [2, 3]. Queries can be classified into two general categories: non-location dependent queries and locationdependent queries. © Springer Nature Switzerland AG 2020 K. Arai et al. (Eds.): SAI 2020, AISC 1228, pp. 157–168, 2020. https://doi.org/10.1007/978-3-030-52249-0_11

158

R. F. Ibrahim

Traditional relational databases quires are examples of non-location dependent queries, and do not include location attributes such as “list the names and grades of fourth-year students in the Computer Science Department”. Or it includes location attributes, but the attribute values are static and explicitly changed by users. Usually, these relational databases are not connected to other spacial databases, and no additional spatial or temporal processing is performed on these location attributes, such as “list names and addresses of all suppliers in Cairo from relational_supplier_ database”, even if the query issuer is a mobile user, the result is valid wherever he goes. A location-dependent query is any query whose answer depends on the locations of certain objects (the mobile user and/or other interesting objects) [4]. The locationdependent queries retrieve data about moving or static objects, and the user can be a mobile user or a stationary user. A location-dependent query is continuous if the query is reevaluated continuously until it is canceled by the user [5], or it is a snapshot if the query result is computed only once and transferred to the user. Location-dependent queries can also be classified into local queries if they retrieve data from the user’s current cell server, or non-local queries, if they retrieve data from a server outside the user’s current cell, or hybrid queries if they retrieve data from both local and non-local servers. For simplicity, in this research, our concern is the management of local queries. We view a transaction as a program in execution in which each write-set satisfies the ACID properties, and the program that updates the database in general as four phases: reading phase, editing phase, validation phase, and writing phase [6]. A mobile transaction is a transaction performed with at least one mobile host takes part in its execution [7]. Workflow is a collection of tasks organized to accomplish some business process (e.g., processing purchase orders over the phone, provisioning telephone service, processing insurance claims). A task can be performed by one or more software systems, one or a team of humans, or a combination of these [8]. The rest of this paper is organized as follows: Sect. 2 describes the related work. Section 3 describes the proposed framework that includes the system environment and architecture, and the proposed transactional workflow processing technique. Section 4 provides the implementation prototype. Finally, Sect. 5 concludes the paper and provides suggestions for future works.

2 Related Work The authors of [9] proposed an architecture for a middleware (Location Dependent Services Manager (LDSM)), between the mobile user and the service content providers to service location-dependent applications which access location-dependent data, and location leveling algorithms to adjust the location granularities and to solve the location mismatch problem. The authors of [10] describe location-dependent data management and provide definitions of its basic concepts. The authors of [11] present a system that performs semantic keyword interpretation on different data repositories. It discovers the meaning of the input keywords by consulting a generic pool of ontologies and applying different disambiguation techniques, then

Multidatabase Location Based Services (MLBS)

159

generates queries that expected to be intended by the user, and presents the generated queries to the user who selects her/his intended query from them. The authors of [12] suggest using data warehousing technologies for LBS and discuss the research challenges posed by location-based services. The authors of [13] suggest a flexible query approach that takes the context of the user into account and the use of semantic techniques to enable the development of intelligent LBS. The authors of [14] present the system SHERLOCK as a common framework for LBS. Ontologies and semantic techniques are used to share knowledge among devices, which enables the system to guide the user selecting the service that best fits his/her needs in the given context. Many researchers focused on continuous query processing where location data are distributed in a network such as [15, 16]. The authors of [17, 18] proposed an extension of SQL by adding SKYLINE operator, to respond to Skyline queries. Skyline is defined as those points which are not dominated by any other point. A point dominates another point if it is as good or better in all dimensions and better in at least one dimension. They gave an example for determining the cheapest and closest hotel to a user based on the calculation of skyline points. They assume the user issues the query to the traditional travel agent database system, without the support of location-dependent services, and all data is available in a single database. Most of the researchers assume that the query results will be returned from the geodatabase or the spatiotemporal database of the LBS system, and they don’t take into consideration organizations’ databases.

3 Motivated Scenarios Connecting different databases can facilitate a lot of tasks in all areas by providing data access to authorized users. In the security field, security officers can use it to get the information they need in a particular place about a particular person. Police patrol teams can identify where they are and share data as well as the data of the persons present in each patrol. Ambulance staff can retrieve the data of the injured or deceased person at the scene which can help them by knowing the patient’s medical history as well as contact details of relatives. In addition to storing their notes about the patient, the medication is given and the treatment required by performing their transactions online. In other emergencies, such as fires, landslides, earthquakes, floods, rescue teams can retrieve data about people on-site, objects or the building and perform their transactions from their locations. In the field of sales, a salesperson can identify customers in the area he is currently in and arranges them according to their proximity to him. In the field of tourism, tourists can retrieve data on services available from their sources, for which there is no data in the LBS system. A tourist can retrieve data about scammers in the area where he is currently located, and thus can avoid them. Also, a person who wants to buy or rent a property can retrieve data about criminals or scammers near to him and whether the area is safe or suspicious area. The above paragraphs are examples of services that need to retrieve data from the LBS system and other traditional systems. Services of other fields are similar to them.

160

R. F. Ibrahim

4 Proposed Framework In this section, we describe the system environment and architecture, and the proposed transactional workflow processing technique. 4.1 System Environment We made the following assumptions about the system environment: • The mobile environment consists of a group of adjacent cells (Fig. 1). Each cell includes a base station (BS) covering a specific geographical area, and a location-based services server (LBSS) that performs location-dependent services for cell users and responds to users’ queries. Each base station manages the communications between mobile units and the LBSS of the cell.

Fig. 1. Mobile environment.

• We assume a large city that includes more than one cell like New York City. Also, the cell range may intersect with other cell ranges in the same city, or with cell ranges in neighboring cities. • We assume that Mobile units (MU) are advanced mobile devices (4G mobile phones or more advanced), Tablet or laptop computer, and they have the capabilities to communicate with GPS and determine the location of the user based on received signals from GPS satellites. The mobile devices can access GIS systems such as google maps. 4.2 System Architecture The system architecture (Fig. 2), consists of: • Cellular networks: is an advanced cellular network with strong signals and high bandwidth in all cells, connecting all components of the system. • Organizations servers: each cell includes many traditional organizations systems (suppose relational database servers), and they are independent of each other. On each organization server, there is a public view query agent that accepts queries about the organization’s open data (data that can be accessed by non-local users), such as

Multidatabase Location Based Services (MLBS)

161

Fig. 2. System architecture

• • • •

objects descriptions and price lists. These servers respond to queries that access data from traditional databases. The public view query agent can be designed to resolve attribute names mismatches between the query text and the organization database. Client-side: is the mobile application on the user’s mobile unit, which sends the user query and the current location data of the mobile unit to the cell location-based service server, and receives a response from it. Geodatabase: stores geographic data about objects in the cell. It can be a standalone system or subsystem from the location-based services server. GPS and other satellite communications: send mobile location information to the mobile unit, so it can calculate its location. Cell location-based service server: includes the following main components: • Communications manager: Manages the communication between the LBS server and other components of the system. • LBS query analyzer: Analyses the user query by performing the following tasks: • Recognizes the type of query (non-location dependent or location-dependent query). • Identifies the target databases (Spacial, spatiotemporal, and/or organizations databases). • Recognizes the words of group functions in the query such as (minimum, average, etc.). • Converts the query to standard formats accepted by the target databases (such as SQL).

162

R. F. Ibrahim

• Write the execution plan and call the query distributor and response collection manager to execute it. • Query distributor and response collection manager: It executes the execution plan that is written by the query analyzer. It sends the subqueries to the target databases, receives the result returned as datasets in the cache memory and executes the required group function on the returned result. Also, it calls the sorting function, if the group function requires sorted data. • Preparing results manager: Prepares the final report that will be sent to the user’s mobile unit. • Spatiotemporal database: Stores location and time data about moving objects in the cell such are cars, mobile user units, etc. 4.3 Transactional Workflow Processing Technique The application acts as a transactional workflow consisting of a group of independent subtransactions, which means that each subtransaction is committed when it completed, and if a subtransaction fails, other subtransactions will continue and complete their processing. Most of the subtransactions are read-only subtransactions that retrieve data from the different databases, except for the reservation subtransaction which is an update subtransaction, it can be performed using the optimistic concurrency control technique and the actionability rules if the mobile user wants to make a reservation from the application. As an example, we assume that a tourist is walking on a street and looking for an average price hotel near him (assume within cell coverage). He runs a location-dependent services application and initiating a transactional workflow by writing a query such as: find an average price hotel for a single room near me. • The mobile unit calculates the user’s location from the information received from the GPS. • The application sends the query with the mobile unit location to the LBS server of the user’s current cell. • The query analyzer analyzes the query text to determine: the objective database type, records, attributes and constraints, and group function required. • Objective database type: The query text doesn’t include any time adverbs, and includes the location adverb (near) and the pronoun me, so the application determines that the geodatabase will be used. • Objective database records: The query text includes the keyword (hotel), so the application determines that spacial information about hotels in the current cell will be retrieved. Assume keywords are (hotel, hospital, restaurant, ambulance, school, gas station, etc.). • Attributes & Constraints: The query text includes the words price, single room, so the application determines that the data will be retrieved includes price and rooms. • Group functions: The query text includes the group function word (average), so the application determines that the average of the returned results will be calculated.

Multidatabase Location Based Services (MLBS)

163

• Execution plan: After the query analyzer determines the previous parts, it writes the execution plane as a transactional workflow to be read and executed by the query distributor and response collection manager. • The query distributor and response collection manager execute the execution plan as a transactional workflow and calls the preparing results manager, which prepares the results report that will be sent by the communication manager. • The communication manager looks for the current location of the user mobile unit and sends the report to it. Pseudocode for the functions of the query analyzer is written in the queryAnalyzer_Procedure.

queryAnalyzer_Procedure (string query_text) { queryAnalyzer_DatabaseType_ Function() { If the query_text include a ( time adverb) word such as (time , at , day, year, month, week, hour, minute, second, night, day, now ,summer, winter, evening, morning, noon, morning, after + timeValue, before +timeValue, between +timeValue and timeValue) use the spatiotemporal DB . Else If the query_text include a (location adverb) word such as (near, front, behind, , center, north, west, above, , south, up down, between, before, after, on ، inside, outside) use the geodatabase . Else use the specified organization database (Noun + Keyword) // example: Cairo Hotel }

queryAnalyzer_DatabaseRecords_ Function() { If the query_text includes the keyword Hotel: query_stmt= 'Select * from objects_table where object_type = "hotel"'. Restaurant: query_ stmt = 'Select * from objects_table where object_type = " restaurant"'. Hospital: query_ stmt = 'Select * from objects_table where object_type = "hospital"'. Ambulance: query_ stmt = 'Select * from objects_table where object_type = " ambulance "'. Police patrol: query_ stmt = 'Select * from objects_table where object_type = " Police patrol "'. etc. }

164

R. F. Ibrahim

queryAnalyzer_Attribute_Constrains_ Function() { // We assume that the operators that are used in constraints are the equal sign '=', 'and' operators for simplicity. For each (adjective+noun) in the query_text { Attribute(i) = noun Cond(i) = noun Cond(i)_value = adjective No-attribute= No-attribute +1 } For each (noun) in the query_text { if noun exist in Attribute[] then loop Else Attribute(i) = noun No-attribute= No-attribute +1 } If No-attribute =1 { Interested_attributes = Attribute(i) Interested_constrains = cond(i) + '=' + cond(i)_value } Else { Interested_attributes = Attribute(i)+ ',' + interested_attributes Interested_constrains = cond(i) + '=' + cond(i)_value + 'and' + Interested_constrains } }

queryAnalyzer_GroupFunction() { If the group_function_word = "average" { For each record query_stmt2_result Sum = sun + attribute (i) Average = sum/n //n= number of returned records Return average } If the group_function_word = "sum" { For each record query_stmt2_result Sum = sun + attribute (i) Return sum } If the group_function_word = "minimum" or "cheapest" or "lowest" { Call find_min() Return minimum } If the group_function_word = "maximum" or "highest" { Call find_max() Return maximum} }

Multidatabase Location Based Services (MLBS)

165

queryAnalyzer_ExecutionPlan_Function() { Begin transactional workflow Begin transaction t1 use returned_Value_DatabaseType (in this example: geodatabase) query_stmt1 = ('Select * from objects_table where object_type =hotele';) query_stmt1_result= exec query_stmt1; End transaction t1 Foreach record in the query_stmt1_result Begin transaction t2 Connect_ to_ server ( string server_name); query_stmt2 = 'Select interested_attributes from public_db where + Interested_constrains' (in this example: SELECT room_type, price FROM rooms1 where room_type = 'single') query_stmt2_result = exec query_stmt2; End transaction t2 Call queryAnalyzer_GroupFunction() } }

5 Prototype Implementation To test the applicability of our architecture, we implemented a simulation prototype as a multiphase and multipart application using Visual C++ 2017 and Microsoft SQL Server 2012. We built a function to analyze the user query text by searching for the adverb words, keywords, and group functions words to determines the database type and the database records relevant to the user’s request and produces SQL statements to perform that. It wrote a notice to the user about the group function “average” that will be used. Figure 3 shows the output of the function. In Fig. 3, the function divided the user text into chunks and found the adverb word “near”, so it determined that the database will be used is the geodatabase and wrote “use geodatabase” in the execution plan file. It found the keyword “hotel”, so it wrote: query_stmt1 = “Select * from objects_table where object_type = ‘hotel’;”.

Fig. 3. Text analysis output

We built the geodatabase using ArcGIS 10.2, which stores data about objects (hotels, restaurants, hospitals, schools, etc.) in the cell. The query returns hotel names and addresses from the geodatabase, and stores it as a dataset. A loop starts that read each

166

R. F. Ibrahim

row in the returned dataset and connects to the hotel database to retrieve the price list of the hotel. To determine the attributes that should be returned from the hotel database and its constraints, we used a text analysis tool [19] (TextPro from FBK organization http://hltservices2.fbk.eu/textpro-demo/textpro.php) to identify nouns and adjectives in the user query text. Figure 4 shows the result of this analysis.

Fig. 4. Text analysis output from TextPro tool.

We assumed that the analysis tool is a part from the system. We built a function that reades this output and searches for nouns and adjectives + nouns. (see queryAnalyzer_Attribute_Constraints_ Function()). If it finds (adjective + noun), it sets the noun as an attribute and a constraint, and the adjective as the constraint value. If it finds a noun without an adjective, it sets it as an attribute only. Then it wrote an SQL statement for retrieving the data. The output from this function is shown in Fig. 5.

Fig. 5. Result of identifying attributes and their constraints.

The query distributor and response collection manager connects to hotels servers and retrieves the required data, then it calls the queryAnalyzer_GroupFunction() to calculate the average price. Figure 6 shows the final results of the user query.

Multidatabase Location Based Services (MLBS)

167

Fig. 6. Final results of the user query.

6 Conclusion and Future Works In this research, we have proposed a framework for multidatabase location-based services (MLBS) that connects the different types of databases (traditional, geographic, spatiotemporal). It adds values to existing systems and provides more services for the users of the systems We view a user query as an initiator of a transactional workflow that accesses a multidatabase system. The system architecture is presented, and the query analyzer performs the major tasks by analyzing the user query text to determines the target database and attributes with their constraints. To test the applicability of our architecture, we implemented a simulation prototype as a multiphase and multipart application using Visual C++ 2017 and Microsoft SQL Server 2012. In this paper, we focused on queries that access traditional databases and geodatabase, but in our future work, we will expand the application to access the spatiotemporal database as well. Improving the query analyzer to generate more semantic interpretation of the user query and achieve accurate results. A Distributed transactional workflow that accesses more than one LBS server and the selection of the appropriate LBS server will be investigated. Data management such as (fragmentation, replication, caching, and indexing), transaction recovery, and performance improvement of query processing of location-dependent data are interesting directions for future work.

References 1. Kupper, A.: Location-Based Services - Fundamentals and Operation. Wiley, Hoboken (2005) 2. Diya, T., Thampi, S.M.: Mobile query processing-taxonomy, issues and challenges. In: Proceedings of the 1st International Conference on Advances in Computing and Communication, pp. 64–77 (2011) 3. Gratsias, K., Frentzos, E., Dellis, E., Theodoridis, Y.: Towards a taxonomy of location-based services. In: 5th International Workshop on Web and Wireless Geographical Information Systems: W2GIS 2005, pp. 19–30. Springer, Heidelberg (2005) 4. B¨orzs¨onyi, S., Kossmann, D., Stocker, K.: The skyline operator. In: Proceedings of the IEEE Conference on Data Engineering, Heidelberg, Germany, pp. 421–430 (2005)

168

R. F. Ibrahim

5. Ilarri, S., Mena, E., Illarramendi, A.: Location-dependent query processing: where we are and where we are heading. ACM Comput. Surv. 42(3), 12:1–12:73 (2010) 6. Ibrahim, R.F.: Mobile transaction processing for a distributed war environment. In: Proceedings of the 14th International Conference on Computer Science & Education (IEEE ICCSE 2019), Toronto, Canada, pp. 856–862 (2019) 7. Serrano-Alvarado, P., Roncancio, C.L., Adiba, M.: Analyzing mobile transactions support for DBMS. In: Proceedings of the 12th International Workshop on Database and Expert Systems Applications (DEXA 2001), pp. 596–600 (2001) 8. Georgakopoulos, D., Hornick, M., Sheth, A.: An Overview of Workflow Management: from Process Modeling to Workflow Automation Infrastructure, Distributed and Parallel Databases, pp. 119–153. Kluwer Academic Publishers, Dordrecht (1995) 9. Seydim, A., Dunham, M., Kumar, V.: An architecture for location dependent query processing. In: Proceedings of the 12th International Conference on Database and Expert Systems Applications, Muich, Germany, pp. 548–555, September 2001 10. Dunham, M., Kumar, V.: Location dependent data and its management in mobile databases. In: Proceedings of the 9th International Conference on Database and Expert Systems Applications, Vienna, Austria, pp. 414–419, August 1998 11. Bobed, C., Mena, E.: QueryGen: semantic interpretation of keyword queries over heterogeneous information systems. Inf. Sci. 412–433 (2016) 12. Jensen, C.S., Friis-Christensen, A., Pedersen, T.B., Pfoser, D., Saltenis, S., Tryfona, N.: Location-based services—a database perspective. In: Proceedings of 8th Scandinavian Research Conference Geographical Information Science, pp. 59–68, June 2001 13. Ilarri, S., lllarramendi, A., Mena, E., Sheth, A.: Semantics in location-based services. IEEE Internet Comput. 15(6), 10–14 (2011) 14. Yus, R., Mena, E., Ilarri, S., Illarramendi, A.: SHERLOCK: semantic management of location based services in wireless environments. Perv. Mob. Comput. 15, 87–99 (2014) 15. Ilarri, S., Mena, E., Illarramendi, A.: Location-dependent queries in mobile contexts: distributed processing using mobile agents. IEEE Trans. Mob. Comput. 5(8), 1029–1043 (2006) 16. Vidyasankar, K.: On continuous queries in stream processing. In: The 8th International Conference on Ambient Systems, Networks and Technologies (ANT-2017), Procedia Computer Science, pp. 640–647. Elsevier (2017) 17. B¨orzs¨onyi, S., Kossmann, D., Stocker, K.: The skyline operator@. In: Proceedings of IEEE Conference on Data Engineering, Heidelberg, Germany, pp. 421–430 (2001) 18. Papadias, D., Tao, Y., Fu, G., Seeger, B.: Progressive skyline computation in database systems. TODS 30(1), 41–82 (2005) 19. FBK organization. TextPro online demo. http://hlt-services2.fbk.eu/textpro-demo/textpro.php

wiseCIO: Web-Based Intelligent Services Engaging Cloud Intelligence Outlet Sheldon Liang(B) , Kimberly Lebby, and Peter McCarthy Lane College, 545 Lane Ave, Jackson, TN 38301, USA {sliang,klebby,pmccarthy}@lanecollege.edu

Abstract. A web service usually serves web documents (HTML, JSON, XML, Images) in interactive (request) and responsive (reply) ways for specific domain problem solving over the web (WWW, Internet, HTTP). However, as more and more websites are shifted onto the cloud computing environment, customers tend to lose interest in “dry”, but in favor of “live” informative content. Furthermore, customers may pursue anything from the Cloud that is disclosed as intelligence for business, education, or entertainment (iBEE) for the sake of decisionmaking. WiseCIO was created to provide web-based intelligent services (WISE) via human-centered computing technology throughout computational thinking. As part of the intelligent service, a timely born “iBee” runs on individual devices as a Just-in-time Agent System (JAS, an imaginary “intelligent bee” helping with iBEE) to access, assemble and synthesize in order to propagate iBEE over the big database. Imaginatively, tens of thousands of concurrent and distributed “iBee” of JAS run on devices with universal interface (Un I), user-centered (Uc X) experience via ubiquitous web-intensive sections (Uw S). This is how wiseCIO works as a wise CIO: it involves Cloud computing over multidimensional databases, Intelligence synthesized via context-aware pervasive service, and Outlet that enables sentimental presentation out of information analytical synthesis. In particular, wiseCIO eliminates organizational and experiential issues that may be seen with traditional websites across a variety of devices. As a result, wiseCIO has achieved Un I without being programmed in HTML/CSS, Uc X of not driving users like “a chasing after webpages” (a saying borrowed from Ecclesiastes 1:14), and Uw S of analytical synthesis via failover and load-balancing in a feasible, automated, scalable, and testable (FAST) approach toward novel networking operations via logical organization of web content and relational information groupings that are vital steps in the ability of an archivist or librarian to recommend and retrieve information for a researcher. More important, wiseCIO also plays a key role as a delivery system and platform in web content management and web-based learning with capacity of hosting 10,000+ traditional webpages with great ease. Keywords: Multidimensional distributed docBases (DdB) · Multi-agent systems · Platform-as-a-Service (PaaS) · Web-intensive section (WiSec) · Content-aware pervasive computing · Human-centered computing · User-centered experience

© Springer Nature Switzerland AG 2020 K. Arai et al. (Eds.): SAI 2020, AISC 1228, pp. 169–195, 2020. https://doi.org/10.1007/978-3-030-52249-0_12

170

S. Liang et al.

1 Introduction 1.1 Desktop Applications Versus Web Services Traditional computer users have become accustomed to standalone desktop applications [1, 2] because whose useful controls (buttons, dialogues, dropdown lists, etc.) are all instantly actionable but without changing the current context. In some degree, desktop applications are more advantageous than web services unless the user starts to realize that the app needs installing or upgrading frequently on his standalone computer. Strickland in his article “Desktop Applications Versus Web Services” [2]: “Users install desktop applications on their local computers by taking advantage of the computer’s resources.” while “Web services exist either wholly or partially on the Internet with resources residing in the cloud”. So “desktop apps are more robust than web services, the former means productivity software often gives users more options than web service counterparts.” It would be very true if the comparison were made between an (excellent) desktop app and an (exhausted) web service before Ajax was introduced: a web service serves web documents (HTML, JSON, XML, Images) in interactive and responsive ways for specific domain problem solving over the web. Here the “exhausted-ness” refers to the fact that each user action required that a complete new page be loaded from a remote server. This process is inefficient, as reflected poorly by the user experience: all page content disappears, then the new page appears [3–6]. 1.2 Web Services Versus Web-Based Intelligent Services (WISE) Without “intelligent” services in discussion, a web service would disappoint users because of its inefficient process, but which may also dedicate innovations to web service issues [7]. In particular, a web service is a standards-based, language-agnostic software entity that accepts specially formatted requests from other software entities on remote machines via vendor and transport neutral communication protocols. However, as more and more websites are shifted onto the cloud, users probably become less satisfied with traditional websites out of following reasons: Fixed & Inflexible Layouts (FIL) – A website usually consists of a number of webpages. For the sake of consistent accessibility, all webpages have to be organized in a “flat” layout of header-body-footer [4], which would not only mystify (complicating) web design but also mess (confusing) human cognition and human-centered logic due to lack of computational thinking [8, 9]. Dry & Inactive Content (DIC) – If most of the web documents are presented as viewable but not much actionable, they are said to be dry & inactive. A good example of a “DIC” document is an image map without actionable spots for users to operate onto. A web document is broadly counted as “dry & inactive” if it only has anchored links that “each user action requires that a complete new page be loaded from a remote server” [3, 4].

wiseCIO: Web-Based Intelligent Services Engaging Cloud Intelligence Outlet

171

WISE services, on the contrary, in great favor of the “live” web content (against FIL & DIC) via web-intensive sections (WiSec). A WiSec promotes an intelligent service that is renderable, actionable and assemblable through a FAST approach [7, 10, 11]. A WiSec, different from traditional webpages that currently popular in the layout of header-body-footer, reflects user-centered experience throughout context-sensitive extensibility, hierarchical extensibility, and analytical synthesis effectively via computational management thinking, which involves abstraction and decomposition, pattern recognition and algorithmic fulfillment [7–9]. In general, a WISE service is basically a web service because of common ingredients that are utilized for algorithmic fulfillment of web services, but it seldom swaps the current context by simply downloading a complete new page from the server. Significantly, a WISE service, much like a desktop app, enables many kinds of “live” or actionable elements on renderable web documents; however, unlike a desktop app, it does not require installation on a local computer. According to wiseCIO [7], a user just needs a browser on his device to enjoy user-centered experience browsing that is context-aware pervasive, human-centered, and efficient due to the low latency [7, 10, 11]. 1.3 Hands-on FAST1 Approach Toward WISE Services Imaging two poles with desktop apps on one end and web services on the opposite, the productive software installed on a desktop computer would cause constant upgrading due to CICD (continuous integration, continuous delivery) [12, 13]. On the other hand, the available service on a remote server has to go through an inefficient process that reflects poor user experience without context-sensitive and hierarchical extensibility. The novelty of WISE is bridging the divide between two poles by exploiting advantages (productivity and availability) and eliminating disadvantages (installing or inefficient) throughout productivity-extensibility-availability, illustrated as Fig. 1. WiseCIO works with universal interface so that the user only needs a browser on his device when anything is available as a service in the cloud, assemblable in the browsing, and actionable on the client-side. Adoption of wiseCIO is central to creating an efficient networking environment to discover and disclose useful and usable iBEE. 1.4 Major Contributions of WiseCIO The mission of wiseCIO is to innovate web-based intelligent services of actionable productivity (on the client-side), assemblable extensibility (in browsing), and ubiquitous availability (in the cloud), in which the computational management thinking [8, 9] is used for conceptual abstraction, hierarchical decomposition, well-recognized patterns via fulfillable algorithms, and analytical synthesis over multiple dimensional databases. The novelty of wiseCIO is to provide holistic organization, rapid assembly, and logical presentation to encourage migrating desktop apps onto the cloud, to enable web 1 FAST stands for a computational solution that is feasible, automated, scalable and testable,

throughout continuous integration and continuous delivery, and more specifically, test-driven and agile development via pattern recognition for machine learning.

172

S. Liang et al.

Fig. 1. The WISE thinker is thinking of “two birds with one stone”: exploiting advantages and eliminating disadvantages from both sides via hierarchical extensibility.

services for human-centered computing, and to engage users in hierarchical depth and contextual breadth. The following “Us” are accountable for major contributions: Universal Interface (Un I) – Universal interface is innovated for logical presentation in contextual (breadth), hierarchical (depth) and direct usability without trivial programming involved [7, 11, 12]. On the client-side, the actionable productivity underlying the human-computer interaction (HCI) [14, 15], enables renderable (interface) and actionable (interaction) features different from traditional hyperlinks. As a company with productive HCI, a website just requires a browser to suit a variety of devices (desktop, laptop and smartphone) via well-recognized patterns across various operating environments (iOS, Windows, Android). User-Centered Experience (Uc X) – User-centered experience fulfills rapid assembly in computational management thinking via abstraction, decomposition, well-fulfilled patterns, and analytical synthesis. In the browsing, human-centered computing technology is used to engage users within context-aware pervasive subject without being driven like a chasing after “webpages”. According to Designing of UI & UX, “UX focuses on a deep understanding of users through the best practice of improving the quality of user’s interaction with and perceptions of any related services” [14, 15], wiseCIO’s “user-centeredness” is context-aware pervasive and embodied as rapid assembly via hierarchical depth and contextual breadth. Ubiquitous Web-intensive Sections (Uw S) – Ubiquitous WiSec feasibly enables holistic organization via scalable document storages and analytical synthesis from distributed docBases (DdB) across a number of servers [16]. In the cloud computing, the ubiquitous availability means queryable, locatable and assemblable via failover (redundancy) and load-balancing (duplicates). The “ubiquitousness” means feasibly-unfailing solutions to logical presentation via rapid assembly over the DdB. In addition to webintensive features (renderable interface and actionable interaction), the ubiquitous WiSec

wiseCIO: Web-Based Intelligent Services Engaging Cloud Intelligence Outlet

173

promotes information synthesis through DdB shards2 in support of aggregating and associating multiple shardings of content as a whole. Apparently, wiseCIO thrives a particular practical significance and application value, so ubiquitous WiSec means a smart “aggregator” of making fine use of web content that is flexible, intelligent, natural and explorable via logical organization of web content and relational information groupings as most archival repositories are digitizing original records to make them widely available to the public [16–20]. 1.5 Organization of This Paper The rest of the paper is organized as follows: Section 2. WISE conceptualizes novel ideas wiseCIO advocates, such as: WiSec: Web-intensive Section. A composite in support of human-computer interaction. aJSON: Actionable JSON [21]. A set of data in support of data-driven process automation. DdB: Distributed docBase. sharding of docBase for scalable and analytical synthesis. MLa: Machine Learning Automata. Automated interpretation in accordance with sample data Section 3. CMT derives types of application model Web Content Management, such as: xCMT: Anything in computational management thinking. Extensible depth & contextual breadth CMS: Content management system. Online services and apps of digital content. eCMS: Educational content management. Education-specific content management wCMS: Worship content management. Worship-specific content management Section 4. CIO dedicates algorithmic fulfillment of iBEE for decision-making, such as: OOWiS: Object-oriented web-intensive section. Granting web entities with object lifecycle. REAP: Rapid extensible and assemblable presentation. Pairing on presentation and assembly. CSSA: Context-sensitive screening aggregation. Resume-like screening via responsive linkages. OLAS: Online analytical synthesis. Multiple shards of DdB are synthesizable as a whole. 2 DdB shards, different from fragments, but in favor of distributed storage and analytical synthesis,

where were first introduced for the course “Advanced Database Applications” at Oklahoma Christian University in support of the retail distribution and product management. The DdB with shards were deepened with the initiated course “Software Engineering for UI & UX” offered at Miami University in order for a hands-on project to promote user-centred experience by turning seven colleges into revised user interface design via rapid assembly and analytical synthesis.

174

S. Liang et al.

Section 5. Conclusion that wiseCIO represents practical significance and application value, and discusses the scope of future work, and the future plan as well.

2 WISE: Striving for Computational Productivity and Availability WISE is central to Platform-as-a-Service (PaaS) that embraces both productivity (from desktop apps) and availability (from web services), and acts like a computational thinker serving in a FAST approach throughout generalized abstraction, divide & conquer via decomposition, well-recognized patterns, algorithmic fulfillment, and analytical synthesis, as illustrated in Fig. 2.

Fig. 2. WISE, instead of just a web service, strives for usability, usefulness and ubiquitousness in computational management thinking.

2.1 WiSec: Interactive, Experiential and Synthesizing Web-intensive sections (WiSec) enhance human-computer interaction (actor & action), operational experience (contextual & extensible) and rapid assembly (ubiquitous synthesis) of web content. Therefore, a WiSec can be used as a playable service to represent a top-level website hosting thousands of traditional webpages, or assembled as subordinate add-on to a larger context, both of which embody user-centered experience engaging the user in a computational management process. A WiSec is a playable service, characterized as interactive, experiential and synthesizing. A set of playable commanding interface (PCI) has been defined in the format of RESTful API [27, 28] for easy calls to a playable service described in Table 1. The web content is represented as document and stored in distributed docBase (DdB) across a number of servers, which is said to be scalable due to the transparency characterized as physically distributed (across a number of servers) and logically as a whole.

wiseCIO: Web-Based Intelligent Services Engaging Cloud Intelligence Outlet

175

Table 1. A general PCI command with key(s), docBase, and key-value pairs ?cMd=key@path/DdB&kvParams a general PCI command usually consists of the following aspects: ● cMd represents the functional command, e.g., jAs; ● key is used for queries and retrieval of content from the DdB; ● path is a remote location referring to distributed docBases; ● kvParams are key-value pairs used for operational modification

Typical playable services in wiseCIO are webJas, calEndar, triVision, cabiNet, and so on, and among which is Just-in-time Agent Service (JAS) or nickname iBee, one of the most essential WiSecs that initiates wiseCIO on the client-side in collaboration with the playable services, illustrated in Fig. 3.

Fig. 3. JAS is symbolized as gears denoting “get the ball rolling” of human-computer interaction through access, assembly, add-on, and activation.

webJas (?jAs=). A Just-in-time Agent Service resides on the client device to watch the user’s action and then bring out a WiSec as a standalone application, or assemble it as a subordinate add-on to a larger context. The JAS is completely distributed, concurrent and autonomous on client-side by calls to a series of well-fulfilled functions in support of user-centered experience. According to Fig. 3, the JAS agent conducts following just-in-time actions: 1) Access & retrieve. Asynchronous retrieval of ubiquitous wiSec from distributed docBases; 2) Assemble & synthesize. Synthesis of multi-shards from DdB into a WiSec ready to present; 3) Add-on & render. Presentation of WiSec either as an add-on to or as a standalone context;

176

S. Liang et al.

4) Activate actionables. Activation of actionable features via events handler ready for HCI. There are other playable services in collaboration with JAS as follows: triVision (?tVn=). A multi-view banner service enhances a large organization/ institution with multi-branches in a decompositional hierarchy. Each banner denotes a branch under the institution, for instance, the College of Engineering and Computing (CEC) under Miami University. The multi-view banner service acts as composite of WiSec that promotes rapid assembly from DdB, as illustrated in Fig. 4.

Fig. 4. Rapid assembly via Actor (Interface) > Action (Interaction) > Assembly(WiSec)

All banners are extensible via rapid assembly, but only one can be opened at a time for feasible content management without overwhelming. As described by use case diagrams [22] via actor, action and assembly together with design patterns [23], a banner is to open, which will force other banners to be closed. The opening/closing action embodies object lifecycle - constructed in opening and destructed in closing. A banner with underlying content to be assembled also acts like a “IEAK furniture box”3 to be transmitted in low latency due to the distributed JAS via concurrent processing. While opening a banner, the JAS in the back helps to assemble WiSec and activate actionable elements via event handler processing, and all actionable elements, including image-based button/icon, advertising flyers, and text words as well, mean “clickable” to make a call to a related playable service via rapid assembly of WiSecs with super ease. calEndar (?cEd=). An eventful calendar service may have events associated with specific dates. By navigating dates on the calendar, the user can explore daily events that act as intelligent composite as illustrated in Table 2. 3 An “IKEA furniture box” is more about user interface in very brief ways than about the real

furniture itself. That is to say, a WiSec is like a brief box in transmission, and assembled on the user’s device to fulfill furniture function as a whole.

wiseCIO: Web-Based Intelligent Services Engaging Cloud Intelligence Outlet

177

Table 2. A PCI service for calEndar

?cEd=path/cEvents&mo=7&da=#27 ?cEd= usually has the following aspects: ● ● ● ●

cEd expects to bring out an eventful calendar, e.g. path/cEvents.cal; if path could be a specific URL or a uSvr (ubiquitous Svr) for rapid assembly; mo=7 denotes a given month, and da=#27 expects to assemble multi-events on the day from DdB.

A PCI allows to use two types of path: a specific URL or a ubiquitous uSvr. The specific path performs direct access to the given URL, and the uSvr path enables ubiquitous access to the clustered servers for analytical synthesis as a whole. Events-associated calendar (calEndar) is another useful WiSec where an event could be as simple as messages, and also as comprehensive as a web-intensive section to be assembled or embedded via extensibility in depth and contextuality in breadth. An eventful calEndar in practice can be used for attendance check-in or coordinating signup sheet. For instance, a community service center may require sign-ups on a remote basis. If somebody has signed up for one role, in the meantime, the following sign-up occurrence would be denied, but encouraged to try other roles to fill in. One of the most complex WiSecs is composite cabinet that can organize digital archives like a dashboard of data centers with “shelves,” “boxes,” “folders,” comparable to “situating archives (physical records) in a specially prepared storage room on a shelf in an environmentally appropriate box, and further in individual folders, designated to hold their very own part of the whole.” [Evelyn Keele ~ certified archive manager]. cabiNet (?cNt=). A cabiNet service acts like a digitizing archives for complex content management system (CMS), shown in Fig. 5.

Fig. 5. cabiNet can work as a dashboard with multiple views monitoring data centers

178

S. Liang et al.

The above-discussed typical playable services are embodied as composites of highly assemblable, harmonically subordinate, and smartly presentable WiSec as a whole with flexibility of organizing complex content online. 2.2 aJSON: Data-Driven Process Automation JavaScript Object Notation (JSON) is an open-standard file format or data interchange format [20, 21] from which actionable JSON has been derived in support of data-driven process automation via data interchanging between the client and playable services. In addition to purely data-interchanging, aJSON represents intelligent units with parameters, or programmable JSON consisting of semantic ingredients. The aJSON can be seen as “advanced and extensible TAGs”4 that comprise WiSec with well-fulfilled patterns. JAS helps to turn aJSON aggregated in WiSec to promote context-aware pervasive and human-computer interaction in FAST process [7, 10, 11]. WiseCIO denotes platform-as-a-service that provides a set of WiSec via PCI as playable services in use of programmable aJSON, which is embodied in a set of wellfulfilled patterns, as shown in Fig. 6.

Fig. 6. Programmable aJSONs comprise playable services in support of wiseCIO as PaaS

Following programmable aJSON appearing as advanced and extensible “TAGs” comprise WiSec, and their “body” would recursively be referring to other WiSecs, which makes sense with rapid assembly via extensible context-aware pervasiveness. bioGraph ~ a biography unit is useful for a person to be represented with a photo, title, brief description, playable multimedia, and bio-body of potential assembly in extensible depth. 4 An HTML tag is commonly defined as a set of characters constituting a formatted command

for a Web page, while the “advanced & extensible TAG” defined as an intelligent unit of wellrecognized patterns.

wiseCIO: Web-Based Intelligent Services Engaging Cloud Intelligence Outlet

179

newSec ~ a news section helps to announce a news including headline, brief description, playable multimedia, and the news-body of potential assembly in extensible depth. litEms ~ a list of items acts like a bulletin to collect a statement and playable multimedia with or without potential assembly in extensible depth. picSlide ~ a picture slideShow collects a group of pictures and related brief statements that is showable when activated. pipeLine ~ a set of playables embodies polymorphic features including multimedia, documents, and ubiquitous WiSecs as well that will take rotation at a time. As an example, a new section (newSec) is used to announce “President’s Message” with a key-value pair as interpreted in Fig. 7.

@newSec : { President’s Message : [img, playable, headline, eyesight] }@ Semantics of @newSec: Actionable ingredients are assigned and integrated as follows:

Fig. 7. The news section is presented with multimedia associated (such as video, audio), headline, and detailed message (as body) via the “eyesight” button.

Image icon. The image icon acts as a button that is paired with a playable to bring out a playable item (e.g. a video, audio, … another WiSec). Eyesight. The eyesight icon enables the news-body extensible in depth shrinkable in breadth according to the user’s action. Propagating WiSec(s). The news-body is mapped to a WiSec stored in ubiquitous DdB via failover, load-balancing and rapid assembly, as illustrated as Fig. 8. Because of limited space in this paper, there are more playable services and programmable aJSONs in wiseCIO that can’t be introduced here. As a PaaS, wiseCIO is open for more to add in codified design patterns [23]. 2.3 DdB: Ubiquitous, Sharding and Assemblable Distributed docBases (DdB) provide a ubiquitously available and highly scalable solution to complex and database-centered business applications. DdB plays a key role in databases across a number of servers to enable new level scalability for (big) database performance. Following QoS aspects are charmingly supportive via ubiquitous WiSecs: Failover. Failover is a networking technique that enables switching automation via redundant or standby DdB when the primary system fails. The failover is achieved by redundant shards via ubiquitous DdB, which embodies fault tolerance function of mission-critical systems that rely on constant accessibility and availability [17].

180

S. Liang et al.

Fig. 8. The newSection is having the underlying body embedded from Fig. 7.

Load-Balancing. Balanced services improve the distribution of workloads across clustered servers with the aim at high throughput and low latency (reduced response time) in order to avoid overload of any single docBase which is achieved by duplicate shards in DdB [18, 26]. The ubiquitous availability via failover and load-balancing is established by duplicate and/or redundant shards across the clustered servers, as illustrated in Fig. 9. Information Analytical Synthesis. This provides many ways of gaining and achieving adequate comprehension by polymorphic shards, that is, the same topic (via keywords) may be represented in several different forms because of multi-facets of the thing that may lead to better comprehensiveness, which is what information synthesis pursues. A playable service may denote web-intensive sections by a key-value pair (key & path) for accessibility to DdB. From the perspective of analytical synthesis, following key-path relationships supports variations of analytical synthesis: 1:1 relationship. Propagating one-on-one retrieval via the key from the pathed DdB. 1:m relationship. Propagating one-on-multi retrieval via the key from multiple pathed DdB.

wiseCIO: Web-Based Intelligent Services Engaging Cloud Intelligence Outlet

181

Fig. 9. The path @uSvr/denotes the ubiquitous WiSec across clustered servers via redundant and duplicate shardings by failover and load-balancing.

m:1 relationship. Propagating multi-on-one retrieval via multi-keys from the pathed DdB. m:n relationship. Propagating multi-on-multi retrieval is theoretically applicable, but not encouraged due to its complexity; it is left open for potential use in the future. In considering the above variations, a typical PCI may have following formats in use, e.g.: ?cMd=key@uSVR/DdB&kvParams

// general format for a PCI command

?cMd=CS@uSVR/[lane|ucla|mu]&kvParams

// 1: m~ look up CS from multi-DdB

?cMd=[CS|MA]@uSVR/mDbB&kvParams

// m:1~ look up CS&MA from mDdB

According to P. G. Goldschmidt [18], information synthesis requires four steps: defining the topic, gathering the relevant information, assessing the validity of such information, and presenting validated information in a way useful to the target audience. The JAS agent is enhanced (from Fig. 3) for ubiquitous WiSecs with a series of well-fulfilled patterns to promote information synthesis, as shown in Fig. 10. Analytical synthesis in collaboration with WiSecs is in following match-ups: Accessing to the defined topic. The topic can be defined by keywords via access & retrieval in asynchronous networking communication; Assembling the gathered information. The gathering of relevant information is via rapid assembly and synthesis of the same topic (defined by the keyword) from multiple DdB; Add-on the validity-assessed information. The validity is assessed is to determine if the related information makes sense as an add-on to the context; Activating the presentation in usable ways. The presentation should be activated with actionable ingredients in pursuit of context-aware pervasive computing.

182

S. Liang et al.

Fig. 10. The JAS agent propagates a WiSec through a series of well-established actions via rapid assembly coordinated with analytical synthesis

As a result, the JAS acting as an intelligent agent enables more data-driven automation than application-oriented process throughout interactive action, experiential operation, and rapid assembly via information analytical synthesis. 2.4 MLa: Automated Data Analytical Synthesis Machine learning is to use analytical models to automate data analytics, identification of patterns, and decision making with minimal human intervention. In other words, computer systems are created to perform specific tasks without using explicit instructions, and relying on patterns and inference instead [24–26, 29]. The JAS agent plus Machine Learning method gives birth to JIT-MLa that automates the task of renderable presentation by using well-fulfilled patterns in aJSON without being explicitly programmed, as illustrated in Fig. 11.

Fig. 11. JIT-MLa assists JAS to fetch sampled data, retrieve identified patterns in aJSON make decisive use for automated renderable presentation of web-intensive sections.

According to Sect. 2.1, the JAS agent resides on the client-side serving in data preps, generating aJSON into rendering presentation via iterative processes. Using the

wiseCIO: Web-Based Intelligent Services Engaging Cloud Intelligence Outlet

183

preaching message as an example, how to present a sermon online in the following format of sample data as follows:

@MP4: [vPlay.png, {0}]@@MP3: [{1}, {2}]@ {3}{4}{5}



The machine learning enables automata by retrieving follows, and stated in Table 3. Table 3. The JIT-MLa works via defined sample, well-fulfilled patterns and usage

sample#=> …{0} …{1}...{2}...{3}...{4}...{5}... pattern#@> @PCHh : [Playable, Message, Passage, Date, Speaker ]@ while (;) { @PCHr : [ {0}, {1}, {2}, {3}, {4}, {5} ]@ []@

// head of table } @PCHe :

usage#:> playV, playA, the relationship… , Psalm 119, Aug04-2019, Justin Wainscott ; playV, playA, our dwelling … , Psalm 90, Jul28-2019, Douglas ; playV, playA, god-centered … , Psalm 90, Jul23-2019, Justin

MLa automates renderable presentation according the sample data to turn “dry” content into “live” items such as playable video and audio, passage reference, eventful date, and speaker’s social media. Sample data in HTML#=>.{0}.{1}.{2}.{3}.{4}.{5}.

Sample data is initiated in HTML/CSS in order to render given preaching message featured with six parameters as “DNA” above discussed in {0}, … {5} Identified pattern in aJSON#@ > @PCHr: [{0},{1},{2},{3},{4},{ 5}]@ Codified pattern is usually represented in aJSON with actionable features integrated in support of human-computer interaction.

184

S. Liang et al.

Decisive & iterative usage #: > 0-layV, 1-playA, 2-message, 3-passage, 4-date, 5-speaker Decisive and iterative usage of the codified pattern is via the strategic process of machine learning once, and iterative use for times without being explicitly programmed. The JAS agent timely invokes its subordinate assistant MLa while retrieving the required WiSecs; only “DNA” ingredients from the DdB are required to transmit. In the meantime, the MLa will automate the rapid assembly from “DNA” ingredients via iterative usage of codified patterns into a renderable WiSec, and the last is to activate actionable elements from “dry” documents to “live” WiSecs. The beauty of MLa is to turn “dry” into “live” elements via the sample data. For instance, MLa enables sermons in “live” WiSecs of multimedia (A/V), biblical references, date-related events, and speaker’s biography. Furthermore, multi-sermons will be posted by going through an iterative process without programming involved.

3 Case Study: Content Management via Computational Thinking According to Jeannette Wing [5], computational thinking is a fundamental skill for everyone, is to solve problems by using abstraction & decomposition, design systems by transforming a difficult problem into a solution (pattern recognition & algorithm), understand by thinking in terms of prevention, protection, and recovery through redundancy, damage containment, and error correction (big data & fault tolerance). wiseCIO serves like a strategic CIO via PaaS that helps with decision-making by discovering iBEE (intelligence for Business, Education and Entertainment). It is reasonable for wiseCIO to manage and deliver complex web content via computational management thinking throughout abstraction & decomposition, pattern recognition & algorithm, and big data & fault tolerance [17, 18, 26]. 3.1 xCMT: Anything in Computational Management Thinking Web content management and comprehensive delivery aim at anything in computational management involving network resources and information technologies, such as clustered (ubiquitous) servers and remote storage, big data (DdB), information analytical synthesis for intelligence for Business, Education, and Entertainment (iBEE). “A practical usage of this Computational Management could network subject matter within a document repository and simplify searches across a wide variety of material types, or Series, aiding the researcher with a range of information inclusive of many parameters [30].” [Evelyn Keele ~ certified archive manager]. From the viewpoint of xCMT, how to manage the rich content for an institution, e.g., Miami University, is to use computational management thinking through abstraction (of an educational institution) and decomposition (into multi-colleges), pattern recognition (of repeating sequence of faculty, curriculum and classes) and algorithmic fulfillment (of web-intensive services), as illustrated in Fig. 12. The xCMT proceeds with problem solving, designing, understanding and application in “bidirectional” approaches (top-down analysis and bottom-up synthesis) throughout decomposition, pattern recognition, abstraction and algorithm:

wiseCIO: Web-Based Intelligent Services Engaging Cloud Intelligence Outlet

185

Fig. 12. Computational management thinking throughout a process of problem solving, designing and understanding through four steps and two approaches.

Top-down analysis. Breaking down a big single entity into smaller and easier to manage parts or fragments, which is necessary for improving understanding. This top-down process starts with the big single entity: e.g., Miami University, then breaks into several colleges, such as the College of Liberal Arts (CLAS), College of Engineering and Computing (CEC), Farmer School of Business (FSB), etc. and all of which are abstracted as components under educational content management (eCMS). Each of eCMS may consist of such playable services as webJas, triVision, calEndar, cabiNet, etc. The breakdown continues in reference to aJSONs: bioGraph, newSec, litEm, picSlide, phpeLine, and so on. And recursively, the aJSON may have a body that may also be embodied as WiSecs. Bottom-up synthesis. Referring to the process of combining the fragmented parts into an aggregated whole, which is comprehensive for improving management. This bottom-up process reflects rapid assembly by synthesizing the web content onto a larger context to present an overall view to the user. For instance, a professor has his bioGraphy (in aJSON) under a department, and to the faculty, this department is a larger context for his biography to be assembled onto. The process may also support aggregating synthesis if the body has multiple resources on the DdB. wiseCIO, differing from traditional websites, provides a rationale in support of universal interface (interactive view) via well-abstracted patterns, user-centered experience via top-down analysis (hierarchical depth of zoom-in) and bottom-up synthesis (contextual breadth of zoom-out), and ubiquitous web-intensive sections (rapid assembly of renderable presentation) via failover and load-balancing. There will be some examples presented in the following parts of this section.

186

S. Liang et al.

3.2 Content Management Service The wiseCIO offers a platform as a service covering the creation, design, operation, stewardship and governing of enterprise-level digital content management system (CMS) typically consisting of two components: content management application (CMA), and content delivery platform (CMP) [9, 30, 31]. Equivalently, WISE acts as CMA responsible for creation, management, and manipulation of digital content, while CIO as CDP aiming at delivery in a feasible, automated, scalable and testable (FAST) approach, as illustrated in Fig. 13.

Fig. 13. eCMS and wCMS derived from CMS via wiseCIO (content management and delivery)

In practice, both eCMS (educational content management) and wCMS (worship content management) are specialized and full of a wealth of digital content, so wiseCIO has applied computational management thinking to comprehensive, complete and complex content management. On the other hand, the content delivery should also be able to reflect the organizational structure throughout abstraction and decomposition, wellfulfilled patterns and analytical synthesis. As a result, wiseCIO in favor of context-aware and hierarchical extensibility in the FAST approach. 3.3 eCMS: Educational Content Management Service wiseCIO starts with eCMS from “dry-inactive” course syllabus to the enhanced webbased intelligent service with dynamic course syllabus as the class progresses with lots of “live” ingredients in following categories: Course Description. Under this category, there may be: course prerequisites, course objectives and outcomes, general plan for the course, the required textbook, and so on. Course Modules and/or Topics. Under this category, there may be scheduled plans on modularly or weekly basis in a hierarchical structure presumably with lots of “live” ingredients as the class progresses, such as:

wiseCIO: Web-Based Intelligent Services Engaging Cloud Intelligence Outlet

187

Lecture notes. Main lecture notes are supposed to be accessible online, for instance, in Google Slides, or Microsoft Powerpoint, … diagrammatic explanations via animations, etc. Quizzes and assignments. Quizzes and assignments should be accessible and downloadable by using buttons, icons, or text anchors as gradually planned with playable commands. Scheduled projects. Initially, several projects can be planned for students to choose via sign-ups by group, then the progressive project should be visual (demoable). Student Roster. Under this category, there may be basic contact information, progressive submissions according to the course topics that are gone through. For the sake of visual case study, let’s get started with a playable service of table of contents, via PCI (?tOc=) based on the following assumptions: Readability. The table of contents is such a readable “tree” that everybody is able to know how to operate and find whatever he is interested in because of book reading experience. This playable is powerful and unlimited (like a very thick book with a table of contents), which helps completely eliminate the hardship of layout via header and footer, as illustrated in Fig. 14.

Fig. 14. The table of contents enables contextual access (without using forward and backward), direct access (click to enter), hierarchical access (items underneath components).

Extensibility. In addition to hierarchical access within the current context, there are means of enabling hierarchical extensibility, such as bulletin, tabs, and drawer as well. After all it is the playable that supports assembly of ubiquitous web-intensive sections. The hierarchical extensibility/contextuality offers dual features in browsing: assembled add-on via the object constructor of exploring in depth, or destroy (after viewed) via the object destructor of exploring in breadth), as illustrated in Fig. 15.

188

S. Liang et al.

Fig. 15. Hierarchical extensibility enables the access-assemble-add-on-and-activate process from ubiquitous WiSecs, constructing when being built up, and destroying after shut-down.

As a good example of following aJSON, bioGraph supports the hierarchical extensibility with following actionable features: @bioGraph{ Bryan Afadzi : [photo, playable, briefing, bafadzi@uSvr/[CSC,BIO] … }@ Photo turns out as a button. When being clicked, the extensible WiSec of bio-body will be brought out according to the key (bafazi) out of the given DdB; Photo explicates underwritten synthesis. When being clicked, the extensible WiSec of bio-body will be synthesized from both CSC and BIO if Bryan has multiple profiles in both CSC major and BIO minor in two separate DdBs, so the synthesized presentation would be rapidly assembled according to the given key (bafazi) that has multi-resources in following format:

bafadzi@uSvr/[CSC,BIO] There are more charming features with cCMS, e.g., live course schedule, where the “live” ingredients in user-centered experience without being programmed, as Fig. 16. 3.4 wCMS: Worship Content Management Worship content management involves calendar-based events, preaching messages represented in video/audio on weekly basis, bible study plans, and so on. As a derived CMS, the wCMS has a large number of similarities to eCMS. Here are highlighted in universal interface, and user-centered experiences by using MLautomaton (see Table 3), demonstrated in Fig. 17. With ubiquitous WiSecs, rapid assembly enables excellence of user-centered experience in different means, for instance, the check-in sheet for attendance is associated with a date for the participant to check in and check out, as shown in Fig. 18. The check-in event associated with the calEndar is well demonstrated with hierarchical access (under a given date) in depth, dynamic presentation (photo related), rapid assembly (personal info) and analytical synthesis (across different events).

wiseCIO: Web-Based Intelligent Services Engaging Cloud Intelligence Outlet

189

Fig. 16. The lively progressive schedule will have more “live” ingredients activated as class goes.

Fig. 17. hierarchical extensibility and contextuality are broadly used via JIT-MLautomoton in collaboration with programmable aJSON with “DNA” ingredients assembled presentable on the client.

4 CIO: Trustworthy Cloud Intelligence Outlet Cloud Intelligence Outlet (CIO) assists web-based intelligent services with iBEE discovered and disclosed via ubiquitous WiSecs. If we see wiseCIO as a novel network operating system with Multiple File Sharing via DdB sharding, similarly, the CIO can be seen as an I/O Management System with a focus on how to present WiSecs on the client-side via a FAST approach. From the viewpoint of I/O Management, the following aspects will be discussed in support of the outlet for iBEE in object-oriented construction paradigm.

190

S. Liang et al.

Fig. 18. Associating events with a calendar also makes great sense with tracking participation.

4.1 OOWS: Object-Oriented Web-Intensive Sections A WiSec has advantageous features, such as being accessible, assemblable as add-on, and activatable with interface integrated, so object-oriented design patterns are utilized to manipulate WiSecs as objects. Object lifecycle. This treats web-intensive sections as an object with actionable constructor while loaded via rapid assembly and analytical synthesis, or destructor when closed (released). Object search. Object search is connected to a PCI that may represents relationship between keys and pathed DdB, for instance, 1:1, 1:m, or m:1 relationship. Object sensitivity. Object sensitivity represents context-awareness to specific keywords, which conducts user-centered experience via contextual exploration. For instance, a web-based learning service, would be conducted according to user-chosen keywords to different extent of depth or breadth accordingly with different context involved. Object synthesis. Object synthesis embodies aggregation or composition from simpler to more complex. For instance, a university is composed of several colleges, so object synthesis means a college can be brought out as part of university. In the logic presentation, hierarchical extensibility would promote object composition for the user to explore within the context in depth or breadth. 4.2 REAP: Rapid Extensible and Assemblable Presentation WiseCIO provides universal interface in support of hierarchical, contextual and direct accessibility to Web content assembled as web-intensive sections. In addition, wiseCIO’s ubiquitous WiSecs promotes rapid extensible and assemblable presentations for better exploration via universal interface and user-centered experience. As discussed in Sect. 2.2 (Fig. 7), ubiquitous WiSecs are stored as shards in DdB in support of rapid assembly of rendering on the client-side devices.

wiseCIO: Web-Based Intelligent Services Engaging Cloud Intelligence Outlet

191

In terms of web content management, extensibility and assembaility are keys to REAP: Extensible presentation. From a user’s perspective, any subordinate branches are extensible underneath a larger organization. For instance, standing at the point of a university, the user usually likes to explore one of the subordinate colleges, hierarchical extensibility will allow the user to look into the college without losing the current context of university, which brings out user-centered experience on content delivery. Assemblable propagation. Assemblable propagation, from the point of view of ubiquitous WiSecs, the duplicate and redundant shards of web content can be rapidly assembled for propagation as web-intensive sections via failover, and load-balancing, which brings out user-centered experience on content management. REAP is made possible because of shards that are holding semantic ingredients for web-intensive sections, and the JIT Agent running on the client-side device in collaboration with distributed and parallel systems, as illustrated in Fig. 19.

Fig. 19. Flexible REAP via client-side extensibility and server-side assemblability

4.3 CSSA: Context-Sensitive Screening Aggregation Context-sensitive screening was inspired by the resume screening that a recruiter just needs to input a few keywords then he can evaluate if an applicant is qualified for the job. This process is very subjective, but the recruiter may feel more freedom to use his judgment instead reviewing the whole resume. An online courseware via web-based learning service or CSSA can be prepared as thoroughly as possible, and also as brief as natural, which would give the user more flexibility on self-taught classes. That is to say, with the presented course content, he may feel some terms too strange so he would like to explore by typing in a few keywords (as questions) at the beginning. As ideal outcomes, those associated keywords would be screening out with playable services being activated accordingly. So he is happy to go into more depth toward that direction as illustrated as Fig. 20.

192

S. Liang et al.

Fig. 20. Context sensitivity leads to dynamically-customized context accordingly

The online instructional courseware (OLIC) is a right direction toward excellence online courses to be offered with lots of hierarchical materials supplied via playable commanding interface according to the course designer. User-centered experience via CSSA will give the user better experience that he is seemingly sitting in the driver’s seat that may be “always ready to learn” because the dynamically-customized context helps meet his special needs in his studies. On the contrary, nowadays, there may be “perfect” courseware online. But with too many “live” ingredients for everybody to look into, it could make the learner feel in the passenger’s seat because he might not “like being taught” in a passive way. Let the learner experience as much as possible in the “driver’s seat” so that he is “always ready to learn” via context-sensitive screening aggregation that allows the user to choose keywords to explore, and let the learner feel as little as possible in the “passenger’s seat” so that he won’t feel “I don’t always like being taught.” OLIC or online instructional courseware should be something to reflect Winston Churchill’s ideology on learning.

wiseCIO: Web-Based Intelligent Services Engaging Cloud Intelligence Outlet

193

“Oh yeah” wiseCIO is all about web-based intelligent service engaging Cloud Intelligence Outlet, a PaaS with universal interface via user-centered experience from ubiquitous synthesis of intelligence for business, education and entertainment (iBEE)!

5 Conclusion: Thriving for OLAPs in FAST Approach This paper presents WiseCIO, a PaaS enhancing a FAST approach throughout playable services and programmable aJSON. The central to wiseCIO is the web-intensive section (WiSec) that embodies playable services as renderable (view), actionable (interactivity), assemblable (contextuality) and analytical synthesis (extensibility). Innovative Solution. wiseCIO innovates solutions to the fulfillment of Un I (universal interface) via browsing on a variety of electronic devices across different operating systems without being explicitly programmed; Uc X (user-centered experience) via extensibility in depth and contextuality in breadth in computational management thinking throughout conceptual abstraction, hierarchical decomposition, well-recognized patterns and algorithmic fulfilment; and Uw S (ubiquitous web-intensive sections/services) via rapid assembly of renderable presentations, actionable human-computer interaction (HCI), and information analytical synthesis in a feasible, automated, scalable and technological approach. aJSON: Programmable JSON. With well-recognized and smartly-codified patterns, aJSON is derivable with capability of enhancement in the first place. The derivable aJSON is capable of fetching meaningful sample data (patterns) in support of machine learning automaton, which makes significant sense of turning “dry & inactive content” (DIC) into “live” ones with super ease. Secondly, aJSON is DNA-like briefing, but with rich semantic ingredients underwritten to enable the reduction of bandwidth usage up to 90% with more automated features flexibly on the client-side. Thirdly, aJSON parameterizes and visualizes the usage of UI & UX design without programming skills required (please see Fig. 7 and Fig. 11 in Sect. 2). WiSec: Playable Services. With the support of programmable aSJON, WiSec embodies playable services without much coding effort and layout pain of header and footer due to the user-centered experience of freedom via extensibility in depth and contextuality in breadth. Also, the usage of the WiSec could be either as an individual “webpage” (propagating direct access) or as an extensible add-on to a larger context (assembled hierarchically), which promotes computational management thinking into human-centered computing experience. wiseCIO: platform as a service. wiseCIO is endowed with well-recognized patterns from playable services to programmable aJSON. In addition, wiseCIO’s scalability enables web content WISE in creation, management and manipulation, and makes a strategic CIO for intelligent content delivery by rapid assembly in the “playable” usage and analytical synthesis from ubiquitous distributed docBase (DdB) via failover and load-balancing. Furthermore, wiseCIO’s computationality favors understanding improvement in computational management thinking, and engages the user to explore iBEE for decision-making throughout conceptual abstraction, hierarchical decomposition, pattern recognition and algorithmic fulfilment.

194

S. Liang et al.

The future work scope and plan will have more effort on online analytical processing & synthesis (OLAPs) for cloud intelligence outlet as follows: Feasible. Generally, our plan is to focus on how to define the topic in a feasible approach under categories of various application fields, for the use of categorized keywords to denote such topics as enterprise (business), education, and entertainment. Automated. Specifically, we will prioritize educational topics on how to gather the relevant information stored in distributed docBases in an actionable approach for analytical processing. Scalable. Importantly, we will also conduct an assessment of the validity of collected information to determine if the topic-related information makes sense in scalable approach for failover via redundancy, and load-balancing via duplicates. Testable. Last, machine learning and automaton techniques will be applied throughout a test-driven agile development for overall online analytical processing and synthesis (OLAPs). WiseCIO thrives on a particular practical significance and application value - “As most archival repositories are digitizing original records to make them widely available to the public, complex content management (WISE) and web content delivery (CIO) as organized within cabiNet (Fig. 5) seem to correspond to a logical arrangement.” [Evelyn Keele, certified archive manager]. The novelty of wiseCIO is to promote universal interface, user-centered experience and ubiquitous synthesis in a FAST approach toward novel network operating systems.

References 1. Illustrated Microsoft Office 365 & Office 2016: Fundamentals, Loose-leaf Version, 1st edn. ISBN-13: 978-1337250771 2. Strickland, J.: Desktop Applications Versus Web Services: What is so productive about productivity software? https://computer.howstuffworks.com/productivity-software2.htm 3. Baltopoulos, I.G.: Introduction to Web Services, CERN School of Computing (iCSC), Geneva, Switzerland (2005). https://www.cl.cam.ac.uk/~ib249/teaching/Lecture1.handout.pdf 4. Designing your website header, body and footer. https://www.website.com/website-builderand-web-design/designing-your-website-header-body-and-footer 5. Kyrnim, J., Meloni, J.C.: HTML, CSS, and JavaScript, All in One. Pearson Publishing, Inc.. ISBN-13: 978-0-672-33808-3 6. Segue Technologies, What is Ajax and Where is it Used in Technology? https://www.seguet ech.com/ajax-technology/. Accessed 12 March 2013 7. COC: Web-based Intelligent Services Enabled Cloud Info-synthesis Outlet – Comprehensive Online Courseware, CUR Biennial Conference, 30 Jun–03 Jul 2018, Crystal City, VA (2018) 8. Wing, J.M.: Computational Thinking. https://www.cs.cmu.edu/~15110-s13/Wing06-ct.pdf 9. Srivastav, M.K., Nath, A.: Web content management system. IJIRAE 3(03). https://www.res earchgate.net/publication/299438184_WEB_CONTENT_MANAGEMENT_SYSTEM 10. Liang, S., Zhang, L., Luqi: Automatic prototype generating via optimized object model. ACM SIGAda Ada Lett. XXIII(2), P22–P31 (2003)

wiseCIO: Web-Based Intelligent Services Engaging Cloud Intelligence Outlet

195

11. Liang, S., Puette, J., Luqi: Quantifiable software architecture of dependable systems of systems. In: de Lemos, R. (ed.) Chapter in book: Architecture, Dependable Systems II. LNCS. Springer, Heidelberg (2004) 12. Nair, S.: What is CICD — Concepts in Continuous Integration and Deployment. https://med ium.com/@nirespire/what-is-cicd-concepts-in-continuous-integration-and-deployment-4fe 3f6625007 13. Martin, R.C.: Agile Software Development, Principles, Patterns, and Practices. Pearson. ISBN 10: 0135974445 14. Stone, D., Jarrett, C., Woodroffe, M., Michocha, S.: User Interface Design and Evaluation. Morgan Kaufmann. ISBN-13: 978-0120884360 15. Benyon, D.: Designing the User Experience - A Guide to HCI, UX and Interaction Design. Pearson Publishing, Inc.. ISBN 978-1-292-15551-7 16. Isaacson, C.: Database Sharding: The Key to Database Scalability. http://www.dbta.com/ Editorial/Trends-and-Applications/Database-Sharding–The-Key-to-Database-Scalability55615.aspx. Accessed 14 Aug 2009 17. Internet-Computer-Security.com: What is High Availability and Failover? http://www.int ernet-computer-security.com/Firewall/Failover.html 18. Anderson, M.: What is Load Balancing? https://www.digitalocean.com/community/tutorials/ what-is-load-balancing 19. Goldschmidt, P.G.: Information synthesis: a practical guide, by Health Services Research. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1068946/ 20. Kurose, J.F., Ross, K.W.: Computer Networking, a Top-Down Approach, 7th edn. Computer Programming by Pearson 21. Microsoft Windows Dev Center: JavaScript Object Notation (JSON). https://docs.microsoft. com/en-us/windows/win32/win7appqual/javascript-object-notation–json 22. Rosenberg, D., Scott, K.: Use Case Driven Object Modeling with UML: A Practical Approach. Addison-Wesley Professional. ISBN-13: 978-0201432893 23. Gamma, E., Helm, R., Johnson, R., Vlissides, J.: Design Patterns: Elements of Reusable Object-Oriented Software 1st Edition – PDF Version, SKU: 5c5b84867f89ce9850a7681a 24. Machine Learning: What it is and why it matters. https://www.sas.com/en_us/insights/analyt ics/machine-learning.html 25. Nilsson, N.J.: Introduction to Machine Learning - an early draft of a proposed textbook, Robotics Laboratory, Department of Computer Science, Stanford University. https://ai.sta nford.edu/~nilsson/MLBOOK.pdf 26. Liang, S., Reibling, L.A., et al.: re-ADA: reliable Ada-based descriptive architecture for C4ISR via a quantitative interoperating model. In: SIGAda 2008 Advance Program, ACM Special Interest Group, Annual International Conference, Portland, OR (2008) 27. Zell Liew: Understanding And Using REST APIs, Smashing Magazine. https://www.smashi ngmagazine.com/2018/01/understanding-using-rest-api/ 28. What is RESTful API. https://restfulapi.net 29. Jay McCarthy: Automata: Compiling State Machines. https://docs.racket-lang.org/automata/ index.html 30. Archives @ PAMA, Region of Peel: How do Archivists Organize Collections? https://peelar chivesblog.com/2015/08/26/how-do-archivists-organize-collections/ 31. Cloud Standards Customer Council: Practical Guide to Platform-as-a-Service (V1.0). http:// www.cloud-council.org/CSCC-Practical-Guide-to-PaaS.pdf

A Flexible Hybrid Approach to Data Replication in Distributed Systems Syed Mohtashim Abbas Bokhari(B) and Oliver Theel Department of Computer Science, University of Oldenburg, Oldenburg, Germany {syed.mohtashim.abbas.bokhari,oliver.theel}@uni-oldenburg.de

Abstract. Data replication plays a very important role in distributed computing because a single replica is prone to failure, which is devastating for the availability of the access operations. High availability and low cost for the access operations as well as maintaining data consistency are major challenges for reliable services. Since failures are often inevitable in a distributed paradigm, thereby greatly affecting the availability of services. Data replication mitigates such failures by masking them and makes the system more fault-tolerant. It is the concept by which highly available data access operations can be realized while the cost should be not too high either. There are various state-of-the-art data replication strategies, but there exist infinitely many scenarios which demand designing new data replication strategies. In this regard, this work focuses on this problem and proposes a holistic hybrid approach towards data replication based on voting structures. Keywords: Distributed systems · Fault tolerance · Data replication · Quorum protocols · Operation availability · Operation cost · Hybrid data replication · Voting structures · Optimization

1 Introduction Data replication is a means by which failures in a distributed paradigm can be masked to gain high availability and fault tolerance within the system. But even in the case of replicated data, one could easily succumb to incorrect values when one replica is updated and other replicas do not reflect it. Moreover, conflicting operations, too, need to be managed to prevent them from affecting correctness. These problems are known as consistency issues, as for this, the data should always be consistent to meet the onecopy serializability (1SR) property [1]. 1SR ensures the access operations in a replicated system to behave as they would do in a non-replicated system. There are strategies known as data replication strategies (DRSs) to ensure such a property and make distributed systems highly available. Let us consider a simple example of a component that detects the altitude of a plane. Altitude is a critical measurement and judgment could easily be clouded through a judgment failure of the component. For this, instead of relying on one component, a system can rather be comprised of three functionally identical components. This way, the failure of one component can easily be compensated by the other two components, © Springer Nature Switzerland AG 2020 K. Arai et al. (Eds.): SAI 2020, AISC 1228, pp. 196–207, 2020. https://doi.org/10.1007/978-3-030-52249-0_13

A Flexible Hybrid Approach to Data Replication in Distributed Systems

197

when the decision is supposed to be made by the majority to come to a conclusive value. The presence of this redundancy among the components along with this majority protocol can easily compensate for the failure of a single component. Hence, we achieve fault tolerance in the system and, thereby, higher availability. Many DRSs enforce a quorum mechanism (a threshold of a minimal number of replicas) comprised of a read quorum (Qr) and a write quorum (Qw) for the access operations to be performed. A read operation reads values from the set of replicas of a read quorum and write operation writes values to the set of replicas of a write quorum. These strategies work in such a way that the intersection between the read and write quorums happens to meet the 1SR property. The DRSs may enforce certain topologies and patterns to access replicas, which thereby indicate their diversity. The decision for a suitable DRS to be chosen for an environment surely is a tradeoff between different qualities, i.e. load, capacity, availability [2], scalability, and cost [3]. The availability of a read operation is point symmetrical to the availability of write operation [4] in optimized DRSs which means, both cannot be optimized independently. An increase in the read availability often, therefore, results in sacrificing the write availability and so is the case with the cost of the access operations. The conflicting goals imply no best solution and compromises must be made depending on the scenario and suitable choices of the concerned application. Furthermore, there are infinitely many such scenarios, but DRSs lack to cover them entirely. This paper is structured as follows; Sect. 2 postulates the research question. Section 3 discusses contemporary data replication approaches and their limitations. Section 4 presents an approach to address the research question. Section 5 presents the results of the proposed approach, followed by the conclusion and prospective future work in Sect. 6.

2 Research Question The quality metrics under focus are cost and availability of the access operations, which also depend upon the number of replicas and their individual availability. The availability of the access operations is the probability by which a user can successfully perform an operation from anywhere within the system. The operation cost reflects the average minimum number of replicas to be accessed to get the correct value. There exist numerous cases between these quality metrics represented by white boxes shown in Fig. 1. The aim in general is to increase availability and reduce costs. But these white marks represent infinitely many scenarios between the cost and availabilities of the access operations despite having secured the consistency to be 1SR. These scenarios cannot be addressed entirely by contemporary strategies. There is no one best solution and many scenarios may easily be left uncovered. It raises the need for new replication strategies to be identified, the questions of how to design unknown DRSs out of the existing ones, and what are the hindrances of doing so?

198

S. M. A. Bokhari and O. Theel

Fig. 1. Trade-off scenarios.

3 Related Work There are various replication strategies known from the literature such as Read-One Write-All (ROWA) [5], the Majority Consensus Strategy (MCS) [6], the Tree Quorum Protocol (TQP) [7], the Weighted Voting Strategy (WVS) [8], Hierarchical Quorum Consensus (HQC) [9], the Grid Protocol (GP) [10], and the Triangular Lattice Protocol (TLP) [11]. These strategies constitute different properties, costs, and availabilities for the access operations. As already mentioned, there are tradeoffs, thereby making indefinite scenarios between these quality metrics to cover which leads to the need of designing new replication strategies. The paper uses a so-called hybrid approach where different strategies are glued together to form a new DRS. There is not much work done on such a hybrid approach, yet, there exist only a few attempts in the literature such as [12] and [13], which merely combine TQPs and GPs. Figure 2 shows different topologies and patterns of replication strategies indicating their diverse nature. This diversity makes the task of developing a hybrid approach very cumbersome to pursue. The problem easily goes out of hand because of these varied patterns for accessing replicas accordingly.

Fig. 2. Contemporary replication strategies [16].

In this regard, this paper uses so-called voting structures to achieve a holistic hybrid approach in DRSs which ignores all their logical imposition in anticipation of making the task easier to achieve. There exist a few papers, i.e., [14] and [15], which use voting structures. In this work, we follow the same line of designing DRSs using voting structures, particularly focusing on MCS, GP, and TLP. This could be a roadmap towards the automatic designing of such strategies in the future.

A Flexible Hybrid Approach to Data Replication in Distributed Systems

199

4 An Idea of Holistic Hybridity in DRSs The research uses voting structures [17] as a unified representation of replication strategies to eliminate the hindrances in achieving the hybrid nature of new solutions. Figure 3 elaborates on a quorum system of TQP modeled as a voting structure (here being a directed acyclic graph DAG). It consists of virtual and actual replicas represented by Vi and pj , respectively. The quorums are the minimal threshold of replicas for the read and write operations. Furthermore, every node has votes (being indicated on the right) reflecting the weightage of that node in the collection of quorums. In some cases, the ordering by which to access the replicas is also important for reducing the cost. It is being shown as edge priorities with 1 being the highest and ∞ being the lowest priority. Considering the total number of votes V on every level, normally, the following conditions are met for 1SR to hold. Qr + Qw > V (to avoid read-write conflicts)

(1)

Qw > V/2 (to avoid write-write conflicts)

(2)

Fig. 3. Voting structure.

For instance, the voting structure presented in Fig. 3 produces the following read (RQ) and write quorum sets (WQ) to perform respective access operations. RQ = {{p1}, {p2, p3}, {p2, p4}, {p3, p4}} WQ = {{p1, p2, p3}, {p1, p2, p4}, {p1, p3, p4}} Figure 4 demonstrates MCS, TQP, ROWA, GP, and TLP (from left top to right bottom) comprising four replicas as a modeled unified representation in the form of voting structures, respectively. The respective quorums and votes are set in the instances, however, for simplicity, edge priorities are not represented in the figure and the interpreting

200

S. M. A. Bokhari and O. Theel

algorithm takes care of the order of replicas inherently. These voting structures forming quorum systems evidently eliminate the diversity between the replication strategies. The same quorums would be derived recursively here, as it would be in an orthodox representation. But this representation is immensely powerful and key to our hybrid approach.

Fig. 4. DRSs being represented as voting structures.

Figure 5 shows a more complex (computer-generated) example of a voting structure that models a TLP-like strategy consisting of six replicas. A manual example of this strategy can be found here [18]. In the same manner, any quorum-based replication strategy can be modeled to represent the respective quorums in the form of these DAGs, with an aim to easily combine them with others afterwards.

Fig. 5. Modeling of a TLP as a voting structure.

A Flexible Hybrid Approach to Data Replication in Distributed Systems

201

5 Experiments and Results In this section, the hybrid approach is practically applied to cutting-edge DRSs by exploiting the concept of voting structures. MCS is the most superior strategy in terms of its operation availabilities where the protocol freely chooses any of the replicas to form the quorum by majority. TLP fairly competes with MCS in terms of availability of the access operations but gives better cost by slightly compromising its availability. There are many prospective possibilities of combining these DRSs to inherit the qualities from both to some extent. Simplistically, it could be combined either way, MCS on top of TLP or vice versa as shown in Fig. 6. Figure 7 presents a hybrid DRS with MCS on top of TLP. This MCS imposes a logical structure over TLP and then, TLP is applied to the physical replicas through MCS. Any two of the child substructures of the root node can be selected to form a read or write quorum. Hence, respective quorums for the access operations can easily be derived by traversing this structure recursively.

Fig. 6. Hybrid DRSs of MCS & TLP.

Fig. 7. Voting structure: MCS on top of TLP.

Figure 8 represents the hybrid DRSs in the form of a voting structure with TLP on top of MCS. MCS with three replicas is attached to every physical node of TLP to obligate the system to consist of a total of 12 replicas. The leaf nodes, hence, are all physical replicas while the rest of them are virtual replicas representing groupings of virtual and actual replicas. Figure 9 illustrates the availability comparison of the two above-given hybrid strategies in their respective order (Fig. 7 as Strategy 1 and Fig. 8 as Strategy 2). The x-axis represents the availability of replicas and the y-axis denotes the availability of access

202

S. M. A. Bokhari and O. Theel

Fig. 8. Voting structure: TLP on the top of MCS.

operations. The availability is calculated by adding up probabilities of all the possible cases of the quorums [19]. It can be seen that they exhibit different properties and availabilities. The latter strategy, which has MCS at the bottom, has better write availability than the former one. This improved write availability is at the expense of read availability, but for the later values of p, the second strategy seems to be better for the operations holistically as it is comparatively harder to increase the write availability.

Fig. 9. Availability of hybrid DRS.

Figure 10 displays the cost comparison of the two above-mentioned hybrid strategies in their respective order (Fig. 7 as Strategy 1 and Fig. 8 as Strategy 2). The cost is being calculated by adding up the probabilities of the respective cases multiplied by their minimal quorums. Additionally, the resulting value is divided by the respective access operation’s availability [19]. The operations appear to have a quite economical cost where values differ in the middle and later on converge onto the same values of four replicas each for the best cases. Similarly, Fig. 11 is an endeavor to combine TLP and GP (which can also be combined in several ways). The given instance focuses on the possibilities of either GP being

A Flexible Hybrid Approach to Data Replication in Distributed Systems

203

Fig. 10. Cost of hybrid DRS.

on top of TLP or TLP being on top of GP. GP has a better read availability while TLP has an edge over write availability and cost values, particularly for the best cases.

Fig. 11. Hybrid DRSs of GP & TLP.

Figure 12 depicts the above-mentioned hybrid DRSs in the form of a voting structure with GP on top of TLP. The structure itself is rather complex and, unfortunately, too large to fit in well but it can be noticed that it comprises 16 actual replicas (leaf nodes) and other virtual replicas to support the quorum mapping of the protocols. Figure 13 represents the mentioned hybrid DRS (see Fig. 11, right DRS) in the form of a voting structure with TLP on top of GP. This type of hybrid approach results in a huge increase of read availability and outclasses MCS, both, in read availability and cost of the access operations (as MCS is costly in general) but, unfortunately, compromises the write availability.

204

S. M. A. Bokhari and O. Theel

Fig. 12. Voting structure: GP on top of TLP.

Fig. 13. Voting structure: TLP on the top of GP.

As for the other example, Fig. 14 shows the availability comparison between the hybrid approach (MCS on the bottom of TLP, Fig. 8) and a flat MCS of 12 replicas. Strategy 1 represents the MCS while Strategy 2 represents the hybrid one. The red and pink lines indicate the read and write availabilities of flat MCS, respectively, whereas the blue and green lines depict the operation availabilities of the hybrid strategy, respectively. It can be seen that the operation availabilities for both these strategies are very close and converge to basically the same values for the later values of p, which is good enough considering quite reliable hardware of today.

Fig. 14. Availability: MCS vs. hybrid DRS.

A Flexible Hybrid Approach to Data Replication in Distributed Systems

205

As shown in Fig. 15, in terms of its cost, the hybrid DRS is far cheaper than the flat MCS while it is overt, too, that availability is not much compromised either. The blue and green lines represent the read and write availabilities of the hybrid strategy, respectively. For the best case, it takes merely four replicas to perform a read or a write operation while the flat MCS takes a constant cost of 13 replicas in total to perform both access operations. The goal here is to achieve low operational costs but at the same time not sacrificing too much of the availabilities. Here, we have significantly decreased the cost of the access operations while not much compromising on the availabilities.

Fig. 15. Cost: MCS vs. hybrid DRS.

As described, the presentation of quorum protocols as voting structures allows replication strategies to be easily merged with the other strategies in many possible ways. The resulting new DRSs can then be used to fulfill the requirements of scenarios which would not have been that easily possible by homogenous strategies considering the quality metrics of operation cost, operation availabilities, while still guaranteeing 1SR, and a certain number of replicas.

6 Conclusion and Future Work The paper demonstrated the usefulness of a hybrid approach based on voting structures for introducing new hybrid DRSs which may potentially fulfill uncovered applicationscenarios for replicated data. It utilizes voting structures as a key element towards this hybrid approach in data replication. Such a unified representation of DRSs makes it very convenient to perform “crossovers” and merge any quorum-based DRS with other quorum-based DRSs in anticipation of adopting the best properties of the two. The paper models cutting-edge replication strategies as unified voting structures, evaluates their performances, and compares the results with contemporary strategies. This idea may lead to the generation of new application-optimized DRSs satisfying the needs of given application-specific scenarios.

206

S. M. A. Bokhari and O. Theel

As a part of future work, we plan to develop a framework to automatically design replication strategies with the help of genetic programming. Moreover, automatically finding optimal combinations by which to merge those DRSs is also in our focus.

References 1. Bernstein, P., Hadzilacos, V., Goodman, N.: Concurrency Control and Recovery in Database Systems, p. 370. Addison Wesley, Boston (1987). ISBN 13 978-0201107159 2. Naor, M., Wool, A.: The load, capacity, and availability of quorum systems. SIAM J. Comput. 27(2), 423–447 (1998) 3. Jimenez-Peris, R., Patino-Martınez, M., Alonso, G., Kemme, B.: How to select a replication protocol according to scalability, availability, and communication overhead. In: Proceedings of the 20th IEEE Symposium on Reliable Distributed Systems (SRDS) (2001) 4. Theel, O., Pagnia, H.: Optimal replica control protocols exhibit symmetric operation availabilities. In: Proceedings of the 28th International Symposium on Fault-Tolerant Computing (FTCS-28), pp. 252–261 (1998) 5. Bernstein, P., Goodman, N.: An algorithm for concurrency control and recovery in replicated distributed databases. ACM Trans. Database Syst. (TODS) 9(4), 596–615 (1984) 6. Thomas, R.H.: A majority consensus approach to concurrency control for multiple copy databases. ACM Trans. Database Syst. 4(2), 180–207 (1979) 7. Agrawal, D., Abbadi, A.: The tree quorum protocol: an efficient approach for managing replicated data. In: Proceedings of the 16th International Conference on Very Large Data Bases (VLDB), pp. 243–254 (1990) 8. Gifford, D.: Weighted voting for replicated data. In: Proceedings of the 7th ACM Symposium on Operating Systems Principles (SOSP), pp. 150–162 (1979) 9. Kumar, A.: Hierarchical quorum consensus: a new algorithm for managing replicated data. IEEE Trans. Comput. 40(9), 996–1004 (1991) 10. Cheung, S., Ammar, M., Ahamad, M.: The grid protocol: a high performance scheme for maintaining replicated data. IEEE Trans. Knowl. Data Eng. 4(6), 582–592 (1992) 11. Wu, C., Belford, G.: The triangular lattice protocol: a highly fault tolerant and highly efficient protocol for replicated data. In: Proceedings of the 11th Symposium on Reliable Distributed Systems (SRDS). IEEE Computer Society Press (1992) 12. Arai, M., Suzuki, T., Ohara, M., Fukumoto, S., Iwasak, K., Youn, H.: Analysis of read and write availability for generalized hybrid data replication protocol. In: Proceedings of the 10th IEEE Pacific Rim International Symposium on Dependable Computing (PRDC), pp. 143–150 (2004) 13. Choi, S., Youn, H.: Dynamic hybrid replication effectively combining tree and grid topology. J. Supercomput. 59(3), 1289–1311 (2012) 14. Theel, O.: Rapid replication scheme design using general structured voting. In: Proceedings of the 17th Annual Computer Science Conference, Christchurch, New Zealand, pp. 669–677 (1994) 15. Pagnia, H., Theel, O.: Priority-based quorum protocols for replicated objects. In: Proceedings of the 2nd International Conference on Parallel and Distributed Computing and Networks (PDCN), Brisbane, Australia, pp. 530–535 (1998) 16. Lee, Y.-J., Kim, H.-Y., Lee, C.-H.: Cell approximation method in quorum systems for minimizing access time. Cluster Comput. 12(4), 387–398 (2009) 17. Theel, O.: General structured voting: a flexible framework for modelling cooperations. In: Proceedings of the 13th International Conference on Distributed Computing Systems (ICDCS), pp. 227–236 (1993)

A Flexible Hybrid Approach to Data Replication in Distributed Systems

207

18. Storm, C.: Specification and Analytical Evaluation of Heterogeneous Dynamic QuorumBased Data Replication Schemes. Springer Vieweg, Wiesbaden (2012). ISBN 978-3-83482380-9 19. Schadek, R., Theel, O.: Increasing the accuracy of cost and availability predictions of quorum protocols. In: Proceedings of the 22nd IEEE Pacific Rim International Symposium on Dependable Computing (PRDC), pp. 98–103 (2017)

A Heuristic for Efficient Reduction in Hidden Layer Combinations for Feedforward Neural Networks Wei Hao Khoong(B) Department of Statistics and Applied Probability, National University of Singapore, Singapore, Singapore [email protected]

Abstract. In this paper, we describe the hyperparameter search problem in the field of machine learning and present a heuristic approach in an attempt to tackle it. In most learning algorithms, a set of hyperparameters must be determined before training commences. The choice of hyperparameters can affect the final model’s performance significantly, but yet determining a good choice of hyperparameters is in most cases complex and consumes large amount of computing resources. In this paper, we show the differences between an exhaustive search of hyperparameters and a heuristic search, and show that there is a significant reduction in time taken to obtain the resulting model with marginal differences in evaluation metrics when compared to the benchmark case. Keywords: Heuristic · Combinatorics Hyperparameter optimization

1

· Neural networks ·

Preliminaries

Much research has been done in the field of hyperparameter optimization [1– 3], with approaches such as grid search, random search, Bayesian optimization, gradient-based optimization, etc. Grid search and manual search are the most widely used strategies for hyperparameter optimization [3]. These approaches leave much room for reproducibility and are impractical when there are a large number of hyperparameters. For example, gird search suffers from the curse of dimensionality when the number of hyperparameters grow very large, and manual tuning of hyperparameters require considerable expertise which often leads to poor reproducibility, especially with a large number of hyperparameters [4]. Thus, the idea of automating hyperparameter search is increasingly being researched upon, and these automated approaches have already been shown to outperform manual search by numerous researchers across several problems [3]. A multilayer perceptron (MLP) is a class of feedforward artificial neural network (ANN) which can be viewed as a logistic regression classifier, where the inputs are transformed using a learnt non-linear transformation and stored in an c Springer Nature Switzerland AG 2020  K. Arai et al. (Eds.): SAI 2020, AISC 1228, pp. 208–218, 2020. https://doi.org/10.1007/978-3-030-52249-0_14

Heuristic for Efficient Reduction in Hidden Layer Combinations

209

input layer. Every element which holds an input is called a “neuron”. A MLP typically consists of at least three layers of nodes: an input layer, a hidden layer and an output layer. With the exception of the input layer, each node in the layer is a neuron that utilizes a nonlinear activation function. In training the MLP, backpropagation, a supervised learning technique is used. In our experiments, we only have test cases consisting of one to three hidden layers, each consisting of up to 10 neurons. The reasons for this number are that our objective is to illustrate the effects of the heuristic using a small toy-example that does not take too long to run in the test cases. Moreover, we found that for the datasets used, the best results from our experiments with grid search and a limit of 10 neurons in each layer involved way less than 10 neurons for each layer.

2

Related Work

There are related work involving search heuristics with similar methods of search, such as the comparison of the error metrics at each iteration of the algorithm. To the best of our knowledge, there has yet to be similar work which involves the result of a grid search as input to the search algorithm, especially in the area of neural architecture search. Similar work was done for time series forecasting [5], where the authors employed a local search algorithm to estimate the parameters for the non-linear autoregressive integrated moving average (NARMA) model. In the paper, the algorithm’s improvement method progressively searches for a new value in a small neighborhood of the underlying solution until all the parameter vector’s elements have been analyzed. The error metric of interest was the mean square error (MSE), where a new direction of search was created when the new vector produces a smaller MSE. Another work [6] which utilized a similar improvement method presented a hybrid global search algorithm for feedforward neural networks supervised learning, which combines a global search heuristic on a sequence of points and a simplex local search.

3 3.1

Experiment Setting and Datasets Programs Employed

We made use of Scikit-Learn [7], a free software machine learning library for the Python programming language. Python 3.6.4 was used in formulating and running of the algorithms, plotting of results and for data preprocessing.

210

3.2

W. H. Khoong

Resources Utilized

All experiments were conducted in the university’s High Performance Computing1 (HPC) machines, where we dedicated 12 CPU cores, 5 GB of RAM in the job script. All jobs were submitted via SSH through the atlas8 host, which has the specifications: HP Xeon two sockets 12-Core 64 − bitLinuxcluster, CentOS6. We utilized Snakemake [8], a workflow management tool to conduct our experiments. 3.3

Data

We perform the experiments on two datasets - Boston house-prices and the MNIST handwritten digit dataset [9]. The Boston house-prices dataset2 is available from Scikit-Learn’s sklearn.datasets package, and . This package contains a few small standard datasets that do not require downloads of any file(s) from an external website. The MNIST dataset can be downloaded from http://yann. lecun.com/exdb/mnist/. 3.4

Notations and Test Cases

We perform our experiments on feedforward neural networks with one, two and three layers, with Scikit-Learn’s MLPRegressor and MLPClassifier from the sklearn.neural network3 package. The MLPRegressor is used for predicting housing prices for the Boston dataset and the MLPClassifier is used for classification with the MNIST dataset. The models optimize the squared-loss using the Limited-memory Broyden–Fletcher–Goldfarb–Shanno algorithm [10] (LBFGS), an optimizer in the family of quasi-Newton methods. The hyperparameter values were all fixed except for the number of neurons at each hidden layer, denoted by hidden layer sizes, which is dependent on the current input at each iteration of the algorithm. The other (fixed) hyperparameters are: activation(relu), solver (lbfgs), alpha (0.0001), batch size (auto), learning rate (constant), learning rate init (0.001), max iter (500) and random state (69). All other hyperparameters not stated here are using their default values. Further information about the hyperparameters can be found in the sklearn.neural network documentation. Our maximum number of neurons in any hidden layer is set at 10, as preliminary experiments show that the best stratified 5-fold cross-validation score is obtained when the number of neurons at any hidden layer is under 10.

1 2 3

See https://nusit.nus.edu.sg/services/hpc/about-hpc/ for more details about the HPC. See documentation at https://scikit-learn.org/stable/modules/generated/sklearn. datasets.load boston.html. See documentation at https://scikit-learn.org/stable/modules/classes.html# module-sklearn.neural network.

Heuristic for Efficient Reduction in Hidden Layer Combinations

211

Define α to be the minimum fraction in improvement required for each iteration of the algorithm. We also define H (i) , i = 0, 1, . . . to be the set of hiddenlayer combination(s) at iteration i. H (0) is the starting set of hidden-layer combinations used as input, with |H (0) | = 1. Let N be the number of hidden layers (j) and Hmodel to be the set containing the best hidden layer combination obtained from fitting with GridSearchCV4 from Scikit-Learn’s sklearn.model selection package. For example, if H (0) = {(3, 4, 3)}, it means that there are 3 neurons in the first and third hidden layer and 4 neurons in the second layer. We also define Hprev as the set containing all previously fitted hidden layer combinations. Hbest is then the set containing the best combination at any iteration of the algorithm, i.e. |Hbest | ≤ 1. Scikit-Learn’s sklearn.preprocessing.StandardScaler5 was also used to standardize features by removing the mean and scaling to unit variance. We also denote the Root Mean Square Error (RMSE) from fitting the model with validation on the test dataset, at the end of the current iteration and from the previous iteration as RMSEmodel , RMSEcurr and RMSEprev respectively. In our experiments, α ∈ {0.01, 0.05, 0.10}. We also set the initial upper-bound threshold K on the RMSE to be an arbitrarily large, for the purpose of passing the first iteration of the loop. Next, we define Combinations(·) as a function that returns the set of all possible hidden layers Lhls without duplicates.

4

Methods Employed

4.1

Method 1 - Benchmark

In this method, all possible hidden-layer sizes (with repetition allowed) are used as hyperparameter. Let Lhls denote the set of all possible hidden layers. Then for example, if there are 2 hidden layers and each layer can have between 1 to 3 neurons, then Lhls = {(1, 1), (1, 2), (1, 3), (2, 1), (2, 2), (2, 3), (3, 3), (3, 2), (3, 1)}.

4 5

See documentation at https://scikit-learn.org/stable/modules/generated/sklearn. model selection.GridSearchCV.html. Further documentation can be found at https://scikit-learn.org/stable/modules/ generated/sklearn.preprocessing.StandardScaler.html.

212

4.2

W. H. Khoong

Method 2 - Heuristic

In this method, a heuristic is used to iteratively explore the hidden-layer combinations, subject to the condition that the absolute change in RMSE is greater or equal to α and that RMSEcurr > RMSEprev . To be precise, the algorithm optimizes within a fixed number of hidden layers. In our experiments, we obtain the input H (0) by performing a grid search on an initialization of hidden-layer combinations of the form: L0 = {(1, 1, . . . , 1), . . . , (10, 10, . . . , 10)} and the ‘best’ hidden-layer combination {(i, i, . . . , i)}, i ∈ {1, . . . , 10} will be assigned as H (0) . (input) Let nmax to be the maximum number of neurons across all hidden layers in (input) L0 . In the above example, nmax = 10. The sequence of steps of the heuristic and its pseudocode are as follows: 1. Initialize a set of hidden-layer combinations of the form: L0 = {(1, 1, . . . , 1), . . . , (i, i, . . . , i), . . . }, i ≥ 1

(1)

Also set an arbitrary large value for RMSEprev and RMSEcurr . 2. Perform a grid search with L0 to obtain H (0) , which corresponds to the set that contains the combination with the lowest test set RMSE from grid search with stratified K-fold cross-validation. 3. Generate the current iteration’s set of hidden-layer combinations without duplicates, with: a. nmin as the minimum number of neurons in any hidden-layer combination in the previous iteration’s set of hidden-layer combinations, deducting the current iteration’s index. If nmin < 1, set it to 1. b. nmax as the maximum number of neurons in any hidden-layer combination in the current iteration’s set of hidden-layer combinations, with an (input) (input) increment of 1. If nmax > nmax , set nmax as nmax . 4. If the set of hidden-layer combinations for the current iteration is empty, the algorithm terminates. Otherwise, obtain the best hidden-layer combination of the set, and set it as the current iteration’s best combination. Update the iteration’s set of hidden-layer combinations to the set of previously fitted hidden-layer combinations and the current iteration’s best combination as the overall best hidden-layer combination. 5. Repeat steps 3 and 4. If the algorithm terminates in as a consequence of step 4, return the last found best hidden-layer combination.

Heuristic for Efficient Reduction in Hidden Layer Combinations

Algorithm 1: Heuristic (input)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40

Input : α, H (0) , nmax , N, K, Combinations(·) Output: RMSEcurr , Hbest Hprev , Hbest ← {}, {} RMSEprev , RMSEcurr ← K, K 2 i ← 0   RMSEcurr −RMSEprev   Δ =  RMSEprev  while Δ ≥ α and RMSEcurr > RMSEprev do RMSEprev ← RMSEcurr if i = 0 then H (i+1) ← H (0) end if else nmin ← min(H (i) ) − i if nmin < 1 then nmin ← 1 end if nmax ← max(H (i) ) + 1 (input) if nmax > nmax then (input) nmax ← nmax end if H (i+1) ← Combinations(nmin , nmax , N ) end if if length(H (i+1) ) = 0 then Break end if model ← fit(H (i+1) ) (i+1) Hcurr,best ← Hmodel Hprev ← Hprev ∪ H (i+1) \ H (i+1) ∩ Hprev if length(Hbest ) = 0 then Hbest ← Hbest ∪ Hcurr,best end if else Hbest ← {} Hbest ← Hbest ∪ Hcurr,best end if if i = 0 then RMSEprev ← 12 RMSEmodel end if RMSEcurr ← RMSEmodel i←i+1 end while Return RMSEcurr , Hbest

213

214

5

W. H. Khoong

Experiment Results

For each dataset, we illustrate the results of Method 1 (Benchmark) in Fig. 1 & 5, and Method 2 (Heuristic) for each α side-by-side in Figs. 2, 3 and 4 and 6, 7 and 8. We then show their overall results in a table for each method - Method 1 (Benchmark) in Tables 1 and 5, and Method 2 (Heuristic) for each α in Tables 2, 3 and 4 and 6, 7 and 8. The time elapsed for Method 2 is the time taken to perform the grid search to obtain the initial input and to obtain the result from the algorithm with the initial input from the grid search’s result. For Method 1, it is the time taken to perform a grid search on all possible combinations of neurons to obtain the best result. 5.1

Boston Dataset

(a) Score

(b) Test RMSE

(c) Time Elapsed

Fig. 1. Method 1 results

(a) Score

(b) Test RMSE

(c) Time Elapsed

Fig. 2. Method 2 results, α = 0.01

(a) Score

(b) Test RMSE

Fig. 3. Method 2 results, α = 0.05

(c) Time Elapsed

Heuristic for Efficient Reduction in Hidden Layer Combinations

(a) Score

(b) Test RMSE

(c) Time Elapsed

Fig. 4. Method 2 results, α = 0.10 Table 1. Summary of results for method 1 1

2

3

Median score

0.83

0.85

0.86

Median RMSE

3.76

3.68

3.63

Median time elapsed (s) 2.92 12.63 56.00 Table 2. Summary of results for method 2, α = 0.01 1

2

3

Median score

0.83 0.83 0.84

Median RMSE

3.65 3.50 3.70

Median time elapsed (s) 2.95 7.19 7.22

Table 3. Summary of results for method 2, α = 0.05 1

2

3

Median score

0.83 0.84 0.83

Median RMSE

3.50 3.56 3.57

Median time elapsed (s) 2.84 5.07 7.03

Table 4. Summary of results for method 2, α = 0.10 1

2

3

Median score

0.83 0.83 0.83

Median RMSE

3.62 3.49 3.71

Median time elapsed (s) 3.04 5.32 6.69

215

216

5.2

W. H. Khoong

MNIST Dataset

(a) Score

(b) Test RMSE

(c) Time Elapsed

Fig. 5. Method 1 results

(a) Score

(b) Test RMSE

(c) Time Elapsed

Fig. 6. Method 2 results, α = 0.01

(a) Score

(b) Test RMSE

Fig. 7. Method 2 results, α = 0.05

(c) Time Elapsed

Heuristic for Efficient Reduction in Hidden Layer Combinations

(a) Score

(b) Test RMSE

(c) Time Elapsed

Fig. 8. Method 2 results, α = 0.10 Table 5. Summary of results for method 1 1

2

3

Median score

0.92

0.93

0.93

Median RMSE

1.08

1.05

1.09

Median time elapsed (s) 1014.20 9498.86 36336.83

Table 6. Summary of results for method 2, α = 0.01 1

2

3

Median score

0.92

0.92

0.93

Median RMSE

1.08

1.08

1.09

Median time elapsed (s) 1011.10 2763.15 2556.78

Table 7. Summary of results for method 2, α = 0.05 1

2

3

Median score

0.92

0.92

0.93

Median RMSE

1.08

1.08

1.09

Median time elapsed (s) 1019.95 2764.92 2539.61

Table 8. Summary of results for method 2, α = 0.10 1

2

3

Median score

0.92

0.92

0.93

Median RMSE

1.08

1.08

1.09

Median time elapsed (s) 1016.91 2786.68 2553.89

217

218

6

W. H. Khoong

Conclusion and Future Work

The main takeaway from the results obtained is the significant reduction in median time taken to run a test case with an almost equally-high median score and equally-low RMSE when compared to the benchmark case, when the heuristic is used in Method 2 for larger number of hidden-layers. For example, a quick comparison with the results in Tables 5 and 6 shows that for the same number of hidden-layers (2 resp. 3), there is a significant reduction in median run-time of the search with Method 2 as compared with Method 1. Moreover, it is also worthwhile to note that there is actually an improvement of median score and reduction in median run-time as the number of hidden-layers increased from 2 to 3 in Method 2 for the MNIST dataset. To the best of our knowledge, such a heuristic with the result of a grid search as input has not been properly documented and experimented with, though it is highly possible that it has been formulated and implemented by others given its simple yet seemingly naive nature. The heuristic can be generalized and applied to other hyperparameters in a similar fashion, and other models may be used as well. We use the MLPRegressor and MLPClassifier models in Scikit-Learn as we find that they help to illustrate the underlying idea of the algorithm the clearest. Due to time constraints we are not able to run for other models and alphas, but we strongly encourage others to explore with other models and variants of the heuristic.

References 1. Claesen, M., De Moor, B.: Hyperparameter search in machine learning. CoRR, abs/1502.02127 (2015) 2. Feurer, M., Klein, A., Eggensperger, K., Springenberg, J., Blum, M., Hutter, F.: Efficient and robust automated machine learning. In: Cortes, C., Lawrence, N.D., Lee, D.D., Sugiyama, M., Garnett, R. (eds.) Advances in Neural Information Processing Systems 28, pp. 2962–2970. Curran Associates, Inc. (2015) 3. Bergstra, J., Bengio, Y.: Random search for hyper-parameter optimization. J. Mach. Learn. Res. 13(1), 281–305 (2012) 4. Claesen, M., Simm, J., Popovic, D., Moreau, Y., De Moor, B.: Easy hyperparameter search using optunity. CoRR, abs/1412.1114 (2014) 5. da Silva, C.G.: Time series forecasting with a non-linear model and the scatter search meta-heuristic. Inf. Sci. 178(16), 3288–3299 (2008). Including Special Issue: Recent advances in granular computing 6. Jordanov, I., Georgieva, A.: Neural network learning with global heuristic search. IEEE Trans. Neural Netw. 18, 937–942 (2007) 7. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011) 8. K¨ oster, J., Rahmann, S.: Snakemake–a scalable bioinformatics workflow engine. Bioinformatics 28(19), 2520–2522 (2012) 9. LeCun, Y., Cortes, C.: MNIST handwritten digit database (2010) 10. Fletcher, R.: Practical Methods of Optimization, 2nd edn. Wiley, New York (1987)

Personalized Recommender Systems with Multi-source Data Yili Wang1,2 , Tong Wu1,2 , Fei Ma1,2(B) , and Shengxin Zhu1,2(B) 1

Department of Mathematical Sciences, Xi’an Jiaotong-Liverpool University, Suzhou 215123, Jiangsu, People’s Republic of China {Yili.Wang16,Tong.Wu1602}@student.xjtlu.edu.cn, {Fei.Ma,Shengxin.Zhu}@xjtlu.edu.cn 2 Laboratory for Intelligent Computing and Finance Technology, Xi’an Jiaotong-Liverpool University, Suzhou 215123, Jiangsu, People’s Republic of China

Abstract. Pervasive applications of personalized recommendation models aim to seek a targeted advertising strategy for business development and to provide customers with personalized suggestions for products or services based on their personal experience. Conventional approaches to recommender systems, such as Collaborative Filtering (CF), use direct user ratings without considering latent features. To overcome such a limitation, we develop a recommendation strategy based on the socalled heterogeneous information networks. This method can combine two or multiple sources datasets and thus can reveal more latent associations/features between items. Compared with the well-known ‘k Nearest Neighborhood’ model and ‘Singular Value Decomposition’ approach, the new method produces a substantial higher accuracy under the commonly used measurement which is mean absolute deviation. Keywords: Recommender systems · Heterogeneous information networks · Singular value decomposition · Collaborative filtering · Similarity

1

Introduction

Recommender systems (RS) aim to find a customer-oriented advertising strategy in business operations, based on customers’ own experiences and traits. By digging out the pattern of users’ interests and products’ or services’ features, highly personalized RS match people with the most appropriate products that satisfy their special needs and tastes. Generally, such RS utilize either of the following two strategies. One is known as content-based approach which uses profiles of each user and item to match user’s preference with associated items. However, only few of profiles can be obtained directly though online histories, such as users’ historical records of movies ratings. Most of the information might be missing or unable for public to collect. The alternating approach is called Collaborative Filtering (CF), which depends on the existing users’ behaviors only without explicit c Springer Nature Switzerland AG 2020  K. Arai et al. (Eds.): SAI 2020, AISC 1228, pp. 219–233, 2020. https://doi.org/10.1007/978-3-030-52249-0_15

220

Y. Wang et al.

profiles [1]. CF measures correlations between known users’ online activities and associated items, and then predicts the missing ones. Two common models used for CF are neighborhood models and latent factor models. Neighborhood models, also known as ‘k Nearest Neighborhood’ (kNN), focus on figuring out similarity among users or items [2]. The user-based method evaluates a user’s preference of items on account of neighboring users who tend to have similar interests in the same item. Similarly, the item-based method assesses a user’s appetite of items by discovering neighboring items liked by the same user. Latent factor models are another promising CF approach to reveal the latent features from observed data, trying to explain the data by factorizing it to a lower dimensional factors [3]. The above-mentioned approaches for CF are normally dependent on structured rating data, while numerous information in texts cannot be utilized. For example, movie-user profiles describe movies’ genres, cast lists and movie reviews, apart from user-movie rating matrix. Recently, we have considered how to combine heterogeneous information from one database for recommendation with linear mixed model [4–11]. In this paper, we try to design a recommendation method which can combine multi-sources databases. We apply a new technique call ‘Heterogeneous Information Networks’ (HIN) that capitalizes on the advantages of both numerical data and information in texts. This technique improves prediction accuracy by connecting various type of data that reflect users’ preferences [12]. Because more interconnected data are mined and applied, it is more possible to dig out users’ truth needs and therefore gives more opportunities for business development. The structure of the rest of the paper is organized as follows. We begin with some preliminaries and related works in Sect. 2. Then in Sect. 3, we describe a new, more accurate recommendation model based on heterogeneous information networks. This model is subject to different choices of meta-path in movie-use profiles, and therefore defines a new similarity metric. In Sect. 4, we apply this new model to movie recommendations and compare its experimental results with the methods mentioned in Sect. 2. Finally, we give the conclusion and some discussion about future works.

2

Related Works

A brief review of two popular algorithms for collaborative filtering is given as below. The general rationale behind it is to estimate users’ unknown preferences only through their existing behaviors. Such behaviors build up connections between users and items which can be items’ ratings given by users. We hold special indexing letters to differentiate users from items: u, v for users, and i, j for items. Structured rating data are given in an m × n matrix R = {rui }1≤u≤m,1≤i≤n with m users and n items. rui denotes the degree of preference by user u with respect to item i. In the case of movie recommendations, rui can be the value varying from 0.5 to 5, indicating user u ’s interest in movie i from weak to strong correspondingly. We use notation rˆui to distinguish preˆ = {ˆ dicted ratings from known ones. R rui } denotes the predicted rating matrix.

Recommender Systems with Multi-source Data

221

The observed pairs (u, i) are stored in the set O = {(u, i)|rui is known }. Moreover, regularization is used to neutralize overfitting for the sparse rating matrix, and the associated regularization parameters are denoted by λ1 and λ2 . Those constants are computed through cross validation algorithms [13]. 2.1

Baseline Estimates

For given user u and item i, the baseline estimate [13] is denoted by bui . It is defined as (1) bui = μ + bu + bi , where μ represents overall mean rating of item i. bu and bi are parameters that signify the deviations from the means of user u and item i, respectively. Baseline estimates are needed in order to combat large user and item effects. These global effects indicate systematical tendencies for some users or items. For example, user X tends to rate action movies with higher ratings than other movie genres, and some popular movies are more likely to receive higher ratings than others. On account of these effects, we adjust data by means of baseline estimates. To figure out parameters bu and bi is equivalent to solve the optimization problem [14]:      2 2 2 (rui − μ − bu − bi ) + λ1 bu + bi . min (2) bu ,bi

u

(u,i)∈O



i

2

The first term (u,i)∈O (rui − μ − bu − bi ) aims to find bu and bi that fit the  2  2  rating data best. The second term λ1 is used to ensure the u bu + i bi complexity of model by regularization. This allows the model to be rich where there exists sufficient data, and to shrink aggressively where data are scarce. 2.2

k Nearest Neighborhood (kNN)

Neighborhood models, also known as ‘k Nearest Neighborhood’, are the most common collaborative filtering approach. In order to predict unobserved relationships between users and items, it identifies items or users that can be regarded as similar. These similar users or items are the ‘neighbors’ for the given user or item. There are two methods for kNN approach. The earlier used one is userbased method where rˆui is calculated as a weighted average of neighboring users’ ratings [1,15]:  suv rvi v∈Nk (u;i) rui =  . (3) suv v∈Nk (u;i)

Nk(u;i)

represents top k neighbors of user u given item i. This A neighborhood set neighborhood is comprised of top k users with highest similarity to user u who

222

Y. Wang et al.

have rated item i. Analogously, the item-based method is induced by taking the weighted average of neighboring items’ ratings [1,2,16]:  sij ruj j∈Nk (i;u) rui =  . (4) sij j∈Nk (i;u)

Nk(i;u)

is the set of items rated by user u that tend to be rated similar to item i. User-based method and item-based method are dual which means that the performance of two methods should be similar. However, item-based method outperforms user-based method in general and allows greater efficiency [2]. This is because the number of items is significantly smaller than the number of users and therefore, the pre-computations for item-item similarities take less time. In addition, items are ‘simpler’ than users since they belong to less genres, while users’ tastes are various. The varied preferences of each user also make similarities between users less meaningful. Hence, item-based method becomes more popular to use. It is common to normalize the data before we implement kNN. The reason for normalization is to assemble data closer to ensure a better mixing. kNN with means is a usual way to achieve the normalization. Moreover, the regularization term should also be considered. The finally adjusted kNN then goes into the form [3,17]:  suv · (rvi − μv ) v∈Nk (u;i)  (5) rˆui = μu + suv v∈Nk (u;i)

or  rˆui = μi +

j∈Nk (i;u)



sij · (ruj − μj )

j∈Nk (i;u)

sij

.

(6)

Equation (5) accounts for the user-based method. μu and μv represents the average ratings of all the items rated by user u and user v. Similarly, Eq. (6) can be induced for the item-based method, where μi and μj are the rating means of all the ratings with respect to item i and item j. The similarity metric, denoted by suv or sij , is the core of kNN. Its functionality is to select appropriate neighbors - similar users or items. Three classical and frequent-used similarity metrics for user-based method are introduced below. Corresponding metrics for item-based method can be inferred by an analogical way. A) Cosine Similarity. The cosine similarity is computed by finding the cosine value of vectors u and v, where they are two vectors indicating users profiles [18]. It is defined as  u·v i∈Iuv rui rvi  =  , (7) suv = cos(u, v) = u2 v2 r2 r2 i∈Iuv

ui

i∈Iuv

vi

Recommender Systems with Multi-source Data

223

where Iuv denotes a set of items that are co-rated by user u and user v. This metric measures two users’ similarity (angle between u and v) on a normalized space. As angle gets smaller, suv gets closer to 1 which means user u and user v tend to be more similar. Cosine similarity metric is easy to calculate, while the magnitude of vectors is not taken into consideration [19]. For example, a pair of ratings (2,2) from user u has the same cosine value as a pair of ratings (5,5) from user v. However, we cannot conclude that these two users have the same preference on such two items. B) Mean Squared Difference (MSD). Then Mean Squared difference (MSD) similarity metric is defined as [13]: suv = where MSD(u, v) =

1 , MSD(u, v) + 1 1 |Iuv |

·



(rui − rvi )

(8)

2

(9)

i∈Iuv

is the mean squared difference. MSD is a basic numerical metric based on the geometrical principles of the Euclidean Distance [14,20,21]. The +1 term is just here to avoid dividing by zero. This metric is able to realize a good prediction accuracy while it is less powerful in coverage [18]. C) Pearson Correlation Coefficient. Pearson correlation coefficient [22] measures the linear relationship of two variables, and can be seen as a cosine similarity normalized by means. It is defined as:  i∈Iuv (rui − μu ) (rvi − μv )  . (10) suv =  2  2 (r − μ ) (r − μ ) ui u vi v i∈Iuv i∈Iuv It can be studied as a similarity metric since the linear relationship indicates the degree of correlation they have [23]. In our use, it evaluates to what degree that user u and user v will have similar preferences for movies. 2.3

Latent Factor Models

We focus on the latent factor model yielded from Singular Value Decomposition (SVD) on the rating matrix. SVD is a matrix factorization model which maps both users and items to a particular latent factor space [24]. Mathematically, it gives a best lower rank approximation to the original user-item rating matrix. SVD algorithm primarily aims to find two matrices Pk×n and Qk×m such that [3]: ˆ = QT P R

(11)

224

Y. Wang et al.

where ‘k’ is the number of latent factors we want to dig out, typically choosing it from 20 to 100. Each element in P and Q indicates the weight (or saying relevance) of each factor with respect to certain user and item. Also considered into the global effects, the prediction rˆui is defined as [25,26]: rˆui = μ + bu + bi + qiT pu .

(12)

The complexity of SVD will grow dramatically if the amount of data increases, since the missing values will be filled in after decomposition. This may also result in more inaccurate prediction. Therefore, it is suggested to apply SVD only on observed data [14]. To figure out the unknown qi , pu , bu and bi , we minimize the following regularized squared error [27]:

 2 2 2 (13) (rui − rˆui ) + λ2 b2i + b2u + qi  + pu  . rui ∈O

Several techniques such as stochastic gradient descent have been used, where the detailed ideas are introduced in the articles [28] and [29]. Through dimensional reduction, such latent factor model discovers hidden correlations, removes redundant and noisy features and accelerates the processing of data.

3 3.1

Methodology Heterogeneous Information Networks (HIN) and Network Schema

GT = (E, L) denotes the network schema shown in Fig. 1 representing all the entity types and relation types in this heterogeneous information networks, where E is the set of entity types and L is the set of relation types. In GT = (E, L), entity types are framed with rounded rectangular while relation types are denoted by the arrows between entity types. 3.2

Meta-Path

Generally, meta-paths are considered as the types of paths in network schema L L L in the form of P = E0 →1 E1 →2 . . . →k Ek where E0 , E1 . . . Ek are elements in E and L0 , L1 . . . Lk are elements in L [30]. Three symmetric meta-paths whose beginning and ending entity type are the same can be derived from GT = (E, L) which are P1 : movie − actor − movie P2 : movie − director − movie P3 : movie − genre − movie

(14)

Recommender Systems with Multi-source Data

225

Fig. 1. Network schema of heterogeneous information networks in this paper

3.3

Adjacency Matrix and Commuting Matrix

WEi Ej denotes the adjacency matrix of two entity types Ei and Ej whose elements represent the relationship between Ei and Ej . For example, elements can be binary in the movie-actor adjacency matrix. ‘1’ on position (i, j) represents that the j th actor has participated in the ith movie and ‘0’ means no participation (Table 1 is an example of a movie-actor adjacency matrix). In another case, user-movie adjacency matrix, elements can be numeric representing the rating on a movie from a user where ‘0’ means the corresponding user did not rate on the corresponding movie (Table 2 is an example of a user-movie adjacency matrix). Table 1. A possible movie-actor adjacency matrix Actor a Actor b Actor c Actor d Actor e Movie a 0

1

1

0

0

Movie b 1

1

0

0

1

Movie c 1

0

1

0

1

Movie d 1

1

0

0

0

Based on the assumption above, the formula of the commuting matrix C for a meta-path P = (E1 E2 . . . Ek ) is C = WE1 E2 WE2 E3 . . . WEk−1 Ek . If all of the elements in adjacency matrix WEi Ej are binary, C(i, j) denoted as Cij is considered as the number of paths instances of two entities xi ∈ E1 and yj ∈ Ek . For example, under the movie − actor − movie(P1 ) path, Cij represents the number of how many actors in movie i and j are in common.

226

Y. Wang et al. Table 2. A possible user-movie adjacency matrix User a User b User c User d User e

3.4

Movie a 2

3

2.5

0

0

Movie b 1

4.5

0

0

0

Movie c 0

2

3

3.5

0

Movie d 0

3

5

0

0

PathSim Similarity

PathSim similarity of entities x and y is defined as s(x, y) =

|{px∼x

2 × |{px∼y : px∼y ∈ P }| , : px∼x ∈ P }| + |{py∼y : py∼y ∈ P }|

(15)

for a symmetric meta-path where px∼y is a path instance between x and y and |{px∼y : px∼y ∈ P }| is the number of path instances between x and y under P [31]. It could be further deduced as sim (xi , xj ) =

2Cij . Cii + Cij

(16)

based on commuting matrix where xi ∈ E1 and xj ∈ E1 following symmetric meta-path P = (E1 E2 . . . E2 E1 ). 3.5

Calculation of Users’ Predicted Scores

The three selected meta-paths are extended to P1 : user − movie − actor − movie P2 : user − movie − director − movie , P3 : user − movie − genre − movie

(17)

in order to figure out users’ predicted ratings. The set of users is denoted as U = {u1 , u2 , . . . uc } and that of movies is defined as I = {m1 , m2 , . . . mn }. Then the predicted score of user ui on movie mj under a meta-path P = (E1 E2 . . . Ek ) could be calculated as  Ru ,m sim (m, mj |P  ) s (ui , mj |P ) = m∈I  i , (18)  m∈I Rui ,m sim (m, mj |P ) where Rui ,m denotes the score of user ui on movie m, Ru i ,m is the binary representation of score of user ui on movie m with value 1 if user ui rated movie m and 0 if not and P  = (E2 . . . Ek ) is a sub-path of P . The formula can be further simplified as

Recommender Systems with Multi-source Data

n s (ui , mj |P ) =

2×Rui ,mk ×Cjk Cjj +Ckk n 2×Ru i ,mk ×Cjk k=1 Cjj +Ckk k=1

where Xjk = Xkj =

n Rui ,mk × Xkj  = k=1 , n  k=1 Rui ,mk × Xkj

2Cjk Cjj + Ckk

227

(19)

(20)

which is a symmetric matrix as to the symmetric commuting matrix C following symmetric meta-path P  in our model. Finally, the predicted rating matrix of c users and n movies under three different meta-paths could be generated as ⎤ ⎡ s (u1 , m1 |Pt ) · · · s (u1 , mn |Pt ) ⎥ .. .. .. ˆ (t) = ⎢ R (21) ⎦ , for t = 1, 2, 3 ⎣ . . . s (uc , m1 |Pt ) · · · s (uc , mn |Pt ) ˆ (t) ∈ Rc×n . Three paths could be given different weights ωt and the where R idealized predicted rating matrix is ˆ= R

3 

ˆ (t) ωt R

(22)

t=1

Nevertheless, there lacks evidence to determine each path exerts how much influence on the predicting process, which result in the indetermination of the precise value of weights. Also, if the detailed users information could be acquired, personalised weights will be worked out based on each user’s profile individually. However, scarcity of available data limited the probability to select proper values of weights. Therefore, in our experiment, we stopped at the previous step and ˆ (t) ) under three selected metaonly calculated three predicted rating matrices (R paths.

4 4.1

Experimental Study Dataset

There are quantities of various datasets on Grouplens with different amount and types of information. The MovieLens dataset [32] called ml-latest-small is used in experiments in this paper which contains 610 users, 9742 movies and 100,836 ratings ranging from 0.5 to 5 rated by users to movies. Links and genres of movies are also included in this dataset, where Links contains movie IDs on MovieLens and their corresponding movie IDs on IMDb. We need the directors and actors information of movies in ml-latest-small on IMDb website [33]. Then, the directors, actors and genres information of most of the movies in ml-latest-small could be acquired on MovieLens and IMDb correspondingly by file Links. The process could be briefly shown in Fig. 2. Nonetheless, there are a few movies in ml-latest-small whose information could not be found on IMDb. Therefore, we adjusted the datasets from two

228

Y. Wang et al.

Fig. 2. Combination of movie information from MovieLens and IMDb

sources and retained the movies which could be both found in ml-latest-small and IMDb (the schematic map is described in Fig. 3).

Fig. 3. Venn diagram of overlapped datasets

Finally, the adjusted dataset used in experiments contains 610 users, 8860 movies, 96,380 ratings, 19 genres, 12,473 actors and 4,035 directors (only the most important one genre of each movie is retained). 4.2

Experiments on Heterogeneous Information Networks Model

In experiments on Heterogeneous Information Networks Model, we focus on the accuracy of the model on the existed ratings after series of predictive steps by Mean Absolute Error (MAE) described as below:    1 ˆ u,m  , (23) MAE = Ru,m − R |R| Ru,m ∈R

ˆ u,m denotes where R is the set of all existed ratings in adjusted dataset and R the predicted score of user u on movie m. From the formula above, it could be inferred that MAE is the average of the differences between the existed rating and the corresponding predictive rating. Therefore, low value of MAE indicates

Recommender Systems with Multi-source Data

229

that the existed ratings after series of predicting steps are still retained precisely, which means that the accuracy of the model is high. Then, we separate the adjusted dataset into training set and testing set. To get more comparable results, we control the testing set separately as 5%, 10%, 15%, 20% of the whole adjusted dataset whose left part was considered as the training set. For instance, there is 20% of the adjusted dataset acknowledged as testing set then the remaining 80% is training set in this single experiment. The outcome of these four experiments satisfies the line chart shown in Fig. 4.

Fig. 4. MAE of HIN under three paths P1 (Actor), P2 (Director), P3 (Genre)

The comparison of HIN, SVD and kNN in accuracy by MAE is recorded in Table 3. Table 3. MAE under HIN, SVD and kNN models Table size HIN director actor

SVD genre

kNN MSD

Cosine Pearson

5%

0.1853

0.2712 0.5102 0.4009 0.6583 0.6647 0.6773

10%

0.1839

0.2645 0.5480 0.4036 0.6684 0.6661 0.6899

15%

0.1903

0.2637 0.6327 0.4152 0.6702 0.6817 0.6889

20%

0.1934

0.2709 0.6403 0.4074 0.6788 0.6852 0.6966

The number of factors chosen in SVD model is 25. The value of MAE tended to be stable when more factors are added. In kNN model, top 70 neighbors are selected with relatively lower MAE compared with other value of k-nearest neighbors.

230

4.3

Y. Wang et al.

Evaluation

As the results shown above, in method of HIN, the predicted rating matrix following movie-actor (P1 ) and movie-director (P2 ) paths are much more accurate than under movie-genre (P3 ) path. It is probably because the amount of genre types is much less than that of actors and directors. When compared with kNN and SVD model, the accuracy of HIN under P1 and P2 paths is much higher, though the performance of HIN following P3 is barely satisfactory. Overall, given relatively enough entity types, HIN has improved the predicting accuracy significantly compared with kNN and SVD and is quite stable in accuracy under datasets with different sizes. 4.4

Limitations

In order to improve the performance of HIN following P3 , more genres with priority corresponding to a single movie are supposed to be taken into consideration. For a instance, the movie, ‘The Godfather’, the first genre label of it is ‘Crime’ and the second is ‘Drama’. Then, the genre ‘Crime’ is supposed to account for higher weights than the genre ‘Drama’ when we look for movies similar to ‘The Godfather’. Moreover, the total genre types should be refined and extended with more quantities. There are only 19 genres in total in our experimental dataset. Hence, given a certain movie, it is possible to find some neighboring ones that show less similarity in contents considering the brief classification and limited genre types. Therefore, each movie in the dataset should be allocated with more refined genre labels. We still take movie, ‘The Godfather’ as an example. Besides ‘Crime’ and ‘Drama’, ‘trickery’ could be considered as the third genre label while ‘family’ might be the fourth. In this way, movies similar to ‘The Godfather’ we find by similarity calculation will be less and more accurate, because the movies satisfying all the four genres will be less than the movies only satisfying two general genres. Finally, as we mentioned in formula (22), the indetermination of the weights for the three paths is also one of the limitations for our experiments. If each user’s profile is accessible, we still need to find a way to work on this data and apply it to acquire personalised weights. All these limitations mentioned above in our experiments are expected for future works to be studied on.

5

Conclusion

In this paper, we develop a recommendation model based on multi-sources datasets. We have shown that the proposed model enjoys a smaller MAE compared with conventional CF methods on a standard benchmark problem. With the use of abundant information in texts, HIN model mostly achieves a much higher accuracy in rating predictions compared to the results of SVD and kNN in user-based method. Further research on SVD and linear mixed model is presented is also presented in our work [34]. The success of this outperformance depends on the choice of meta-paths which are used to explain the explicit relationships between users’ interests and movies’ features.

Recommender Systems with Multi-source Data

231

In the HIN model, the movie-director path and movie-actor path realize a better accuracy compared to the movie-genre path. This indicates that it is less possible that two movies can be regarded as similar movies if they have the same genre. That is to say this movie-genre path can be more powerful if more genres are concerned for each movie. Moreover, users may have various appetites of items on account of their own attributes, such as ages and occupations. If users’ demographic information is considered into the meta-path, HIN is able to achieve even a better accuracy with precisely personalized weights of each meta-path. For future work, we plan to (1) further investigate the methods to allocate different weights of each meta-path proposed in our model; (2) introduce more attributes to refine the limited types of genres; (3) if possible, utilize explicit users’ profiles to better explore users’ potential interests and improve the performance. If more data are available, the HIN model can be a promising tool for personalized recommender systems with a high degree of accuracy. Acknowledgment. The research is supported by Laboratory of Computational Physics (6142A05180501), Jiangsu Science and Technology Basic Research Programme (BK20171237), Key Program Special Fund in XJTLU (KSF-E-21, KSF-P-02), Research Development Fund of XJTLU (RDF-2017-02-23), and partially supported by NSFC (No. 11571002, 11571047, 11671049, 11671051, 61672003, 11871339), XJTLU SURF project (No. 2019-056).

References 1. Bell, R.M., Koren, Y.: Scalable collaborative filtering with jointly derived neighborhood interpolation weights. In: ICDM, vol. 7, pp. 43–52. Citeseer (2007) 2. Sarwar, B.M., Karypis, G., Konstan, J.A., Riedl, J., et al.: Item-based collaborative filtering recommendation algorithms. Www 1, 285–295 (2001) 3. Koren, Y.: Factor in the neighbors: scalable and accurate collaborative filtering. ACM Trans. Knowl. Discov. Data (TKDD) 4(1), 1 (2010) 4. Gao, B., Zhan, G., Wang, H., Wang, Y., Zhu, S.: Learning with linear mixed model for group recommendation systems. In: Proceedings of the 2019 11th International Conference on Machine Learning and Computing, pp. 81–85. ACM (2019) 5. Chen, Z., Zhu, S., Niu, Q., Lu, X.: Censorious young: knowledge discovery from high-throughput movie rating data with LME4. In: 2019 IEEE 4th International Conference on Big Data Analytics (ICBDA), pp. 32–36. IEEE (2019) 6. Chen, Z., Zhu, S., Niu, Q., Zuo, T., Zhu, S., Niu, Q.: Knowledge discovery and recommendation with linear mixed model. Access IEEE 8, 38304–38317 (2020) 7. Zhu, S., Gu, T., Xu, X., Mo, Z.: Information splitting for big data analytics. In: International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery, CyberC 2016, Chengdu, China, 13–15 October 2016, pp. 294–302 (2016). https://doi.org/10.1109/CyberC.2016.64 8. Zhu, S.: Fast calculation of restricted maximum likelihood methods for unstructured high-throughput data. In: 2017 IEEE 2nd International Conference on Big Data Analysis (ICBDA), pp. 40–43, March 2017 9. Zhu, S., Wathen, A.J.: Essential formulae for restricted maximum likelihood and its derivatives associated with the linear mixed models (2018)

232

Y. Wang et al.

10. Zhu, S., Wathen, A.J.: Sparse inversion for derivative of log determinant. arXiv preprint arXiv:1911.00685 (2019) 11. Zhu, S., Gu, T., Liu, X.: AIMS: average information matrices splitting. Math. Found. Comput. https://doi.org/10.3934/mfc.2020012 12. Lu, X., Zhu, S., Niu, Q., Chen, Z.: Profile inference from heterogeneous data - fundamentals and new trends. In: Business Information Systems - 22nd International Conference, BIS 2019, Seville, Spain, 26–28 June 2019, Proceedings, Part I, pp. 122–136 (2019). https://doi.org/10.1007/978-3-030-20485-3 10 13. Jahrer, M., T¨ oscher, A., Legenstein, R.: Combining predictions for accurate recommender systems. In: Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 693–702. ACM (2010) 14. Koren, Y.: Factorization meets the neighborhood: a multifaceted collaborative filtering model. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 426–434. ACM (2008) 15. Adeniyi, D.A., Wei, Z., Yongquan, Y.: Automated web usage data mining and recommendation system using k-nearest neighbor (KNN) classification method. Appl. Comput. Inform. 12(1), 90–108 (2016) 16. Park, Y., Park, S., Jung, W., Lee, S.G.: Reversed CF: a fast collaborative filtering algorithm using a k-nearest neighbor graph. Expert Syst. Appl. 42(8), 4022–4028 (2015) 17. Bobadilla, J., Ortega, F., Hernando, A., Guti´errez, A.: Recommender systems survey. Knowl.-Based Syst. 46, 109–132 (2013) 18. Sanchez, J., Serradilla, F., Martinez, E., Bobadilla, J.: Choice of metrics used in collaborative filtering and their impact on recommender systems. In: 2008 2nd IEEE International Conference on Digital Ecosystems and Technologies, pp. 432– 436. IEEE (2008) 19. Singhal, A., et al.: Modern information retrieval: a brief overview. IEEE Data Eng. Bull. 24(4), 35–43 (2001) 20. Miller, B.N., Albert, I., Lam, S.K., Konstan, J.A., Riedl, J.: Movielens unplugged: experiences with an occasionally connected recommender system. In: Proceedings of the 8th International Conference on Intelligent User Interfaces, pp. 263–266. ACM (2003) 21. Shardanand, U., Maes, P.: Social information filtering: algorithms for automating “word of mouth”. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 210–217. Chi (1995) 22. Benesty, J., Chen, J., Huang, Y., Cohen, I.: Pearson correlation coefficient. In: Noise Reduction in Speech Processing, pp. 1–4. Springer, Heidelberg (2009) 23. Ahlgren, P., Jarneving, B., Rousseau, R.: Requirements for a cocitation similarity measure, with special reference to Pearson’s correlation coefficient. J. Am. Soc. Inf. Sci. Technol. 54(6), 550–560 (2003) 24. Sarwar, B., Karypis, G., Konstan, J., Riedl, J.: Application of dimensionality reduction in recommender system-a case study. Technical report, Minnesota Univ Minneapolis Dept of Computer Science (2000) 25. Luo, X., Zhou, M., Xia, Y., Zhu, Q.: An efficient non-negative matrix-factorizationbased approach to collaborative filtering for recommender systems. IEEE Trans. Ind. Inform. 10(2), 1273–1284 (2014) 26. Mehta, R., Rana, K.: A review on matrix factorization techniques in recommender systems. In: 2017 2nd International Conference on Communication Systems, Computing and IT Applications (CSCITA), pp. 269–274. IEEE (2017) 27. Zheng, S., Ding, C., Nie, F.: Regularized singular value decomposition and application to recommender system. arXiv preprint arXiv:1804.05090 (2018)

Recommender Systems with Multi-source Data

233

28. Shamir, O.: Fast stochastic algorithms for SVD and PCA: convergence properties and convexity. In: International Conference on Machine Learning, pp. 248–256 (2016) 29. Koren, Y., Bell, R., Volinsky, C.: Matrix factorization techniques for recommender systems. Computer 8, 30–37 (2009) 30. Kong, X., Yu, P.S., Ding, Y., Wild, D.J.: Meta path-based collective classification in heterogeneous information networks. In: Proceedings of the 21st ACM International Conference on Information and Knowledge Management, pp. 1567–1571. ACM (2012) 31. Sun, Y., Han, J., Yan, X., Yu, P.S., Wu, T.: Pathsim: meta path-based top-k similarity search in heterogeneous information networks. Proc. VLDB Endow. 4(11), 992–1003 (2011) 32. Harper, F.M., Konstan, J.A.: The movielens datasets: history and context. ACM Trans. Interact. Intell. Syst. (TIIS) 5(4), 19 (2016) 33. Melville, P., Mooney, R.J., Nagarajan, R.: Content-boosted collaborative filtering for improved recommendations. In: AAAI/IAAI, vol. 23, pp. 187–192 (2002) 34. Zuo, T., Zhu, S., Lu, J.: A hybrid recommender system combing singular value decomposition and linear mixed model. In: Arai, K., Bhatia, R., Kapoor, S. (eds.) SAI 2020. AISC, vol. 1228, pp. 347–362. Springer, Cham (2020)

Renormalization Approach to the Task of Determining the Number of Topics in Topic Modeling Sergei Koltcov and Vera Ignatenko(B) National Research University Higher School of Economics, 55/2 Sedova Street, St. Petersburg, Russia 192148 {skoltsov,vignatenko}@hse.ru

Abstract. Topic modeling is a widely used approach for clustering text documents, however, it possesses a set of parameters that must be determined by a user, for example, the number of topics. In this paper, we propose a novel approach for fast approximation of the optimal topic number that corresponds well to human judgment. Our method combines the renormalization theory and the Renyi entropy approach. The main advantage of this method is computational speed which is crucial when dealing with big data. We apply our method to Latent Dirichlet Allocation model with Gibbs sampling procedure and test our approach on two datasets in different languages. Numerical results and comparison of computational speed demonstrate a significant gain in time with respect to standard grid search methods.

Keywords: Renormalization theory

1

· Optimal number of topics

Introduction

Nowadays, one of the widely used instruments for analysis of large textual collections is probabilistic topic modeling (TM). However, when using topic modeling in practice, the problems of selecting the number of topics and values of hyperparameters of the model arise since these values are not known in advance by practitioners in most applications, for instance, in many tasks of sociological research. Also, the results of TM are significantly influenced by the number of topics and inappropriate hyperparameters may lead to unstable topics or to topic compositions that do not accurately reflect the topic diversity in the data. The existing methods to deal with this problem are based on grid search. For instance, one can use standard metrics such as log-likelihood [1] or perplexity [2] and calculate the values of these metrics for different values of model parameters and then choose the parameters which lead to the best values of considered metrics. Another popular metric is semantic (topic) coherence [3]. A user has to select the number of most probable words in topic to be used for topic coherence calculation, then topic coherence is calculated for individual topics. Let us note c Springer Nature Switzerland AG 2020  K. Arai et al. (Eds.): SAI 2020, AISC 1228, pp. 234–247, 2020. https://doi.org/10.1007/978-3-030-52249-0_16

Renormalization Approach for Determining the Number of Topics

235

that there is no clear criterion for selecting the number of words and the authors [3] propose to consider 5–20 terms. Values of individual topic coherence are then aggregated to obtain a single coherence score [4,5]. After that one can apply a grid search for determining the best values of model parameters with respect to the coherence score. However, the above methods are extremely time-consuming for big data which is why optimization of the procedure of topic number selection is of importance. The computational complexity of the existing grid-search-based methods calls for greedy solutions that can speed up the process without substantial loss of TM quality. In this work, we propose a significantly faster solution for an approximation of an optimal number of topics for a given collection. We refer to the number of topics determined by encoders as the ’optimal number’ of topics. Our approach is based on renormalization theory and on entropic approach [7], which, in turn, is based on the search for a minimum Renyi entropy under variation of the number of topics. Details of the entropic approach are described in Subsect. 2.3. The author of [7] demonstrated that the minimum point of Renyi entropy lies approximately in the region of the number of topics identified by users. This approach also requires a grid search for model optimization, but the search itself is optimized based on the previous research and theoretical considerations. In work [8], it was demonstrated that the density-of-states function (to be defined further) inside individual TM solutions with different topic numbers is self-similar in relatively large intervals of the number of topics, and such intervals are multiple. Based on these facts and taking into account that big data allow applying methods of statistical physics, we conclude that it is possible to apply the renormalization theory for fast approximation of the optimal number of topics for large text collections. This means that calculation reduction in our approach is based on the mentioned self-similarity. We test our approach on two datasets in English and Russian languages and demonstrate that it allows us to quickly locate the approximate value of the optimal number of topics. While on the dataset consisting of 8,624 documents our approach takes eight minutes, the standard grid search takes about an hour and a half. Therefore, for huge datasets gain in time can vary from days to months. We should especially note that renormalization-based methods are suitable for finding approximate values of T only. However, the exact value can be found afterwards by grid search on a significantly smaller set of topic solutions, which compensates for the approximate character of the renormalization-based search. Our paper consists of the following sections. Subsection 2.1 describes basic assumptions of probabilistic topic modeling and formulation of the task of topic modeling. Subsection 2.2 gives an idea of renormalization theory which is widely used in physics. Subsection 2.3 reviews Renyi entropy approach which was proposed in [7,9]. Subsection 2.4 describes findings of work [8] which are necessary for the application of renormalization theory to topic modeling. Section 3 describes the main ideas of our approach and its application to Latent Dirichlet Allocation model (LDA). Section 4 contains numerical experiments on the renor-

236

S. Koltcov and V. Ignatenko

malization of topic models and comparison of obtained approximations of the optimal number of topics to the ground truth. Section 5 summarizes our findings.

2 2.1

Background Basics of Topic Modeling

Topic modeling takes a special place among machine learning methods since this class of models can effectively process huge data sets. In the framework of TM, several assumptions are expected to be met. First, the dataset contains a fixed number of topics. It means that a large matrix of occurrences of words in documents can be represented as a product of two matrices of smaller size which represent the distribution of words by topics and distribution of topics by documents, correspondingly. Second, documents and words are the only observable variables. Hidden distributions are calculated based on these variables. Thus, a document collection can be characterized by three numbers: D, W, T , where D is the number of documents, W is the number of unique words in the dataset, T is the number of topics, which is usually selected by users of TM. Third, currently, TM is constructed on the basis of the ‘bag of words’ concept. It means that topic models do not take into account the order of words in documents. Thus, probability of a word  d can be written in the following form  w in a document [10,11]: p(w|d) = t p(w|t)p(t|d) ≡ t φwt θtd , where {p(w|t) ≡ φwt } refers to the distribution of words by topics, {p(t|d) ≡ θtd } is the distribution of topics by documents. A more detailed description of TM formalism can be found in [7,9]. In fact, finding the hidden distributions in large text collection is equivalent to understanding what people write about without reading a huge number of texts, that is, to identifying topics that are discussed in the collection. 2.2

Basics of Renormalization Theory

Renormalization is a mathematical formalism that is widely used in different fields of physics, such as percolation analysis and phase transition analysis. The goal of renormalization is to construct a procedure for changing the scale of the system under which the behavior of the system preserves. Theoretical foundations of renormalization were laid in works [12,13]. Renormalization was widely used and developed in fractal theory since fractal behavior possesses the property of self-similarity [14,15]. To start a brief description of renormalization theory, let us consider a lattice consisting of a set of nodes. Each node is characterized by its spin direction, or spin state. In turn, a spin can have one of many possible directions. Here, the number of directions is determined by a concrete task or a model. For example, in the Ising model, only two possible directions are considered; in the Potts model, the number of directions can be 3–5 [16]. Nodes with the same spin directions constitute clusters. The procedure of scaling or renormalization follows the block merge principle where several nearest nodes are replaced by one node. The direction of the new spin is determined by the direction of the majority of spins in the block. A block merge procedure is conducted

Renormalization Approach for Determining the Number of Topics

237

on the whole lattice. Correspondingly, we obtain a new configuration of spins. The procedure of renormalization can be conducted several times. Following the requirement of equivalence between the new and the previous spin configurations, it is possible to construct a procedure of calculation of parameters and values of critical exponents, as described in [17]. Let us note that consistent application of renormalization of the initial system leads to approximate results, however, despite this fact, this method is widely used since it allows to obtain estimations of critical exponents in phase transitions, where standard mathematical models are not suitable. Renormalization is applicable if scale invariance is observed. Scale invariance is a feature of power-law distributions. Mathematically, selfsimilarity (or scale invariance) is expressed in the following way. Assume that f (x) = cxα , where c, α are constants. If we transform x → λx (it corresponds to scale transformation) then f (λx) = c(λx)α := βxα , where β = cλα , i.e., scale transformation leads to the same original functional dependence but with a different coefficient. In concrete applications, the parameter of power-law, α, can be found by different algorithms, such as ‘box counting’ or others. 2.3

Entropy-Based Approach

The entropic approach for analysis of topic models was proposed in [7,9] and is based on a set of principles. A detailed discussion of these principles can be found in [7,9]. In this work, we would like to briefly discuss some important observations related to the entropic approach which would be necessary for the formulation of our renormalization procedure. First, a document collection is considered as a statistical system, for which the free energy can be determined. Let us note that free energy is equivalent to Kullback-Leibler divergence. Further, the free energy (and, correspondingly, Kullback-Leibler divergence) can be expressed in terms of Renyi entropy through partition function (Zq ) [9]: Zq = ρ(P˜ )q , where q = 1/T is a deformation parameter, ρ = N/(W T )

(1)

is the ‘density-of-states’ function of the whole topic  solution, N is the number of highly probable words with p(w|t) > 1/W , P˜ = w,t p(w|t) · 1{p(w|t)>1/W } with 1{·} being an indicator function. Thus, Renyi entropy of a topic solution can be expressed in the following form: SqR =

ln(Zq ) . q−1

(2)

We would like to notice that the above expression of Renyi entropy is in Beck notation [18]. Since the procedure of TM shifts the information system from a state of high entropy to a state of low entropy, the calculation of deformed Renyi entropy

238

S. Koltcov and V. Ignatenko

after the TM allows estimating the effect of model hyperparameters and the number of topics on the results of TM. It was demonstrated [7,9] that minimum entropy corresponds to the number of topics which was selected by users in the process of dataset labeling. It allows us to link the procedure of searching for a minimum deformed entropy with the process of data labeling, which plays a crucial role in machine learning models. However, searching for the minimum Renyi entropy demands exhaustive search over the set of hyperparameters and numbers of topics which is a time-consuming process. A partial solution to this problem can be found through the analysis of self-similarity in topic solutions under variation of the number of topics. 2.4

Self-similar Behaviour in Topic Models

As it was shown in [8], topic models have the properties of self-similar behavior under variation of the number of topics. Such behavior is expressed in the fact that the ‘density-of-states’ function satisfies ρ(λ T1 ) = β(1/T )α with some β and α, and, therefore is linear in bi-logarithmic coordinates. However, such behavior is observed only in some ranges of the number of topics. Moreover, the inclination angles of linear pieces of the ‘density-of-states’ function are different in different regions that correspond to different fractal dimensions. The determination of the inclination angles was implemented according to the following steps: 1) Multidimensional space of words and topics is covered by a grid of fixed size (matrix Φ = {φwt }). 2) The number of cells satisfying φwt > 1/W is calculated. 3) The value of ρ for the fixed number of topics T is calculated according to Eq. (1). 4) Steps 1, 2, 3 are repeated with cell sizes (i.e. the number of topics) being changed. 5) A graph showing the dependence of ρ in bi-logarithmic coordinates is plotted. 6) Using the method of least squares, the slope of the curve on this plot is estimated, the value of the slope is equal to the value of fractal ln(ρ) dimension calculated according to the following relation: D = ln( 1 . T ) In work [8], two datasets in different languages were tested under variation of the number of topics and it was demonstrated that there are large regions where the density-of-states function self-reproduces, i.e., fractal behavior is observed. Areas between such regions of self-similarity are transition regions. In such regions, change in the density-of-states function happens, i.e. the character of self-similarity changes. Work [8] demonstrates that transition regions correspond to human mark-up. Regions of self-similarity do not lead to changes in the structure of solutions of TM, therefore, it is sufficient to find transition regions in order to determine the optimal topic number in a collection. The disadvantage of this approach is in its computational complexity both in terms of time and computational resources: to find transition regions one needs to run topic modeling many times with multiple values of topic numbers. Since there are regions of self-similarity, we propose to apply renormalization theory to speed up the search for the topic number optimum.

Renormalization Approach for Determining the Number of Topics

3 3.1

239

Method Application of Renormalization in Topic Modeling

In this subsection, we explain the main idea of renormalization for the task of topic modeling (its application for Latent Dirichlet Allocation model with Gibbs sampling procedure will be demonstrated in Subsect. 3.2). Recall that the output of TM contains matrix Φ = {φwt } of size W ×T . Here, we consider a fixed vocabulary of unique words, therefore, the scale of renormalization depends only on parameter q = 1/T . Renormalization procedure is a procedure of merging two topics into one new topic. As a result of the mergingprocedure, we obtain a new topic t˜ with its topic-word distribution satisfying w φwt˜ = 1. Since the calculation of matrix Φ depends on a particular topic model, the mathematical formulation of renormalization procedure is model-dependent. Also, the results of merging depend on how topics for merging were selected. In this work, we consider three principles of selecting topics for merging: – Similar topics. Similarity measure can be calculated according to Kullback  φ 1 ) = w φwt1 ln(φwt1 )− Leibler divergence [19]: KL(t1 , t2 ) = w φwt1 ln( φwt wt2  − w φwt1 ln(φwt2 ), where φwt1 and φwt2 are topic-word distributions, t1 and t2 are topics. Then two topics with the smallest value of KL divergence are chosen. – Topics with the lowest Renyi entropy. Here, we calculate Renyi entropy for each topic individually according to Eq. (2), where only probabilities of words in one topic are used. Then we select a pair of topics with the smallest values of Renyi entropy. As large values of Renyi entropy correspond to the least informative topics, minimum values characterize the most informative topics. Thus, we choose informative topics for merging. – Randomly chosen topics. Here, we generate two random numbers in the range [1, Tˆ], where Tˆ is the current number of topics, and merge topics with these numbers. This principle leads to the highest computational speed. 3.2

Renormalization for Latent Dirichlet Allocation Model

Let us consider Latent Dirichlet Allocation model with Gibbs sampling algorithm. This model assumes that word-topic and topic-document distributions are described by symmetric Dirichlet distributions with parameters α and β [20], correspondingly. Matrix Φ is estimated by means of Gibbs sampling algorithm. Here, values α and β are set by the user. Calculation of Φ consists of two phases. The first phase includes sampling and calculation of a counter cwt , where cwt is the number of times when word w is assigned to topic t. The second phase contains recalculation of Φ according to cwt + β . φwt =  ( w cwt ) + βW

(3)

For our task of renormalization, we use the values of counters cwt and Eq. (3). Notice that counters cwt form matrix C = {cwt }, and this is the matrix which

240

S. Koltcov and V. Ignatenko

undergoes renormalization. Based on matrix C, renormalized version of matrix Φ is then calculated. Algorithm of renormalization consists of the following steps: 1. We choose a pair of topics for merging according to one of the principles described in Subsect. 3.1. Let us denote the chosen pair of topics by t1 and t2 . 2. Merging of selected topics. We aim to obtain the distribution of the new topic t˜ resulting from merging topics t1 and t2 , which would satisfy Eq. (3). Merging for matrix C means summation of counters cwt1 and cwt2 , namely, cwt˜ = cwt1 + cwt2 . Then, based on new values of counters, we calculate φwt˜ in the following way (analogous to Eq. (3)): cwt1 + cwt2 + β φwt˜ =  . (4) ( w cwt1 + cwt2 ) + βW  One can easily see that new distribution φ·t˜ satisfies w φwt˜ = 1. Then, we replace column φwt1 by φwt˜ and delete column φwt2 from matrix Φ. Note that this step leads to decreasing the number of topics by one topic, i.e., at the end of this step we have T − 1 topics. Steps 1 and 2 are repeated until there are only two topics left. At the end of each step 2, we calculate Renyi entropy for the current matrix Φ according to Eq. (2). Then we plot Renyi entropy as a function of the number of topics and search for its minimum to determine the approximation of the optimal number of topics. Thus, our proposed method incorporates Renyi entropy-based approach and renormalization theory. Moreover, it does not require the calculation of many topic models with different topic numbers, but it only requires one topic solution with large enough T .

4

Numerical Experiments

For our numerical experiments, the following datasets were used: – Russian dataset (RD) from the Lenta.ru news agency [21]. Each document of the dataset was assigned to one of ten topic classes by dataset provider. We consider a subset of this dataset which contains 8,624 documents with a total number of 23,297 unique words (available at [22]). – English dataset (ED) is the well-known ‘20 Newsgroups’ dataset [23]. It contains 15,404 English documents with the total number of 50,948 unique words. Each of the documents was assigned to one or more of 20 topic groups. Moreover, it was demonstrated [24] that 14–20 topics can represent this dataset. These datasets were used for topic modeling in the range [2, 100] topics in the increments of one topic. Hyperparameters of LDA model were fixed at the values: α = 0.1, β = 0.1. Research on the optimal values of hyperparameters for these datasets was presented in work [9], therefore, we do not vary hyperparameters in our work. For both datasets, the topic solution on 100 topics underwent

Renormalization Approach for Determining the Number of Topics

241

renormalization with successive reduction of the number of topics to one topic. Based on the results of consecutive renormalization, curves of Renyi entropy were plotted as functions of the number of topics. Further, the obtained Renyi entropy curves were compared to the original Renyi entropy curves [7] obtained without renormalization. 4.1

Russian Dataset

Figure 1 demonstrates curves of Renyi entropy, where the original Renyi entropy curve was obtained by successive topic modeling with different topic numbers (black line) and the other Renyi entropy curves were obtained from five different runs of the same 100-topic model by means of renormalization with a random selection of topics for merging. Here and further, the minima are denoted by circles in the figures. The minimum of the original Renyi entropy corresponds to 8 topics, minima of renormalized Renyi entropy correspond to 12, 11, 11, 17 and 8 topics, depending on the run. Accordingly, the average minimum of five runs corresponds to 12 topics. As it is demonstrated in Fig. 1, renormalization with merging of random topics, on one hand, provides correct values of Renyi entropy on the boundaries, i.e., for T = 2 and T = 100, on the other hand, the minimum can fluctuate in the region [8, 17] topics. However, on average, random merging leads to the result which is quite similar to that obtained without renormalization. Figure 2 demonstrates renormalized Renyi entropy based on merging topics with the lowest Renyi entropy. It can be seen that for this principle of selecting topics for merging, renormalized Renyi entropy curve is flat around its minimum (unlike the original Renyi entropy curve) which complicates fining this minimum. The flat area around the global minimum is located in the region of 10–18 topics. At the same time, at the endpoints of the considered range of topics the renormalized Renyi entropy curve has values similar to those of the original Renyi entropy, i.e. for T = 2 and T = 100. Figure 3 demonstrates the behavior of renormalized Renyi entropy when the principle of selecting topics for merging is based on KL divergence. It shows that this principle leads to the worst result: the renormalized Renyi entropy curve has a minimum that does not correspond to the optimal number of topics. However, just like all other versions of renormalized entropies, it behaves “correctly” on the boundaries, i.e. it has maxima for T = 2 and T = 100. 4.2

English Dataset

The results obtained on this dataset are similar to those based on the Russian dataset. Figure 4 demonstrates five runs of renormalization with randomly selected topics for merging on the English dataset. One can see that the curves are very similar to each other and to the original Renyi entropy curve. The minimum of the original Renyi entropy corresponds to 14 topics, minima of renormalized Renyi entropy correspond to 17, 11, 14, 23 and 12 topics, depending on the run of renormalization. Accordingly, the average minimum of five runs corresponds to 15 topics. Figure 5 demonstrates the renormalized Renyi entropy

242

S. Koltcov and V. Ignatenko

Fig. 1. Renyi entropy vs the number of topics T (RD). Original Renyi entropy – black. Renormalized Renyi entropy with random merging of topics: run 1 – red; run 2 – green; run 3 – blue; run 4 – magenta; run 5 – yellow. ()

Fig. 2. Renyi entropy vs the number of topics T (RD). Original Renyi entropy – black; renormalized Renyi entropy (topics with the lowest Renyi entropy merged) – red.

curve, where topics with the lowest Renyi entropy were merged. The minimum of the renormalized entropy corresponds to 16 topics. On average, this type of renormalization leads to slightly lower values of Renyi entropy compared to the original Renyi entropy. Figure 6 demonstrates renormalized Renyi entropy,

Renormalization Approach for Determining the Number of Topics

243

Fig. 3. Renyi entropy vs the number of topics T (RD). Original Renyi entropy – black; renormalized Renyi entropy (similar topics with the lowest KL divergence merged) – red.

Fig. 4. Renyi entropy vs the number of topics T (ED). Original Renyi entropy – black. Renormalized Renyi entropy with random merging of topics: run 1 – red; run 2 – green; run 3 – blue; run 4 – magenta; run 5 – yellow.

where topics were merged based on KL divergence between them. Again, we can see that this type of merging leads to the worst result. The renormalized Renyi entropy has a minimum at T = 43 that does not correspond either to the human mark-up or to the minimum of the original Renyi entropy.

244

S. Koltcov and V. Ignatenko

Fig. 5. Renyi entropy vs the number of topics T (ED). Original Renyi entropy – black; renormalized Renyi entropy (topics with the lowest Renyi entropy merged) – red.

Fig. 6. Renyi entropy vs the number of topics T (ED). Original Renyi entropy – black; renormalized Renyi entropy (similar topics with the lowest KL divergence merged) – red.

4.3

Comparison of Computational Speed of Original and Renormalized Models

Table 1 demonstrates computational speed for a sequence of topic models and for renormalization. All calculations were performed on the following equipment:

Renormalization Approach for Determining the Number of Topics

245

Table 1. Computational speed. Dataset

TM Renormalization Renormalization Renormalization simulation (random) (minimum (minimum KL and Renyi entropy) divergence) calculation of Renyi entropy

Russian dataset

90 min

8 min

16 min

140 min

English dataset

240 min

23 min

42 min

480 min

notebook Asus, Intel Core I7 - 4720 HQ CPU 2.6 GHz, Ram 12 Gb, Operation system: Windows 10 (64 bits). Calculations on both datasets demonstrate that renormalization with randomly selected topics for merging is the fastest. Moreover, this type of renormalization leads to the most similar behavior of the renormalized Renyi entropy curve to the original Renyi entropy. Also, the computational speed for this type of renormalization is almost 11 times higher than that of the original Renyi entropy. Renormalization based on merging topics with the lowest KL divergence is the slowest: such calculation is even more time-consuming than regular grid-search calculation with a reasonable number of iterations. Renormalization in which topics with the lowest Renyi entropy are merged takes the second place: its computation is five times faster than that of the original Renyi entropy. Summarizing the obtained results, we conclude that renormalization with randomly selected topics for merging could be an efficient instrument for the approximation of the optimal number of topics in document collections. However, it is worth mentioning that one should run such renormalization several times and average the obtained number of topics.

5

Conclusion

In this work, we have introduced renormalization of topic models as a method of fast approximate search for the optimal range of T in text collections, where T is the number of topics into which a topic modeling algorithm is supposed to cluster a given collection. This approach is introduced as an alternative to computationally intensive grid search technique which has to obtain solutions for all possible values of T in order to find the optimum of any metric being optimized (e.g. entropy). We have shown that, indeed, our approach allows to estimate the range of the optimal values of T for large collections faster than grid search and without substantial deviation from the “true” values of T , as determined by human mark-up. We have also found out that some variants of our approach yield better results than others. Renormalization involves a procedure of merging groups of

246

S. Koltcov and V. Ignatenko

topics, initially obtained with the excessive T , and the principle of selection of topics for merge has turned out to significantly affect the final results. In this work, we considered three different merge principles that selected: 1) topics with minimum Kullback-Leibler divergence, 2) topics with the lowest Renyi entropy, or 3) random topics. We have shown that the latter approach yielded the best results both in terms of computational speed and accuracy, while Renyi-based selection produced an inconvenient wide flat region around the minimum, and the KL-based approach worked slower than non-renormalized calculation. Since on our collections, random merge produced speed gain of more than one hour, corpora with millions of documents are expected to benefit much more, in the numbers amounting to hundreds of hours. A limitation of the renormalization approach is that it is model-dependent, i.e. the procedure of merge of selected topics depends on the model with which the initial topic solution was obtained. However, although we have tested our approach on topic models with Gibbs sampling procedure only, there seem to be no theoretical obstacles for applying it to other topic models, including the Expectation-Maximization algorithm. This appears to be a promising direction for future research deserving a separate paper. Acknowledgments. The study was implemented in the framework of the Basic Research Program at the National Research University Higher School of Economics (HSE) in 2019.

References 1. Wallach, H.M., Mimno, D., McCallum, A.: Rethinking LDA: why priors matter. In: Proceedings of the 22nd International Conference on Neural Information Processing Systems, pp. 1973–1981. Curran Associates Inc., USA (2009) 2. Manning, C.D., Schutze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (1999) 3. Mimno, D., Wallach, H.M., Talley, E., Leenders, M., McCallum, A.: Optimizing semantic coherence in topic models. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 262–272. Association for Computational Linguistics, Stroudsburg (2011) 4. R¨ oder, M., Both, A., Hinneburg, A.: Exploring the space of topic coherence measures. In: Proceedings of the 8th ACM International Conference on Web Search and Data Mining, pp. 399–408. ACM, New York (2015) 5. Stevens, K., Kegelmeyer, P., Andrzejewski, D., Buttler, D.: Exploring topic coherence over many models and many topics. In: Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 952–961. Association for Computational Linguistics, Stroudsburg (2012) 6. Teh, Y.W., Jordan, M.I., Beal, M.J., Blei, D.M.: Hierarchical Dirichlet processes. J. Am. Stat. Assoc. 101(476), 1566–1581 (2006). https://doi.org/10.1198/ 016214506000000302 7. Koltsov, S.: Application of R´enyi and Tsallis entropies to topic modeling optimization. Phys. A 512, 1192–1204 (2018). https://doi.org/10.1016/j.physa.2018. 08.050

Renormalization Approach for Determining the Number of Topics

247

8. Ignatenko, V., Koltcov, S., Staab, S., Boukhers, Z.: Fractal approach for determining the optimal number of topics in the field of topic modeling. J. Phys: Conf. Ser. 1163, 012025 (2019). https://doi.org/10.1088/1742-6596/1163/1/012025 9. Koltsov, S., Ignatenko, V., Koltsova, O.: Estimating topic modeling performance with Sharma-Mittal entropy. Entropy 21(7), 1–29 (2019). https://doi.org/10.3390/ e21070660 10. Hofmann, T.: Probabilistic latent semantic indexing. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 50–57. ACM, New York (1999) 11. Vorontsov, K., Potapenko, A.: Additive regularization of topic models. Mach. Learn. 101, 303–323 (2015). https://doi.org/10.1007/s10994-014-5476-6 12. Kadanoff, L.P.: Statistical Physics: Statics. Dynamics and Renormalization. World Scientific, Singapore (2000) 13. Wilson, K.G.: Renormalization group and critical phenomena. I renormalization group and the Kadanoff scaling picture. Phys. Rev. B 4(9), 3174–3183 (1971). https://doi.org/10.1103/PhysRevB.4.3174 14. Olemskoi, A.I.: Synergetics of Complex Systems: Phenomenology and Statistical Theory. Krasand, Moscow (2009) 15. Carpinteri, A., Chiaia, B.: Multifractal nature of concrete fracture surfaces and size effects on nominal fracture energy. Mater. Struct. 28(8), 435–443 (1995). https:// doi.org/10.1007/BF02473162 16. Essam, J.W.: Potts models, percolation, and duality. J. Math. Phys. 20(8), 1769– 1773 (1979). https://doi.org/10.1063/1.524264 17. Wilson, K.G., Kogut, J.: The renormalization group and the ∈ expansion. Phys. Rep. 12(2), 75–199 (1974). https://doi.org/10.1016/0370-1573(74)90023-4 18. Beck, C.: Generalised information and entropy measures in physics. Contemp. Phys. 50, 495–510 (2009). https://doi.org/10.1080/00107510902823517 19. Steyvers, M., Griffiths, T.: Probabilistic topic models. In: Handbook of Latent Semantic Analysis. 1st edn. Lawrence Erlbaum Associates, Mahwah (2007) 20. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003) 21. News dataset from Lenta.ru. https://www.kaggle.com/yutkin/corpus-of-russiannews-articles-from-lenta 22. Balanced subset of news dataset from Lenta.ru. https://yadi.sk/i/ RgBMt7lJLK9gfg 23. 20 Newsgroups dataset. http://qwone.com/jason/20Newsgroups/ 24. Basu, S., Davidson, I., Wagstaff, K.: Constrained Clustering: Advances in Algorithms, Theory, and Applications, 1st edn. Chapman and Hall, New York (2008)

Strategic Inference in Adversarial Encounters Using Graph Matching D. Michael Franklin(B) Kennesaw State University, Marietta, GA, USA [email protected] http://ksuweb.kennesaw.edu/~dfrank15/

Abstract. There are many situations where we need to determine the most likely strategy that another team is following. Their strategy dictates their most likely next actions by selecting the optimal policy from their set of policies. In this scenario, there is a hierarchical, multi-agent, multi-team environment where the teams are built on layers of agents working together at each level to coordinate their behaviors, such as SiMAMT. We can think of a strategy as a hierarchically layered policy network that allows for teams to work together as a group while maintaining their own personalities. They can also shift from one policy to another as the situation dictates. SiMAMT creates an environment like this where sets of such teams can work together as allies or team up against others as adversaries. In this context, we wish to have a set of teams working as allies facing another set of teams as adversaries. One alliance should be able to analyze the actions of another alliance to determine the most likely strategy that they are following, thus predicting their next actions as well as the next best actions for the current alliance. To accomplish this, the algorithm builds graphs that represent the alignment (i.e., the constellation) of the various agent’s policies and their movement dependency diagrams (MDDs). These graphs are a clear way to represent the individual agent’s policies and their aggregation into a strategy. In this instance, the edges of the graphs represent choices that the policy can make while the vertices represent the decision junctures. This creates a map of the various agents as they move through a progression of decisions, where each decision is made at a decision juncture, and each edge shows the probabilistic progression from each of those decisions. These graphs can show the likelihood of actions taken at each level of the hierarchy, thus encoding the behaviors of the agents, their groups, the teams, and the alliances. We wish to demonstrate that these graphs can represent large sets of teams or alliances and that each alliance can use these representations to coordinate their own behavior while analyzing the behaviors of other alliances. Further, we wish to show that an alliance can make a decision in interactive time on which policy from within their strategy set should be in place based on their observations of other alliance’s strategies. To do so, the algorithm will build a probabilistic graph based on the observed actions of the other alliances by observing the actions taken by each agent within that alliance. It can then compare that probabilistic graph with known graph strategies or c Springer Nature Switzerland AG 2020  K. Arai et al. (Eds.): SAI 2020, AISC 1228, pp. 248–262, 2020. https://doi.org/10.1007/978-3-030-52249-0_17

Strategic Inference via Graph Matching

249

those that it has learned along the way. We present this methodology and verify it with experimentation confirmed in the conclusions in this paper. Keywords: Artificial intelligence matching · Strategy inference

1

· Multi-agent systems · Tree

Introduction

We wish to show that we can utilize the SiMAMT simulation environment [5] to analyze large-scale, team-based, strategic interactions in interactive time. To do so, we will use the graph matching algorithm, presented in [4] and implemented within SiMAMT for doing the graph matching. It has been shown in [3] that strategic reasoning is possible in multi-agent AI, and that using a methodology such as the one presented here allows for these strategies to be recognized. These tools provide an excellent testing environment for strategic inference. We will extend this work into the large-scale multi-agent environment. The graph matching problem is shown to be at an least NP-Complete problem e.g., [7], indicating the inherent complexity with the graphing challenge. Further, it is complicated even more when we consider that we need to distinguish isomorphic (and, for that matter, homeomorphic) variants of known graphs [10]. While the previously cited paper, [4], has shown that this challenge can be met in interactive time, we need to examine how the algorithm will handle graphs of a larger scale, both in terms of the number of node and edges as well as in complexity. Mapping hierarchical strategies for multi-agent teams is exactly the scenario that will test this system. These multi-agent team strategies have many more branches, a higher level of subgraph homeomorphism and isomorphism, and an inherently layered subgraph challenge. This complexity arises because the overall graph is an aggregation and an agglomeration of the individual agent’s strategy graphs melded into a subgroup graph, and then these subgroup graphs are joined to form a group graph, and these group graphs are combined to build the team graph, and then these team strategy maps are united to arrive at the alliance graph that controls the entire team of teams (the alliance). Within these alliance graphs, however, there may be a large amount of repetition, redundancy, or slight variation (i.e., strongly similar). To remain efficient, the algorithm needs to be modified to include the ability to recognize and identify subgraphs, with memory, as it proceeds with its analysis. To compound this challenge, as it should be clear, these subgraphs are likely homeomorphs or isomorphs of each other. This means that the strategy recognition needs to be hierarchical, and within each layer and at each level, there needs to be a ‘memory’ of recognized strategies such that this information is fed forward to the next layer of recognition. To complicate this further, it should be noted that this algorithm is meant to run in parallel and in a distributed fashion, so this ‘memory’ must be a shared memory. These factors have a tremendous impact on the runtime of the algorithm as well as the scalability.

250

D. M. Franklin

In this paper, we will build the strategy graphs starting from the agent level and rising to the top of the hierarchy. These graphs have been described in other work [5], but will be shown here for completeness. These graphs can indicate a flow of decision making, a movement graph, or many other forms of organized flow process. We will envision these as team movement strategies for the sake of this paper. Once we have built these large-scale graphs, we will then seek to analyze them at the global and local level (as a whole, and in parts) to determine the most likely policy, thus the most likely strategy that the other alliance is using. Finally, we will compare these most-likely strategies with the current strategy for our alliance to see if we need to switch our own strategy to better suit the circumstances and the actions of the other alliances.

2

Related Works

The work of Vazirani [10] describes the complexity classes of finding complete matchings in variations of complete graphs. It documents how to decompose the problem of K3,3 into subsets of examination using K5 discrete elements. This decomposition brings the problem from the complexities of intractability into the tractable realm. Namely, considering the decomposition as a set of parallel sub-optimizations that can be discretely considered provides a lower-NP bound, though it requires parallel processing. Further, the paper introduces the exponential increase in complexity when considering homeomorphic graphs. The insight of graph decomposition is utilized herein to inform the process of culling the list of viable candidate trees and confirms the intractability of isomorphic graphs. In [1], the authors investigate paths and matchings within k-trees. They analyze the complexity classes of such matching and searches in these trees. Their experiments substantiate the claims of the complexity of matching trees and offer motivation for approximate solutions to this class of problems. Their insight into tree-decomposition is also helpful in organizing solutions to large-scale problems in the realm of tree-matching. Their work stops short of larger scale graphs and does not consider approximate solutions to the more intractable issues of homeomorphism and isomorphism. Some additional and informative expansion of this analysis is provided by [11] where they examine the combinatorial nature of these complex graph matching calculations. Datta, et al., study the effects of moving the graph matching problem into the area of bipartite graphs. In [2] they established an algorithm to find matches within and between graphs and analyzed the relevant complexities, assigning simplex matching to NC. This lays the groundwork herein where the limits of this complexity were pushed into higher bounds by matching homeomorphic and isomorphic graphs. Their work also helped to inform the baseline algorithm used herein for comparison to the approximate solution (along with several other works and the author’s own research). This work builds on the preliminary and related work found in [4] and expands to complete the promise of that work. The work herein starts from

Strategic Inference via Graph Matching

251

there and expands to include the fulfillment of the multi-agent ramifications, the handling of increased complexity, and the breadth of application areas from this foundational work. The work of Fukuda, et al., further explores the complexity of matchings within bipartite graphs and aims to make improvements to the process. Their work [6] elucidates the difficulties and complexities of such matchings. In particular, they move from O((c + I)n2 ) to O(c(n + m) + n2.5 ). Additionally, [8] extends the analysis of graph matching and graph complexity analysis. While they do not deal directly with homeomorphic and isomorphic matchings, they do further describe the relationship with complexity and memory management. The algorithm they propose does, in fact, lower the computational complexity, but at the cost of increased memory utilization. Additionally, they recognize that computing such difficult matchings pushes the limits of computability. As a result, herein, the entire problem of doing these matchings in real-time is further clarified as intractable and in need of the approximate solution proposed in this research.

3

Methodology

The graphs are built from foundational elements, and these form the nodes. The edges are allowable transitions from one element to another. Compiling these elements and mappings, we can form the strategy graph for an agent, such as the one shown in Fig. 1. These figures show a soccer or football field and an example of the movements through that field that an agent may take (according to their individual policy). These individual agents, shown by each graph, move throughout the field based on the team’s current strategy (which selects the current policy for each agent in accordance with the overarching team goals, respecting the alliance strategy that is in force for the entire alliance). While these strategic graphs could represent many operations, functionalities, or other elements, this particular example represents a soccer team’s strategy for back line defense. There are several ‘teams’ in soccer that make up the overall Team. Here we see the backline (defense) team, but there are also the frontline (offense) and midline (mixed offense and defense) teams. There are two other players, typically, in soccer (the goalie and a sweeper), and they are accounted for in one of these lines depending on the strategy. In most scenarios the goalie is integrated into the backline and the sweeper is incorporated into the frontline. The alliance strategy would dictate the strategies for each team as well as the memberships for the goalie and the sweeper. Continuing this example, we can see the fullbacks and their areas of coverage (the graph). Figure 1a shows the defensive coverage area for the left fullback, Fig. 1b shows the center fullback, and Fig. 1c shows the right fullback, each of them showing the policies assigned to these agents from their backline team strategy, shown collectively in Fig. 1d. To be clear, as is seen in the referenced papers that led to this work, this is just one possible set of policies from this particular strategy, and this represents just one such strategy from the set of strategies in the alliance strategy.

252

D. M. Franklin

Fig. 1. Building a strategy graph from policies

Strategic Inference via Graph Matching

253

To further this example without belaboring the point, Fig. 2 shows a compilation leading to a sample alliance strategy. Figure 2a shows a sample midline strategy and Fig. 2b shows a sample frontline strategy. The complexity of even a simple graph like this becomes clear when you see the overlay of the three team strategies into the alliance strategies, shown in Fig. 2c. While this image may be tough to see, it should suffice to show the complexity, and the progression of the complexity, as strategies interleave and combine. Further, there would be a similar alliance strategy for the other alliance, and both alliances would be shifting their strategies for each team as the game progressed based on their data gathering from the various agents observations. Again, the complexity of another scenario, like a complex battlefield, shown in Fig. 3, would be significantly more complex. We will use this soccer scenario for our initial experimentation for this paper. When these graphs are then combined to form an alliance strategy in a similar fashion, the graph becomes larger and more complex. This resultant structure is harder to visualize in two dimensions, but can be thought of as a three-dimensional (or, in larger cases, an n-dimensional) tree or a multi-layered graph. These graphs are also capable of being analyzed (compared for similarity, for example) by our proposed algorithm. The study will be extended to measure the effect of the algorithm’s runtime or efficiency when these larger and more complex graphs are compared for similarity. The algorithm takes the target graph (the one it is inspecting) and compares it to a candidate graph (one from the set of known graphs). The algorithm processes each candidate graph from the set to compare the target graph with each one. This can be done in serial or parallel. This comparison process can be considered in two ways: first, it can be a complete match; second, it can be a partial match or a probabilistic match. In the first, the matching requires a walk through the graph multiple times to verify that the connectivity of each node (its sets of neighbors) are identical. This is an arduous and rigorous process [10], especially when considering home-morphs, isomorphic, and other variations of similar types. In the second, the matching is challenging because of partial information. This partial information means that the candidates have to be held in memory according to their probability of matching, and the set of possible matches can be narrowed down as more information becomes available, as illustrated in Fig. 4. The converse advantage, though, is that many of the candidates can be eliminated with very little information as they do not contain any matching information [4]. When this algorithm is expanded to the more complex and voluminous examples in this paper, these considerations become even more difficult. More details of this matching are discussed in the experimentation section. As Fig. 4 shows, the algorithm is processing these matchings and building a belief network of the candidates. The belief network holds each candidate that has a greater than zero probability of matching and then updates the beliefs with each new observation reported into the system. The SiMAMT system gathers observations from each agent as a part of its processing and reports these observations according to the scenario (e.g., only to itself, to its immediate teammates,

254

D. M. Franklin

Fig. 2. Building an alliance graph from strategies

Strategic Inference via Graph Matching

Fig. 3. Battlefield strategies example

255

256

D. M. Franklin

Fig. 4. Approximate matching

to the whole team, etc.). As these observations are gathered, each informed agent forms their belief network and acts according to this data. SiMAMT also provides a module to infer the derived strategies based on these observations, and that module is powered by our proposed algorithm. The proposed algorithm drives the belief networks that reside with each agent and at each level of the hierarchy. This explains the need for the algorithm to be able to scale, to be distributed, and to work in parallel. To process the graphs, they are first sorted. The sorting process orders the vertices by degree, from highest to lowest. The edges have no need to be sorted as they are pairings of vertices. This is, of course, not a cheap process - there is a cost to sorting. There are two considerations that help. We need only sort the target graph as the candidates graphs can be sorted ahead of time. As is written in the reference literature, we are comparing the observed graph to all known strategy graphs and looking for the closest match; if none is found, the new graph being built from the observation is added to the known set of strategies and the set grows. The algorithm benefits greatly from this sorting as it can now reject quickly any candidate graphs whose first sorted vertex is of a lower or

Strategic Inference via Graph Matching

257

higher degree than the target graph. This ‘quick no’ philosophy does not reduce the worst-case scenario, but has a significant impact on the average runtime of the algorithm. This process, and the resultant culling, have been written about in [4]. We now wish to prove that those observation hold, and in fact, continue to improve, as the complexity and size of the graphs increases. 3.1

Homeomorphism and Isomorphism

Two graphs which contain the same graph vertices connected in the same way, perhaps with additional vertices, are said to be homeomorphic. Formally, two graphs G and H with graph vertices Vn = (1, 2, ..., n) are said to be homeomorphic if there is a permutation p of Vn such that the resultant graph is a subgraph of the former. Figure 5 provides an example from ([9]).

Fig. 5. Homeomorphic graphs

Two graphs which contain the same number of graph vertices connected in the same way are said to be isomorphic. Formally, two graphs G and H with graph vertices Vn = (1, 2, ..., n) are said to be isomorphic if there is a permutation p of Vn such that (u, v) is in the set of graph edges E(G) ⇐⇒ (p(u), p(v)) is in the set of graph edges E(H). Figure 6 provides an example from ([12]).

Fig. 6. Isomorphic graphs

The cost of calculating isomorphism is much higher than the cost of calculating homeomorphism so the approximation algorithm should consider these in this order. There may be graphs that are homeomorphic and not isomorphic, but there are no reasonable graphs that are isomorphic but not homeomorphic. This intuition is used in the intelligent pruning section of the approximation algorithm to cull the target list. The culling process involves an overall reduction in the set of candidate graphs. As mentioned earlier, the sorting process allows for a quick check on the most complex vertex and its related edges. This eliminates many candidates. Additionally, the graph storage memory model holds the number of vertices and

258

D. M. Franklin

edges, so another ‘quick no’ can come from simply checking to see if the target graph and the candidate graph have the same number of vertices or edges. Naturally, when matching with incomplete information, it is not as clear how the eliminations might work (since you may not have observed the highest-degree vertex yet, for example). However, the process is quite similar - if a particular observed node does not exist in the candidate graph (with the prescribed edges), then that graph cannot match the target graph. This optimization offers increased performance as the algorithm runs and allows for the decision making process to happen within the interactive time requirement. The experiments section describes the experiments that were designed to test these hypotheses and gather performance metrics.

4

Implementation

The complete matching algorithm has to, as previously mentioned, walk through each vertex and trace each edge to verify the matching neighbors. The previous work we published shows the progression from the complete matching to the approximate matching, but in general it follows the aforementioned culling process. This approximate matching algorithm is further optimized by pruning the graphs to keep from pursuing branches that should be abandoned as there are invalid (non-matching) elements. These optimizations serve to bring the runtime of the algorithm down to performing within the constraints of interactive time with the examples and experiments given in the reference literature, but we wish to expand these experiments to work with much larger and much more complex graphs. These same optimizations should have a flattening efficiency to them, meaning that they should be more efficient as the size and complexity increase. The experiments will validate this hypothesis.

5

Experiments

The experiments will look first at the aforementioned soccer problem to analyze the strategies of alliances to see how the algorithm scales. Table 1 shows the results from the initial runs and confirms that the algorithm can process large scale data. These first runs are using a minimal configuration for graph complexity where we have 3 teams per alliance. There are a number of strategies to consider: 3 per team, 3 teams per alliance, and the overall alliance for a total of 13 policies dictated by 4 strategies. There are two alliances in the scenario, so there are twice as many overall policies to coordinate and strategies to analyze (SiMAMT has each team analyze its own strategy along with all of the others). However, the reality of the strategies is that there are very large graphs associated with them because there are many options for each player on the team. For comparison, there are potentially 3k nodes with up to 10k edges for each strategy network graph. The results show the number of graph matches found, including the number of homeomorphs and isomorphs. We can see that the algorithm is more than capable of handling graphs this large within

Strategic Inference via Graph Matching

259

the constraint window of interactive time decision making. The times are only presented for scale, the time is irrelevant in and of itself, so long as it can be done in interactive time. For this example, to understand the time growth, the algorithm is run in serial mode (much slower) instead of in parallel. These times are from an i7, 3.2 GHz, 16 GB RAM computer without using the GPU for calculations. Table 1. Full and approximate tree matching (n = 1000, d = 7) Alg.

Category

Trial1 Trial2 Trial3 Avg

Full

H-morphs

3784

3924

3856

3854.67

Full

I-morphs

1254

1148

1305

1235.67

Full

Time (sec) 1654

1743

1770

1722.33

Aprx H-morphs

3784

3924

3856

3854.67

Aprx I-morphs

1254

1148

1305

1235.67

205

212

198

Aprx Time (sec)

205

In the second experiment the algorithm was put to the test in a large scale battle scenario. In this scenario there are soldiers at the base (leaf) level each with their own policy. The soldiers are grouped into teams of 10 to 12 (the strategy dictates the team makeup), each team having its own strategy for disseminating policies to each solider. 4 to 8 teams form up into a squad, again with strategies at each level. These squads can be grouped into a platoon, then platoons into companies, companies into battalions, and battalions into brigades. These brigades are quite large, containing 5 k soldiers or more. They can be folded into divisions, and divisions into corps. With the numbers we are using, that could mean 40 k–50 k soldiers in the scenario for each alliance, and we want 4 alliances in all. The algorithm is able to handle the much larger scale and corresponding increase in complexity, and the results are shown in the Table 2. Table 2. Full and approximate tree matching (n = 50000, d = 100) Alg.

Category

Trial1

Full

H-morphs

126784 134850 127504 129712.67

Full

I-morphs

Full

Time (sec) 112856 132756 122953 122855

Aprx H-morphs Aprx I-morphs Aprx Time (sec)

3452

Trial2 3875

Trial3

Avg

3603 3643.33

126784 134850 127504 129712.67 3452

3875

3603 3643.33

507

518

531 518.67

The main result from these tables is that the approximation algorithm does not lose any data through these experiments, though that is not a claim

260

D. M. Franklin

(simply that the loss is negligible is the claim). The second important result is that the approximation algorithm shows significant time reduction even while maintaining (here) a perfect operation (though it is not theorized to maintain 100% accuracy by any means). More importantly, the approximation algorithm continues to derive time savings benefits even when the complexity grows, as shown in Table 3. In this table, the complexity is measured in dimensionality, being the higher order of the scale (as is true with big-Oh calculations, the constants (the size of the graphs by node count) is less important than the complexity). We see the odd-power growth that covers both of the described scenarios and beyond to show that the speed of the overall execution (here, serially) does not slow down in proportion to the complexity. While there is some sub-linear increase in run time, it falls below a linear growth scale. It should be noted, again, that this is with the algorithm running in serial mode without assistance so that the times were measurable. With the algorithm running in parallel, the conclusion is reached within seconds at the most, but sub-second on the average case. Table 3. Growth of algorithmic methods by complexity Tree complexity Homeomorphs Isomorphs Full match (time in sec) Approx match (time in sec)

n3

n

n5

n7

n9

21

380

1340

3875

10694

1

10

30

51

78

427 24310 68934 128943 412985 9

111

173

189

205

One final element of this process is to bring the graph matching to its intended conclusion and ask: Can it recognize the other alliances strategy? And, if so, how long does it take? We present the answers in tabular format to show the trials that were run for each scenario (each scenario is its own i.d.d dataset) and how many steps of movement or action it took for SiMAMT, utilizing the proposed algorithm, to recognize correctly the other alliance’s strategy. It may encounter the correct strategy early on in its run, but the steps continue to count so long as it has not made its final concluding guess. This is meant to make it a true reflection of the strategy inference engine concluding from the data what the most likely adversarial strategy is without accepting guesses. Table 4 shows the results of the experiments showing how many steps it took to recognize the correct strategy.

Strategic Inference via Graph Matching

261

Table 4. Moves to recognize correct strategy Trial number Recognition steps

6

1

27

2

32

3

25

4

31

5

30

Conclusions

The experiments have shown that the proposed algorithm can be applied to a much larger and more complex dataset and remain reliable and efficient. The growth of the dataset (here, the graphs and their connections) does not adversely affect the utility of the algorithm, and the efficiency remains even with this growth. Further, the data supports the intractability of using a complete matching algorithm as the time begins to take hours, if not longer, to compare the graphs and find matches. We have applied the algorithm to multi-agent, multiteam scenarios, like soccer and the modern battlefield, and have found that SiMAMT can handle the load of running such simulations because of the efficiency of this graph matching algorithm.

7

Future Work

We wish to expand this algorithm further in the future. For now, it operates and works within the confines of the system, but if we wish to utilize even larger and even more complex systems, we would need to optimize the parallel processing for such intricate growth in scale and complexity so that it could run on smaller and more widely available hardware platforms (i.e., not requiring a large cluster or HPC environment. There are many more scenarios to include, and we wish to expand the battlefield simulation to include even more cooperating and competing alliances.

References 1. Das, B., Datta, S., Nimbhorkar, P.: Log-space algorithms for paths and matchings in k-trees. Theory Comput. Syst. 53(4), 669–689 (2013) 2. Datta, S., Kulkarni, R., Roy, S.: Deterministically isolating a perfect matching in bipartite planar graphs. Theory Comput. Syst. 47(3), 737–757 (2010) 3. Franklin, D.M.: Strategy inference in stochastic games using belief networks comprised of probabilistic graphical models. In: Proceedings of FLAIRS (2015)

262

D. M. Franklin

4. Franklin, D.M.: Strategy inference via real-time homeomorphic and isomorphic tree matching of probabilistic graphical models. In: Proceedings of FLAIRS (2016) 5. Franklin, D.M., Hu, X.: SiMAMT: a framework for strategy-based multi-agent multi-team systems. Int. J. Monit. Surveill. Technol. Res. 5, 1–29 (2017) 6. Fukuda, K., Matsui, T.: Finding all the perfect matchings in bipartite graphs. Appl. Math. Lett. 7(1), 15–18 (1994) 7. Kumar, R., Talton, J.O., Ahmad, S., Roughgarden, T., Klemmer, S.R.: Flexible tree matching. In: Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence - Volume Three, IJCAI 2011, pp. 2674–2679. AAAI Press (2011) 8. Lotker, Z., Patt-Shamir, B., Pettie, S.: Improved distributed approximate matching. J. ACM 62(5), 1–17 (2015) 9. Sepp, S.: Homeomorphic Graph Images, December 2015 10. Vazirani, V.V.: NC algorithms for computing the number of perfect matchings in k3, 3-free graphs and related problems. Inf. Comput. 80(2), 152–164 (1989) 11. Wang, T., Yang, H., Lang, C., Feng, S.: An error-tolerant approximate matching algorithm for labeled combinatorial maps. Neurocomputing 156, 211 (2015) 12. Windsor, A.: BOOST Library: Planar Graphs, December 2015

Machine Learning for Offensive Security: Sandbox Classification Using Decision Trees and Artificial Neural Networks Will Pearce1(B) , Nick Landers1 , and Nancy Fulda2 1

2

SilentBreak Security, Lehi, UT 84043, USA {will,nick}@silentbreaksecurity.com Brigham Young University, Provo, UT 84602, USA [email protected]

Abstract. The merits of machine learning in information security have primarily focused on bolstering defenses. However, machine learning (ML) techniques are not reserved for organizations with deep pockets and massive data repositories; the democratization of ML has lead to a rise in the number of security teams using ML to support offensive operations. The research presented here will explore two models that our team has used to solve a single offensive task, detecting a sandbox. Using process list data gathered with phishing emails, we will demonstrate the use of Decision Trees and Artificial Neural Networks to successfully classify sandboxes, thereby avoiding unsafe execution. This paper aims to give unique insight into how a real offensive team is using machine learning to support offensive operations. Keywords: Neural networks · Malware Machine learning · Information security

1

· Detection · Offensive ·

Introduction

The composite set of problems an offensive team needs to solve in order to gain and keep access to a network is quite complex, especially when there are one or more defensive products at each phase of an attack. At a very high level, the process [8] is as follows: 1. External Reconnaissance – Gathering emails, footprinting external infrastructure. 2. Initial Access – Exploiting a technical vulnerability, or landing a phish. 3. Foothold Stabilization – Installing persistence and ensuring access to the network is safe and stable. 4. Privilege Escalation – Gaining elevated privileges in the network. 5. Action on Objectives – Pivoting to relevant servers/hosts, exfiltrating data, installing additional malware, etc. c Springer Nature Switzerland AG 2020  K. Arai et al. (Eds.): SAI 2020, AISC 1228, pp. 263–280, 2020. https://doi.org/10.1007/978-3-030-52249-0_18

264

W. Pearce et al.

Given the gauntlet of products and configurations each network presents, it is important that offensive teams take steps to reduce exposure of their Intellectual Property (aka Tools, Tactics, and Procedures) at all phases of an attack. The cost of not doing so can be high – ask any team that has needed to re-roll their entire infrastructure, lost a useful technique to carelessness, or had to rewrite a piece of malware. One important way to protect offensive IP is by preventing the detection of phishing payloads. Phishing is a common technique to gain initial access to organizations’ networks. A typical phishing email will emulate correspondence from a trusted entity, with the aim of convincing the user to access a malicious web link or attachment. When clicked, the link or attachment will download a payload onto the user’s system, ready for the user to execute. After execution, malware is deployed, giving access to the user’s host and potentially compromising the security of an entire network. This is particularly dangerous when the user works for a large corporation or government entity that safeguards critical information. To combat the rise in phishing emails that contain malicious documents, security vendors have integrated sandbox environments into their products. Because sandboxes provide a controlled environment for security analysts to observe malware, it is in the best interest of attackers to keep their malware from executing in a sandbox. To evade analysis, malicious payloads often contain checks against the properties of the host that would indicate whether or not the payload is being executed in a sandbox. If a host fails a check, the payload simply exits, or executes benign code. In this way, a payload might evade scrutiny from more skilled human analysts. Sandboxes also provide a pipeline of threat intelligence data. This data further helps defenders by providing clues about new trends in phishing techniques and malware authorship. In this paper we present a novel sandbox detection method based on process list data. We compare the performance of two ML algorithms on this task: A Decision Tree classifier based on [3] and [2] and a two-layer Artificial Neural Network (ANN) implemented in Keras [5]. Empirical results show that both models perform well, and are both accurate enough to trust with automated malware deployment decisions. We conclude by highlighting several operational considerations that govern potential deployment of this technology in production settings, including Attribution, Sandbox drift, and Adversarial inputs. 1.1

Basic Sandbox Evasion

Successful execution of a payload on target largely depends on successful sandbox evasion. This can be accomplished via inline evasion techniques such as extended sleep times; logic that executes when specific conditions are met; or calls to trusted domains for special key-exchanges. However, many of these behaviors are well known and can be detected by sufficiently advanced sandboxes [1,11]. Further, some sandbox checks are used so frequently in the context of malware, that the checks themselves become classified as malicious.

Machine Learning for Offensive Security

265

Common sandbox detection techniques include checking for recently used files, virtualization MAC addresses, the presence of a keyboard, domain membership, and so forth. However, in the escalating arms race of cybersecurity, security vendors clamp down on detectable information almost as quickly as malware developers begin to exploit it [4]. Moreover, as information for attackers grows harder to come by, the best checks are checks that provide additional information about a host, or the environment. When attempting to gain initial access via phishing, one of the first pieces of information our team gathers from a host is a process list1 . We manually review each process list in order to gather the security products installed on the host, the architecture, domain-joined status, and user context. The information is used to determine if it is safe to proceed with the next phase of the attack installing persistence and deploying other tools. Additionally, a process list can be helpful when troubleshooting deployment failures: By knowing more about the execution environment, we can more confidently adjust our technique or payload for the next execution opportunity.

2

Related Work

To our knowledge, no previous research has explored the application of machine learning to sandbox detection via process list data. As it turns out, applications of ML to any form of offensive strategy are difficult to come by, primarily due to data scarcity within the execution environment and the lack of ability for offensive teams to share rich datasets. Accordingly, ML for penetration testing generally focuses on exploring already open-source data such as discovering vulnerabilities, for example by using a Naive Bayes classifier to identify the active web server [6] or by using a convolutional neural network to identify outdated (and hence likely vulnerable) web sites via visual inspection of screenshots [7]. ML has also been used for payload optimization by the penetration test tool DeepExploit [9], which examines system configurations and selects payloads via the asynchronous model A3C [10].

3

Methodology

In this research, our goal was to accurately classify sandboxes using the aforementioned process list data. A key advantage of this approach is its ability to generalize and automate sandbox checks, such that network operators aren’t responsible for manually exiting malware that has been executed in a sandbox. Additionally, rather than gathering information from multiple sources on the host, thereby increasing activity on the target, simply collecting a process list is sufficient to gather the model’s input data. How the information is gathered is not part of this paper, but common execution vectors include command execution, Win32 API calls, or shellcode injection. Each process list was posted to an 1

tasklist.exe/ps.

266

W. Pearce et al.

external server for collection and processing. There were no sandbox detection checks used during data collection. Each process list was manually labelled as target (0), or sandbox (1). 3.1

Process List Data

A process list is a valuable piece of information. Not only is it a reflection of the security posture of the host, it is a reflection of the user, and a standard corporate image. Experience of the authors suggests that differences between “safe” (Appendix: Table 5) and “unsafe” (Appendix: Table 6) hosts are consistent across multiple organizations and sandboxes from multiple vendors, such that process list data could be generalized sufficiently to make an accurate classification. Since both experience and empirical evidence (Appendix: Table 5, Table 6) suggest that process count and user context are major contributors to successful human classification2 , the following features were selected: – Process Count – User Count – Process Count/User Count Ratio Process Count. End user workstations typically have a lot of processes outside of the default Windows processes running: Multiple office products, security products, etc. Sandboxes just boot up, run the payload, and close. Sandboxes aren’t using Excel, Word, Spotify, etc. all at the same time. Additionally, most end user workstations are Windows 10, while most sandboxes still run Windows 7. It’s an important note, because the number of default processes on Windows 10 is significantly higher. User Count. User count is a proxy for administrative privileges. On Windows, if the user is running in a medium-integrity context, the process list will only contain ownership information for processes the user owns (or has access to read). Otherwise, if the user is running in a high-integrity context, the process list will contain the owners for all processes. The difference in user count can be attributed to the fact that most sandboxes run payloads in a high-integrity context, but most organizations have removed administrative rights from their users. Process/User Ratio. This feature is simply a combination of the previous features, and only seeks to support the inductive bias of the authors. Model accuracy was lower when this feature was omitted from training. Other features such as average process id, or a boolean indication of known safe processes running on the host would be alternative options not explored in our research.

2

Even the casual observer will notice stark differences between each process list (Appendix: Table 5, Table 6).

Machine Learning for Offensive Security

267

Table 1. Process list data collected via server uploads executed by phishing payloads. Environments were labeled “Safe” if a human analyst identified the process list as indicative of a production environment and “Unsafe” if the analyst classified it as a Sandbox environment. All Data (384 Samples) Feature Process Count User Count Ratio

Min Max

Mean Std Dev

9.0 305.0 80.0 1.0 17.0 2.5 2.1 305.0 50.3

59.9 1.7 64.6

Safe (324 Samples) Process Count User Count Ratio

9.0 305.0 89.2 1.0 17.0 2.5 2.1 305.0 57.2

60.7 1.8 68.0

Unsafe (60 Samples) Process Count 11.0 1.0 User Count 2.7 Ratio

3.2

56.0 30.6 4.0 2.9 44.0 12.9

11.65 1.0 8.8

Data Analysis and Limitations

An overview of the collected process lists and feature statistics is presented in Table 1. Of 384 collected process lists, 60 were judged to be sandbox environments and 324 were judged to be production environments. Process list length ranged from 9.0–305.0 active processes, with a mean length of 80.0 and a standard deviation of 59.9. A number of observations can be made about the data depicted in Table 1. Firstly, user count is not a robust way of checking whether the payload was run as an administrator. While most sandboxes run payloads as an administrator, some do not, and this reflects in the user count due to the variance in Windows permissions. Secondly, the max users in safe hosts is 17.0 (Table 1). It is unusual for a phishing target to have 17 users logged in, and is more indicative of a server of some sort, or would indicate a particular organization’s remote access implementation. This data point could be confidently removed. Finally, the data set is small by machine learning standards, and could affect the model’s ability to generalize for successful classification. Additionally, the dataset was not cleaned or processed to remove oddities or potentially erroneous data points.

4

Algorithms

We applied two ML algorithms to our sandbox detection task: A Decision Tree Classifier and an Artificial Neural Network. We will see in Sect. 5 that both

268

W. Pearce et al.

models performed well, but the Decision Tree performed best overall. These two algorithms were selected for their simplicity. It is of utmost importance that ML models deployed in offensive tasks are able to operate without much human oversight, particularly when many offensive teams lack data scientists. 4.1

Decision Trees

Decision Trees [3] are a non-parametric supervised learning method that can be easily visualized and interpreted, as seen in Fig. 1. They are able to handle both numerical and categorical data, and do not require data normalization or other data preparation techniques. Potential drawbacks of Decision Trees include their susceptibility to overfitting and their difficulty representing XOR, parity, or multiplexor problems. They are also prone to overfitting in some contexts.

Fig. 1. Graphical depiction of a sandbox-detection Decision Tree Classifier generated using scikit-learn [12]. A Decision Tree learns to classify each input vector using if-then decision rules extracted from direct observations of the data.

Of the machine learning algorithms considered by our team, Decision Trees are closest to current human-driven methods of detection through a series of true/false checks. We trained our Decision tree using the features depicted in Table 1. Training was completed with a 80/20 split. No alterations were made to the raw features, such that the features of the process lists found in Appendix Tables 5 and 6 were: hosts = [(40, 4, 10),(220, 1, 220), ...]

Machine Learning for Offensive Security

269

We found the Decision Tree Classifier to be data efficient, quick to train, and effective at our sandbox detection task. Additionally, team members preferred the Decision Tree due to its implicit explainability. Further details can be found in Sect. 5 of this paper. 4.2

Artificial Neural Network

An Artificial Neural Network (ANN) is a biologically-inspired method of detecting predictable relationships between input samples and their associated training labels [13,14]. ANNs can be difficult to train, but are able to represent complex functions and generalize well to previously unseen input configurations. For sandbox detection, we used a 3 by 3 artificial neural network built with Keras using binary cross-entropy loss. Raw inputs were scaled with min-max, and a sigmoid activation function was used. The model was trained for 500 epochs, as depicted in Fig. 2.

Fig. 2. ANN Mean Squared Error during training. Minimum loss was achieved within a few hundred learning cycles, not surprising given the relatively small size of the training data.

One challenge faced by our ANN was the fact that our dataset was both small and relatively “dirty”. This is not unexpected given our use case, but it does create a challenge for a learning algorithm that usually requires thousands of training examples in order to generalize well. In an ideal scenario, we would have liked to collect a larger dataset, but real-world constraints of extracting

270

W. Pearce et al.

process list information from initial access payloads in production environments made this infeasible.

5

Results

Results from our experiments are shown in Tables 2 and 3. The ANN achieved a classification accuracy of 92.71%. The Decision Tree Classifier was able to improve accuracy by 2.09% over the ANN, obtaining an overall classification accuracy of 94.80%. Table 2. Sandbox detection accuracy for a two-layer network with clamped inputs and sigmoid activation function using binary cross-entropy loss. This model was able to generalize well to unseen data, identifying sandbox environments with high likelihood. ANN

Results

Mean Absolute Error 0.1188 Mean Squared Error 0.0573 Accuracy

92.71%

Table 3. Sandbox detection accuracy for a Decision Tree Classifier trained on the process list data depicted in Table 1. This model was able to outperform the ANN by 2.09%, and was preferred by our team because its decisions were transparent to humans. Decision Tree

Results

True Positive False Positive False Negative True Negative

66 1 3 7

Accuracy

94.80%

Table 4 shows the calculated precision, recall, and F1-score for Safe and Unsafe execution environments respectively, along with macro and weighted averages accounting for the classification imbalance in the dataset. Examination of the data reveals that safe hosts had a high F1-score of 0.97, but the unsafe hosts had a lower than acceptable level of 0.78. This is likely due to the small sample size of unsafe hosts. From an operational security perspective, avoiding unsafe execution is far more important than achieving safe execution for the longevity of a piece of malware.

Machine Learning for Offensive Security

271

Table 4. Decision Tree Metrics. Precision is the ratio of true positives to total predicted positives. Recall is the ratio of true positives to total actual positives. F1-score precision∗recall . A “Safe” label indicates a non-sandbox production is defined as 2 ∗ precision+recall environment in which malicious code can be safely deployed. An “Unsafe” label indicates a sandbox environment. Support indicates the number of samples contributing to the data calculations.

6

Metrics

Precision Recall F1-score Support

Safe Unsafe Macro Average Weighted Average

0.96 0.88 0.92 0.95

0.99 0.70 0.84 0.95

0.97 0.78 0.87 0.95

67 10 77 77

Discussion and Future Work

There are several operational factors that must be considered before deploying models to a production setting. 1. Attribution - If machine-learning models were embedded into malicious documents, these files would be easy to attribute to a particular group as this technique is not well known or widespread. 2. Sandbox drift - Overnight, all sandboxes could change, making all models and data irrelevant. The difference in Windows 10 and Windows 7 from a process count standpoint is significant. Or worse, sandboxes could change slowly, leading to inconsistent predictions. 3. Adversarial inputs - Depending on how payloads are subsequently staged, an analyst could submit false inputs to the dropper server, gaining access to payloads. 4. NLP techniques - Tokenizing a process list rather than using a regex could be a more robust way of parsing process lists, as malware deployment methods will change. Additionally, NLP techniques such as document classification become possible. 5. Data Collection - A separate data collection effort would be ideal, such that production payloads would be separate from collection payloads. This would allow for data collection efforts to support offensive operations without interfering with production deployments. Additionally, once access to a network has been gained, any host in the network becomes a potential phishing target. Therefore, any user workstation process list gathered would be a legitimate data point even if it did not come from the initial access payload.

7

Further Research

Further research in this area should focus on the collection of a larger and more balanced dataset in order to improve classification accuracy. We would also like

272

W. Pearce et al.

to explore methods for client-side classification. Because a trained model is comprised of static weights, they could be embedded into phishing payloads. This strategy, among others, should be examined for soundness and practicality. Machine learning for offensive operations has typically been confined to a priori vulnerability analysis such as detecting specific server software or identifying outdated web sites. Our research demonstrates that machine learning can also be useful in situ. Even with limited training data, ML is able to allow the detection of sandbox environments with high accuracy, thus improving the likelihood that malicious payloads will remain hidden. Going forward, we hope to see more applications of machine learning for such purposes.

8

Conclusion

In this paper we have explored the use of machine learning for in situ offensive operations, and have shown that both a Decision Tree Classifier and an Artificial Neural Network are able to detect safe environments with high accuracy and with a strong F1-score, even given limited data. We have outlined several operational considerations that will affect the use of this technology in production settings, and have suggested promising avenues for future exploration. As part of this research, a tool called ‘Deep-Drop’ was developed as a machine-learning enabled dropper server. Deep-Drop contains all code and data mentioned in this paper.3

Appendix

Table 5. Safe Process List Safe Process List PID PPID ARCH SESS NAME 0 4 456 596 784 792 864 872 1020

0 0 4 536 536 776 784 784 864

OWNER

System Process System smss.exe csrss.exe wininit.exe csrss.exe services.exe lsass.exe svchost.exe (continued)

3

https://github.com/moohax/Deep-Drop.

Machine Learning for Offensive Security

273

Table 5. (continued) Safe Process List PID PPID ARCH SESS NAME 416 496 860 1032 1116 1172 1268 1316 1324 1332 1484 1496 1504 1588 1660 1712 1732 1788 1900 1912 2032 1168 2072 2080 2180 2216 2296 2328 2336 2364 2384 2460 2512 2540 2552 2580 2676

864 784 864 864 776 1116 864 864 864 864 864 864 864 864 864 864 1116 864 864 864 864 864 864 864 864 2072 864 864 864 864 864 864 864 864 1268 1168 864

OWNER

svchost.exe fontdrvhost.exe svchost.exe svchost.exe winlogon.exe fontdrvhost.exe svchost.exe svchost.exe svchost.exe svchost.exe svchost.exe svchost.exe svchost.exe svchost.exe svchost.exe svchost.exe dwm.exe svchost.exe svchost.exe svchost.exe svchost.exe svchost.exe nvvsvc.exe nvscpapisvr.exe svchost.exe nvxdsync.exe svchost.exe svchost.exe svchost.exe svchost.exe svchost.exe svchost.exe svchost.exe svchost.exe WUDFHost.exe dasHost.exe svchost.exe (continued)

274

W. Pearce et al. Table 5. (continued)

Safe Process List PID PPID ARCH SESS NAME 2684 2732 2940 3004 2472 3068 3092 3180 3192 3544 3656 3820 3876 3888 4000 3908 4124 4148 4156 4164 4200 4308 4428 4532 4620 4636 4892 4952 5160 5172 5184 5196 5236 5260 5268 5276

864 864 864 864 864 864 864 864 864 1268 864 864 864 864 864 864 3876 864 864 864 864 864 864 864 864 864 864 864 864 864 864 864 864 864 864 864

OWNER

svchost.exe svchost.exe svchost.exe svchost.exe svchost.exe igfxCUIService.exe svchost.exe svchost.exe svchost.exe WUDFHost.exe svchost.exe svchost.exe RtkAudioService64.exe SavService.exe svchost.exe SearchIndexer.exe RAVBg64.exe svchost.exe svchost.exe PulseSecureService.exe svchost.exe svchost.exe svchost.exe svchost.exe SCFManager.exe spoolsv.exe SACSRV.exe SCFService.exe mDNSResponder.exe armsvc.exe OfficeClickToRun.exe svchost.exe AppleMobileDeviceService.exe svchost.exe IntelCpHDCPSvc.exe AdminService.exe (continued)

Machine Learning for Offensive Security

275

Table 5. (continued) Safe Process List PID PPID ARCH SESS NAME 5284 5324 5340 5352 5360 5400 5412 5420 5448 5480 5492 5600 5648 5692 5708 5720 5732 5760 5776 5788 5796 5804 5812 5832 5944 5952 6132 6432 6716 7152 7220 8764 9076 5316 9084 3636

864 864 864 864 864 864 864 864 864 864 864 864 864 864 864 864 864 864 864 864 864 864 864 864 864 864 5760 4 864 864 4164 864 864 864 864 864

OWNER

LogiRegistryService.exe svchost.exe esif uf.exe svchost.exe FoxitConnectedPDFService.exe ALsvc.exe svchost.exe swc service.exe ManagementAgentNT.exe SecurityHealthService.exe RouterNT.exe SAVAdminService.exe swi service.exe SntpService.exe svchost.exe sqlwriter.exe ssp.exe swi filter.exe TBear.Maintenance.exe WavesSysSvc64.exe TeamViewer Service.exe svchost.exe svchost.exe svchost.exe svchost.exe svchost.exe swi fc.exe Memory Compression svchost.exe IntelCpHeciSvc.exe PulseSecureService.exe sdcservice.exe svchost.exe svchost.exe csia.exe svchost.exe (continued)

276

W. Pearce et al. Table 5. (continued)

Safe Process List PID

PPID ARCH SESS NAME

3148 4688 5728 8020 7968 6448 7488 1412 6212 3848 1048 9236 9532 9768 10132 10192 10244 11596 9996 10744 12316 12464 12624 13176 13208 13224 13236 2776 780 1544 13332 15048 15224 15332 14540 800 14648

864 5340 2032 864 864 1660 1660 1660 864 864 5884 3520 416 416 1900 10132 864 10396 11596 11596 416 864 416 11596 11596 11596 11596 11596 11596 11596 11596 1048 1048 1048 1048 1048 1048

x64 x64 x64 x64 x64 x64 x64

1 1 1 1 1 1 1

x64 x64 x64 x64 x64 x86

1 1 1 1 1 1

x64 x64 x64 x64

1 1 1 1

x64 x64 x64 x64 x64 x64 x64 x64 x64 x64 x64 x64 x64 x64

1 1 1 1 1 1 1 1 1 1 1 1 1 1

svchost.exe esif assist 64.exe sihost.exe svchost.exe svchost.exe itype.exe ipoint.exe taskhostw.exe svchost.exe PresentationFontCache.exe explorer.exe igfxEM.exe ShellExperienceHost.exe RuntimeBroker.exe TabTip.exe TabTip32.exe svchost.exe chrome.exe chrome.exe chrome.exe SystemSettingsBroker.exe svchost.exe WmiPrvSE.exe chrome.exe chrome.exe chrome.exe chrome.exe chrome.exe chrome.exe chrome.exe chrome.exe MSASCuiL.exe RtkNGUI64.exe RAVBg64.exe WavesSvc64.exe SACMonitor.exe LCore.exe

OWNER CORP\ CORP\ CORP\ CORP\ CORP\ CORP\ CORP\

CORP\ CORP\ CORP\ CORP\

CORP\ CORP\ CORP\ CORP\

CORP\ CORP\ CORP\ CORP\ CORP\ CORP\ CORP\ CORP\ CORP\ CORP\ CORP\ CORP\ CORP\ CORP\ (continued)

Machine Learning for Offensive Security

277

Table 5. (continued) Safe Process List PID

PPID ARCH SESS NAME

OWNER

13152 15444 16144 9116 15840 16316 16124 15980 15928 14760 15412 13648 9932 13656 14700 14416 14428 15308 9404 12400 11324 16096 11140 2964 13392 9844 14880 2980 1152 8964 15488 16836 4616 9416 7336 11724

1048 1048 1048 1048 10652 864 10652 10652 1660 416 864 1048 13648 13648 1048 13648 416 14428 864 416 864 864 416 11596 1048 13392 864 13648 11596 416 11596 864 1268 416 416 864

CORP\ CORP\ CORP\ CORP\ CORP\

x64 x64 x64 x86 x86

1 1 1 1 1

x86 x86 x64 x64

1 1 1 1

x64 x64 x64 x86 x64 x64 x86 x64 x64

1 1 1 1 1 1 1 1 1

x64 x86 x64

1 1 1

x64 x64 x64 x64

1 1 1 1

x64 x64

1 1

RtkUGui64.exe iTunesHelper.exe DellSystemDetect.exe ONENOTEM.EXE Pulse.exe iPodService.exe ALMon.exe jusched.exe RAVBg64.exe unsecapp.exe svchost.exe Slack.exe Slack.exe Slack.exe OUTLOOK.EXE Slack.exe iexplore.exe iexplore.exe svchost.exe dllhost.exe svchost.exe svchost.exe dllhost.exe chrome.exe HprSnap6.exe TsHelper64.exe svchost.exe Slack.exe chrome.exe CertEnrollCtrl.exe chrome.exe svchost.exe WUDFHost.exe InstallAgent.exe InstallAgentUserBroker.exe svchost.exe

CORP\ CORP\ CORP\ CORP\ CORP\ CORP\ CORP\ CORP\ CORP\ CORP\ CORP\ CORP\ CORP\

CORP\ CORP\ CORP\ CORP\ CORP\ CORP\ CORP\

CORP \ CORP \ (continued)

278

W. Pearce et al. Table 5. (continued)

Safe Process List PID

PPID ARCH SESS NAME

17420 17364 15372 10752 5672 18392 476 11656 6672 13044 9920 12768 1968 10004 13716 17216 14400 12348 10496 16428 1632 16156 16540 13948 11752 16568 11200 5684 16856 6416

864 11596 11596 864 11596 416 416 11596 11596 416 416 11596 11596 11596 11596 11596 11596 11596 1116 416 416 416 416 416 416 864 3592 3820 1660 16856

x64 x64

1 1

x64 x64 x64 x64 x64 x86 x64 x64 x64 x64 x64 x64 x64 x64

1 1 1 1 1 1 1 1 1 1 1 1 1 1

x64 x64 x64 x64 x64 x64

1 1 1 1 1 1

x64 x64 x64

0 1 1

svchost.exe chrome.exe chrome.exe svchost.exe chrome.exe Microsoft.StickyNotes.exe SkypeHost.exe chrome.exe chrome.exe OneDrive.exe SearchUI.exe chrome.exe chrome.exe chrome.exe chrome.exe chrome.exe chrome.exe chrome.exe LogonUI.exe LockAppHost.exe LockApp.exe ApplicationFrameHost.exe SystemSettings.exe Calculator.exe Microsoft.Photos.exe svchost.exe .exe audiodg.exe .exe conhost.exe

OWNER CORP\ CORP\ CORP\ CORP\ CORP\ CORP\ CORP\ CORP\ CORP\ CORP\ CORP\ CORP\ CORP\ CORP\ CORP\ CORP\ CORP\ CORP\ CORP\ CORP\ CORP\ CORP\

CORP\ CORP\

Machine Learning for Offensive Security

279

Table 6. Unsafe Process List Unsafe Process List PID

PPID ARCH SESS NAME

0

0

OWNER

4

0

x86

0

System

244

4

x86

0

smss.exe

NT AUTHORITY\ SYSTEM

352

304

x86

0

csrss.exe

NT AUTHORITY\ SYSTEM

400

304

x86

0

wininit.exe

NT AUTHORITY\ SYSTEM

408

392

x86

1

csrss.exe

NT AUTHORITY\ SYSTEM

440

392

x86

1

winlogon.exe

NT AUTHORITY\ SYSTEM

504

400

x86

0

services.exe

NT AUTHORITY\ SYSTEM

512

400

x86

0

lsass.exe

NT AUTHORITY\ SYSTEM

520

400

x86

0

lsm.exe

NT AUTHORITY\ SYSTEM

628

504

x86

0

svchost.exe

NT AUTHORITY\ SYSTEM

696

504

x86

0

svchost.exe

NT AUTHORITY\ NETWORK SERVICE

744

504

x86

0

svchost.exe

NT AUTHORITY\ LOCAL SERVICE

864

504

x86

0

svchost.exe

NT AUTHORITY\ SYSTEM

932

504

x86

0

svchost.exe

NT AUTHORITY\ LOCAL SERVICE

972

504

x86

0

svchost.exe

NT AUTHORITY\ SYSTEM

1140 504

x86

0

svchost.exe

NT AUTHORITY\ NETWORK SERVICE

1344 504

x86

0

spoolsv.exe

NT AUTHORITY\ SYSTEM

1380 504

x86

1

taskhost.exe

bea-chi-t-7pr01\ John Doe

1408 504

x86

0

svchost.exe

NT AUTHORITY\ LOCAL SERVICE

1512 972

x86

1

taskeng.exe

bea-chi-t-7pr01\ John Doe

1548 504

x86

0

mfemms.exe

NT AUTHORITY\ SYSTEM

1636 1548

x86

0

mfevtps.exe

NT AUTHORITY\ SYSTEM

1696 1512

x86

1

cmd.exe

bea-chi-t-7pr01 \ John Doe

1708 1548

x86

0

mfehcs.exe

NT AUTHORITY \ SYSTEM

1940 504

x86

0

sppsvc.exe

NT AUTHORITY \ NETWORK SERVICE

2016 504

x86

0

svchost.exe

NT AUTHORITY \ NETWORK SERVICE

340

408

x86

1

conhost.exe

bea-chi-t-7pr01 \ John Doe

256

1696

x86

1

cmd.exe

bea-chi-t-7pr01 \ John Doe

308

256

x86

1

GoatCasper.exe

bea-chi-t-7pr01 \ John Doe

1780 864

x86

1

dwm.exe

bea-chi-t-7pr01 \ John Doe

1008 1748

x86

1

explorer.exe

bea-chi-t-7pr01 \ John Doe

1436 1008

x86

1

jusched.exe

bea-chi-t-7pr01 \ John Doe

264

504

x86

0

svchost.exe

NT AUTHORITY \ LOCAL SERVICE

648

504

x86

0

svchost.exe

NT AUTHORITY \ SYSTEM

220

504

x86

0

SearchIndexer

NT AUTHORITY \ SYSTEM

2328 1008

x86

1

bea-chi-t-7pr01 \ John Doe

2700 628

x86

1

Setup.exe

bea-chi-t-7pr01 \ John Doe

2240 504

x86

0

msiexec.exe

NT AUTHORITY\ SYSTEM

3272 2240

x86

1

msiexec.exe

bea-chi-t-7pr01\ John Doe

3056 768

x86

0

MpCmdRun.exe

NT AUTHORITY \ NETWORK SERVICE

System Process

280

W. Pearce et al.

References 1. Agrawal, H., Alberi, J., Bahler, L., Micallef, J., Virodov, A., Magenheimer, M., Snyder, S., Debroy, V., Wong, E.: Detecting hidden logic bombs in critical infrastructure software. In: 7th International Conference on Information Warfare and Security, ICIW 2012, pp. 1–11 (2012) 2. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001) 3. Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J.: Classification and Regression Trees. Statistics/Probability Series. Wadsworth Publishing Company, Belmont (1984) 4. Chailytko, A., Skuratovich, S.: Defeating sandbox evasion: how to increase the successful emulation rate in your virtual environment (2017) 5. Chollet, F., et al.: Keras: deep learning for humans (2015). https://github.com/ fchollet/keras 6. Esage, A.: Gyoithon: tool to make penetration testing with machine learning (2018). https://www.securitynewspaper.com/2018/06/02/gyoithon-toolmake-penetration-testing-machine-learning/ 7. Fox, B.: Eyeballer (2019). https://github.com/bishopfox/eyeballer 8. Hutchins, E.M., Cloppert, M.J., Amin, R.M.: Intelligence-driven computer network defense informed by analysis of adversary campaigns and intrusion kill chains. Lead. Issues Inf. Warfare Secur. Res. 1(1), 80 (2011) 9. Takaesu, I.: Deepexploit (2019). https://github.com/13o-bbr-bbq/machine learning security/tree/master/DeepExploit 10. Mnih, V., Badia, A.P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., Kavukcuoglu, K.: Asynchronous methods for deep reinforcement learning. In: International Conference on Machine Learning, pp. 1928–1937 (2016) 11. Mourad, H.: Sleeping your way out of the sandbox (2015). https://www.sans.org/ reading-room/whitepapers/malicious/sleeping-sandbox-35797 12. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011) 13. Priddy, K.L., Keller, P.E.: Artificial Neural Networks: An Introduction, vol. 68. SPIE Press, Bellingham (2005) 14. Rosenblatt, F.: The perceptron: a probabilistic model for information storage and organization in the brain. Psychol. Rev. 65(6), 386 (1958)

Time Series Analysis of Financial Statements for Default Modelling Kirill Romanyuk(B) and Yuri Ichkitidze National Research University Higher School of Economics, St. Petersburg, Russia [email protected]

Abstract. Credit rating agencies evaluate corporate risks and assign ratings to companies. Each rating grade corresponds to certain boundaries of default probability. KMV is a popular model to assess the default probability of a company. In this paper, a method to predict the default probability of a company is proposed. This method is based on the main concept of the KMV model; however, financial statements are applied instead of stock prices, i.e. time-series of EBIT (earnings before interest and taxes), net debt, sales, and the last year value of WACC (weighted average cost of capital). Default probabilities for 150 companies are evaluated. Results and limitations are discussed. Keywords: Probability of default · Financial statements rating · Time series · Monte Carlo method

1

· Credit

Introduction

Credit rating agencies play a vital role in modern economies by reducing the informational gap between lenders and borrowers, in order to make the financial system more efficient. More specifically, a credit rating agency needs to evaluate the probability of default, which is important for assessing risk premiums and prices of financial instruments [1]. A broad range of data is applied for default modelling. Standard information is represented by stocks and options prices [2,3], accountancy data [4,5], and macroeconomic factors [6,7]. Combinations of these types of data can be also applied [8–10]. Which option is better is debatable [10–12]. Nevertheless, there are attempts to use different sources of information such as linguistic information [13], media reports [14] and even in what ways monetary policy and corporate governance affect default risk [15,16]. As a result, researchers try to employ various sources of data to achieve additional predictive power in default modelling. KMV is a popular model for corporate default probability estimation, which requires stock prices. The fundamental idea behind application of stock prices is that such prices reflect information available to the market [17]. However, many countries do not have well developed stock markets, and some shares are nontradable that damages the main idea of the KMV model [18]. In this article, a modification of the KMV model is proposed. Financial statements are applied c Springer Nature Switzerland AG 2020  K. Arai et al. (Eds.): SAI 2020, AISC 1228, pp. 281–286, 2020. https://doi.org/10.1007/978-3-030-52249-0_19

282

K. Romanyuk and Y. Ichkitidze

instead of stock prices. It generally allows to avoid KMV model problems with non-tradable stocks.

2

Methods

The proposed method is based on the KMV model according to which PD is mainly defined by the following elements: asset value (V a ), asset risk, and debt value. A default is a situation when the estimated value of assets is lower than the default point. Two aspects of the proposed method should be highlighted. Firstly, the value of assets is estimated as the current EBIT value obtained through simulation of sales (S) and EBIT-to-Sales ratio (1). It eliminates a bias in assets pricing usually caused by overreactions and underreactions, leading to momentum effects, i.e. excess volatility because of returning to the average value. Secondly, the value of a net debt is the default point. As a result, PD over t years (pt ) is calculated by using the formulas (2, 3), i.e. n simulations are handled by application of the Monte Carlo method. a Vi,t =

xi,t

ˆ i,t · (1 + gˆi,t ) Si,t · R W ACC − gˆi,t

 a 0 if xi,t−1 = 0 ∩ Vi,t ≥ N Di,t = a 1 if xi,t−1 = 1 ∪ Vi,t < N Di,t n 

pt =

i=1

(1)

(2)

xi,t

(3) n Time series analysis is applied for assessing sales (St ), rate of return (Rt ), growth rate (gt ), and net debt (N Dt ). The assessment of sales is based on the AR(0) model, i.e. the first differentials of sales in adjusted prices with analysis of nonlinear dynamics that is conducted by checking when the growth rate in the last years (ms ) is statistically different than the average value during the whole period (ml ) (4–6). ml  r r r r ln(S1−k /S−k )/ms − ln(S1−k /S−k )/ml | k=1 k=1  ts = 2 /m + σ 2 /m σm l s ms l ⎧m l ⎪ r r ⎪ ln(S1−k /S−k )/ml if ts < tk ⎨ k=1 c= m s ⎪ r r ⎪ ln(S1−k /S−k )/ms if ts ≥ tk ⎩

|

m s

(4)

(5)

k=1

St =

Pt t 

Pb · S0r · ek=1

(6) c+k

Default Modelling

283

EBIT-to-Sales ratio is a trend stationary model with the historical average as a trend and the average rate of return obtained from the simulation model ˆ (7, 8). Estimation over the whole period (ml ) as a rate of return estimation (R) of the average growth rate of sales (ˆ g ) is the average logarithmic growth rate of sales obtained from the simulation model considering non-linear dynamics by assessing whether the growth rate in the last years (ms ) is statistically different than the average value during the whole period (ml ) (9, 10). In addition, if the sales increase, then the dynamics of net debt is evaluated as a constant cointegration ratio with the moving average value of sales; otherwise, the net debt remains at the same level. ˜ t−k = a1 R ˜ t−k−1 + a0 + t−k R

(7)

ml  ˜ t−k R

ˆt = R

k=1

ml

ml  ln(S˜t+1−k /S˜t−k )/ms − ln(S˜t+1−k /S˜t−k )/ml | k=1 k=1  ts = σS2˜ /ml + σS2˜ /ms ml ms ⎧m l ⎪ ⎪ ln(S˜t+1−k /S˜t−k )/ml if ts < tk ⎨ k=1 gˆt = m s ⎪ ⎪ ln(S˜t+1−k /S˜t−k )/ms if ts ≥ tk ⎩

|

(8)

m s

(9)

(10)

k=1

3

Results

The initial data is presented by 2018 year value of WACC (weighted average cost of capital) and time series of financial statements, i.e. fiscal year values of net debt, sales, and EBIT from 1983 to 2018. The inflation rate during this period was applied for adjusting financial statements. The analysis was conducted on 150 US companies within Moody’s B rating group, i.e. Baa, Ba, and B rating. The probability of default for the above-mentioned companies was produced. The average PD among these companies was calculated. At this point, the predicted and historical PD can be compared for different ratings (Fig. 1 and Fig. 2). The gap between the predicted and historical PD turned out to be significant. However, shapes of curves represented by the historical PD and predicted PD are similar after application of some coefficient. Suppose we are trying to find linear approximation of the predicted PD through the historical PD without a constant term. Optimal values are approximately 50% (Baa), 38% (Ba), and 32% (B). It should be noted that the method allows to predict the probability of default for up to 30 years. However, the historical probability of default can be tracked for up to 20 years in the Moody’s report, i.e. “Annual default study: Defaults will rise modestly in 2019 amid higher volatility” published by Moody’s Investors Service in 2019.

284

K. Romanyuk and Y. Ichkitidze

Fig. 1. Predicted and historical PD for Baa rating

Fig. 2. Predicted and historical PD for Ba rating

4

Conclusions

This paper presents a method to evaluate the corporate probability of default. The proposed method is basically a modification of the KMV model, which applies financial statements instead of stock prices. Time series of financial statements are, therefore, evaluated through the Monte Carlo method. The results

Default Modelling

285

show that if a credit rating is lower, then the percentage of PD explained by the model is lower. It is reasonable because when a credit rating is lower, it becomes harder to find companies with financial histories longer than 15 years. In the meantime, the historical probability of default was calculated based on Moody’s statistics for all companies in the given credit rating group. The conditional probability of having companies with at least 15 years of financial history is reduced when credit rating decreases; so we are talking about much more stable companies than on average in the given group. The longer time series of financial statements is, the better predictions are generally expected, which leads to limitations of the method. It will technically work, if time series are just 5 years long. However, it is better to have 15 or more years of time series for reliable predictions. Quarterly statements can be taken in order to make the minimum appropriate length of time series four times shorter, but seasonal fluctuations should be considered and predictions will also be shorter, i.e. 7.5 years instead of 30. This method can be valuable for agents in the financial system because it provides an additional option to assess the probability of default through financial statements instead of stock prices. It can be especially valuable in cases of evaluating companies with non-tradable stocks.

5

Further Research Outlook

Further research can be focused on tuning different elements in the method. For example, the applied default point in the proposed method comes from theoretical intuition. However, different default points can lead to more precise predictions of PD. Machine learning techniques are applied in the KMV model by researchers to discover the optimal default point [19] and for other purposes [20]. Such techniques can be further applied in the proposed method to calculate more precise default probabilities.

References 1. Heynderickx, W., Cariboni, J., Schoutens, W., Smits, B.: The relationship between risk-neutral and actual default probabilities: the credit risk premium. Appl. Econ. 48(42), 4066–4081 (2016) 2. Charitou, A., Dionysiou, D., Lambertides, N., Trigeorgis, L.: Alternative bankruptcy prediction models using option-pricing theory. J. Bank. Finance 37(7), 2329–2341 (2013) 3. Camara, A., Popova, I., Simkins, B.: A comparative study of the probability of default for global financial firms. J. Bank. Finance 36(3), 717–732 (2012) 4. Li, L., Faff, R.: Predicting corporate bankruptcy: what matters? Int. Rev. Econ. Finance 62, 1–19 (2019) 5. Altman, E.I., Iwanicz-Drozdowska, M., Laitinen, E.K., Suvas, A.: Financial distress prediction in an international context: a review and empirical analysis of Altman’s Z-score model. J. Int. Financial Manag. Account. 28(2), 131–171 (2017) 6. Xing, K., Yang, X.: Predicting default rates by capturing critical transitions in the macroeconomic system. Finance Res. Lett. (2019, in press)

286

K. Romanyuk and Y. Ichkitidze

7. Figlewski, S., Frydman, H., Liang, W.: Modeling the effect of macroeconomic factors on corporate default and credit rating transitions. Int. Rev. Econ. Finance 21(1), 87–105 (2012) 8. Tinoco, M.H., Wilson, N.: Financial distress and bankruptcy prediction among listed companies using accounting, market and macroeconomic variables. Int. Rev. Financial Anal. 30, 394–419 (2013) 9. Bellalah, M., Zouari, S., Levyne, O.: The performance of hybrid models in the assessment of default risk. Econ. Model. 52, 259–265 (2016) 10. Li, M.Y.L., Miu, P.: A hybrid bankruptcy prediction model with dynamic loadings on accounting-ratio-based and market-based information: a binary quantile regression approach. J. Empir. Finance 17(4), 818–833 (2010) 11. Hillegeist, S.A., Keating, E.K., Cram, D.P., Lundstedt, K.G.: Assessing the probability of bankruptcy. Rev. Account. Stud. 9(1), 5–34 (2004) 12. Sanjiv, R.D., Hanouna, P., Sarin, A.: Accounting-based versus market-based crosssectional models of CDS spreads. J. Bank. Finance 33(4), 719–730 (2009) 13. Lu, Y.C., Shen, C.H., Wei, Y.C.: Revisiting early warning signals of corporate credit default using linguistic analysis. Pac.-Basin Finance J. 24, 1–21 (2013) 14. Lu, Y.C., Wei, Y.C., Chang, T.Y.: The effects and applicability of financial media reports on corporate default ratings. Int. Rev. Econ. Finance 36, 69–87 (2015) 15. Malovana, S., Kolcunova, D., Broz, V.: Does monetary policy influence banks’ risk weights under the internal ratings-based approach? Econ. Syst. 43(2), 100689 (2019) 16. Ali, S., Liu, B., Su, J.J.: Does corporate governance quality affect default risk? The role of growth opportunities and stock liquidity. Int. Rev. Econ. Finance 58, 422–448 (2018) 17. Jovan, M., Ahcan, A.: Default prediction with the Merton-type structural model based on the NIG Levy process. J. Comput. Appl. Math. 311, 414–422 (2017) 18. Zhang, Y., Shi, B.: Non-tradable shares pricing and optimal default point based on hybrid KMV models: evidence from China. Knowl.-Based Syst. 110, 202–209 (2016) 19. Lee, W.C.: Redefinition of the KMV models optimal default point based on genetic algorithms evidence from Taiwan. Expert Syst. Appl. 38, 10107–10113 (2011) 20. Yeh, C.C., Lin, F., Hsu, C.Y.: A hybrid KMV model, random forests and rough set theory approach for credit rating. Knowl.-Based Syst. 33, 166–172 (2012)

Fraud Detection Using Sequential Patterns from Credit Card Operations Addisson Salazar(B) , Gonzalo Safont, and Luis Vergara Institute of Telecommunications and Multimedia Applications, Universitat Politècnica de València, Valencia, Spain {asalazar,lvergara}@dcom.upv.es, [email protected]

Abstract. This paper presents a novel method for detection of frauds that uses the differences in temporal dependence (sequential patterns) between valid and nonlegitimate credit card operations to increase the detection performance. A two-level fusion is proposed from the results of single classifiers. The first fusion is made in low-dimension feature spaces from the card operation record and the second fusion is made to combine the results obtained in each of the low-dimension spaces. It is assumed that sequential patterns are better highlighted in low-dimension feature spaces than in the high-dimension space of all the features of the card operation record. The single classifiers implemented were linear and quadratic discriminant analyses, classification tree, and naive Bayes. Alpha integration was applied to make an optimal combination of the single classifiers. The proposed method was evaluated using a real dataset with a great disproportion between nonlegitimate and valid operations. The results were evaluated using the area under the receiver operating characteristic (ROC) curve of each of the single and fused results. We demonstrated that the proposed two-level fusion combining several low-dimension feature analyses outperforms the conventional analysis using the full set of features. Keywords: Pattern recognition · Credit card fraud detection · Sequential patterns · Optimal decision fusion

1 Introduction Nowadays, the increase of technological resources has empowered the capabilities of fraudsters to attack bank credit card operations. This is an important economical and operational problem for financial firms. This problem has been approached from different perspectives of machine learning including single [1–7] and combined classification methods [8–13]. Another important issue is the lack of publicly available databases, due to confidentiality, which inhibit direct comparisons between approaches [14]. This work focuses on finding sequential patterns in credit card operations that allows to discern the valid ones from the non-legitimate (fraud) ones. Most previous works on dynamic features and patterns have considered sophisticated statistical models [15–17], prior knowledge [18], or transforms such as wavelet analysis [19, 20]. The analysis of © Springer Nature Switzerland AG 2020 K. Arai et al. (Eds.): SAI 2020, AISC 1228, pp. 287–296, 2020. https://doi.org/10.1007/978-3-030-52249-0_20

288

A. Salazar et al.

credit card operations is ill-suited for these methods, given the extremely large number of credit cards, small number of operations per card, and relatively high dimensionality. Considering that the number of operations of a credit card is limited in a short time period, we should focus the analysis on low-dimension feature spaces. This allows to implement methods that prevent overfitting and theirs results can be more easily interpreted [21]. We search for features extracted from the credit card data that show the dynamics of sequential patterns from users. The proposed method includes a first stage of processing in low-dimension feature space using the following classifiers: linear discriminant analysis (LDA), quadratic discriminant analysis (QDA), classification tree (Tree), and naive Bayes (NB). The second stage is a combination of the results from each individual classifier for every lowdimension feature space, and the third stage consists of combining the results of the second stage to obtain a global classification result. The combination of the results is made using Alpha Integration that provide a set of weights to fuse the posterior probabilities of the individual classifiers. This method has been recently developed to optimize the linear combination of a posteriori probabilities assuming error minimum probability or least mean squares [22–26]. The rest of the paper contains the following. The proposed method and the results obtained from a real dataset is explained in Sect. 2 and Sect. 3, respectively. Section 4 contains the conclusions and future lines of research derived from this work.

2 Proposed Method A pipeline diagram of the different stages of the proposed method is shown in Fig. 1. In this work, the considered low-dimension feature spaces (2D dimension) were defined by choosing card operation variables working together with bank professionals. Although it falls outside the scope of this work, future works will consider the definition of the low-dimension subspaces using methods such as hierarchical clustering [27–30], feature selection [31], and knowledge discovery methods [32, 33]. We hypothesize that time sequential patterns of use of bank credit cards are usually different from fraudsters than from normal users. Examples of apprehensive behaviors of the use of a card are the following: speed of use of the card in different operations with increasing amounts and using the card in different separated locations in a short period of time. Different low-dimension feature spaces using 6 selected variables including operation speed and separated location shifts. A total number of 1,479,852 operations of a global bank firm were processed. Those operations were from 133,500 credit cards of which 0.22% were fraudulent. Some instances of dynamic behavior of card operations in one particular low-dimension feature space are shown in Fig. 2. Temporal sequential patterns are depicted by directed graphs (blue for valid operations and red for non-legitimate operations). Figure 2(a) shows the differences between the two patterns more clearly.

Fraud Detection Using Sequential Patterns from Credit Card Operations

289

Low dimension feature space (LFS) definition

LFS 1

LFS N Feature extraction

Classification LDA, QDA, Tree, NB

Feature extraction

...

Classification LDA, QDA, Tree, NB

Classifier Combination

Combined Result LFS 1

Classifier Combination

Decision Fusion

Combined Result LFS N

Global Classification Result

Fig. 1. A pipeline diagram of the different stages of the method.

(a)

(b)

Fig. 2. Some instances of dynamic behavior of card operations in one particular low- dimension feature space. Each arrow represents the change from one operation (the beginning of the arrow) to the next (the head of the arrow). Red arrows point to fraud operations and blue arrows point to legitimate operations. Given the wide diversity of strategies adopted by fraudsters, the dynamics of fraud operations in a particular low-dimension subspace can become obvious (a) or remain subtle (b).

290

A. Salazar et al.

2.1 Feature Extraction The definitions of the features extracted from the credit data records are shown in Table 1, where x1 (t), x2 (t), t = 1 … n correspond to the data of variables 1 and 2 of the 2D low-dimension subspace, respectively. The features are extracted by card. Table 1. Features extracted in low-dimension spaces (2D). Feature

Definition

Axial mean

μ1 =

n

t=1 x1 (t)

n

μ2 = Axial variance

n

t=1 x2 (t)

n

var1 =

n 

(x1 (t) − μ1 )2

t=1

var2 = Mean axial velocity

v¯ 1 = v¯ 2 =

Velocity sign changes

n 

(x2 (t) − μ2 )2

t=1 n−1

v1 (t) , where v1 (t) = x1 (t + 1) − x1 (t) n−1

t=1

n−1

v2 (t) , where v2 (t) = x2 (t + 1) − x2 (t) n−1

t=1

vsc1 =

vsc2 =

t=n−2 

sign(v1 (t))

t=1

 = sign(v1 (t − 1))

t=n−2 

sign(v2 (t))

t=1

 = sign(v2 (t − 1))



v¯ 12 + v¯ 22

Mean total velocity

mtv =

Rotation angle

ra = atan(¯v1 , v¯ 2 )

2.2 Alpha Integration Information Fusion Technique Let us consider the 2-class classifier scenario (detection). A detector provides a normalized score which a measure of the probability of class 1 (i.e., 1 minus the probability of class 2). Assuming a given number of detectors, the individual scores are to be fused to get a sole score. Most fusion methods uses some predefined operator to combine the scores provided by every single detector. The mean, the median, the maximum or the minimum are typical examples. These simple operators does not take into account either the separate performance of every detector or the statistical dependence among the whole set of classifiers. The so call alpha integration (see [22, 23]) allows an optimized combination of the individual scores. It was adapted to the fusion of detectors in [24] and to the multi-class problem in [25]. Let us consider that each detector provides a score

Fraud Detection Using Sequential Patterns from Credit Card Operations

291

s_i. Alpha integration linearly combines a non-linear transformation of the scores s = [s1 … sD ]T , as indicated by Eq. (1) ⎧  2 ⎪ 1−α ⎨ D w (s ) 1−α

D 2 , α = 1 i i i=1 wi ≥ 0, wi = 1 (1)  α (s) = i=1 D ⎪ ⎩ exp i=1 wi log(si ) , α = 1 As we can see in (1), the non-linear transformation of every single score is controlled by a unique parameter α, which gives the name to the method. On the other hand the weights w = [w1 … wD ]T are used to linearly combine the transformed scores. Notice that sα (s) is a normalized score. Also notice that for the specific values of α = −1, α = ∞, α = −∞ we respectively get the mean, the minimum and the maximum operators. By optimizing the parameters α and w = [w1 … wD ]T from training data, we can implement optimum fusion of scores. Different optimization criteria were proposed in [24].

3 Experimental Results The dataset described in Sect. 2 was divided in three subsets: training (50%), validation (25%), and testing (25%). A total number of 100 Montecarlo experiment iterations was performed changing randomly the records of the datasets for training, validation, and testing. All the operations of the same card were always assigned to the same subset.

Fig. 3. ROC curves for several low dimension feature spaces.

The parameters of the classifiers were set as follows: (i) LDA and QDA, the inversion of the covariance matrix was implemented using pseudo-inverse and the priors were

292

A. Salazar et al.

estimated empirically; (ii) Classification Tree, minimum number of leaf node observations = 1; no maximum number of decision splits; the split criterion was Gini’s diversity index; (iii) Naïve Bayes, each variable was assumed Normal (Gaussian) distribution. Figure 3 shows the average results obtained. Standard deviation of the Montecarlo iterations is not shown since it was very small, it was below 1% in all cases, being the maximum 0.16%. Notice that the global classification results overcome the ones obtained in each of the low dimension feature spaces in the range of false alarm rate from 0% to 10%. A comparison of the distribution of scores provided by every single classifier and the fused result for a particular 2D low-dimension feature space is shown in Fig. 4. It can be seen that the scores of the fused result is more distributed in the full range of score values [0, 1] than the single classifier results that are concentrated at extreme score

Fig. 4. Comparison of the distribution of scores provided by every single classifier and the fused result for a particular 2D low dimension feature space.

Fraud Detection Using Sequential Patterns from Credit Card Operations

293

Fig. 5. Comparison of results (ROC curves) for all the classification methods analyzed.

values. In addition the concentration of false alarm for QDA and NB is eliminated in the fused result. Besides, the variance of the fused results for the iterations of the Montecarlo experiment in terms of classification accuracy and area under curve (AUC) was lower than the single classifier results. Figure 5 shows the following results: global classification for low dimension feature spaces; individual classifiers, and classifier combination using the full set of features. For single classifiers, a pre-processing step of principal component analysis (PCA) was applied in order to reduce the dimension of the problem and filter noise; 99.99% of the explained variance was retained. It demonstrates the interesting sequential patterns are best discerned in low-dimension feature spaces of the features defined to measure the dynamics of the credit card operations. The AUCs for the ROC curves of Fig. 3 and Fig. 5 are shown in Table 2. Besides the improvement in AUC, the standard deviation of the proposed 2D lowdimension subspace fusion method (0.07%) is much lower than the whole feature space fusion result (0.80%). All the single classifiers, except LDA (0.60%), are above 1%, being the maximum standard deviation method the classification tree results.

294

A. Salazar et al.

Table 2. Results of AUC estimated for the single classifier and fused results using the full set of features and the overall fused result obtained by the two-level fusion proposed method. Method

AUC (%)

LDA

23.34

0.60

QDA

23.28

1.02

Average Std

Tree

45.89

1.80

NaiveBayes

22.71

1.60

Whole space fusion

47.22

0.80

Worst 2D subspace

10.01

0.16

Best 2D subspace

52.65

0.09

Proposed 2D subspace fusion 59.88

0.07

4 Conclusions In this paper we have presented a new method for fraud detection which uses several low-dimension subspaces and single classifiers to produce multiple classification results. A two-level fusion was proposed to combine all these results. The first fusion is made in low-dimension feature spaces from the card operation record and the second fusion is made to combine the results obtained in each of the low-dimension spaces. An optimal decision fusion was made using alpha integration algorithm. Results show the improvement of detection performance of the proposed subspace processing compared with the results by processing the whole set of variables without extraction of dynamic features to distinguish behavior of sequential patterns of fraudsters from legitimate customers. The experimental results demonstrate the improvement achieved by the new method both in the area under the receiver operating characteristic curves and in a reduced variability of classification accuracy. There is scope for future research. More extensive number of experiments as well as theoretical analysis will provide more strong fundaments to the method. Besides, in this work, the subspaces of interest were defined with the help of an expert. Another future line of work will be the automatic definition of the low-dimension subspaces of interest using hierarchical and knowledge discovery methods. Acknowledgment. The Generalitat Valenciana supported this work under grant PROMETEO/2019/109.

References 1. Ahmeda, M., Mahmooda, A.N., Islam, R.: A survey of anomaly detection techniques in financial domain. Future Gener. Comput. Syst. 55, 278–288 (2016)

Fraud Detection Using Sequential Patterns from Credit Card Operations

295

2. Bhattacharyya, S., Jha, S., Tharakunnel, K., Westland, J.C.: Data mining for credit card fraud: a comparative study. Decis. Supp. Syst. 50, 602–613 (2011) 3. Bolton, R.J., Han, D.J.: Statistical fraud detection: a review. Stat. Sci. 17(3), 235–255 (2002) 4. Panigrahi, S., Kundu, A., Sural, S., Majumdar, A.K.: Credit card fraud detection: a fusion approach using Dempster-Shafer theory and Bayesian learning. Inf. Fusion 10, 354–363 (2009) 5. Phua, C., Lee, V., Smith, K., Gayler, R.: A comprehensive survey of data mining-based fraud detection research. Comput. Res. Repos. 1–14 (2010) 6. Raj, S.B.E., Portia, A.A.: Analysis on credit card fraud detection methods. In: IEEE International Conference on Computer, Communication and Electrical Technology – ICCCET 2011, Tamil Nadu (India), pp. 152–156 (2011) 7. Wongchinsri, P., Kuratach, W.: A survey - data mining frameworks in credit card processing. In: 13th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology, ECTI-CON 2016, Chiang Mai, Thailand, pp. 1–6 (2016) 8. Salazar, A., Safont, G., Soriano, A., Vergara, L.: Automatic credit card fraud detection based on non-linear signal processing. In: International Carnahan Conference on Security Technology (ICCST), Boston, MA (USA), pp. 207–212 (2012) 9. Salazar, A., Safont, G., Vergara, L.: Surrogate techniques for testing fraud detection algorithms in credit card operations. In: International Carnahan Conference on Security Technology (ICCST), Rome (Italy), pp. 1–6 (2014) 10. Vergara, L., Salazar, A., Belda, J., Safont, G., Moral, S., Iglesias, S.: Signal processing on graphs for improving automatic credit card fraud detection. In: International Carnahan Conference on Security Technology (ICCST), Madrid (Spain), pp. 1–6 (2017) 11. Salazar, A., Safont, G., Rodriguez, A., Vergara, L.: Combination of multiple detectors for credit card fraud detection. In: International Symposium on Signal Processing and Information Technology (ISSPIT), Limassol (Cyprus), pp. 138–143 (2016) 12. Salazar, A., Safont, G., Vergara, L.: Semi-supervised learning for imbalanced classification of credit card transaction. In: International Joint Conference on Neural Networks (IJCNN), Rio de Janeiro (Brazil), pp. 1–7 (2018) 13. Salazar, A., Safont, G., Rodriguez, A., Vergara, L.: New perspectives of pattern recognition for automatic credit card fraud detection. In: Encyclopedia of Information Science and Technology, 4th edn., pp. 4937–4950, IGI Global (2018) 14. Dal Pozzolo, A., Caelen, O., Le Borgne, Y.-A., Waterschoot, S., Bontempi, G.: Learned lessons in credit card fraud detection from a practitioner perspective. Expert Syst. Appl. 41, 4915–4928 (2014) 15. Safont, G., Salazar, A., Rodriguez, A., Vergara, L.: On recovering missing ground penetrating radar traces by statistical interpolation methods. Remote Sens. 6(8), 7546–7565 (2014) 16. Safont, G., Salazar, A., Vergara, L., Gomez, E., Villanueva, V.: Probabilistic distance for mixtures of independent component analyzers. IEEE Trans. Neural Netw. Learn. Syst. 29(4), 1161–1173 (2018) 17. Safont, G., Salazar, A., Vergara, L., Gómez, E., Villanueva, V.: Multichannel dynamic modeling of non-Gaussian mixtures. Pattern Recogn. 93, 312–323 (2019) 18. Llinares, R., Igual, J., Salazar, A., Camacho, A.: Semi-blind source extraction of atrial activity by combining statistical and spectral features. Digit. Sig. Process.: Rev. J. 21(2), 391–403 (2011) 19. Abry, P., Didier, G.: Wavelet estimation for operator fractional Brownian motion. Bernoulli 24(2), 895–928 (2018) 20. Wendt, H., Abry, P., Jaffard, S.: Bootstrap for empirical multifractal analysis. IEEE Sig. Process. Mag. 24(4), 38–48 (2007)

296

A. Salazar et al.

21. Dubuisson, S.: Tracking with Particle Filter for High-Dimensional Observation and State Spaces. Digital Signal and Image Processing Series. Wiley, Hoboken (2015) 22. Amari, S.: Integration of stochastic models by minimizing α-divergence. Neural Comput. 19, 2780–2796 (2007) 23. Choi, H., Choi, S., Choe, Y.: Parameter learning for alpha integration. Neural Comput. 25(6), 1585–1604 (2013) 24. Soriano, A., Vergara, L., Bouziane, A., Salazar, A.: Fusion of scores in a detection context based on alpha-integration. Neural Comput. 27, 1983–2010 (2015) 25. Safont, G., Salazar, A., Vergara, L.: Multiclass alpha integration of scores from multiple classifiers. Neural Comput. 31(4), 806–825 (2019) 26. Amari, S.: Information Geometry and Its Applications. Springer (2016) 27. Igual, J., Salazar, A., Safont, G., Vergara, L.: Semi-supervised Bayesian classification of materials with impact-echo signals. Sensors 15(5), 11528–11550 (2015) 28. Salazar, A., Igual, J., Vergara, L., Serrano, A.: Learning hierarchies from ICA mixtures. In: IEEE International Joint Conference on Artificial Neural Networks, Orlando, FL (USA), pp. 2271–2276 (2007) 29. Salazar, A., Igual, J., Safont, G., Vergara, L., Vidal, A.: Image applications of agglomerative clustering using mixtures of non-Gaussian distributions. In: International Conference on Computational Science and Computational Intelligence (CSCI), Las Vegas, NV (USA), pp. 459–463 (2015) 30. Lesot, M.-J., d’Allonnes, A.R.: Credit-card fraud profiling using a hybrid incremental clustering methodology. In: International Conference on Scalable Uncertainty Management, Marburg (Germany), pp. 325–336 (2012) 31. Lui, H., Motoda, H.: Computational Methods of Feature Selection. CRC Press, Boca Rató (2007) 32. Maimon, O., Rokach, L.: Data Mining and Knowledge Discovery Handbook. Springer (2005) 33. Salazar, A., Gosalbez, J., Bosch, I., Miralles, R., Vergara, L.: A case study of knowledge discovery on academic achievement, student desertion and student retention. In: IEEE ITRE 2004 - 2nd International Conference on Information Technology: Research and Education, London (United Kingdom), pp. 150–154 (2004)

Retention Prediction in Sandbox Games with Bipartite Tensor Factorization Rafet Sifa1(B) , Michael Fedell2 , Nathan Franklin2 , Diego Klabjan2 , Shiva Ram2 , Arpan Venugopal2 , Simon Demediuk3 , and Anders Drachen3 1

Fraunhofer IAIS, Sankt Augustin, Germany [email protected] 2 Northwestern University, Evanston, USA {stevenfedell2019,nathanfranklin2019,ShivaVenkatRamanan2020, ArpanVenugopal2019}@u.northwestern.edu, [email protected] 3 DC Labs, York, UK {simon.demediuk,anders.drachen}@york.ac.uk

Abstract. Open world video games are designed to offer free-roaming virtual environments and agency to the players, providing a substantial degree of freedom to play the games in the way the individual player prefers. Open world games are typically either persistent, or for single-player versions semi-persistent, meaning that they can be played for long periods of time and generate substantial volumes and variety of user telemetry. Combined, these factors can make it challenging to develop insights about player behavior to inform design and live operations in open world games. Predicting the behavior of players is an important analytical tool for understanding how a game is being played and understand why players depart (churn). In this paper, we discuss a novel method of learning compressed temporal and behavioral features to predict players that are likely to churn or to continue engaging with the game. We have adopted the Relaxed Tensor Dual DEDICOM (RTDD) algorithm for bipartite tensor factorization of temporal and behavioral data, allowing for automatic representation learning and dimensionality reduction. Keywords: Tensor factorization intelligence

1

· Behavioral analytics · Business

Introduction

Game Analytics research has in recent years advanced rapidly. In the span of a decade, analytics has moved from a supporting role to a cornerstone of game development. Despite the commercial and academic interest, the domain is still in its explorative phase, with maturity of the knowledge, technology and models applied varying across business models, game genres and platforms [1–3]. Two key challenges in Game Analytics are player profiling and churn prediction. These are important for different reasons: Behavioral profiling is an important process in game development as it allows the complexity space of player c Springer Nature Switzerland AG 2020  K. Arai et al. (Eds.): SAI 2020, AISC 1228, pp. 297–308, 2020. https://doi.org/10.1007/978-3-030-52249-0_21

298

R. Sifa et al.

behavior to be condensed into a specific set of profiles, which showcase how a game is being played. Behavioral profiling is notably important for persistent and semi-persistent games, where live operations utilize profiling to understand how the community is playing the game [3,4]. Churn prediction is a key process in Game Analytics for many different types of games, not the least those that use a freemium revenue model [5–9]. Churn prediction basically attempts to predict when a specific player will stop playing the game. With accurate churn prediction, it is possible for analytics teams to pinpoint players who may not be enjoying the game or experiencing problems progressing. Understanding when a player might leave the game provides the ability to explore why that might be happening via behavioral analysis [5,6,8,10]. Churn prediction in open-world games, whether single-player or massively multi-player online (MMOG) is virtually unexplored, with very few publications on this problem [11,12]. 1.1

Objective and Contribution

While methodologically there are different possible approaches towards building profiles and classification models (see e.g. [3,4,8]), the approach adopted here is bipartite tensor factorization, due to prior successful application of tensor models in OWGs [12] and freemium games [13]. This paper is, to the best knowledge of the authors, the first to propose the use of bipartite tensor factorization for learning temporal representations for behavior prediction in games. The work presented directly extend prior Game Analytics research churn prediction, by showing incorporating low dimensional and automatically extracted temporal features can provide similar and for some metrics better prediction performance than models trained solely on aggregate behavioral data (that the majority of the previous work adopts) omitting the temporal information. The test case used here is the open world game (OWG) Just Cause 2 (JC2). JC2 features a massive freely navigable environment with missions, objectives and other activities spread across the environment. While AAA (major commercial) OWG titles like JC2 vary in their design (e.g. Skyrim, Grand Theft Auto, in general these feature spatio-temporal navigation, and tactical combat. Freedom is a characteristic of OWGs, and space/time are both important dimensions for assessing the user experience, and thus for behavioral analysis [1,4]. On a final note, while MMOGs are typically also OWGs, the presence of many players within the same virtual world, a compared to just one for JC2, mean that the analyses presented here may not translate directly to these types of games. 1.2

Related Work

The work presented here builds on previous research in Game Analytics on prediction modeling, behavioral profiling and spatio-temporal behavioral analytics (e.g. [4–6,10,11,14–16]). With respect to prediction in games, previous work has primary targeted either predicting future behavior [10,11,14,15] or sought to inform situations

Retention Prediction with Bipartite Tensor Factorization

299

Fig. 1. Our bipartite tensor factorization based representation framework to compress multidimensional player information for future behavior prediction. Each slice of the extracted tensor encodes a player’s observations and for training our predictors we consider the vectorized version (denoted by vec(·)) of the low dimensional factors. In real life scenarios the learned basis matrices A and B can be easily used to infer the low dimensional representations from new players (for orthogonal basis matrices this boils down to matrix multiplication), which can be fed to the trained classifiers for future behavior predictions.

related to agent modeling in Game AI [17]. In terms of the former, the emphasis has been on persistent or semi-persistent games where live operations are important to the financial success of a title. A variety of machine learning-based approaches have been adopted, including pattern recognition, regression, decision trees [6], support vector machines [18], Hidden Markov Models [8,11] and deep learning [10]. Runge et al. [8] and Hadiji et al. [6] benchmarked multiple methods in churn prediction. Behavioral profiling in games has a substantial history, recently summarized by Sifa et al. [4], and will therefore not be covered in detail here. The key objective of profiling is to act a means for managing complex user data and building meaning from them, discovering underlying patterns in the behavior of the players [4,19]. Profiling allows for a condensation and modeling of a complex behavioral space, exemplified in MMOGs and OWGs. Spatio-temporal analytics is comparative infrequent, notably compared to the strong tradition in Game AI where e.g. agent models require consideration of both dimensions [17]. However, although several papers exist on the topic of visualizing behavioral data from games, e.g. Wallner et al. [20]. Another key precursor paper is Sifa et al. [12], who adapted different tensor models, that factorizes asymmetric waypoint matrices to learn spatio-temporal features, for churn prediction at the individual player level, achieving up to 81% accuracy for Just Cause 2, using the same dataset as applied here. The work by Sifa et al. [12] also highlighted the importance of spatio-temporal features in predicting retention in OWGs, possibly because

300

R. Sifa et al.

these dimensions are integral to the user experience of these games. This result contrasted prediction work in mobile games where highly successful classification work has shown that spatial features are not important to predicting retention (see [5,6,10,11,14,16]). The work presented here extends [12] by considering a more general factorization model to be able to automatically learn temporal features from a set of bipartite player matrices. To this end, we will present how we can design a bipartite temporal behavioral player tensor and introduce the use of Relaxed Tensor Dual DEDICOM (RTDD) to extract features that can be later used in further analytics applications, which for our case will be about predicting the future arrival behavior of a set of JC2 players

2

Relaxed Tensor Dual DEDICOM

We devote this section to explain the tensor factorization model called Relaxed Tensor Dual DEDICOM (RTDD) model [13,21,22], which generalizes the matrix and tensor factorization models INDSCAL and DEDICOM [12,13,23] to decompose bipartite tensors into combinations of low ranked matrices. We will use this model in our experiments to factorize a data tensor that encodes temporal player interactions to learn compact player representations for retention prediction. Formally, given a bipartite data tensor Y ∈ Rm×n×d containing a collection d bipartite m × n matrices or slices (i.e. Y = {Y 1 , . . . , Y d } for Y i ∈ Rm×n ) and the dimensionalities of the hidden components p and q, the RTDD model yields a left basis matrix A ∈ Rm×p , a right basis matrix B ∈ Rn×q and a coefficient tensor W ∈ Rp×q×d to represent each slice Y i of the data tensor as Y i = A Wi B T ,

(1)

where W i ∈ Rp×q is the ith slice of W. It is worth mentioning that, akin to two factor matrix factorization models (see [3]), for a given set of factors {A, B, W} and the model parameters p and q, which are typically chosen to be p, q  min(m, n), the representation in (1) compresses factorized tensor as the space complexities for the data and the factorized representation respectively are O(mnd) and O(mp + nq + dpq), where the latter reduces down to O(dpq) when the coefficient tensors are used in further analytics applications (as we will show in our case study). Finding the RTDD factors of a given bipartite tensor Y can be obtained by minimizing the sum of the reconstruction error of each slice defined as E(A, B, W) =

d    Y i − A Wi B T 2 ,

(2)

i=1

 where · is defined as the Frobenius norm [13], and cannot be directly solved due to the unconvexity of the factors in (2). A popular alternative to solve such problems is to consider a set of iterative optimization updates, in which

Retention Prediction with Bipartite Tensor Factorization

301

objective function is optimized for each factor (for RTDD A B and each slice of W) independently keeping the other factors fixed (see examples in [3,21,24]). In summary, an alternating algorithm to come up with optimal RTDD factors starts with random factors1 and iteratively optimizes the minimized objective E from (2) by consecutively – minimizing E for A with fixed B and W, – minimizing E for B with fixed A and W as well as – minimizing E for each slice of W with fixed A and B until a predefined stopping condition (e.g. a maximum number of iterations or stabilization of E) is met. Another important aspect in tensor factorization is to impose constraints to the factors for efficiency in variety of aspects such as representability, interpretability or speed-up [3]. In this work we will consider the constraints introduced in [13]2 that force the basis matrices of RTDD A and B to be column orthonormal (i.e. AT A = I p and B T B = I q , where I h is the h × h identity matrix). These constraints cannot only speedup the factorization process (as empirically shown in [25]) but also allow us to easily obtain coefficient matrices for a new set of players from the previously trained models. The latter is particularly beneficial in continuous profiling and prediction environments, in which behavioral representations are learned from a (typically) large player base and can be used to infer ones for newly observed data units. For the case of RTDD, once a data tensor Y is decomposed into a combination of the factors {A, B, W} as in (1) we can obtain a coefficient tensor for an unseen ˆ data tensor Yˆ ∈ Rm×n×d (e.g. containing dˆ players) by considering the global ˆ i is the ith slice of ˆ i ← AT Yˆ i B, where W minimizers of (2) for each slice as W ˆ the new coefficient tensor W corresponding to the ith slice of Yˆ (see Fig. 1 for more details).

3

Data and Pre-processing

In this section we will briefly explain the game, whose players we analyzed in this work, the important steps we considered for preprocessing our dataset and the way we designed our tensor for factorization with the goal predicting future player behavior. 3.1

Just Cause 2: Gameplay

Just Cause 2 is a third-person action-adventure game which allows players to explore an open world map, with the overarching goal of overthrowing the dictatorship government of the fictional nation of Panau (see Fig. 2). The playable 1

2

The initial factors also has to follow the same constrains (if any imposed) for a converging optimization process. We used the authors’ original Python implementation from https://tinyurl.com/ rtddcode in our experiments.

302

R. Sifa et al.

(a) a look at the game world

(b) a fighting scene

Fig. 2. Just Cause 2 is an action-adventure sandbox (open world) game that takes place in an imaginary island Panau with a total coverage about 1000 km2 . Images are copyright of Square Enix (2009).

world is an area which covers about 1000 virtual square kilometers. The game allows players to use weapons from a vast arsenal while giving them access to different kinds of sea, air, and land modes of transportation. To advance through the game, players can earn chaos points by completing missions or destroying select government properties, causing the government to collapse. The chaos system provides the players freedom to progress through the game in a number of different ways besides main mission completion. 3.2

Behavioral Features and Temporal Aggregations

The data set used in this analysis consists of in-game statistics of 5331 randomly sampled players. The dataset has more than 10 million records with actions, timestamps, and locations that were normalized and stored in a relational database for easier querying and processing. Based on an initial exploratory data analysis, nine unique players were removed from the dataset as they consisted of erroneous records with abnormal values of in-game statistics. The users are allowed to play the game at four different levels of difficulty. Since the dataset contains player statistics for each of the difficulty levels in which the player can engage the game; we considered a composite key of player id and difficulty level as a unique player. Thus, unlike the previous work analyzing this game, this analysis comprises of 6598 individual data points. Similar to many of the previously mentioned early work in behavioral analytics in games, we extracted 93 features from our player base, which comprised our entire expanded behavioral space of interest. These features were recorded in the database as either cumulative statistics over the player’s lifetime, or description of events that take place during gameplay. We note that these features can be categorized into four distinct groups. The largest group is comprised of the lifetime counter statistics of the game, which includes different kinds of kills, structures destroyed, chaos caused, missions completed, and many more; these values are recorded as lifetime totals at each increment. The next set of features is made of player actions not included in the statistics, such as entry and exit

Retention Prediction with Bipartite Tensor Factorization

303

of vehicles and parachutes; these actions are both geo- and timestamped and are given as single, point-in-time observations. The third group of features pertains to the cause of deaths, and finally, the fourth group of features provides extraction (a form of transport) information. To aggregate the behavioral features for each player, we required a common temporal feature space. Since the amount of time spent by a player for each session can vary considerably, we sought to design a temporal unit which would hold equivalence across players with minimal loss of information. To accomplish this, we divided the data by playtime (seconds played since starting) into Time Buckets, periods of 1000 s for the first 10, 000 s of game play, followed by periods of 50, 000 s until 1, 000, 0000 s in total. 3.3

Labeling for Retention Analysis

Previous studies covering churn and retention analysis usually defined the prediction setting as observing the player within a predefined time interval and predicting his future behavior again in a predefined time interval (see [6,12] for examples). Similar to [12] we also define a churning player as one who after an observation period of 14 days beginning with their first session, failed to return to the game in the 7 days following (days 15–21). Accordingly, we created a churn flag assigned to any player IDs who fit this definition. By this measure, roughly 30% of the unique players in our dataset were retained, and the other 70% churned. We note that since the number of churners is significantly high, it is of considerable benefit if we can rightly identify returners and churners using their gameplay data from the initial gaming sessions. 3.4

Final Tensor Design

After pre-processing the data, extracting behavioral features, and aggregating over the temporal units described above, the next step was to design the tensor for bipartite tensor factorization using the Relaxed Tensor Dual DEDICOM model which is the focus of this paper and will be subsequently discussed in greater detail. The processed dataset, had been aggregated to a long matrix format (or matricized) consisting of the records at the temporal unit (session/time bucket) for all players. Following that, the data were then scaled using the standard and minmax scaling. The former normalizes each feature to have 0 mean and standard deviation of 1, whereas, the latter normalizes every data feature to live in the same predefined range (we have chosen the most standard method to transform every feature to reside in the unit hypercube). This scaled long-format dataset was then converted back into a tensor with m × n × d dimensions where d is the number of unique players, m is the number of temporal units, and n is a column for each behavioral feature. Another important aspect of our tensor design was related to censoring. That is, the decomposition model requires each of it’s d slices to have the same dimensions; however, the amount of time played by each player was widely varied. To remedy this, the value of m was chosen sufficiently large to capture the

304

R. Sifa et al.

longest-playing player. All other player matrices were padded with rows of zeros for time units which exceeded their maximum playtime. Table 1. Cross validation prediction performance of our baseline setting that omits the temporal features and only incorporates the behavioral features. We obtain up to 0.63 F-Score for predicting future player retention. Model

Precision Recall F-Score

Random Forest

0.605

0.657

0.631

Logistic Regression 0.450

0.675

0.541

Gradient Boosting

0.635

0.626

0.617

Table 2. A more detailed comparison of the retention prediction results of our playerbase for different parametrization of the tensor factorization model and normalization methods, where we obtained results that are better than our baselines (see Table 1). RTDD

Random Forest

Logistic Regression

Gradient Boosting

p q Reconstruction errorIterationsPrecisionRecallF-ScorePrecisionRecallF-ScorePrecisionRecallF-Score (a) prediction results incorporating minmax scaling for Y 2515 1367.01

19

0.613

0.756 0.677

0.599

0.677 0.636

0.639

0.678 0.658

2525

954.19

20

0.613

0.766 0.681

0.584

0.691 0.633

0.643

0.670 0.656

2550

362.26

54

0.609

0.767 0.679

0.552

0.723 0.626

0.648

0.668 0.658

5025

919.18

13

0.603

0.777 0.679

0.57

0.694 0.626

0.638

0.679 0.658

5050

299.97

100

0.610

0.778 0.684

0.544

0.729 0.623

0.646

0.683 0.664

(b) prediction results incorporating standard scaling for Y 25 571959.72

6

0.582

0.793 0.671

0.632

0.543 0.584

0.602

0.689 0.643

252549984.64

13

0.579

0.797 0.671

0.568

0.667 0.613

0.629

0.625 0.627

50 572793.53

35

0.601

0.761 0.672

0.643

0.526 0.579

0.614

0.672 0.642

502549714.19

100

0.593

0.791 0.678

0.568

0.672 0.616

0.621

0.646 0.634

505021707.29

55

0.587

0.805 0.679

0.533

0.707 0.608

0.642

0.628 0.635

4

Prediction Results

In this section we will present our retention prediction results by first explaining the setting we considered for our baseline. Following that we will take a look at the prediction results using RTDD as input features from the perspectives of data normalization and parametrization. In order to set a baseline, we analyze the performance of predicting retention using only the behavioral features ignoring the temporal axis. For this analysis we create a matrix with PlayerID as rows and the behavioral features aggregated over the 14 day activity period of a player to be the columns. We trained a 5fold cross-validated Logistic Regression [26], Random Forest [27] and Gradient

Retention Prediction with Bipartite Tensor Factorization

305

Boosting Classification [28] models with 93 aggregated behavioral features as predictors and retention flag as response to predict player retention. Among these models, Random Forest predicted retention the best with precision of 0.605, recall of 0.657 and F-Score of 0.63 (see Table 1). In the following we will use these results as our baseline. We incorporated the temporal behavior of the players with their behavior over the 14 day period by creating a tensor Y as described above with time periods (temporal feature) as its rows, behavioral features as column and each individual players as slices. The behavioral features are aggregated across time periods from only 1–14 days in their playing life of each player. Following that we used RTDD to factorize tensor Y into temporal basis matrix A, behavioral basis matrix B and coefficient tensor W. RTDD embeds the temporal and behavioral dimensions into their respective loading matrices. Each slice of the coefficient tensor W is then vectorized and used as compact temporal-behavioral features to predict future user behavior. In order to evaluate the improvement in retention prediction brought about by the compact features, we predict retention probability of players using Random Forest, Logistic Regression and Gradient Boosting Classifiers.

0.676 0.680 0.680 0.683 0.683 (a) minmax scaling for Y

50

5

0.674 0.680 0.682 0.684 0.686

25

0.648 0.643 0.635 0.640 0.620

10

0.672 0.673 0.679 0.673 0.677

q 15

0.657 0.660 0.657 0.663 0.648

p 15

0.670 0.671 0.671 0.671 0.668

10

0.667 0.669 0.672 0.672 0.676

25

0.655 0.663 0.664 0.653 0.663

5

0.680 0.677 0.678 0.680 0.678

50

5

50

10

25

p 15

q 15

25

10

50

5

0.679 0.675 0.676 0.679 0.680 (b) standard scaling for Y

Fig. 3. Exploring the retention prediction quality in terms of F-Score for two popular scaling techniques and different parametrization of RTDD using the Random Forest classifier. For the former we chose the standard scaling, that normalizes the data to have zero mean and standard deviation one, and minmax scaling, that compresses all the features to a predefined range (usually to the unit hypercube). We ran a grid search for predicting retention in JC2 for RTDD parameters defined as q, p ∈ [5, 10, 15, 25, 50]. Our results indicate that, although for both of the utilized normalization methods the best results are obtained for larger values of p and q, compared to standard scaling, minmax normalization yielded more stable prediction results.

306

R. Sifa et al.

As in our case the choice for the number of latent factors affects the representational power [3] and thus the follow-up applications using the latent representations, we utilized a grid search on the RTDD parameters p and q. We particularly chose the values of p and q to be respectively as p ∈ [5, 10, 15, 25, 50] and q ∈ [5, 10, 15, 25, 50], while assuring p, q  min(m, n) to consider a compressed factor representation for each player. We trained Random Forest classifier with the compressed features as predictors and retention flag as response to predict retention probability of players. 5-fold cross validation was used to evaluate the Random Forest Model on various settings of tree depth and maximum features used to split the decision nodes in the trees while building the ensemble model. All the resulting models were evaluated based on their cross-validated F-score value. We present the prediction results for different values of p and q in Fig. 3. We note that the models ran with compact features from minmax scaled tensors predicted retention slightly better than standard scaled tensors for the same p and q setting. After our explorative analysis, we then created a smaller subset of optimal p and q settings based on highest 5 settings with cross-validation Fscore from the Random Forest output (that we show in Table 2). For p value of 50 and q value of 50, Random Forest predicted player retention best with precision, recall and F-score of 0.61, 0.778 and 0.684, respectively. Following that, to compare the Random Forest results against other standard classification models, we trained Logistic Regression and Gradient Boosting classifier for the smaller subset of optimal p-q settings (see Table 2 for cross validation precision, recall and F-score comparisons). Overall, our results do indeed indicate that, compared to the baseline model, the compact temporal behavioral features learned with RTDD have improved the retention prediction results substantially, where observed improvements more than 0.8%, 14% and 5% for respectively the values of precision, recall and Fscore. This implies that bipartite tensor factorization not only allows for learning compact temporal representations but also informative representations that help us predict future behavior better than the non-temporal behavioral features.

5

Conclusion and Future Work

In this work we presented a novel approach that is based on the work of [12] to automatically learn useful representations from temporal and behavioral features in sandbox games by utilizing bipartite tensor factorization. Unlike the static feature definitions that are mostly utilized in the previous work (e.g. as in [6,10]) our approach easily allows to incorporate the temporal player behavior into any prediction framework. Our case study with JC2 empirically showed that incorporating the coefficient matrices that are automatically learned by factorizing our tensor (encoding information about players, time periods and behavioral features) for each player can improve predicting the future arrivals. Our future work involves evaluating our behavior prediction models for the cases of enforcing different constraints (than orthogonality of the basis) on our tensor factorization model. In addition to that, we will explore how adding such constraints impacts the interpretability of the resulting factors.

Retention Prediction with Bipartite Tensor Factorization

307

Acknowledgments. Part of this work was jointly funded by the Audience of the Future programme by UK Research and Innovation through the Industrial Strategy Challenge Fund (grant no.104775) and supported by the Digital Creativity Labs (digitalcreativity.ac.uk), a jointly funded project by EPSRC/AHRC/ Innovate UK under grant no. EP/M023265/1. Additionally, part of this research was funded by the Federal Ministry of Education and Research of Germany as part of the competence center for machine learning ML2R (01—S18038A).

References 1. El-Nasr, M., Drachen, A., Canossa, A.: Game Analytics: Maximizing the Value of Player Data. Springer, Cham (2013) 2. Drachen, A., Mirza-Babaei, P., Nacke, L.: Games User Research. Oxford University Press, Oxford (2018) 3. Sifa, R.: Matrix and tensor factorization for profiling player behavior. LeanPub (2019) 4. Sifa, R., Drachen, A., Bauckhage, C.: Profiling in games: understanding behavior from telemetry (2017) 5. Drachen, A., Lunquist, E., Kung, Y., Rao, P., Klabjan, D., Sifa, R., Runge, J.: Rapid prediction of player retention in free-to-play mobile games. In: Proceedings of the AAAI AIIDE (2016) 6. Hadiji, F., Sifa, R., Drachen, A., Thurau, C., Kersting, K., Bauckhage, C.: Predicting player churn in the wild. In: Proceedings of the IEEE CIG (2014) 7. Kim, K.-J., Yoon, D., Jeon, J., Yang, S.-I., Lee, S.-K., Lee, E., Bertens, P., Peri´an ˜ez, A., Hadiji, F., M¨ uller, M., Joo, Y., Lee, J., Hwang, I.: Game data mining competition on churn prediction and survival analysis using commercial game log data. In: Computational Intelligence and Games (CIG) (2017) 8. Runge, J., Gao, P., Garcin, F., Faltings, B.: Churn prediction for high-value players in casual social games. In: Proceedings of the IEEE CIG (2014) 9. Sifa, R., Hadiji, F., Runge, J., Drachen, A., Kersting, K., Bauckhage, C.: Predicting purchase decisions in mobile free-to-play games. In: Eleventh Artificial Intelligence and Interactive Digital Entertainment Conference (2015) ´ Saas, A., Guitart, A., Magne, C.: Churn prediction in mobile social 10. Peri´ an ˜ez, A., games: towards a complete assessment using survival ensembles. In: Proceedings of the IEEE DSAA (2016) 11. Demediuk, S., Murrin, A., Bulger, D., Hitchens, M., Drachen, A., Raffe, W.L., Tamassia, M.: Player retention in league of legends: a study using survival analysis. In: Proceedings of the ACSW IE. ACM (2018) 12. Sifa, R., Srikanth, S., Drachen, A., Ojeda, C., Bauckhage, C.: Predicting retention in sandbox games with tensor factorization-based representation learning. In: Proceedings of the IEEE CIG (2016) 13. Sifa, R., Yawar, R., Ramamurthy, R., Bauckhage, C.: Matrix and tensor factorization based game content recommender systems: a bottom-up architecture and a comparative online evaluation. In: Proceedings of the AAAI AIIDE (2018) 14. Lee, S.K., Hong, S.J., Yang, S.I., Lee, H.: Predicting churn in mobile free-to-play games. In: Proceedings of the IEEE ICTC (2016) 15. Liu, X., Xie, M., Wen, X., Chen, R., Ge, Y., Duffield, N., Wang, N.: A semisupervised and inductive embedding model for churn prediction of large-scale mobile games. In: 2018 IEEE International Conference on Data Mining (ICDM). IEEE (2018)

308

R. Sifa et al.

16. Viljanen, M., Airola, A., Pahikkala, T., Heikkonen, J.: Modelling user retention in mobile games. In: Proceedings of the IEEE CIG (2016) 17. Yannakakis, G.N., Togelius, J.: A panorama of artificial and computational intelligence in games. IEEE Trans. Comput. Intell. AI Games 7(4), 317–335 (2015) 18. Xie, H., Devlin, S., Kudenko, D., Cowling, P.: Predicting player disengagement and first purchase with event-frequency based data representation. In: Proceedings of Computational Intelligence in Games (CIG) (2015) 19. Sifa, R., Bauckhage, C.: Online k-maxoids clustering. In: 2017 IEEE International Conference on Data Science and Advanced Analytics (DSAA), pp. 667–675. IEEE (2017) 20. Wallner, G., Kriglstein, S.: PLATO: a visual analytics system for gameplay data. Comput. Graph. 38, 341–356 (2014) 21. Kolda, T.G., Bader, B.W.: Tensor decompositions and applications. SIAM Rev. 51(3), 455–500 (2009) 22. Tucker, L.R.: Some mathematical notes on three-mode factor analysis. Psychometrika 31(3), 279–311 (1966) 23. Harshman, R.A.: Models for analysis of asymmetrical relationships among N objects or stimuli. In: Proceedings of the Joint Meeting of the Psychometric Society and the Society for Mathematical Psychology (1978) 24. Bader, B.W., Harshman, R., Kolda, T.G.: Temporal analysis of semantic graphs using ASALSAN. In: Proceedings of the IEEE ICDM (2007) 25. Sifa, R., Yawar, R., Ramamurthy, R., Bauckhage, C., Kersting, K.: Matrix- and tensor factorization for game content recommendation. Springer German J. Artif. Intell. 34, 57–67 (2020) 26. McCullagh, P., Nelder, J.A.: Generalized Linear Models. Chapman and Hall/CRC Monographs on Statistics and Applied Probability Series, 2nd edn. Chapman & Hall, Boca Raton (1989) 27. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001) 28. Friedman, J.H.: Stochastic gradient boosting. Comput. Stat. Data Anal. 38(4), 367–378 (2002)

Data Analytics of Student Learning Outcomes Using Abet Course Files Hosam Hasan Alhakami1(B) , Baker Ahmed Al-Masabi2 , and Tahani Mohammad Alsubait3 1 College of Computer and Information Systems, Department of Computer Science,

Umm Al-Qura University, Makkah, Kingdom of Saudi Arabia [email protected] 2 College of Computer and Information Systems, Department of Computer Science and Engineering, Umm Al-Qura University, Makkah, Kingdom of Saudi Arabia [email protected] 3 College of Computer and Information Systems, Department of Information Science, Umm Al-Qura University, Makkah, Kingdom of Saudi Arabia [email protected]

Abstract. The ABET accreditation requires Student Outcomes’ (SO) direct assessment in various courses. Outcome-based learning is what students learn as because of these assessments. Nevertheless, most universities find it difficult to approve ABET Course File. The scores of students will not truly reflect their outcome- based learning if the design of ABET Course File improperly done in a manner that addresses the relevant SOs. Contrariwise, ABET course files has to do with the direct relationship with the course contents. In cases whereby, outcome-based learning is not evident. As such, the aim of this project includes the analysis of students’ performance and accomplishments regarding ABET course files Learning, using data mining approaches. Also, this project intends to test various methods of Data mining including Naïve Bayes, Decision tree, and so on and recommend an appropriate method to predict the performance of the student. The accuracy of the prediction of student performance is high in decision tree compared to other algorithms such as Naive Bayes. Keywords: ABET accreditation requires · Student Outcomes’ (SO) · Course Learning Outcomes (CLOs) · K-Nearest Neighbors (K-NN) · Data Mining or Knowledge Discovery in Databases (KDD) · Database Management Systems (DBMS) · WEKA · Decision Tree · Neural networks · Naïve Bayesian classification · Support vector machines

1 Introduction The performance of students is the key of higher learning institutions since most highquality universities’ requirements include records of high academic performance. Prediction of student performance is very helpful in the educational system for both learners and educators. Students’ performance is obtainable with the measurement of the learning © Springer Nature Switzerland AG 2020 K. Arai et al. (Eds.): SAI 2020, AISC 1228, pp. 309–325, 2020. https://doi.org/10.1007/978-3-030-52249-0_22

310

H. H. Alhakami et al.

co curriculum, along with assessment. According to Shahiri and Husain [1]. Previous academic works have defined students’ performance in several ways. To Usamah et al. [2], academic analytics is a proven approach to predict and monitor students’ performance and is useful in improving the success in educational system. Nevertheless, the majority of the research works discussed measuring the success of students with graduation. Usually, most Malaysian higher learning institutions assessed the performance of students using their final grades. Extracurricular activities, final exam score, assessment mark and course framework constitute the final grades according to Shahiri and Husain [1]. It is necessary for the learning process to be effective. It is also necessary for the performance of students’ to be maintained. It is critical to organize a tactical program while students’ study in an institution by assessing their performance according to Ibrahim, and Rusli [3]. Students’ performance is presently being assessed because scholars are proposing various methods. Data mining is among the commonest methods used in analyzing students’ performance. Most aspects of education have used it in recent times according to Romero, and Ventura [4]. Another name for data mining is educational data mining. It is used for extracting valuable information and are used for predicting the performance of the students as well as patterns from the large educational database according to Angeline [5]. Students’ performance can be predicted used its valuable information and patterns. Consequently, instructors can use it to provide an effective teaching technique. Moreover, it is possible for the instructors to use it for monitoring academic progress made by the students. Its use enhances the learning activities of students because it allows the administration to enhance the performance system. Therefore, the use of methods of mining data helps to focus on various bodies’ particular needs. The project aims to analyze the performance and achievements of the students with respect to the ABET (Accreditation Board for Engineering and Technology) course files using Data mining techniques. The objectives involved in this project are: Identify the classification methods that can be used to analyze and predict the performance and achievements of the students with respect to ABET course files. Perform the classification using Weka.

2 Literature Review There are many studies that talk about data mining in terms of research and practical, these studies have dealt with the classification methods that used with educational data and knowledge of factors which affecting the performance of the student. This section provides a complete background study conducted for this project. The chapter contains four main sections: (Data Mining, Classification, Factors involved in predicting student performance and related works). These sections contain research in the fields of education, in terms of methods of obtaining data, as well as methods of classification, factors affecting, and classification tools used. What is covered in this chapter is important to achieve one of the most important research objectives and identify the most effective classification methods with educational data.

Data Analytics of Student Learning Outcomes Using Abet Course Files

311

2.1 Data Mining The concept of Data mining is relatively new. Various fields have used data mining successfully because of its importance in decision making process. The field of education is also inclusive in it. In recent times several researchers have developed an interest in studying the use of data mining in education and they referred it as Educational Data Mining (EDM). Educational Data Mining is the application of Data Mining techniques to educational data, and its objective is to analyze these types of data in order to resolve educational research issues according to Romero, and Ventura [4]. There are several factors that affect the learning environment and have been studied. According to Baker et al. [6] if we consider that improving motivation of students directly improves learning. After investigating what type of motivation strongly affects the learning, then we can specifically focus on those specific behaviors and motivations to make a better learning environment for students. The factors affecting student performance, including the most influential elements of “grades” and “absence” according to fernandes [7]. With the help of educational data set we can study and visualized all the factors which affect student’s performance in an educational system. Educational Data Mining aims at developing innovative techniques for exploring educational data to determine how valuable it is in the system of learning according to Kotsiantis [8]. It also aims at analyzing the academic performance of learners, as well as developing a system that warns ahead of time to avoid educational problems according to Mueen et al. [9]. They used Naive Bayes classifier and obtained 86% prediction accuracy which assist teachers to early detect students who are expected to fail a particular course and they can provide special attention and warning to these students. This study also reveals that different factors such as personal interest, family influence, instructors’ attention toward students or combination of these factors effect students’ performance and it vary from place to place and applied differently in different field of studies. The academic process of students requires their academic performance to be predicted and analyzed to help the student and the university according to Romero, and Ventura [10]. This task is very complex because students’ performances are affected by various factors which include the interactions between teachers and classmates, prior academic performance, previous schooling, psychological profile, as well as a family factor That primarily affect student performance. According to Araque et al. [11] the rates of dropouts of students are higher and factors associated with dropouts are multi-casual in nature and they are much related to students’ psychological and educational characteristics. Baker [12] stated that there is a difference between the standard data mining algorithms and educational data mining algorithms, their difference lies in the fact that their educational hierarchy is not the same. Academics have suggested new techniques for mining educational data recently. Educational data mining is different from data mining because there is explicit exploitation in multiple level of hierarchy in educational data. That is why educational data mining has emerged as an independent research area and the annual International Conference on Educational Data Mining has been established in 2008 to study this field of data science. Educational data mining is considered as an independent aspect of the research area. When obtaining educational data before starting to predict the use of this data must know the data, repair errors and cleaned complete to get the correct results [13]. After obtaining the data, we must know the content of this data and remove the unimportant in the database, and we must add

312

H. H. Alhakami et al.

elements that are important in the analysis. There are five broad categories of current techniques in educational data mining; prediction is among them, and it typically utilizes input data to predict the value of the output. The classifications of prediction include classification, regression, as well as density estimation. 2.2 Classification Classification is among the popular methods used in mining data. The purpose of classification in data mining is to accurately predict the target class for each instance in the data set. A pre classified prototype is used for building a model in classification, for allocating a class or label to a record. The classification method comprises the training aspect, as well as the testing aspect. The training aspect involves the use of an aspect of the data referred to as a training set, to construct the model. The known aspect knows every characteristic including the classes. As soon as the model is built, it is used for describing a class or record to a new record, in a case whereby the class characteristic is not known. A model or classifier such as a Kk-NN, Naïve Bayes, Neural Network (NN), and Decision Tree (DT). • A Decision Tree: Decision Tree is among the most conventional methods used for prediction. Decision tree is a supervised machine learning classification technique where data is continuously split according to a certain parameter. Several academics have utilised the method as it is easy-to-use and understands, and it helps in uncovering large or small data structure, predict the value and Its prediction accuracy is very high compared to other algorithms according to Natek, and Zwilling [14]. They found that on small data sets using decision tree algorithms the prediction rate was high (above 80%). According to Romero et al. [15], comprehending the decision tree models is easy. This is as a result of the way they reason, and it is possible to convert them into an IF THEN rules’ set directly. A decision tree can easily be converted to a set of classification rules. The feature that can use student data for academic performance to predict drop out is an instance where the Decision Tree technique is used in previous studies. It has also been used to predict MCA students’ thirdsemester performance according to Mishra et al. [16] as well as using the behavioural patterns of students to predict their suitable career. They focused on evaluating the factors that affect students’ third semester performance. They compare two different decision tree classifier and visualized the results. These two classifiers were Random Tree classifier and j48. In result they found that Random Tree gave higher accuracy in predicting students’ behavior than j48. The assessment of the performance of the student uses features obtained from data logged in a web-based system of education. In different research the marks gotten in particular courses, the final cumulative grade point average (CGPA), and the students’ final grades are examples of the dataset Used to predict student performance according to Minaei-Bidgoli et al. [17]. Romero et al. [15] carried out some experiments in order to test the performance of different classification algorithms for predicting students’ final marks based on information in the students’ usage data in an e-learning system. They classified students with equal or nearly equal marks are placed in one group and they create different groups depending on the activity carried out on online system. Depending upon their activity

Data Analytics of Student Learning Outcomes Using Abet Course Files

313

they classified results into four groups which predicted their marks in final exams. For example, if student score is less than 5 then he will fail and if his score is greater than 9, he will perform exceptionally well. An examination, as well as analysis of the entire datasets, were carried out to discover major factors affecting the performance of the students according to Bunkar et al. [18]. After knowing the factors affecting student performance, we would study the most appropriate data mining algorithms for predicting students’ performance in terms of accuracy and speed according to Ramesh et al. [19]. A comparison of the methods of classification used to predict the students’ academic performance to find out the most appropriate among these methods was made by Mayilvaganan and Kapalnadevi [20]. They used decision tree algorithm with some other algorithms to analysis the performance of the students which can be experimented in Weka tool. In the meantime, Gray et al. [21] researched on the classification models’ accuracy for predicting how learners fare in tertiary education, they concluded that models trained on younger students had good prediction accuracy, and were applicable to a new student data, while it was less accurate on older students, and were not applicable to a new data. • Neural Networks: This is a common method used in mining educational data. Neural networks are computing systems inspired by biological neural networks and these systems learn through examples to perform actions. They are generally not programmed with task specific rules. Its benefits include its capability of detecting every possible communication that takes place among according to Gray et al. [21]. The neural network may be able to identify doubtlessly; it does not matter if it is a dependent relates with the independent variables in a complicated nonlinear a manner according to Arsad, and Buniyamin [22]. Hence, it can be argued that the neural network method is among the best technique used in predicting. Students’ performance is predicted using an Artificial Neural Network Model. They came up with Neural Network prediction model to predict the academic performance of students. They concluded with the idea that there is a direct relation with fundamental (prerequisite) subjects in early semesters to the final grade of a student. The behavior of students concerning independent learning, admission data are the characteristics that Neural Network analyses according to Kumar [23]. The others constitute a similar report, as well as Decision Tree technique that allowed academics to utilised the two methods for comparison to ascertain the best prediction technique, to be used to analyze the performance of students. • Naïve Bayes: Four reports out of thirty have utilised Naïve Bayes algorithms for estimating students’ performance. Naive Bayes classifiers are probabilistic classifiers which use Bayes’ theorem and having strong independence assumption between features. These four reports aim at discovering the prediction technique that is most suited for the method of prediction that can be used to predict the students’ performance using comparisons between algorithms and find out the most appropriate among them and the most accurate according to Zhou et al. [24]. It is evident from their investigation that the entire qualities that the data contains have been used Naïve Bayes. After that, it examined them one by one to demonstrate the significance, as well as the independence of the individual characteristic according to Gao et al. [25]. Mueen et al. [9] used Naive Bayes classifier and obtained 86% prediction accuracy which assist

314

H. H. Alhakami et al.

teachers to early detect students who are expected to fail a particular course and they can provide special attention and warning to those students. • K-NN: KNN stands for K nearest neighbors and it is a classification algorithm which first store all training data and then classifies testing data based on training data. It is frequently used to test students’ performance in academia. K Nearest neighbor had excellent precision. Bigdoli et al. [26], stated that the time required for identifying the performance of students as excellent learner, good learner, average learner, and the slow learner is lesser with K-Nearest Neighbors technique. The use of K-Nearest Neighbor provides an excellent precision when it comes to predicting the progress made by learners in the in tertiary education according to Yu et al. [27]. 2.3 Related Works ABET accreditation is the goal of every university or college, As the results of this accreditation are great in scientific and practical terms, the methods of assessment of the students’ results differed, but they all show a better result for the student and educational entity [33]. The development of Abet standards is beneficial to the educational institution and contributes to increase the effectiveness of programs and ease of dealing with it [34]. There is no doubt, that the educational process needs technology in many evaluations to contribute in development of education, through knowing of these evaluations and objectives and fully assembled, to combine the AI with education management [35]. Accrediting entities are requiring, as well as recognizing the things that are necessary to process the PLOs’ Assessment enhanced, useful, as well as sustainable education, to a great extent according to Buzzetto-More & Alade [36]. They also explain that technology and e-learning strategies are very helpful in students’ assessment and increase human productivity. Effectiveness of implementation depends on institutional support and long range of consistence effort that includes technology will help to improve results in this field. Assessment describes the process through which the data containing students’ achievements are identified, collected, as well as analyzed, for measuring the result of each learning achievement. The most suitable result being measured utilizes qualitative, quantitative and direct or indirect measures to assess effectively according to ABET [37]. Furthermore, it is fundamental to assess the students, to ascertain the compliance of the educational institution in satisfying stipulate requirements. It also ascertains that the institution can provide the relevant resources for quality education according to Love & Cooper [38]. The following are required by the majority of the accreditation entities to: 1. Specification of information and competencies that students are expected to be proficient in by the time they graduate (take note of a set of program learning outcomes, 2. Establish assessment processes that will help decide the success of the program that helps the students attain these learning outcomes and 3. Perform a process that will constantly improve the students; it is usually called the closing loop. It is a process that enhances the teaching and learning experiences of courses, as well as program levels according to Alzubaidi [39]. The emphasis of the students’ learning outcomes is centered around developing the students’ collaborative, interactive, as well as rational abilities, as they help them carry

Data Analytics of Student Learning Outcomes Using Abet Course Files

315

out learning activities successfully. Educational institutions have student-specific educational data (grades, achievements, etc.) that predict student performance and future behavior [40]. The educational authorities (schools/universities) have huge databases that have not been fully used so that the educational authorities began to analyze this data and use it to know the behavior of students and their academic level. The learning outcome includes the requirements of students in the area of behaviors, as well as competencies they should acquire when they complete a learning experience according to Asheim, Gowan and Reichgelt [41]. The impact of learning outcomes on the design of curriculum and quality assurance is direct. The learning outcomes signify the changes that take place from the regular perception that is based on the teacher’s thinking process to the use of the method that is based on students’ thinking process. This enables the relationship between the teaching-learning-assessment to be focused on, as well as facilitating the basic connection between the measurement, delivery, and design of learning according to Adam [42]. The most concern found among the faculty when handling the process of accreditation was created according to Jones and Price [43]. Nevertheless, the program level have confirmed that there are advantages in using the learning outcomes based technique according to Clarke and Reichgelt [44], library level, which makes it possible for students to attain the required skills according to Gowan, MacDonald and Reichgelt [45], and individual courses level according to Rigby and Dark [46]. Imam et al. [47] presented introduced a technique that can be used to obtain data that satisfies ABET “Student Outcomes” (SOs), which transforms Course Learning Outcomes (CLOs), meeting the required data attained using CLOs’ assessment to satisfy SO data. When the characteristics of fuzzy are considered, regarding SOs, as well as CLOs’ metrics, there has been a proposition to use Fuzzy Logic algorithm for the extraction of SO satisfaction data from the CLO satisfaction data for specific courses. An executable process that is most appropriate for resolving issues was used to describe fuzzy variables’ membership functions, which include the SOs, CLOs, as well as the relationship between both of them. The fuzzy logic algorithm created a set of 24 rules to form the rule base. MATLAB has been utilized to implement and test the algorithm, and it has resulted in the presentation of an application example of a real-world problem. The intention of Zacharis [48] is developing a practical model that can be used to predict the students at risk of not performing well in blended learning courses. The proposal from past investigations states that teachers can develop prompt, interventions that are evidence-based to help students that are struggling or at risk, by analyzing the usage data that is stored in the log files contained in the modern Learning Management Systems (LMSs). This report aims at examining the tracking data of students from a Moodle LMS- supported blended learning course to point out relevant connections between course grade and various online activities. Only fourteen LMS usage variables were discovered to be relevant, out of twenty-nine of them. They were inputted in a stepwise multivariate regression which demonstrated that the prediction was 52% of the difference in the final student grade, using four variables such as number of files viewed, quiz efforts, contribution towards creating content, as well as reading and posting messages. The major difference between the above discussed journal and this research is that the journals did not use the classification algorithm to perform the performance analysis of the system.

316

H. H. Alhakami et al.

2.4 Summary In the past, most of the contributions describe the ABET accreditation at the abstract level. Some of them did not analyze ABET assessment criteria in depth. Over time there has been an immense improvement in this domain and recent research is contributing toward practical implementation which is providing the guideline in specific setting for educational entities as Cook et al. [49] has proposed a generic model based on ABET criteria to highlight deficiencies in the academic programs, intending to apply for ABET accreditation. Eventually, there has been a significant interest in improving the educational system and to get better performance from students. It became possible because of the work which has been done in this field. But somehow there has been a limitation of this contribution and it will require a lot more effort to get positive results. There is a gap in academia on how to find assessment methodologies to satisfy the ABET guidelines for a specific institution because a lot of factors affecting these assessments in various ways and there is much diversity in student’s behavior in different academic institutions depending upon their background, schooling, ethnicity, culture and there are many other factors as well. Data collection is also a major challenge. Because of this diversity in the educational system, models trained on one set of students could not be applied to the students from another corner, which are scarcely distinguishable in terms of factors affecting them. It has also been visualized the lack of available literature on designing assessment strategies to measure student’s performance. Most of the existing literature has worked on different classification techniques while in this paper we have tried to analyze all classification methods which are used in the classification. Unlike previous papers, this paper takes an aggregate approach to guide all key factors of the assessment and we focused on analysis, design, and practical model which can be used in real-life academic environments for countries like Saudi Arabia. Previous studies helped us to know the factors which are influencing the student’s performance. It also helps us to suggest some other factors that have a major impact on student’s performance. We found that Internet assessment is a major feature that is commonly used in previous studies. Another important feature is the CGPA. Based on this knowledge we collected data from Umm Al Qura University (students of the Faculty of Computer Science and Engineering). We used classification methods with educational data and knowledge of factors that affect the performance of the student. We Perform classifications using Weka and identify different classes that can be used to analyze the performance of the student. We analyzed the results received from the classification to identify the suitable classification algorithm to use for analyzing the student performance. Then we evaluate the accuracy of the classification to ensure that recommendations are valid. A lot of factors affect student’s performance and different classification methods are used to predict student’s educational data. More work is needed to find the factors affecting student’s performance and appropriate classification techniques are also required. We received data set from Umm Al-Qura University and performed primary research to analyze the student performance using the WEKA data mining tool and predict their future. We found that all factors had an impact on the student’s success or failure in the final exam.

Data Analytics of Student Learning Outcomes Using Abet Course Files

317

3 Methods The steps involved in the prediction of student performance and achievements in the university is provided below: Step 1: Identify the research questions that need to be addressed in this research. Step 2: Download the dataset Umm Al-Qura University regarding the university assessments and the student results. Step 3: Clean and arrange the data using Excel. Step 4: Identify different classes that can be used to analyze the performance of the student. Step 5: Perform classifications using Weka Step 6: Analyze the results received from the classification to identify the suitable classification algorithm to use for the student performance. Step 7: Evaluate the accuracy of the classification to ensure the recommendation is valid. 3.1 Factors Involved in Predicting Student Performance The significant features used to predict students’ performance can be identified using the systematic literature review. There is no doubt that the prediction of student performance in previous analytical studies has focused on specific factors and was considered the most common analytically and that their impact on student performance is very high [28]. Knowing the factors influencing student performance through previous studies gives full knowledge to suggest other factors that have an impact on student performance. Internal assessment and cumulative grade point average (CGPA) are among the features that are commonly used. CGPA has been used by ten out of thirty reports as the major feature for predicting the performance of students according to Angeline; Mayilvaganan, M. and Kalpanadevi [5, 20]. CGPA is the commonest method that academics use since its value is substantial regarding future career, as well as career and academic mobility. It is possible to see it as a symbol of achieved academic potential according to bin Mat et al. [2]. It is evident from the outcome of CGPA is the most important input variable by 0.87 when contrasted with other variables using the coefficient correlation analysis according to Ibrahim, and Rusli [3]. Apart from that, according to research conducted by Christian and Ayub [29], the most influential feature used to determine how the students can survive in their study, regardless of if they complete their study, is CGPA. This report grouped the internal evaluation into attendance, class test, lab work, as well as a mark. The entire features would be categorized in one internal assessment. Investigators majorly used the features for predicting the performance of students according to MinaeiBidgoli et al. [26]. Subsequently, external assessments, as well as the demography of students are usually part of the commonest features. Disability, family background, age, and gender constitute the demography of the students according to bin Mat et al. [2]. The mark gotten for a particular subject’s final exam is external assessments according to Angeline [5]. Investigators usually utilize student’s demo- graphic such as gender is because their learning process has various types of male or female student. bin Mat et al. [2] discovered that the majority of female students have different positive styles of learning. Their learning attitudes also differ from those of the male students according to

318

H. H. Alhakami et al.

Simsek, and Balaban [30]. The female students study more than their male counterparts are more focus, preserved, self-directed, obedient and disciplined. However, the learning tactics of female students hare more effective according to Simsek, and Balaban [30]. Therefore, there is proof that that gender is among the important qualities that affect the performance of students. Social interaction network according to Szolnoki, and Perc [31], as well as, high school background according to Shahiri, and Husain [1], are among the three common characteristics that are commonly used to predict the students’ performance is according to Mishra et al. [16]. Five out of thirty researches use each of these qualities. From other investigations, several investigators have predicted the performance of the students using the psychometric factor according to Wook et al. [32]. The family support, engage time, study behavior, as well as student interest, have been pointed out to be a psychometric factor. The systems have been well defined, straightforward and easy-to-use, as a result of the features they have used. The lecturers can use it for assessing the things achieved by the students academically, as determined by their behavior, as well as their interest according to Mayilvaganan, and Kalpanadev [20]. Nevertheless, most times, investigators seldom use these features to predict the performance of students since its major focus is on qualitative data. Additionally, getting participants to provide valid data is difficult. 3.2 Dataset The dataset used for this project is Umm Al-Qura University Dataset (students of the Faculty of Computer Science and Engineering). The dataset information is derived from several different courses. Dataset has several csv files. However, this project will make use of 2 major files which are programming languages 382.csv and Web developer 392.csv. Each course contains several important assessments of student performance in addition to evaluating attendance at each course. Each department has an independent assessment in terms of results for students, and this assessment is based on student data. In total the dataset has 126 student assessment information. The data cleanup also includes the data arrangement to prepare for the data analysis. Therefore, programming languages 382.csv and Web developer 392.csv files was used as a starting point for the data preparation. Delete duplicate data, for best results in analysis. Due to redundancy in the used datasets, data cleaning was required in order to detect and correct (or remove) corrupt or inaccurate records from any dataset and to be able to include numerical data in order to ensure the accuracy of the performance during the analysis (the textual and empty data were padded with value 0, representing no entry available). In addition to the previous step, 5 columns containing the data related to the students’ pass and fail status was added. Final: if Adjusted Final Grade Numerator ≥60” Pass” else” Fail”. Attendees, Total for Quizzes and Attendees and Mid-term: If the score above 50% gives” high” and if it is lower gives” Low”. Grade: = IF Adjusted Final Grade Numerator ≥90;” A”; ≥80; B ≥70;”C”; ≥60;” D”; else” H”.

Data Analytics of Student Learning Outcomes Using Abet Course Files

319

4 Results and Discussion 4.1 Basic Excel Analysis After studying the factors affecting the performance of students through the database used in this research, all factors had an impact on the student’s success or failure in the course. The results also showed that students who had a high attendance and degree in the mid-term passed the course programming languages 382. The following Table 1 provides a summary of the results of the programming languages 382 course analysis (see Fig. 1). Table 1. Results of analysis of programming languages 382. Attendees Attendees Mid-term & Quizzes Pass Fail Pass Fail Pass Fail High 16

1

18

28

8

0

Low 14

29

12

2

22

30

programming languages 382 35 30 25 20 15 10 5 0

Fail

Pass Mid-term

Fail

Pass

AƩendees & Quizzes High

Fail

Pass AƩendees

Low

Fig. 1. Results of analysis of programming languages 382.

The web developer course differs from programming languages in that all students whose attendance, midterm degree, and high quizzes passed the course. The following Table 2 provides a summary of the results of the web developer 392 course analysis (see Fig. 2). Grade For both courses the level of students was (D) and this indicates the weak level of students. The following Table 3 explains students’ grade in both courses.

320

H. H. Alhakami et al. Table 2. Results of analysis of web developer 392. Attendees Attendees Mid-term & Quizzes Pass Fail Pass Fail Pass Fail High 32

3

34

0

36

4

Low 15

16

13

19

11

15

Web developer 392 40 35 30 25 20 15 10 5 0 Fail

Pass Mid-term

Fail

Pass

Fail

AƩendees & Quizzes High

Pass AƩendees

Low

Fig. 2. Results of analysis of web developer 392.

Table 3. Average students in both courses. Course name

A B C

Programming languages 382 1 3 Web developer 392

D

6 20

0 4 19 24

4.2 Weka Analysis For different data weka analysis included a special section for each course using (J48, NaiveBayes). There is no doubt that the results provided are important to know the accuracy of predicting the algorithm with data and factors affecting student performance. The accuracy of the j48 algorithm is higher with all the data in this section. As shown in the following Table 4 and Table 5, the rates are high in accuracy.

Data Analytics of Student Learning Outcomes Using Abet Course Files

321

Table 4. Results of analysis of programming languages 382 using weka Algorithm

TP

FP

Recall F-Measure

NaiveBayes 0.983 0.017 0.983

0.983

J48

1.00

1.00

0.00

1.00

Table 5. Results of analysis of web developer 392 using weka Algorithm

TP

FP

Recall F-Measure

NaiveBayes 0.970 0.012 0.970

0.970

J48

1.00

1.00

0.00

1.00

• Correctly classified instances: After testing the algorithm for each of the data of students in the course programming languages -382 and web developer-392 of the data of the students of the University of Umm Al-Qura for the previous year, it is clear the high accuracy rate through the previous results of the algorithm j48 an increase in both: 100% correct reading, the time to perform the test is 0 s. • Incorrectly classified instance: The Naïve Bayes algorithm was the least accurate in dealing with the data of the course 382 and 392 for the data of students of the University of Umm Al-Qura for the previous year. In course 382 we have a false prediction, while course 392 shows 2 wrong predictions. In terms of timing 0.03 is high compared to the other algorithm used. Through the matrix in the courses, it is clear in Course No. 382 that the algorithm predicted that a student pass, but the correct prediction fails and also the matrix in Course 392 that the algorithm predicted that two students FAIL but the correct prediction PASS.

5 Conclusion Educational data has great impact on the students and the educational entities, it has contributed to the growth of the educational field, addresses the problems of students and educational bodies, fights the old traditional methods and reduces educational abuses. The lack of interest from the students and the instructors to provide the correct data led to the lack of development in the educational field. The advantage of conducting data analysis lies in helping the students, knowing the performance and predicting the academic future of students. The study of factors affecting the student’s performance gives guidance to the educational entities on the most important factors which helped in affecting the performance of the students. We used Excel as an assistant to see how factors affect student performance. Weka was utilized as a data mining tool to perform the analysis for different classification algorithms such as Naïve Bayes and J48. From the analysis it is identified that J48 is a suitable approach to predict the performance of the student.

322

H. H. Alhakami et al.

5.1 Recommendation Based on literary reading in the prediction of student performance and knowledge of the factors influencing as well as the classifications used in the prediction of educational data and analysis of the data used in this research to find out the factors affecting student performance and the appropriate classifications of data. Under the objectives of this research, recommendation of the procedures is: a. It is needed to do a lot of research to know the performance of the student and the factors affecting their performance. It is very necessary to find useful educational data. b. Student results have a very high impact on passing the course. c. After testing the algorithms on the data and knowing the accuracy of the prediction of the algorithm it turned out that j48 has a very high accuracy in dealing with educational data 5.2 Future Works The future work for this project is provided below: 1. Gather course outcomes and ABET files for the specific course are gathered to perform the analysis. 2. Gather assignments and the learning outcomes of the assignments are also gathered to perform the analysis. 3. Perform neural network prediction. 5.3 Limitations The major limitation of this study is the availability of the dataset. The dataset used in this research is received from Umm Al-Qura University. However, to get accurate results, the primary research should be performed to gather more data from some other resources to perform the analysis.

References 1. Shahiri, A.M., Husain, W., et al.: A review on predicting student’s performance using data mining techniques. Procedia Comput. Sci. 72, 414–422 (2015) 2. bin Mat, U., et al.: An overview of using academic analytics to predict and improve students’ achievement: a proposed proactive intelligent intervention. In: 2013 IEEE 5th Conference on Engineering Education (ICEED), pp. 126–130. IEEE (2013) 3. Ibrahim, Z., Rusli, D.: Predicting students’ academic performance: comparing artificial neural network, decision tree and linear regression. In: 21st Annual SAS Malaysia Forum, 5th September (2007) 4. Romero, C., Ventura, S.: Educational data mining: a review of the state of the art. IEEE Trans. Syst. Man Cybern. Part C (Appl. Rev.) 40(6), 601–618 (2010)

Data Analytics of Student Learning Outcomes Using Abet Course Files

323

5. Magdalene Delighta Angeline, D.: Association rule generation for student performance analysis using apriori algorithm. SIJ Trans. Comput. Sci. Eng. Appl. (CSEA) 1(1), 12–16 (2013) 6. Baker, R.S., Corbett, A.T., Koedinger, K.R.: Detecting student misuse of intelligent tutoring systems. In: International Conference on Intelligent Tutoring Systems, pp. 531–540. Springer, Heidelberg (2004) 7. Fernandes, E., et al.: Educational data mining: predictive analysis of academic performance of public-school students in the capital of Brazil. J. Bus. Res. 94, 335–343 (2019) 8. Kotsiantis, S.B.: Use of machine learning techniques for educational proposes: a decision support system for forecasting students’ grades. Artif. Intell. Rev. 37(4), 331–344 (2012) 9. Mueen, A., Zafar, B., Manzoor, U.: Modeling and predicting students’ academic performance using data mining techniques. Int. J. Modern Educ. Comput. Sci. 8(11), 36 (2016) 10. Romero, C., et al.: Predicting students’ final performance from participation in online discussion forums. Comput. Educ. 68, 458–472 (2013) 11. Araque, F., Roldán, C., Salguero, A.: Factors influencing university dropout rates. Comput. Educ. 53(3), 563–574 (2009) 12. Baker, R.S.J.D., et al.: Data mining for education. Int. Encyclopedia Educ. 7(3), 112–118 (2010) 13. Al-Nadabi, S.S., Jayakumari, C.: Predict the selection of mathematics subject for 11 th grade students using Data Mining technique. In: 2019 4th MEC International Conference on Big Data and Smart City (ICBDSC), pp. 1–4. IEEE (2019) 14. Quadri, M.M.N., Kalyankar, N.V.: Drop out feature of student data for academic performance using decision tree techniques. Glob. J. Comput. Sci. Technol. (2010) 15. Romero, C., et al.: Data mining algorithms to classify students. Educ. Data Mining (2008) 16. Mishra, T., Kumar, D., Gupta, S.: Mining students’ data for prediction performance. In: 2014 Fourth International Conference on Advanced Computing & Communication Technologies, pp. 255–262. IEEE (2014) 17. Minaei-Bidgoli, B., et al.: Predicting student performance: an application of data mining methods with an educational web-based system. In: 33rd Annual Frontiers in Education, FIE, vol. 1, p. T2A–13. IEEE (2003) 18. Bunkar, K., et al.: Data mining: prediction for performance improvement of graduate students using classification. In: 2012 Ninth International Conference on Wireless and Optical Communications Networks (WOCN), pp. 1–5. IEEE (2012) 19. Vamanan, R., Parkavi, P., Ramar, K.: Predicting student performance: a statistical and data mining approach. Int. J. Comput. Appl. 63(8) (2013) 20. Mayilvaganan, M., Kalpanadevi, D.: Comparison of classification techniques for predicting the performance of students’ academic environment. In: 2014 International Conference on Communication and Network Technologies, pp. 113–118. IEEE (2014) 21. Gray, G., McGuinness, C., Owende, P.: An application of classification models to predict learner progression in tertiary education. In: 2014 IEEE International Advance Computing Conference (IACC), pp. 549–554. IEEE (2014) 22. Arsad, P.M., Buniyamin, N., et al.: A neural network students’ performance prediction model (NNSPPM). In: 2013 IEEE International Conference on Smart Instrumentation, Measurement and Applications (ICSIMA), pp. 1–5. IEEE (2013) 23. Anupama Kumar, D.M.S., Vijayalakshmi, M.N., Anupama Kumar, D.V.M.N.S.: Appraising the significance of self-regulated learning in higher education using neural networks. Int. J. Eng. Res. Dev. 1(1), 09–15 (2012) 24. Zhou, X., et al.: Detection of pathological brain in MRI scanning based on wavelet entropy and naive Bayes classifier. In: International Conference on Bioinformatics and Biomedical Engineering, pp. 201–209. Springer (2015)

324

H. H. Alhakami et al.

25. Gao, C., et al.: Privacy-preserving Naive Bayes classifiers secure against the substitution then comparison attack. Inf. Sci. 444, 72–88 (2018) 26. Minaei-Bidgoli, B.: Data mining for a web-based educational system. Ph.D. thesis. Michigan State University. Department of Computer Science and Engineering (2004) 27. Yu, Z., et al.: Hybrid k-nearest neighbor classifier. IEEE Trans. Cybern. 46(6), 1263–1275 (2015) 28. Saa, A.A., Al-Emran, M., Shaalan, K.: Factors affecting students’ performance in higher education: a systematic review of predictive data mining techniques. Technol. Knowl. Learn. 1–32 (2019) 29. Christian, T.M., Ayub, M.: Exploration of classification using NB Tree for predicting students’ performance. In: 2014 International Conference on Data and Software Engineering (ICODSE), pp. 1–6. IEEE (2014) 30. Simsek, A., Balaban, J.: Learning strategies of successful and unsuccessful university students. Contemp. Educ. Technol. 1(1), 36–45 (2010) 31. Szolnoki, A., Perc, M.: Conformity enhances network reciprocity in evolutionary social dilemmas. J. Roy. Soc. Interface 12(103), 20141299 (2015) 32. Wook, M., et al.: Predicting NDUM student’s academic performance using data mining techniques. In: 2009 Second International Conference on Computer and Electrical Engineering, vol. 2, pp. 357–361. IEEE (2009) 33. Shafi, A., et al.: Student outcomes assessment methodology for ABET accreditation: a case study of computer science and computer information systems programs. IEEE Access 7, 13653–13667 (2019) 34. Hadfield, S., et al.: Streamlining computer science curriculum development and assessment using the new ABET student outcomes. In: Proceedings of the Western Canadian Conference on Computing Education. WCCCE 2019. Calgary, AB, Canada, pp. 10:1–10:6 (2019). ISBN 978-1-4503-6715-8. http://doi.acm.org/10.1145/3314994.3325079 35. U˘gur, S., Kurubacak, G.: Technology management through artificial intelligence in open and distance learning. In: Handbook of Research on Challenges and Opportunities in Launching a Technology Driven International University, pp. 338–368. IGI Global (2019) 36. Buzzetto-More, N.A., Alade, A.J.: Best practices in e assessment. J. Inf. Technol. Educ.: Res. 5(1), 251–269 (2006) 37. ABET. ABET Accreditation Board for Engineering and Technology. (2010). Computing Accreditation Commission. Criteria for accrediting computing programs (2010). http://www. abet.org. Accessed 17 Oct 2019 38. Love, T., Cooper, T.: Designing online information systems for portfolio-based assessment: Design criteria and heuristics. J. Inf. Technol. Educ.: Res. 3(1), 65–81 (2004) 39. Alzubaidi, L.: Program outcomes assessment using key performance indicators. In: Proceedings of 62nd ISERD International Conference (2017) 40. Ahuja, R., et al.: Analysis of educational data mining. In: Harmony Search and Nature Inspired Optimization Algorithms, pp. 897–907. Springer (2019) 41. Nachouki, M.: Assessing and evaluating learning outcomes of the information systems program. World 4(4) (2017) 42. Adam, S.: Using learning outcomes. In: Report for United Kingdom Bologna Seminar, pp. 1–2 (2004) 43. Jones, L.G., Price, A.L.: Changes in computer science accreditation. Assoc. Comput. Mach. Commun. ACM 45(8), 99 (2002) 44. Clarke, F., Reichgelt, H.: The importance of explicitly stating educational objectives in computer science curricula. ACM SIGCSE Bull. 35(4), 47–50 (2003) 45. Gowan, A., MacDonald, B., Reichgelt, H.: A configurable assessment information system. In: Proceedings of the 7th Conference on Information Technology Education, pp. 77–82. ACM (2006)

Data Analytics of Student Learning Outcomes Using Abet Course Files

325

46. Rigby, S., Dark, M.: Designing a flexible, multipurpose remote lab for the IT curriculum. In: Proceedings of the 7th Conference on Information Technology Education, pp. 161–164. ACM (2006) 47. Imam, M.H., et al.: Obtaining ABET student outcome satisfaction from course learning outcome data using fuzzy logic. Eurasia J. Math. Sci. Technol. Educ. 13(7), 3069–3081 (2017) 48. Zacharis, N.Z.: A multivariate approach to predicting student outcomes in web enabled blended learning courses. In: The Internet and Higher Education, vol. 27, pp. 44–53 (2015) 49. Cook, C., Mathur, P., Visconti, M.: Assessment of CAC self-study report. In: Proceedings of the 34th Annual Frontiers Education (FIE), vol. 1, pp. T3G/12–T3G/17 (2004)

Modelling the Currency Exchange Rates Using Support Vector Regression Ezgi Deniz Ülker1 and Sadik Ülker2(B) 1 Computer Engineering Department, European University of Lefke, Gemikonagi, Mersin 10,

KKTC, Turkey [email protected] 2 Electrical and Electronics Engineering Department, European University of Lefke, Gemikonagi, Mersin 10, KKTC, Turkey [email protected]

Abstract. In this work, we modelled the exchange rates using Support Vector Regression (SVR). The currency selected in this work was Turkish Lira (TRY) and compared the rates between United States Dollars (USD), Euro (EUR), and British Pounds (GBP). The modelling was done for the period of January 2018 till October 2019 data, with the aim to investigate three different issues. The difference in using ν-regression and ε–regression was investigated. The effect of using different kernel functions was examined. Also, the effect of using multiple different exchange rates together (multi-rate) compared with modeling only one rate (single-rate) was explored. The results showed very successful modelling of the currencies producing very accurate results using the ν-regression modelling with radial kernel and using multi-rate modelling. Keywords: Support vector regression · Economic forecasting · Kernel

1 Introduction Exchange rate prediction is one of the difficult type of forecasting that exists and it is of particularly important in economics studies. Although the rates are classified as noisy, non-stationary and chaotic [1], it is believed that with the historical data and observations that were made, it is possible to predict this time series forecasting and do an analysis on the exchange rates successfully. This is because it is believed that historical data which in fact incorporates most of the non-stationary and chaotic behavior, can lead to a strong model for the currency value. In 1976, auto regressive integrated moving average technique (ARIMA) was developed and since then has been widely used for time series forecasting [2]. Weigend, Rumelhart and Huberman used curve fitting technique with weight elimination in forecasting currency exchange rates [3]. Zhang and Wan used statistical fuzzy interval neural network for currency exchange rate time series prediction. The study used exchange rates between US Dollars and three other currencies Japanese Yen, British Pound and Hong Kong Dollar [4]. Trafalis and Ince used support vector machine for regression and © Springer Nature Switzerland AG 2020 K. Arai et al. (Eds.): SAI 2020, AISC 1228, pp. 326–333, 2020. https://doi.org/10.1007/978-3-030-52249-0_23

Modelling the Currency Exchange Rates Using Support Vector Regression

327

applied the method for financial forecasting specifically in stock prediction [5]. Yang Chan and King used support vector machine regression for volatile stock market prediction specifically used the regression model to predict Hang Seng Index [6]. As early as 1993, multi-layer perceptron network was used by Refenes et al. in predicting exchange rates [7]. Galeschuk recently explored the effect of artificial neural networks in foreign currency exchange rate prediction [8]. Kamruzzaman, Sarkar and Ahmad used support vector machine based models for predicting foreign currency exchange rates using different kernel functions [9]. Waheeb and Ghazali recently used tensor product functional link neural network enhanced with genetic algorithm to forecast the daily exchange rate for Euro/USD and Japanese Yen/USD [10].

2 Support Vector Regression Support vector machine is a supervised machine learning algorithm used for classification and regression. The main idea in support vector regression is to find the best possible regression model using the support vectors while tolerating some of the errors that might occur while predicting the result. Support vector regression originate from Vapnik’s original work in statistical learning [11]. The regression problem is formulated as a quadratic optimization problem. Kernel function is used in the algorithm to map input data to higher dimension and the regression is done in the transformed space. This process is detailed and explained as a tutorial by Smola and Schölkopf [12]. Neural networks have successfully been applied to many different fields in economics and finance. Zhang and Hu used a neural network model to forecast GBP/USD exchange rate [13]. Zimmermann used neural networks in modeling FX-markets [14]. Kamruzzaman and Sarkar used artificial neural network in forecasting the currency rates [15]. Nagpure applied deep learning model using support vector regressor and artificial neural network to predict various currency exchange rates [16]. Support vector regression has been used in many different research work in economics and finance. Xiang-rong et al. used multiple kernel support vector regression for economic forecasting [17]. Chen and Wang used support vector regression with genetic algorithms in forecasting tourism demand [18]. Bahramy and Crone used support vector regression in forecasting exchange-rates [19]. Li et al. used support vector regression network with two layers to give online prediction of financial time series and applied it for exchange rate USD and Japanese Yen [20].

3 Model In our model, we used the currency exchange rates of Turkish Lira (TRY) with three other currencies: United States Dollar (USD), European Union Euro (EUR) and British Pounds (GBP). The rates were considered in between January 2018 till October 2019. The reason for choosing Turkish Lira was because of the highly unpredictable behavior of the currency especially due to economic fluctuations in the country. In our model we obtained the currency exchange rates every month in the time period indicated for the 1st day and 15th day of each month.

328

E. D. Ülker and S. Ülker

Total of 44 data points were taken. The model consisted of taking the currency rates for USD/TRY, GBP/TRY and EUR/TRY. Among this data 40 data points were used for training and 4 for testing. Also from the model obtained all the data that were used in training and testing were recalculated using the model to see the correctness of the model and to see the overall predicted behavior.

4 Experimental Results Support vector regression model was established to see the effectiveness of the model in various ways. In short, comparison for various regression models were studied. Specifically comparison between ν-regression and ε-regression was done. Comparison of different kernels using ν-regression was performed and finally comparison of exchange rate prediction using only one currency model and the currency model using the exchange rates of other currencies were done. Although similar examples do exist for predicting the currency exchange rates, to the authors knowledge, prediction of the exchange rate using different exchange rates at the same time was not applied before. The first comparison was done using radial kernel and comparing the data obtained with ν-regression and ε-regression. The result is shown graphically in Fig. 1.

9.00000

GBP/TRY

8.00000 7.00000 6.00000

Real Value eps-regression

5.00000

nu-regression

4.00000 01.18 03.18 05.18 07.18 09.18 11.18 01.19 03.19 05.19 07.19 09.19

Fig. 1. GBP/TRY currency rate model using ε-regression and ν-regression.

From the figure it can be observed that both of the models actually gave very accurate answers, leading to the real value. In terms of numbers, average error between the real values and predicted values are 0.079078 for ν-regression, and 0.121888 for ε-regression. The next comparison was done to observe the effect of different kernels on predicting the currency rate GBP/TRY. The responses of different kernels are shown in Fig. 2.

Modelling the Currency Exchange Rates Using Support Vector Regression

329

9.00000 8.00000

GBP/TRY

7.00000 Real Value

6.00000

Radial 5.00000

Sigmoid Polynomial

4.00000

Linear

3.00000 01.18 03.18 05.18 07.18 09.18 11.18 01.19 03.19 05.19 07.19 09.19

Fig. 2. GBP/TRY currency rate model comparison of different kernels.

Table 1. Average error in test data using different kernels. Regression

Kernel

Average error

Nu-regression

Radial

0.079078

Nu-regression

Sigmoid

0.758699

Nu-regression

Polynomial

0.419214

Nu-regression

Linear

0.140910

As we can see in the figure, the better responses were observed using radial and linear kernels. The average error in each of the kernels is tabulated in Table 1. The last set of comparisons were done for each of the currency rates separately. While doing this comparison in models, we modeled the data in two different ways. One of them was standard using only one currency rate and modelling it using the data set for only that currency. In the second way of modelling however, we used the data from three different currency rates at the same time. Comparison of each of the currencies were as follows. In the comparison of GBP/TRY currency exchange rate, it is clearly observed that when only one currency rate was used, the model was more subject to error. However, when all of the three currency models were used in modelling data, the fit was much better and this is shown in Fig. 3.

330

E. D. Ülker and S. Ülker

9.00000 8.00000

GBP/TRY

7.00000 6.00000 Real Value

5.00000

Multi-rate

4.00000

Single-rate

3.00000 01.18 03.18 05.18 07.18 09.18 11.18 01.19 03.19 05.19 07.19 09.19

Fig. 3. GBP/TRY model comparison with multi-rate and single-rate in model.

When the currency rate USD/TRY was considered, similar to GBP/TRY currency exchange value trend, in the models, when multi-rate was considered, noticeably the predicted model was better than single-rate model. The response is shown in Fig. 4.

7.00000

USD/TRY

6.00000

5.00000 Real Value 4.00000

Multi-rate Single-rate

3.00000 01.18 03.18 05.18 07.18 09.18 11.18 01.19 03.19 05.19 07.19 09.19

Fig. 4. USD/TRY model comparison with multi-rate and single-rate in model.

Modelling the Currency Exchange Rates Using Support Vector Regression

331

EUR/TRY response is shown in Fig. 5. Similar observation with the other two exchange rates was observed here. Multi-rate modelling produced more accurate result than single-rate modelling.

8.00000

EUR/TRY

7.00000 6.00000 5.00000 Real Value Multi-rate

4.00000

Single-rate 3.00000 01.18 03.18 05.18 07.18 09.18 11.18 01.19 03.19 05.19 07.19 09.19

Fig. 5. EUR/TRY model comparison with multi-rate and single-rate in model.

Table 2 shows the result summary for comparing the multi-rate model and single-rate model. Table 2. Average error in test data using different kernels. Regression

Multi-rate modelling

Single-rate modelling

GBP/TRY

0.079078

0.515692

USD/TRY

0.043888

0.138459

EUR/TRY

0.084335

0.156793

5 Conclusions In this work the aim was to use support vector regression in modelling the currency exchange rates. While this modelling was done, three set of comparisons also took place. One of the comparison was to see if ε-regression model and ν-regression model

332

E. D. Ülker and S. Ülker

have distinctive results in predicting the values. It was observed that in fact both εregression model and ν-regression model gave very similar responses and both seem to work equally well. The second comparison was the comparison of different kernels. When compared it was observed that the radial kernel and linear kernel produced much better models compared to sigmoid and polynomial kernels. The third comparison was using three different currencies data while predicting the model for the currency rate of interest. It was very noticeable that for all the currency rates that were studied, using the multi-rate model with three different currency exchange rates produced more accurate prediction about the real values compared with the single-rate model. Overall, it can be concluded that support vector regression is a very powerful regression method for predicting the currency exchange rates. The model is even stronger when multiple currency exchange rates are used in modelling rather than one.

References 1. Yao, J., Tan, C.: A case study on using neural networks to perform technical forecasting of forex. Neurocomputing 34, 79–98 (2000) 2. Box, G.E.P., Jenkins, G.M., Reinsel, G.C., Ljung, G.M.: Time Series Analysis: Forecasting and Control, 5th edn. Wiley, Hoboken (2016) 3. Weigend, A.S., Rumelhart, D.E., Huberman, B.A.: Generalization by weight-elimination applied to currency exchange rate prediction. In: International Joint Conference on Neural Networks (IJCNN), Seattle (1991) 4. Zhang, Y.-Q., Wan, X.: Statistical fuzzy interval neural networks for currency exchange rate time series prediction. Appl. Soft Comput. 7(4), 1149–1156 (2007) 5. Trafalis, T.B., Ince, H.: Support vector machine for regression and applications to financial forecasting. In: Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks. IEEE, Como, Italy (2000) 6. Yang, H., Chan, L., King, I.: Support vector machine regression for volatile stock market prediction. In: Yin, H., Allinson, N., Freeman, R., Keane, J., Hubbard, S. (eds.) International Conference on Intelligent Data Engineering and Automated Learning. LNCS, vol. 2412, pp. 391–396. Springer, Heidelberg (2002) 7. Refenes, A.N., Azema-Barac, M., Chen, L., Karoussos, S.A.: Currency exchange rate prediction and neural network design strategies. Neural Comput. Appl. 1(1), 46–58 (1993) 8. Galeshchuk, S.: Neural networks performance in exchange rate prediction. Neurocomputing 172, 446–452 (2016) 9. Kamruzzaman, J., Sarker, R.A., Ahmad, I.: SVM based models for predicting foreign currency exchange rates. In: Third IEEE International Conference on Data Mining. IEEE, Melbourne (2003) 10. Waheeb, W., Ghazali, R.: A new genetically optimized tensor product functional link neural network: an application to the daily exchange rate forecasting. Evol. Intell. 12(4), 593–608 (2019) 11. Vapnik, V.N.: An overview of statistical learning theory. IEEE Trans. Neural Netw. 10(5), 988–998 (1999) 12. Smola, A.J., Schölkopf, B.: A tutorial on support vector regression. Stat. Comput. 14, 199–222 (2004)

Modelling the Currency Exchange Rates Using Support Vector Regression

333

13. Zhang, G., Hu, M.Y.: Neural network forecasting of the British Pound/US Dollar exchange rate. OMEGA: Int. J. Manag. Sci. 26, 495–506 (1998) 14. Zimmermann, H., Neuneier, R., Grothmann, R.: Multi-agent modeling of multiple FXmarkets by neural networks. IEEE Trans. Neural Netw. 12(4), 735–743 (2001) 15. Kamruzzaman, J., Sarker, R.A.: ANN-based forecasting of foreign currency exchange rates. Neural Inf. Process. – Lett. Rev. 3(2), 49–58 (2004) 16. Nagpure, A.R.: Prediction of multi-currency exchange rates using deep learning. Int. J. Innov. Technol. Explor. Eng. 8(6), 316–322 (2019) 17. Xiang-rong, Z., Long-ying, H., Zhi-sheng, W.: Multiple kernel support vector regression for economic forecasting. In: 2010 International Conference on Management Science and Engineering. IEEE, Melbourne (2010) 18. Chen, K.-Y., Wang, C.-H.: Support vector regression with genetic algorithms in forecasting tourism demand. Tour. Manag. 28(1), 215–226 (2007) 19. Bahramy, F., Crone, S.F.: Forecasting foreign exchange rates using support vector regression. In: 2013 IEEE Conference on Computational Intelligence for Financial Engineering & Economics. IEEE, Singapore (2013) 20. Li, B., Hu, J., Hirasawa, K.: Financial time series prediction using a support vector regression network. In: 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence, Hong Kong, China (2008)

Data Augmentation and Clustering for Vehicle Make/Model Classification Mohamed Nafzi1(B) , Michael Brauckmann1 , and Tobias Glasmachers2 1

2

Facial & Video Analytics, IDEMIA Identity & Security Germany AG, Bochum, Germany {mohamed.nafzi,michael.brauckmann}@idemia.com Institute for Neural Computation, Ruhr-University Bochum, Bochum, Germany [email protected] Abstract. Vehicle shape information is very important in Intelligent Traffic Systems (ITS). In this paper, we present a way to exploit a training data set of vehicles released in different years and captured under different perspectives. Also the efficacy of clustering to enhance the make/model classification is presented. Both steps led to improved classification results and a greater robustness. Deeper convolutional neural network based on ResNet architecture has been designed for the training of the vehicle make/model classification. The unequal class distribution of training data produces an a priori probability. Its elimination, obtained by removing of the bias and through hard normalization of the centroids in the classification layer, improves the classification results. A developed application has been used to test the vehicle re-identification on video data manually based on make/model and color classification. This work was partially funded under the grant. Keywords: Vehicle shape classification

1

· Clustering · CNN

Introduction

The aim of this work is the improvement of the vehicle re-identification module based on shape and color classification from our previous work [1] but also to present methods, which definitely help other researcher to improve the accuracy of their vehicle classification module. To reach this goal, we had to overcome several challenges. The appearance of a vehicle varies not only due to its make and to its model but also can differ strongly depending on the year of released and the perspective. For this reason, we created a data augmentation process employing a developed web crawler querying vehicles with different model and their years and different views, such as front, rear or side. The training data labels contain only make and model information, no year of release or view information. To obtain a refined underlying representation for the two missing labels, a clustering approach was developed. It shows a hierarchical structure, to generate data driven different subcategories for each make and model label pair. In this work, we used convolutional neural network (CNN) to train a vehicle make/model classifier based on the ResNet architecture. We performed a threshold optimization c Springer Nature Switzerland AG 2020  K. Arai et al. (Eds.): SAI 2020, AISC 1228, pp. 334–346, 2020. https://doi.org/10.1007/978-3-030-52249-0_24

Vehicle Make/Model Classification

335

for the make/model and color classification to suppress false classifications and detections. Comparing with the works of other researchers, our module contains more classes to classify. It covers most of the known makes and models of the released years between 1990 and 2018 worldwide.

2

Related Works

Some research has been performed on make/model classification of vehicles. Most of it operated on a small number of make/models because it is difficult to get a labeled data set panning all existing make/models. Manual annotation is almost impossible because one needs an expert for each make being able to recognize all its models and it is very tedious and time consuming process. Author in [11] presented a vehicle make/model classification trained on just 5 classes. Three different classifiers were tested based on one class k-nearest-neighbor, multi-class and neural network and operated on frontal images. Authors in [7] used 3D curve alignment and trained just 6 make/models. Authors in [15] developed a classifier based on a probabilistic neural network using 10 different classes. They worked on frontal images. Authors in [5] applied image segmentation and combined local and global descriptors to classify 10 classes. Authors in [3] used a classifier based on CNN to classify 13 different make/models. Authors in [8] used symmetrical speeded up robust features to separate between 29 classes. Authors in [14] showed a classification based on 3D-model using 36 classes in a Bayesian approach supporting multiple views. Authors in [6] presented a make/model identification of 50 make/models based on oriented contour points. Two strategies have been tested, the discriminant function combining several classification scores and voting spaces. Authors in [10] used the geometry and the appearance of car emblems from rear view to identify 52 make/models. Different features and classifications have been tested. Authors in [12] investigated two different classification approaches, a k-nearest-neighbour classifier and a naive bayes classifier and worked on 74 different classes with frontal images. Authors in [13] developed a make/model classification based on feature representation for rigid structure recognition using 77 different classes. Two distances have been tested, the dot product and the euclidean distance. Authors in [4] tested different methods by make/model classification of 86 different classes on images with side view. The best one was HoG-RBF-SVM. [16] used 3D-boxes of the image with its rasterized low-resolution shape and information about the 3D vehicle orientation as CNNinput to classify 126 different make/models. The module of [9] is based on 3D object representations using linear SVM classifiers and trained on 196 classes. In a real video scene all existing make/models could occur. Considering that we have worldwide more than 2000 models, make/model classification trained just on few classes will not succeed in practical applications. Authors in [2] increase the number of the trained classes. His module is based on CNN and trained on 59 different vehicle makes as well as on 818 different models. His solution seems to be closer for commercial use. Our developed module in our previous work [1] was trained on 1447 different classes and could recognize 137 different vehicle makes as well as 1447 different models of the released year between 2016 till

336

M. Nafzi et al.

2018. In this paper We made a data augmentation and clustering. Our current module was trained on 4730 different classes and could recognize 137 different vehicle makes as well as 1447 different models of the released year between 1990 and 2018. It shows better results on video data than [1] and [2]. We believe that our module gives the appropriate solution for commercial use.

3

Data Augmentation

We used a web crawler to download data with different older model years from 1990 to 2018 and different perspectives. This was an augmentation for our training data. First of all, we removed images with the same content to reduce the redundancy followed by a data cleansing step based on a vehicle quality metric to remove images with bad detections. This data augmentation was an important step to classify more classes.

4

Clustering

First, we want to explain why we need clustering. Many vehicles with the same model but different released year or different perspective have less similarity than some vehicles with different model. When we set all elements with the same make/model but different views and different released years in the same class, we get after the training shape feature vectors which are not very close to their centroids. This leads oft to miss classification. Therefore, clustering is necessary because the labels of the model year and the perspective are not available. We developed a hierarchical clustering method to cluster our data. This method comprises the following processing steps: 1. calculation of the distances between all feature vectors within the same make and model category, obtained by the networks described in [1]. 2. Iteratively, determination of the elements that form the maximum feature vector density using a given constant threshold. Creation of a new cluster using these elements. 3. In the last step: only classes with a minimum number of 20 elements are accepted to avoid label errors. This data augmentation and clustering have increased the number of our classes from 1447 to 4730. As example by Mercedes-Benz C in our previous work [1], we had just one class. As shown in Fig. 1. Seven classes with different released years and/or different perspectives were obtained.

Fig. 1. Examples of the generated classes after data augmentation and clustering by Mercedes-Benz C.

Vehicle Make/Model Classification

337

Fig. 2. Density of client and impostor matching scores without applying clustering. Client score is the matching score of two feature vectors with the same model. Impostor score is the matching score of two feature vectors with different model.

Fig. 3. Density of client and impostor matching scores without applying clustering. Separation of client score density in two densities. Client score is the matching score of two feature vectors with the same model. Impostor score is the matching score of two feature vectors with different model.

Figure 2 presents the densities of client and impostor matching scores without applying clustering. Matching of vehicles with the same model but different

338

M. Nafzi et al.

Fig. 4. Density of client and impostor matching scores with applying clustering. Client score is the matching score of two feature vectors with the same cluster. Impostor score is the matching score of two feature vectors with different model.

cluster leads oft to low client scores as shown in Fig. 3. Same model and different cluster means same model and different released years or different views. After applying clustering and retraining the shape network, the feature vectors become very close to their centroids, which decreases the number of low client scores as you could see in Fig. 4.

5

CNN-Architectures

We developed, optimized and tuned our standard CNN-network for make/model classification. It is based on Res-Net architecture. It shows very gut results on controlled data set. Its coding time is 20 ms (CPU 1 core, i7-4790, 3.6 GHz). This net was trained on 4730 classes using about 4 Million images.

6

Removing of a Priori Probability

Our training data does not have equal class distribution. This leads to an a priori probability. The class distribution of testing data differs from training data and is usually unknown. For this reason, we removed this a priori probability by removing of the bias from the classification layer and by normalization of the centroids. To train the modified network faster, we initialized the convolutions and the first IP-layer with our pre-trained network and defined a new classification layer. In this way the training converges after two or three days instead of three or four weeks.

Vehicle Make/Model Classification

339

Fig. 5. Error rates of controlled and video data set with different qualities by color, make and make/model classification. This plot helps to set a threshold. FAR: False Acceptance Rate. FRR: False Reject Rate.

340

7

M. Nafzi et al.

Threshold Optimization

In practice, it is recommended to set a threshold by the classification. In this way, we could eliminate false classifications. For example, elements with unknown classes, bad image quality, occlusion or false detection leads to miss classification. We calculated the error rates shown in Fig. 5 using different testing data with different qualities to optimize the thresholds for the make/model, make and color classification. We used just similarities with rank 1.

8

Experiments Related to the Clustering of the Training Data and to the Removing of a Priori

We tested two controlled data with good and bad quality. Each data set contains 3306 images with different views and about two images per class. Additionally, we tested Stanford data set and video data. The following steps led to the improvement of our classification module: – – – – – –

Using large scale data set by training. Cleansing of the data. Clustering. Optimization of our CNN-Net. Removing of a priori. Using of best-shot by video data.

8.1

Experiments on Controlled Data

We tested the effect of removing of a priori probability. Figure 6 and 7 show the results on controlled data set. The comparison of the results of our make/model classification with the state of the art on Stanford data set has been presented in Table 1.

Fig. 6. Results of make/model classification on controlled data set with good quality and different views. ResNet-1 is the trained ResNet in [1]. ResNet-2 is ResNet-1 + removing of a priori probability.

Vehicle Make/Model Classification

341

Fig. 7. Results of make/model classification on controlled data set with bad quality and different views. ResNet-1 is the trained ResNet in [1]. ResNet-2 is ResNet-1 + removing of a priori probability. Table 1. Accuracy of our shape module and of the shape module presented by [2] on Stanford data set.

Accuracy (top1)

8.2

Our shape module

[2]

93.8%

93.6%

Experiments on Video Data

The data augmentation and the clustering leads to better classification than [1] especially on video data. Since our video testing data contain some classes, that are not included in our old training data [1]. The results in Fig. 8 shows the effect of the data augmentation with and without the use of clustering. Without clustering all elements with the same make/model are in the same class by training. This is not beneficial, because vehicles with the same make/model but different model year and/or different views look different. Therefore, the clustering of the data was necessary. The selection and the classification of the best-Shot ROI-image shows better results than to classify each detection. The tested video data 1 is very challenging because it contains images with some views, which are not included in our training data like top/frontal or top/rear. The results in Fig. 9 shows the effect of the data augmentation using clustering. We used the API referred in [2] https://www.sighthound.com/products/ cloud to test our BCM-Parking1 video data. The API detected correctly the vehicles. The comparison of our results with the results of [2] on this video data set shown in the Table 2 is absolutely fair because we did not tune our training on this data set. It contains 111 best-shots images.

342

M. Nafzi et al.

Fig. 8. Results of make classification on traffic video data set with hard quality and different views. ResNet-1 is the trained ResNet in [1]. ResNet-3 is the new trained ResNet using data augmentation (without clustering). ResNet-4 is the new trained ResNet using data augmentation and clustering.

Fig. 9. Results of make/model and make classification on traffic video data set with medium quality. Sample images shown in Fig. 10. ResNet-1 is the trained ResNet in [1]. ResNet-4 is the new trained ResNet using data augmentation and clustering.

8.3

Tool for Vehicle Re-identification on Video Data

To validate the re-identification module, we developed a stand-alone tool, as shown in Fig. 10 and 11. This tool allows the re-identification of a vehicle in a video data set with the help of its make or its model. Within the search the combination of shape and color is possible. We used the color classification module of the previous work [1]. The results show the best-shot images sorted by their classification scores. We used a threshold to remove false classifications and false detections. The selection of the best-shot image allows to show all ROIs with the same track-id. Further, the original or the ROI-images can be shown. We tested two video data set.

Vehicle Make/Model Classification

343

Table 2. Accuracy of our shape module and of the shape module presented by [2] on our BCM-Parking1 video data. Sample images shown in Fig. 11. ResNet-1 is the trained ResNet in [1]. ResNet-4 is the new trained ResNet using data augmentation and clustering. ResNet-1

ResNet-4

[2]

Make/Model classification (top1) 29.0%

71.1%

20.5%

Make classification (top1 )

78.9%

56.4%

42.1%

Fig. 10. Vehicle re-identification based on shape (Volkswagen/Scirocco) and color (white) classification (video data).

344

M. Nafzi et al.

Fig. 11. Vehicle re-identification based on shape (SEAT/Ibiza) and color (white) classification (video data).

9

Conclusion and Future Work

We presented two methods, that will definitely help other researchers to improve their vehicle shape classification. The first method removing of a priori brings generally 1% to 2% absolutely improvement and it is very easy to apply. The second method clustering is necessary and very useful when we make a data augmentation with vehicles with different released years and/or different perspectives and when only the labels make and model are available. In the practice we recommend to set a threshold in term to eliminate bad classifications and bad detections. At the moment we are focussing on the vehicle re-identification based on shape/color feature vector. In this way we could re-identify each vehicle

Vehicle Make/Model Classification

345

without to know its make/model or its color. Because the features of vehicles with the same shape and color will be close each to other and will produce higher scores by matching if they have similar views. By this method a probe image of the search vehicle is needed by the re-identification. Acknowledgment. – Victoria: Funded by the European Commission (H2020), Grant Agreement number 740754 and is for Video analysis for Investigation of Criminal and Terrorist Activities. – Florida: Funded by the German Ministry of Education and Research (BMBF).

References 1. Nafzi, M., Brauckmann, M., Glasmachers, T.: Vehicle shape and color classification using convolutional neural network. CoRR, abs/1905.08612, May 2019. http:// arxiv.org/abs/1905.08612 2. Dehghan, A., Masood, S.Z., Shu, G., Ortiz, E.G.: View independent vehicle make, model and color recognition using convolutional neural network. CoRR, abs/1702.01721 (2017). http://arxiv.org/abs/1702.01721 3. Satar, B., Dirik, A.E.: Deep learning based vehicle make-model classification. CoRR, abs/1809.00953 (2018). http://arxiv.org/abs/1809.00953 4. Boyle, J., Ferryman, J.: Vehicle subtype, make and model classification from side profile video. In: 2015 12th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), pp. 1–6, August 2015. https://doi.org/10.1109/ AVSS.2015.7301783 5. AbdelMaseeh, M., Badreldin, I., Abdelkader, M.F., El Saban, M.: Car make and model recognition combining global and local cues. In: 2012 21st International Conference on Pattern Recognition (ICPR), ICPR 2012, pp. 910–913, November 2013. doi.ieeecomputersociety.org/ 6. Clady, X., Negri, P., Milgram, M., Poulenard, R.: Multi-class vehicle type recognition system, vol. 5064, pp. 228–239, July 2008. https://doi.org/10.1007/978-3540-69939-2 22 7. Ramnath, K., Sinha, S.N., Szeliski, R., Hsiao, E.: Car make and model recognition using 3D curve alignment. In: IEEE Winter Conference on Applications of Computer Vision, pp. 285–292, March 2014. https://doi.org/10.1109/WACV.2014. 6836087 8. Hsieh, J., Chen, L., Chen, D.: Symmetrical surf and its applications to vehicle detection and vehicle make and model recognition. IEEE Trans. Intell. Transp. Syst. 15(1), 6–20 (2014). https://doi.org/10.1109/TITS.2013.2294646 9. Krause, J., Stark, M., Deng, J., Fei-Fei, L.: 3D object representations for finegrained categorization. In: 2013 IEEE International Conference on Computer Vision Workshops, pp. 554–561, December 2013. https://doi.org/10.1109/ICCVW. 2013.77 10. Llorca, D.F., Col´ as, D., Daza, I.G., Parra, I., Sotelo, M.A.: Vehicle model recognition using geometry and appearance of car emblems from rear view images. In: 17th International IEEE Conference on Intelligent Transportation Systems (ITSC), pp. 3094–3099, October 2014. https://doi.org/10.1109/ITSC.2014.6958187 11. Munroe, D.T., Madden, M.G.: Multi-class and single-class classification approaches to vehicle model recognition from images (2005)

346

M. Nafzi et al.

12. Pearce, G., Pears, N.: Automatic make and model recognition from frontal images of cars. In: 2011 8th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), pp. 373–378, August 2011. https://doi.org/10.1109/ AVSS.2011.6027353 13. Petrovic, V., Cootes, T.F.: Analysis of features for rigid structure vehicle type recognition. In: Proceedings of the British Machine Vision Conference, BMVA, United Kingdom, vol. 2 (2004) 14. Prokaj, J., Medioni, G.: 3-D model based vehicle recognition. In: 2009 Workshop on Applications of Computer Vision (WACV), pp. 1–7, December 2009. https:// doi.org/10.1109/WACV.2009.5403032 15. Psyllos, A., Anagnostopoulos, C.N., Kayafas, E.: Vehicle model recognition from frontal view image measurements, vol. 33, pp. 142–151, Amsterdam, The Netherlands. Elsevier Science Publishers B.V., February 2011. https://doi.org/10.1016/j. csi.2010.06.005 16. Sochor, J., Herout, A., Havel, J.: BoxCars: 3D boxes as CNN input for improved fine-grained vehicle recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3006–3015 (2016)

A Hybrid Recommender System Combing Singular Value Decomposition and Linear Mixed Model Tianyu Zuo1 , Shenxin Zhu1,2(B) , and Jian Lu3(B) 1

3

Department of Mathematics, Xi’an Jiaotong-Liverpool University, Suzhou 215123, Jiangsu, People’s Republic of China [email protected] 2 Laboratory for Intelligent Computing and Finance Technology, Xi’an Jiaotong-Liverpool University, Suzhou 215123, Jiangsu, People’s Republic of China [email protected] Guangdong Institute of Aeronautics and Astronautics Equipment & Technology, Zhuhai, People’s Republic of China [email protected]

Abstract. We explain the basic idea of linear mixed model (LMM), including parameter estimation and model selection criteria. Moreover, the algorithm of singular value decomposition (SVD) is also mentioned. After introducing the related R packages to implement these two models, we compare the mean absolute errors of LMM and SVD when different numbers of historical ratings are given. Then a hybrid recommender system is developed to combine these two models together. Such system is proven to have higher accuracy than single LMM and SVD model. And it might have practical value in different fields in the future. Keywords: Hybrid system mixed model

1

· Singular value decomposition · Linear

Introduction

With the emergence of electronic commerce and Big Data, a variety of machine learning algorithms and models, such as collaborative filtering (CF), singular value decomposition (SVD) and heterogeneous information network, have been put into practice to predict users’ evaluations about one certain product. The pervasive application of recommender systems not only provides much convenience to people’s consumption choices in their daily lives, but also help those internet corporations gain a lot of commercial interests when using successful advertising strategies. In light of the huge practical value of recommender system (RS), many researches have been done to analyze user’s traits and historical ratings. In an attempt to deal with cold start problems where user’s historical ratings are not provided, Gao [1] and Chen [2,3] use linear mixed models (LMM) c Springer Nature Switzerland AG 2020  K. Arai et al. (Eds.): SAI 2020, AISC 1228, pp. 347–362, 2020. https://doi.org/10.1007/978-3-030-52249-0_25

348

T. Zuo et al.

to make a prediction about movies ratings, based on new users’ personal information such as age, gender, occupation and so on. Lu [4] and Wang [16] create several meta-paths to combine different movies’ features, such as directors and genres, with users’ historical ratings, according to basic principals of heterogeneous information network. In addition, Koren’s [5] article introduces a new SVD-based latent factor model which largely improves quality of the original recommender system. However, when it comes to the prediction accuracy of ratings, those RS models appear to have different performances when the sizes of available historical rating data are different. Traditional recommender system based on CF or SVD may not predict well when there is not enough user’s rating data given, and it is really necessary to improve those RS performances so that they can retain their prediction accuracy when confronted with data sparsity. This article aims to show how the accuracy of traditional algorithms like SVD plummets down when the number of users’ historical ratings is small, while introducing a new hybrid system combining SVD and LMM together which shows better quality than those original RS.

2 2.1

Methodology: Linear Mixed Model Introduction of LMM

Linear mixed model (LMM) is a special case of hierarchical linear model (HLM) which divides all the observations into multiple levels according to different grouping factors, and each observation exists in one certain level of each factor [19]. What makes LMM different from ordinary HLM is that it introduces a concept of ‘fixed’ and ‘random’ effect when analyzing those grouping factors. Fixed factor is a categorical factor which includes all levels that we are interested in during our research. Whereas random effects are those factors whose levels are just randomly sampled from the entire population, and all the possible levels may not appear in the data set we have [6]. The analysis of random effects may not be main objectives in our experiments. Their existence just helps to account for the variation of dependent values in LMM model. The equation of LMM model with respect to a given level i of a grouping factor is shown below: y i = X i β + Zi b i + ε i

(1)

Here, yi is a ni × 1 vector of responses, β and bi are p×1 vector of fixed effect and q×1 vector of random effect (i.e. the number of fixed factors and random factors are p and q respectively) εi represents random errors of all the observations. Xi and Zi are ni ×p and ni ×q design matrices which show whether each observation is in a certain level of group factors. It should be noted that bi and εi are random variables which follow bivariate normal distribution [7]: bi ∼ Nq (0, α2 D), εi ∼ Nni (0, α2 Ri )

(2)

Hybrid Recommender System with SVD and LMM

349

α2 D and α2 Ri are co-variance matrices, α2 is an unknown parameter. Moreover, although observations in yi are not necessarily independent from each other in equation (1), the vector bi should be considered independent from εi . Combing N all the observations n in different levels N together (i.e. n = i=1 ni ), we can get classical formula for LMM: y = Xβ + Zb + ε

(3)

where b ∼ NNq (0, α2 D), ε ∼ Nn (0, α2 R) (b1 , b2 , ·

·bN ),

(ε1 , ε2 , ·

(4) ·, εN ) .

· ε= · D and R are In equation (3) and (4), b = extended matrices with D and Ri (i=1,2,· · ·, N ) positioned on their diagonals, respectively. 2.2

Estimation of Parameters

In order to obtain the parameters β, D and R in the linear mixed model, some techniques should be used. First, for simplicity all elements in co-relation matrix D and R are written as ‘θD ’ and ‘θR ’, respectively. Then maximum-likelihood estimation(MLE) will be applied to estimate the parameters. As its name implies, MLE aims to find the corresponding values of β, θD and θR when the likelihood function of LMM attains its maximum. Since the observations ‘yi ’ are normally distributed, the likelihood function can be written as: LM L (β, σ 2 , θ)

  1 1 (yi − Xi β)2  √ exp − = 2 σ 2 det(Vi ) 2πσ det(Vi ) i=1   N  1 (yi − Xi β)2 exp − = (2π)−N/2 (σ 2 )−N/2 2 σ 2 det(Vi ) i=1 N 

(5)

where Vi = Zi D(θD )Zi + Ri (θR ). Take the log operation and ignore the constant part ‘(2π)−N/2 ’, the loglikelihood expression is given by: N

lM L (β, σ 2 , θ) = −

N 1 log σ 2 − log[det(Vi )] 2 2 i=1

N 1  − 2 (yi − Xi β) Vi−1 (yi − Xi β). 2σ i=1

(6)

When we fixed θ and maximize equation (6) respect to β, the estimate of β can be obtained: N N   ˆ Xi Vi−1 Xi )−1 Xi Vi−1 yi . (7) β(θ) =( i=1

i=1

350

T. Zuo et al.

Then plug (7) into (6) and maximize the expression with respect to σ, the estimate of σ and log-likelihood function of θ are shown below: 2 σ ˆM L (θ) =

N  i=1

∗ 2 lM L (σ , θ) = −

ri Vi−1 ri /n.

(8)

N

N 1 log(σ 2 ) − log[det(Vi )] 2 2 i=1 N 1   −1 − 2 r V ri . 2σ i=1 i i

(9)

ˆ where ri = ri (θ) = yi − Xi β(θ). It should be noted that ML estimation is biased since the estimates of α2 ∗ 2 and lM L (σ , θ) largely depend on the estimate of β. In light of this, an improved method called ‘restricted maximum likelihood estimation(REML) is utilized. The detailed idea is provided in Burzykowski’s [6] book and expressions of α2 ∗ 2 and lREM L (σ , θ) are shown as follows: σˆ2 REM L =

N  i=1

∗ lREM L (θ) = −

ri Vi−1 ri /(n − p).

N N  n−p 1 [log( ri ri )] − log[det(Vi )] 2 2 i=1 i=1 N  1 − log[det( Xi Vi−1 Xi )]. 2 i=1

2.3

(10)

(11)

Model Selection Criteria

After several LMM models have been built, it is essential to choose the optimal model which is the most appropriate to fit the data and has a relatively simple structure. Two main approaches to analyze LMM models are Akaike Information criterion (AIC) and Bayesian Information criterion (BIC), their equations are show below [3]: ˆ AIC = 2p − 2l(Θ), (12) ˆ BIC = p · log n − 2l(Θ),

(13)

where p is the number of parameters, n is the number of observations in the data ˆ is the maximum likelihood estimation for the parameter vector. set, and l(Θ) The lower values AIC and BIC have, the better the LMM model will be. The main difference between AIC and BIC is that the latter emphasizes more on the simplicity of model, so a larger penalty of BIC will be given to a complex model when it is trained by a big data set.

Hybrid Recommender System with SVD and LMM

351

Moreover, when deciding parameters in our models, we have to test whether a particular factor is statistically significant or not. A common way is to use likelihood ratio test (LRT). Its main principal is to compare the likelihood function of nested and reference models (i.e. all the parameters of nested models are included in reference model) [6]. The distribution of log-likelihood function is shown in the following equations:  Lnested (14) = −2 log (Lnested ) − (−2 log (Lreference )) ∼ χ2df −2 log Lreference Lnested and Lreference stand for likelihood functions of nested and reference model whose ratio follows Chi-distribution with ‘df’ degrees of freedom. The value of ‘df’ is the difference between the number of parameters in nested model and that in reference model. If LRT is small enough(i.e. smaller than significance level, normally 10%), then we can come to a conclusion that the particular factor tested is statistically significant and should be retained in the model. For more details on the computing issues, the reader is redirected to our work [17–21].

3 3.1

Methodology: Singular Value Decomposition Introduction of SVD

Singular Value decomposition(SVD) is a matrix factorization model which maps both users and items to a particular latent factor space [8]. Consider a rating matrix Rm×n whose rows stand for different users and columns represent items. Each element rij in the matrix means the rating of user i on item j. Of course some of the positions in R are vacant, waiting for us to predict their values. SVD algorithm aims to find two matrices Pk×n and Qk×m such that ˆ R = QT P = R

(15)

Here ‘k’ is the number of latent factors we consider and the complexity of SVD ˆ here is a full matrix whose element rˆij is model increases as k grows larger. R the prediction value of rating rij . The predicted ratings rˆij can be expressed as follow: (16) rˆij = qiT pj qi and pj in Eq. (16) are the ith and jth column of Q and P respectively. In particular, P and Q are obtained when the regularized squared error on the set of known ratings K attains its minimum [8]:

 2 2 2 (17) min (rij − qiτ pj ) + λ qi  + pj  (i,j)∈K

In order to satisfy the condition of Eq. (17), several techniques such as gradient decent have been used, the detailed ideas are included in the article [8] and [9]. Lastly, the main difference between SVD and LMM is that the prediction of former method largely depends on users’ historical ratings, while the latter depends on user’s and item’s features like gender, age and genre.

352

3.2

T. Zuo et al.

Evaluation of SVD by Recommenderlab

Recommenderlab is an R package for building and evaluating a variety of recommendation algorithms like collaborative filtering, SVD and so on [10]. In order to evaluate a SVD model, we have first use the function evaluationScheme to split the whole data set into two part: training set and test set. Normally the ratio of their sizes is 9:1. The related code sentence is shown below: eval_sets