149 32 165MB
English Pages 1492 Year 2023
Lecture Notes in Networks and Systems 711
Kohei Arai Editor
Intelligent Computing Proceedings of the 2023 Computing Conference, Volume 1
Lecture Notes in Networks and Systems
711
Series Editor Janusz Kacprzyk, Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland
Advisory Editors Fernando Gomide, Department of Computer Engineering and Automation—DCA, School of Electrical and Computer Engineering—FEEC, University of Campinas—UNICAMP, São Paulo, Brazil Okyay Kaynak, Department of Electrical and Electronic Engineering, Bogazici University, Istanbul, Türkiye Derong Liu, Department of Electrical and Computer Engineering, University of Illinois at Chicago, Chicago, USA Institute of Automation, Chinese Academy of Sciences, Beijing, China Witold Pedrycz, Department of Electrical and Computer Engineering, University of Alberta, Alberta, Canada Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland Marios M. Polycarpou, Department of Electrical and Computer Engineering, KIOS Research Center for Intelligent Systems and Networks, University of Cyprus, Nicosia, Cyprus Imre J. Rudas, Óbuda University, Budapest, Hungary Jun Wang, Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong
The series “Lecture Notes in Networks and Systems” publishes the latest developments in Networks and Systems—quickly, informally and with high quality. Original research reported in proceedings and post-proceedings represents the core of LNNS. Volumes published in LNNS embrace all aspects and subfields of, as well as new challenges in, Networks and Systems. The series contains proceedings and edited volumes in systems and networks, spanning the areas of Cyber-Physical Systems, Autonomous Systems, Sensor Networks, Control Systems, Energy Systems, Automotive Systems, Biological Systems, Vehicular Networking and Connected Vehicles, Aerospace Systems, Automation, Manufacturing, Smart Grids, Nonlinear Systems, Power Systems, Robotics, Social Systems, Economic Systems and other. Of particular value to both the contributors and the readership are the short publication timeframe and the world-wide distribution and exposure which enable both a wide and rapid dissemination of research output. The series covers the theory, applications, and perspectives on the state of the art and future developments relevant to systems and networks, decision making, control, complex processes and related areas, as embedded in the fields of interdisciplinary and applied sciences, engineering, computer science, physics, economics, social, and life sciences, as well as the paradigms and methodologies behind them. Indexed by SCOPUS, INSPEC, WTI Frankfurt eG, zbMATH, SCImago. All books published in the series are submitted for consideration in Web of Science. For proposals from Asia please contact Aninda Bose ([email protected]).
Kohei Arai Editor
Intelligent Computing Proceedings of the 2023 Computing Conference, Volume 1
Editor Kohei Arai Faculty of Science and Engineering Saga University Saga, Japan
ISSN 2367-3370 ISSN 2367-3389 (electronic) Lecture Notes in Networks and Systems ISBN 978-3-031-37716-7 ISBN 978-3-031-37717-4 (eBook) https://doi.org/10.1007/978-3-031-37717-4 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface
With profound pride and privilege, we present before you the proceedings of the Computing Conference, 2023. It was held over two days from 22 to 23 June 2023 at London, UK, in a hybrid mode. The conference was hugely successful as it was attended by 200 delegates from more than 60 countries across the globe. The conference covered the whole gamut of topics ranging from Internet of Things, Artificial Intelligence, Ambient Intelligence, e-Learning and Machine Vision. The conference provided a coveted platform to all the renowned and budding researchers and industry experts to voice their iconic, innovative and insightful research study. The synergy of studies made by the academia and industry experts is definitely going to give a great thrust to the technological advancement of the world. The conference had four keynote addresses, paper presentations and engaging networking breaks for the delegates which allowed them to build long-term associations. We received a voluminous number of 539 paper submissions out of which we selected 193 papers on the criteria of originality, applicability and presentation. The selected papers provide a vast pool of knowledge and expertise in solving routine, repetitive and rigorous tasks. They are also a window to future living trend. The studies also gave an important thread for future research and beckoned all the bright minds to foray in those fields. The conference, without doubt, ignited a spark of great interest amongst its distinguished audience. The astounding success of the conference would not have been possible without the precious contribution of many people. The key stakeholders were the authors who gave such thought-provoking studies. The arduous task of review and evaluation by the Technical Committee members cannot be overlooked. The session chair’s role was noteworthy. We would extend our heartfelt gratitude to all the above key contributors. This note of thanks would be incomplete without the mention of our esteemed keynote speakers who enthralled everyone with their unique researches. The organizing committee’s efforts cannot go un-noticed as they managed seamlessly such a huge event and that too in hybrid mode. Our special thanks to them as well. We have sincerely endeavoured to publish the cherry-picked studies for our avid scientific readers. The encouraging response by our authors, participants and readers is indeed our dose of motivation. We hope to continue bringing the most unique and path-breaking researches in future as well with your enthusiastic support. Kohei Arai
Contents
Bit-Level Operation-Based MAC Unit for Vector Multiplications . . . . . . . . . . . . Feng Yan, Xiaochen Wang, and Tiantai Deng
1
New Routines for Faster Balancing of AVL Trees . . . . . . . . . . . . . . . . . . . . . . . . . Orieh Destiny Anyiawe, Austin Ramsey, and David Fawcett
7
Tensor Algebra on an Optoelectronic Microchip . . . . . . . . . . . . . . . . . . . . . . . . . . Sathvik Redrouthu and Rishi Athavale
16
Evaluation of Accuracy: A Comparative Study Between Finger Pointing and Stylus Pointing Using Mid-Air Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Zeeshan Haider Malik, Bassam Siddiqui, Afsheen Sabir, and Ismail Afzal
34
Exploring the Application of Gamification in the Software Development Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Lutendo Lesley and Ernest Mnkandla
52
FCA-SAPO: A New Comprehensive Fog Computing Adoption Model for Saudi Arabian Public Organisations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mohammed Alyami, Natalia Beloff, and Martin White
69
A Review of Computational Load-Balancing for Mobile Edge Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Michael Wilson, Henry Nunoo-Mensah, and Kwame Osei Boateng
86
Simultaneous Estimation Method of Mutually Correlated Geophysical Parameters and Its Application of Aerosol Parameter Estimation . . . . . . . . . . . . Kohei Arai and Xing Ming Lian
111
Petri Net Parallel Computing Theory and Applications . . . . . . . . . . . . . . . . . . . . . Marshall Rawson and Michael G. Rawson Computing the Performance of a New Adaptive Sampling Algorithm Based on The Gittins Index in Experiments with Exponential Rewards . . . . . . . James K. He, Sofía S. Villar, and Lida Mavrogonatou
129
147
viii
Contents
Towards Data-Effective Educational Question Generation with Prompt-Based Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yongchao Wu, Jalal Nouri, Beáta Megyesi, Aron Henriksson, Martin Duneld, and Xiu Li Building an Artist’s Profile in Javascript: Integrating Data Analytics and the New York Metropolitan Museum of Art Open Dataset to Visualize Elements of an Artist’s Life . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Suzanna Schmeelk
161
175
From Traditional Data Models to Blockchain Technology: A Polyglot Persistence Approach to Store the Electronic Health Record . . . . . . . . . . . . . . . . André Araújo, Henrique Couto, Valéria Times, and Rendrikson Soares
188
On Some Confidence Intervals for Estimating the Population Process Capability Index Cp: An Empirical Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . B. M. Golam Kibria and Shipra Banik
202
Learning from Few Examples with Nonlinear Feature Maps . . . . . . . . . . . . . . . . Ivan Y. Tyukin, Oliver Sutton, and Alexander N. Gorban
210
Retention of Computing Students in a London-Based University During the Covid-19 Pandemic Using Learned Optimism as a Lens: A Statistical Analysis in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Alexandros Chrysikos, Indrajitrakuraj Ravi, Dimitrios Stasinopoulos, Robert Rigby, and Stephen Catterall
226
Abbreviation Disambiguation: A Review of Modern Techniques to Improve Machine Reading Comprehension . . . . . . . . . . . . . . . . . . . . . . . . . . . . Vince Sing Choi and Kazem Taghva
250
On the Use of Generative Adversarial Networks to Generate Face Images from Voice Neural Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Christian Salamea-Palacios, Edison Zumba-Narváez, and Fernando Zumba-Narváez
262
The Active Nonsmooth Manifolds of a Neural Network Classifier: A Tool for Confidence Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Stéphane Chrétien, Volodimir Mitarchuk, and Julien Velcin
274
Benchmarking TPU and GPU for Stock Price Forecasting Using LSTM Model Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . T. O. Kehinde, S. H. Chung, and Felix T. S. Chan
289
Contents
ix
Early Plant Disease Detection Using Infrared and Mobile Photographs in Natural Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Malithi De Silva and Dane Brown
307
Finding Eulerian Tours in Mazes Using a Memory-Augmented Fixed Policy Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mahrad Pisheh Var, Michael Fairbank, and Spyridon Samothrakis
322
Deep Learning Based Shadow Removal: Target to Current Methodology Flaws . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shi-Jinn Horng and Cheng-En Zhuang
340
Machine Learning Techniques for Accurately Detecting the DNS Tunneling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mouhammd Alkasassbeh and Mohammad Almseidin
352
Gradient Descent-Based Optimization Algorithms for Batch-Normalized Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Charles Usigbe and Xiao Perry
365
Landslide Prediction Using Multi-Layer Perceptron Model . . . . . . . . . . . . . . . . . Geetanjali Mahamunkar, Arvind Kiwelekar, and Laxman Netak
398
Amenable Sparse Network Investigator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Saeed Damadi, Erfan nouri, and Hamed Pirsiavash
408
Comparison of Adversarial and Non-Adversarial LSTM Music Generative Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Moseli Mots’oehli, Anna Sergeevna Bosman, and Johan Pieter De Villiers
428
Deep Reinforcement Learning for Heat Pump Control . . . . . . . . . . . . . . . . . . . . . Tobias Rohrer, Lilli Frison, Lukas Kaupenjohann, Katrin Scharf, and Elke Hergenröther
459
Efficient Training of Foosball Agents Using Multi-agent Competition . . . . . . . . Adriatik Gashi, Elke Hergenröther, and Gunter Grieser
472
Managing Expectations of Energy and Technology Transitions: The Role of Observation in Stability and Instability . . . . . . . . . . . . . . . . . . . . . . . . . . . J. Kasmire
493
x
Contents
Encounter-Based Density Approximation Using Multi-step and Quantum-Inspired Random Walks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Robert S. Wezeman, Niels M. P. Neumann, Frank Phillipson, and Robert E. Kooij Self-organizing and Load-Balancing via Quantum Intelligence Game for Peer-to-Peer Collaborative Learning Agents and Flexible Organizational Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ying Zhao, Gabe Mata, and Charles Zhou
517
532
Crowd-Sourcing High-Value Information via Quantum Intelligence Game . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Charles C. Zhou and Ying Zhao
552
Teaming Humans with Virtual Assistants to Detect and Mitigate Vulnerabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fitzroy D. Nembhard and Marco M. Carvalho
565
Spaces of Interpretations: Personal, Audience and Memory Spaces . . . . . . . . . . Yehuda Roth An Application Based on the Concept of Gamification to Promote Cultural Tourism in the Municipality of San Diego in the Department of Cesar, Colombia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Paola Patricia Ariza-Colpas, Marlon Alberto Piñeres-Melo, Roberto-Cesar Morales-Ortega, Andres Felipe Rodriguez-Bonilla, Shariq But-Aziz, Diego Armando Rodriguez-Parra, Ileana Rodriguez-Bonilla, and Leidys del Carmen Contreras Chinchilla Platform Based on Augmented Reality to Support Cultural Tourism in the Department of Cesar, Colombia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Paola Patricia Ariza-Colpas, Marlon Alberto Piñeres-Melo, Roberto-Cesar Morales-Ortega, Andres Felipe Rodriguez-Bonilla, Shariq But-Aziz, Leidys del Carmen Contreras Chinchilla, Maribel Romero Mestre, and Ronald Alexander Vacca Ascanio Assessment of Human Personality Traits Using Smartphone Sensing . . . . . . . . . Sehrish Rafique, Muhammad Ehatisham-ul-Haq, Kainat Ibrar, Amanullah Yasin, Fiza Murtaza, and Muhammad Awais Azam
577
586
598
613
Contents
An Approach to Mobile App Design and Development Combining Design Thinking, User Experience, and Iterative-Incremental Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Iris Iddaly Méndez-Gurrola, Ramón Iván Barraza-Castillo, Abdiel Ramírez Reyes, and Alejandro Israel Barranco-Gutiérrez Deploying Digital Twin in Manufacturing Systems: Scope and Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nada Ouahabi, Ahmed Chebak, Mouna Berquedich, Oulaid Kamach, and Mourad Zegrari Toward the Selection of a Lightweight Authentication Technique for the Security of Smart Homes: Framework Architecture Based on a User Centric Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tanya Koohpayeh Araghi, David Megías, and Andrea Rosales Evaluating the User Experience of Music Streaming Services . . . . . . . . . . . . . . . Roko Fumi´c, Mateo Rumac, and Tihomir Orehovaˇcki Using Drone and AI Application for Power Transmission Line Inspection and Maintenance: A Case Study in Vietnam . . . . . . . . . . . . . . . . . . . . Dinh Cong Nguyen, Le Nhan Tam, Dinh Hung Phan, The Cuong Nguyen, Dung Nguyen Duy, and Quang Nguyen Xuan Artificial Intelligence Traffic Analysis Framework for Smart Cities . . . . . . . . . . Monther Tarawneh, Faisal AlZyoud, and Yousef Sharrab Predictability and Comprehensibility in Post-Hoc XAI Methods: A User-Centered Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Anahid Jalali, Bernhard Haslhofer, Simone Kriglstein, and Andreas Rauber Multi-sensor Failure Recovery in Aero-Engines Using a Digital Twin Platform: A Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A. Manuja, Saurav Anilkumar, V. V. Varun, A. Mathew, S. P. Sureshkumar, and R. George Hierarchical Joint Entity Recognition and Relation Extraction of Contextual Entities in Family History Records . . . . . . . . . . . . . . . . . . . . . . . . . Daniel Segrera, Chetan Joshi, Lawry Sorenson, Stephen Hood, Timothy Brown, Mark Clement, Joseph Price, Eric Burdett, and Stanley Fujimoto Coupled-Tensor Generated Word Embeddings and Their Composition . . . . . . . Matej Cibula and Radek Marik
xi
623
639
651
668
684
699
712
734
742
753
xii
Contents
Credibility Analysis for Social Media Content Using Sentence Transformer Based Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sanjeev Roka and Danda B. Rawat
768
Cryptocurrency Valuation: An Explainable AI Approach . . . . . . . . . . . . . . . . . . . Yulin Liu and Luyao Zhang
785
The Path to Autonomous Learners . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hanna Abi Akl
808
Can Mobile Device Use in the Classroom Facilitate Student Engagement in Higher Education . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Michael Bass and Perry Hessenauer An e-Learning Course on Artificial Intelligence in Production – Development of a Target Group-Oriented Continuing Education Format for Technical Innovations . . . . . . . . . . . . . . . . . . . Erik Voigt, Marietta Menner, and Julia Thurner-Irmler
831
841
Design and Implementation of a Postgraduate Micro-credential in Software Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . David Parsons
853
Social Media in Support of Higher Education Teaching and Learning: A Systematic Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Lily Schoeman and Sunet Eybers
865
Designing an Interactive Learning Suite for Children: Results from a Usability Study with a Multidisciplinary Research Team . . . . . . . . . . . . . Arash Soleimani
873
Towards the Establishment of E-Assessment at the University of Mauritius . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Abdool Qaiyum Mohabuth
886
Inquiring Minds Want to Know What HBCU Students Say About a STEM Master Course Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D’Nita Andrews Graham
905
On the Use of Blogging in the Classroom of English for Specific Purposes in Times of COVID-19 to Promote Written Skills: A Collaborative Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ana Ibañez Moreno
919
Contents
Using Virtual Reality Learning Environments to Improve Success for Online Students . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Evelyn R. Sowells-Boone A RabbitMQ-Based Framework to Deal with Naval Sensor Systems Design Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Paul Quentel, Yvon Kermarrec, Pierre Le Berre, Ludovic Grivault, and Laurent Savy
xiii
940
948
A Novel Method of Automatic Modulation Classification with an Optimised 1D DBSCAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bill Gavin, Edward Ball, and Tiantai Deng
960
A Secure Information Transmission Scheme for the Cluster Blockchain of the Internet of Vehicles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hua Yi Lin, Meng-Yen Hsieh, and Kuan-Ching Li
968
Techniques for Long-Term Service Prioritization and Optimization Using IEEE 802.11 Technologies (Email and HTTP) . . . . . . . . . . . . . . . . . . . . . . Ali Mohd Ali, Mohammad R. Hassan, Ahmed Abu-Khadrah, and Ahmad Al-Qerem
980
An Improved WRR Scheduling Algorithm for MANETs . . . . . . . . . . . . . . . . . . . 1000 Mukakanya Abel Muwumba, Odongo Steven Eyobu, and John Ngubiri Blockchain Network Analysis: A Comparative Study of Decentralized Banks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1022 Yufan Zhang, Zichao Chen, Yutong Sun, Yulin Liu, and Luyao Zhang Evaluating Self-supervised Transfer Performance in Grape Detection . . . . . . . . 1043 Michael Woodson and Jane Zhang HINAY: A Mobile Application for Real-Time Traffic Sign Detection . . . . . . . . . 1058 Daniel Pete M. Aguilar and Reginald Neil C. Recario Challenges of the Creation of a Dataset for Vision Based Human Hand Action Recognition in Industrial Assembly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1079 Fabian Sturm, Elke Hergenroether, Julian Reinhardt, Petar Smilevski Vojnovikj, and Melanie Siegel A Novel Features Selection Model for Fire Detection and Fire Circumstances Recognition by Considering Fire Texture: MIC-RF-RFE . . . . . . 1099 Jittarin Jetwiriyanon, Ziheng Feng, and Kanoksak Wattanachote
xiv
Contents
A Study on Artificial Intelligence Techniques for Automatic Fish-Size Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1116 Rajarshi Biswas, Marcel Mutz, Nisha George, and Dirk Werth Falling People Detection in Real Time Video Using Convolution Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1127 Sathit Prasomphan, Earn Suriyachay, Satayu Samonothai, and Jiratchakit Tamasri Some Guidelines for Cybersecurity Governance in the Internet of Medical Things . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1139 Basie von Solms and Jaco du Toit Recovering from Memory the Encryption Keys Used by Ransomware Targeting Windows and Linux Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1149 Xosé Fernández-Fuentes, Tomás F. Pena, and José C. Cabaleiro De-anonymising Individuals Through Unique Patterns in Movement Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1167 Nikolai J. Podlesny, Anne V. D. M. Kayem, and Christoph Meinel MOTP – a Microservice Approach to Provide One-Time Pins (OTPs) as a Service . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1185 Mkhululi Mtshemla, Jaco Du Toit, and Carl van der Westhuizen Recent Advances in Cyberattack Detection and Mitigation Techniques for Renewable Photovoltaic Distributed Energy CPS . . . . . . . . . . . . . . . . . . . . . . . 1202 Jessica Whitaker and Danda B. Rawat Insider Threat Detection on an Imbalanced Dataset Using Balancing Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1216 Keir Dinardo, Mouad Lemoudden, and Jawad Ahmad Stack-Based Buffer Overflow Implementation Using Python 3 . . . . . . . . . . . . . . 1227 Jewel Donkor Apeko and Claude Turner Design and Development of a Comprehensive Cyber Security Competition Visualization System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1240 Thomas Chapman, Claude Turner, Dwight Richards, Rolston Jeremiah, Jie Yan, Ruth Agada, Mohammed Abdulai, and Tricia Camaya Teaching Cybersecurity with Experiential Learning: The Case of the Phishing and Deviance Module in Social Science Courses . . . . . . . . . . . . 1250 Carlene Buchanan Turner, Claude Turner, and Austin Ashe
Contents
xv
Investigating Instagram Privacy Through Memory Forensics . . . . . . . . . . . . . . . . 1263 Ahmad Ghafarian and Justin Fredy Internet of Things Security Threats and Attacks: Vulnerability Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1274 Zubeir Izaruku Dafalla and Andrews Samraj Application of Machine Learning in Intrusion Detection Systems . . . . . . . . . . . . 1288 Milena Gjorgjievska Perusheska and Vesna Dimitrova Nature Inspired Metaheuristic Techniques of Firefly and Grey Wolf Algorithms Implemented in Phishing Intrusion Detection Systems . . . . . . . . . . . 1309 Sandra Kopecky and Catherine Dwyer Classification of Gas Sensor Data Using Multiclass SVM . . . . . . . . . . . . . . . . . . 1333 M. Jaleel, A. Amira, and H. Malekmohamadi Blockchain Technology Approach on Securing Smart Water Metering Networks Toward Anomaly Free: An Overview and Future Research Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1345 M. N. Kanyama, F. Bhunu Shava, A. M. Gamundani, and A. Hartmann Smart Environment: Using a Multi-agent System to Manage Users and Spaces Preferences Conflicts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1361 Pedro Filipe Oliveira, Paulo Novais, and Paulo Matos Carpediem: Investigating the Interactions of Health Pillars to Design Holistic Recommendations for Achieving Long-Term Changes in Lifestyle Behaviours . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1378 Carolina Migliorelli, Laura Ros-Freixedes, Meritxell Gomez-Martinez, Laura Sistach-Bosch, and Silvia Orte Multimedia Georeferenced Contents for Climate Events: The MAGIS Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1396 Mariagrazia Fugini, Jacopo Finocchi, Elisa Rossi, and Sara Comai The Effect of ASR Apps on Monophthong Pronunciation Improvement and Generalization to New Words in English . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1410 Haiyan Xiao, Keting Ou, Hongyan Wang, and Jeroen van de Weijer IoT Secure Cloud Enabled Model for Soil Nutrition Monitoring and Fertilizer Suggestion for Agricultural Industry of Sri Lanka . . . . . . . . . . . . . 1434 U. H. D Thinura Nethpiya Ariyaratne, V Diyon Yasaswin Vitharana, L. H Don Ranul Deelaka, H. M Sumudu Maduranga Herath, Anuradha Jayakody, and Narmada Gamage
xvi
Contents
Modeling Internet-of-Things (IoT) Behavior for Enforcing Security and Privacy Policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1450 Anubhav Gupta, Daniel Campos, Parth Ganeriwala, Siddhartha Bhattacharyya, TJ OConnor, and Adolf Dcosta Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1473
Bit-Level Operation-Based MAC Unit for Vector Multiplications Feng Yan1 , Xiaochen Wang2(B) , and Tiantai Deng2 1 Electrical and Electronic Department, University of Leeds, Leeds LS2 9JT,
West Yorkshire, UK 2 Department Electronic and Electrical Engineering,
University of Sheffield, Sheffield S1 3JD, UK {xiao.chen,t.deng}@sheffield.ac.uk
Abstract. Computing power has shown great significance in supporting various applications in daily life. With the continuous increase in demand for computing power in recent years, it is vital to provide a continuous increase of computing power through specialized architectures to meet the demands of modern complex algorithms and applications, given that the fabrication is reaching its physical limit. This research purposes a novel MAC unit based on a novel computing theory to reduce the latency of parallel multiplication-accumulation processes, such as matrix multiplication, convolution in deep learning and DFT. The novel MAC unit is implemented using Xilinx Vivado and Vitis IDE. With data collected from FPGA implementation and simulation in the design software, the result shows that a processing element that consists of 16/32 MAC units used 4−6 times more resources than the traditional implementation but has 9 times lower latency. Keywords: Parallel MAC Unit · Processing Element (PE) · FPGA Implementation · Computing Theory
1 Introduction Multiplication-accumulation unit is a key but a basic component to support matrix multiplications, which have loads of applications, for example, Digital Signal Processing (DSP) [1], Artificial Intelligence (AI) [2], and Image Processing [3]. Modern architecture to process matrix multiplication usually requires multiple MAC units sharing data with their neighbour MACs to reduce the memory transactions, for example, the Google Tensor Processing Unit (TPU) is using systolic array architecture, which is the most mainstream architecture for matrix multiplications, to achieve the reuse of data and reduce one dimension of the computing complexity [4]. In the systolic array, numbers of MAC units are connected into a Processing Element (PE) [5]. In the Huawei DaVinci architecture, which is an AI accelerator also using systolic array architecture, each PE has 16 MACs with an adder tree to accumulate the result for each MACs [6]. Modern application involving matrix multiplication usually requires us to do large-scale matrix multiplication. For example, in Convolutional Neural Networks (CNN), the accelerator © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): SAI 2023, LNNS 711, pp. 1–6, 2023. https://doi.org/10.1007/978-3-031-37717-4_1
2
F. Yan et al.
usually relies on the “img2col” function to transfer 3D convolution into matrix multiplication and the size of the matrix could be several hundred K by several hundred K [7] because of the growing size of the image and feature maps. To accelerate large-scale matrix multiplications, systolic array-based architectures are usually used, for example, TPU and NVDLA [8]. Thus a MAC unit with lower latency or higher integration degree would be very useful to the large architecture [9]. In this paper, we purpose a novel MAC unit which is based on a new computing theory. Our integrated MAC units are scalable and work naturally in parallel. Several MAC operations are done in one component in an efficient way instead of having separate multipliers in the PE. Detailed contributions are as follows: 1. A new computing theory for designing MAC units based on the bit-level math of the parallel MAC operation. 2. An FPGA implementation of the novel MAC unit that is compatible with the mainstream systolic array architecture. 3. A lower latency of MAC array can be used for low latency applications when the area is not so important. The rest of the paper is organized as follows: in Sect. 2, we provide a detailed explanation of the computing theory and how the hardware is made; in Sect. 3, we provide a comparison and discussion of our novel MAC unit with the implementations on other papers; in the last section, we provide a conclusion with the analysis of the potential for our novel MAC unit and summarize our future work to follow.
2 Computing Theory and Hardware Architecture 2.1 Computing Theory The starting point of our new computing theory is the definition of vector multiplication. Ai and Bi are two vectors in binary form with a word length of 8 and 16 elements and C is the result of the vector multiplication. Ai =
7
ai ∗ 2i
(1)
bi ∗ 2i
(2)
i=0
Bi =
7 i=0
7 Am ∗ Bm = ai ∗ 2i ∗ bj ∗ 2j C= m=0 m=0 i=0 j=0
7 7 i+j = 15 m=0 i=0 j=0 ai ∗ bj ∗ 2 15
15
7
(3)
It is a simple transformation from the original result to what we have in Eq. (3). The first accumulation sign with the index (m) could be translated into an adder tree in the hardware implementation. The second and third accumulation signs with the index (i, j) can be represented by using a simple counter. The power of 2 can be translated into
Bit-Level Operation-Based MAC Unit
3
shifters and multiplication in the binary form can be implemented through AND Gate. There is no data dependency in the several accumulation signs, and it is easily scaled to numbers with different word lengths and different numbers of elements in the vector. Instead of doing multiplication through independent numbers and adding them together, our computing theory integrates the whole MAC process. This theory handles the MAC operation well because we can use the counter, AND gate and shifter to replace the original multipliers and increase the efficiency. 2.2 Hardware Architecture The hardware is implemented in a way that corresponds to Eq. (3). Figure 1 shows the general architecture of this piece of MAC unit. It is easy to see that the architecture can be easily pipelined. If we completely follow Eq. (3), there will be 64 counters and fixed shifters in the hardware, which might consume too much area. Thus, after carefully considering the features of those counter and shifter, we found out that, the number of the shift bits is from 0–14 for 8-bit numbers. That means we can merge some of the results and reduce the number of counters and shifts from 64 to 15. To show how we combine those counters and shifters are merged, we present Table 1 and Fig. 2 for more information. The first column of table one shows how many bits need to be shifted. The second column shows bit positions for A and B (8-bit number) and the third column is the number of 1-bit AND gates we need. It is easy to see that the total number of AND gates are generally the before the merging but the number of counters, but it does help to reduce the numbers of counters and shifters from 64 to 15.
Fig. 1. General Architecture of the MAC unit based on our Novel Computing Theory
3 Result We implement our architecture on the Xilinx FPGA Evaluation Board (Zedboard) to test the performance, area consumption and power consumption. The whole design is done through the Xilinx Vivado Toolchain.
4
F. Yan et al. Table 1. Merge Table for Reducing the Number of AND Gates, Counters, and Shifters
Shift
Bit position from A and B
Number of AND Gates (in Bits)
0
0/0
7
1
1/0, 0/1
14
2
0/2, 1/1, 2/0
21
3
0/3, 1/2, 2/1, 3/0
28
4
0/4, 1/3, 2/2, 3/1, 4/0
35
5
0/5, 1/4, 2/3, 3/2, 4/1, 5/0
42
6
0/6, 1/5, 2/4, 3/3, 4/2, 5/1, 6/0
49
7
0/7, 1/6, 2/5, 3/4, 4/3, 5/2, 6/1, 7/0
56
8
1/7, 2/6, 3/5, 4/4, 5/3, 6/2, 7/1
49
9
2/7, 3/6, 4/5, 5/4, 6/3, 7/2
42
10
3/7, 4/6, 5/5, 6/4, 7/3
35
11
4/7, 5/6, 6/5, 7/4
28
12
5/7, 6/6, 7/5
21
13
6/7, 7/6
14
14
7/7
7
Fig. 2. Architecture after Optimization
To show the advancement of our MAC unit, we also implement a comparison version which is made up of the traditional multipliers to do an 8-bit vector multiplication with 16 elements in each vector. Two groups of PEs have a very similar power consumption (0.096w vs 0.097w), but the PE based on our new MAC unit shows 9 times less latency (54 CLK vs 6 CLK) with 6 times (6 times in the LUT and FFs) more resources. Detailed result can be found in Table 2.
Bit-Level Operation-Based MAC Unit
5
Table 2. Resource/Latency Comparison for 16 MACs Design
Our work
Normal MAC
LUTs
420
74
FFs
324
53
DSPs
0
0
BRAMs
0
0
Latency
6 clk
54 clk
Power
0.096 W
0.097 W
When we scale up the design from 16 MACs to 32 MACs, our novel MAC unit is no longer 6 times of the resources yet, the latency keeps the same as the 16 MACs. The result for 32 MACs can be found in the Table 3 below: Table 3. Resource/Latency Comparison for 32 MACs Design
Our work
Normal MAC
LUTs
1956
472
FFs
1140
190
DSPs
0
0
BRAMs
0
0
Latency
6 clk
54 clk
Power
0.141 W
0.115 W
4 Conclusion and Future Work This paper presents a novel bit-level computing theory which could guide the design of the PEs in the systolic array architectures with a higher degree of integration. The experiment result shows that with 16 MACs, our design with 9 times less latency, yet shows only 6 times the resources and very similar power consumption. With 32 MACs, compare to the normal MAC unit, our design is only around 4 times of the resources and keep the 9 times less latency. This work is currently in the form of a MAC unit, yet integrated into the larger architecture for evaluation purpose. Thus, the next step will be integrate our novel MAC unit into larger architectures such as systolic array for evaluation purpose. Acknowledgment. This project is funded by a Royal Society Research Grant.
6
F. Yan et al.
References 1. Sohl, J., Wang, J., Liu, D.: Large matrix multiplication on a novel heterogeneous parallel DSP architecture. In: Dou, Y., Gruber, R., Joller, J.M. (eds.) APPT 2009. LNCS, vol. 5737, pp. 408–419. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-03644-6_32 2. Wu, D., Fan, X., Cao, W., Wang, L.: SWM: A high-performance sparse-Winograd matrix multiplication CNN accelerator. IEEE Trans. Very Large Scale Integr. (VLSI) Systems 29(5), 936–949 (2021) 3. Al-Qadi, Z., Aqel, M.: Performance analysis of parallel matrix multiplication algorithms used in image processing. World Appl. Sci. J. 6(1), 45–52 (2009) 4. He, X., et al.: Sparse-TPU: adapting systolic arrays for sparse matrices. In: Proceedings of the 34th ACM International Conference on Supercomputing, pp. 1–12 (2020) 5. Wei, X., et al.: Automated systolic array architecture synthesis for high throughput CNN inference on FPGAs. In: Proceedings of the 54th Annual Design Automation Conference 2017, pp. 1–6 (2017) 6. Liao, H., Tu, J., Xia, J., Zhou, X.: DaVinci: a scalable architecture for neural network computing. In: Hot Chips Symposium, pp. 1–44 (2019) 7. Tao, W.Z,. Wang, Y., Zhang, H.: Overview of tensor layout in modern neural network accelerator. In: 2021 18th International Computer Conference on Wavelet Active Media Technology and Information Processing (ICCWAMTIP), pp. 368–371. IEEE (2021) 8. Feng, S., Wu, J., Zhou, S., Li, R.: The implementation of LeNet-5 with NVDLA on RISCV SoC. In: 2019 IEEE 10th International Conference on Software Engineering and Service Science (ICSESS), pp. 39–42. IEEE (2019) 9. Shao, Y.S., et al.: Simba: scaling deep-learning inference with multi-chip-module-based architecture. In: Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, pp. 14–27 (2019)
New Routines for Faster Balancing of AVL Trees Orieh Destiny Anyiawe(B) , Austin Ramsey, and David Fawcett Lawrence Technological University, Southfield, MI 48075, USA [email protected] https://destinyanyaiwe.net Abstract. The longer running times of insert and delete operations of Adelson-Velskey & Landis (AVL) trees due to the complexity associated with its balancing technique and overhead workload, continues to impede its ubiquitous use and versatility compared to other binary search trees, like Red Black trees. In this paper, we present a non-geometrically inspired implementation for faster balancing of AVL trees. The performance in run-time of proposed routines is consistently better than four other publicly available algorithms of AVL trees.
Keywords: Sling Balancing
1
· Swing · Geometric Description · Rotation ·
Introduction
The left-to-right numerical ordering of nodes in a binary search tree (BST) does not guarantee shallow BST’s, even for complete binary trees (BTs), [1]. In theory and application, when the ordering conditions of a BST fail, the resulting tree is a degenerated linked list. Thus, the operational cost of search and access operations becomes linear (O(n)). Adelson-Velskey & E. M. Landis (AVL) trees, introduced in 1962, is a self balancing BST in which all operations have a worst case run-time of O(log n). However, it generally has longer running times for insert and delete operations as a result of the faulted left-to-right numerical ordering of nodes. In fact, AVL trees have been observed to perform better than unbalanced BST, Splay, and red black trees only if insertions occur in a sorted order and later accesses are random [2]. AVL tree has a strict balancing ability that stems from its balancing property. It requires nodes in the tree, to maintain the height of their left and right subtrees to differ by at most 1. This condition prevents binary search trees in general from degenerating into linked lists, by automatically maintaining the height (h) of the tree. Thus, the average internal path length of BST (eg, AVL and Red Black trees) is O(log n). Moreover, AVL trees are more balanced than red black trees (RBT) stemming from their lookup constants; 1.44 and 2 for AVL and RBT respectively, while deletions are cheaper in RBT than in AVL, [3–6,15]. See Eqs. 1 and 2, n is the number of nodes in the tree. AVL tree is popular in databases where faster retrieval is required and in many other application areas. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): SAI 2023, LNNS 711, pp. 7–15, 2023. https://doi.org/10.1007/978-3-031-37717-4_2
8
O. D. Anyiawe et al.
Such as, the management of resources and efficiency in programmable-chips, [7]. h ≤ 1.44 log(n + 1) − 1.33 ∈ O(log n) : AV L h ≤ 2 log(n + 1) ∈ O(log n) : Red Black T rees
(1) (2)
Traditional algorithm for AVL trees keeps track of several parameters associated with each node, for instance, the height of an empty subtree. There increasing the complexity of its balancing technique as well as the overhead workloads. Consequently, the ubiquitous use and versatility of the tree is impeded. We record some real life applications, L` azaro et al in [7], chose AVL tree as the underlying architecture for Industrial Ethernet networks, which avails network parameters the ability to search networks with known and fixed latency [8], for faster real-time performance. Based on its structure, AVL trees have been used to propose diverse real life application models. For instance, [9], in which multiple AVL trees were used to construct point-to-point links to boost gateway node authentication, and prioritized threshold networks for Smart Cities, based on a security measure for cloud integrated Data Management. As indicated earlier, AVL trees perform faster retrievals and provide complex insertion and removal operations. Its algorithm keeps track of several parameters associated with each node, it has more lookups than other balanced data structures (e.g red-black trees). Thus, making it more expensive to rebalance. Traditionally, the mechanism for rebalancing an AVL tree is either a single or double rotation depending on the orientation of the subtree with the structure violation. In this paper, we show the possibility of optimizing the processing cost of balancing AVL trees during insertion/deletion based on new balancing geometries and associated implementations. Similar works in this direction are not specific to AVL trees. Rather, they are on self-balancing trees in general, with a focus on parallelism [11,12], or memory layout [13] or particularly, enhancing the simplicity and efficiency of red-black tree, [14]. Balancing AVL trees by single or double rotation is costly in running times, and geometrically heavy (Fig. 1). Which contributes to why AVL trees are complex to implement. We propose new mechanisms sling and swing operations to rebalance AVL trees. Definition 1 (Balanced Tree). A balanced tree is a tree whose height (h) is minimal, (i.e h = log n) and not skewed based on the height of its left and right children or subtrees. Definition 2 (Unbalanced Cases). Let α be a parent node in an AVL tree (T ). When a node is inserted into/or deleted from T , it could lead to the violation of the balancing property of the tree at node α in any of the following cases. Or, we say that T is unbalanced if a new node is inserted into 1. left-left: the left subtree of left child of α 2. left-right: the right subtree of left child of α 3. right-left: the left subtree of right child of α
Hamiltonian Mechanics
9
4. right-right: the right subtree of right child of α Example 1. Double Rotation to rebalance a Case 2: Fig. 1 presents traditional double rotations’ steps and stages involved in rebalancing a left-right structure (Case 2, Fig. 1.I). B node has been inserted into the right subtree of the left subtree of C. This makes C an unbalanced node which calls for a left-right rotation. Figure 1.II, show left rotation on the left subtree of C, causing the tree to interimly become a right-right structure (Case 4) , Fig. 1.III. Since C is still unbalanced, we right-rotate the tree, Fig. 1.IV, to make B the new root node. C is now the right subtree of its own left subtree and the tree is balanced, Fig. 1.V.
Fig. 1. Images Showing (I) Case 2, Node C is the Alpha Node; Intermediate Structure (III) after the First Rotation (II) and the Final Balanced Tree (V) Following the Second Rotation (IV). (Images Copied from [10])
2
New Mechanisms
An AVL tree is first, a binary search tree (BST). Thus, the lef tChild ≤ root ≤ rightChild ordering principle of BSTs is maintained whenever new nodes are inserted/deleted into the tree. In addition, to ensure that the tree remains shallow (as a result of the ordering property of AVL trees). First, we echo that cases 1 and 4 are mirror images of each other and that a similar relationship exist between cases 2 and 3. Traditionally, the latter cases requires a more exhaustive routine called the double (left-right, or right-left) rotations to re-balance the tree (like Example 1) while the former pair are rebalanced with single rotation. Steps III –V of Fig. 1 (of Example 1) describe left-left single rotation. We now propose two balancing mechanisms that balances AVL trees with the non-geometric concept of rotation.
10
O. D. Anyiawe et al.
2.1
Sling Routine
Fig. 2. Typical Case 4 on the Left Hand Side: The AVL Tree is Fixed using Sling() Routine (Right Hand Side)
The result of Sling mechanism is equivalent to what is obtainable using single rotation and it is applicable to cases 1 and 4. Consider Fig. 2, which has two images; a) geometric representation of case 4 (on the left) and b) sling operation (on the right). The sling routine could take three parameters; the alpha (α) node, alpha’s left or right child and alpha’s grandChild. Depending on the specific case at hand, the sling operation proceeds by shooting alpha’s child up by a level while pushing alpha down by a level to either the right subtree of alpha’s child (in case 1) or the right subtree of child node (for case 4), Algorithm 1.
Algorithm 1. Sling Algorithm procedure sling(α, child) height(α) > 1? if case 1/4 make α right/left Child of child update(height) end procedure
Along the unbalanced path, the sling() function considers the child of alpha node as an arrow on a bow, with the alpha node and alpha’s grandChild as the ends of the bow. It slings the alpha node’s child. The sling can be a slingLeft() or slingRight() for case 1 and case 4 operations respectively. Both operations promote the level of the child node along the unbalanced path while reducing the height of alpha (it detaches alpha node from its parent, if it exists).
Hamiltonian Mechanics
2.2
11
Swing Routine
Most programming languages do have one or more of these functions; the swap, move or copy function in their standard template libraries. Thus, the choice of the primary function that can be called by the proposed swing routine is programming language dependent. We implemented the swing function by combining swap (in C++) and sling functions to generate a result equivalent to double rotation. A call on the swap function, swaps alpha’s grandChild and alpha’s child in the applicable path before the sling function is called to complete the rebalancing. This swap operation transforms a case 2 structure to a case 4 form, and a case 3 to case 1. The rebalancing of the tree is completed with a call to an appropriate sling operation. The swing routine also takes the same parameters like the sling routine and it is an admissible rebalancing mechanism for cases 2 and 3 structures. Algorithm 2. Swing Algorithm procedure swing(α, leftChild, grandChild) height(α) > 1? if case 2 function swap(leftChild, grandChild) call SLING(α, Child) end function end procedure
3
Experimental Operations and Benchmarks
We developed an AVL tree program based on the proposed sling and swing mechanisms. In this section, we first establish their applicability and benchmark their performance on cases 1 - to - 4 structures. Thereafter, we compare the performance of the new code with publicly available AVL tree algorithms using randomly generated inputs with variant sizes. Algorithms from GeeksForGeeks (Geeks), Programiz (ProgZ) websites were chosen along with AVL tree algorithms found in the Data Structures and Algorithm Analysis textbook by Mark A. Weiss, [1] (Weiss). We also compared the performance of the new code to a Red-Black tree (RBT) implementation since it is generally faster than AVL trees, i.e., A = {NewCode, Geeks, rogZ, Weiss, RBT}. The programming language used is C++. 3.1
Case-by-Case Experiment
To validate the proposed mechanism in tandem with developing benchmark comparison, small sized AVL trees of height 2, were constructed to represent the
12
O. D. Anyiawe et al.
cases (e.g the tree on the left hand side on Fig. 2). These trees were sequentially introduced to selected tree algorithms as inputs. Table 1 is a presentation of their running times (in milliseconds). We benchmarked run-times of each algorithm for the four cases. The average run time of three consecutive runs were recorded. In comparison, the total runtime to execute all the cases by the code inspired by the proposed mechanism beats those of the other AVL trees. It is consistently (50%) faster than the other traditional AVL trees and has equivalent performance with the RBT algorithm used (shaded Cyan). None of the other AVL tree performance (25%) is faster than red-black tree. The average running times that are shaded pink in Table 1, are instances where the new code’s performance is second best algorithm when the entries for red-black tree are not considered. Table 1. Considering Cases 1–4 Structures, this Table Presents the Bench-Mark Performance of the Newly Inspired Code and the other AVL Tree Implementations Including a Column with Average Running Times for a Red-Black Tree Algorithm. The Newly Inspired Code is Consistently (50%) Faster than Traditional AVL Trees and it has Equal Performance Rating (50%) with Red-Black Tree Implementation NewCode Geeks Case Case Case Case Total
1 0.07815 2 0.06912 3 0.07846 4 0.07288 0.29861
ProgZ
Weiss
RBT
0.07848 0.06975 0.07538 0.08945
0.07851 0.07502 0.06997 0.08141 0.07690 0.08285 0.08279 0.0843
0.06968 0.07580 0.07136 0.07856
0.31306
0.30817 0.32358
0.2954
Figure 3 is a bar chart representation of the information captured in Table 1. The height of a bar is used to represent the average running time of the implicated algorithm. In terms of understanding the graph, the algorithm with the shortest bar has the best performance time in that case category. For example, when the codes were executed for case 4, the new code performed best. 3.2
Random Input and Variant Data Sizes
Upon validation and benchmarking of the proposed mechanisms, we created arrays of different sizes with randomly generated values (see InputSize column of Table 2), to further test and compare the selected algorithms. Each algorithm was tested with the same input array. The elements of the arrays (which are not ordered) were inserted into the trees one at a time to create ideal situations/violations (that entails the occurrence of the four violation cases at different points in time). The average time of three consecutive executions were then recorded. Table 2 is the performance summary, in milliseconds, of all the algorithms with respect to 6 randomly generated arrays of different sizes. The array sizes are 10, 100, 1,000, 10,000, 100,000
Hamiltonian Mechanics
13
Fig. 3. Bar Chart Graph of Benchmark Performance of the Newly Inspired Code, the Three other AVL Tree Implementations and Red-Black Tree Algorithm for Cases 1– 4 Structures. The Newly Inspired Code is Consistently (50%) Faster than the other Traditional AVL Trees and it has Equal Performance Rating (50%) with Red-Black Tree Implementation.
and 1,000,000. The NewCode’s tree, i.e., the code implementation of the proposed mechanisms has faster run-times than the three other arbitrarily chosen AVL including the RedBlack tree algorithms we worked with. This outcome is shaded green (five of the six arrays trials) in Table 2. The NewCode performance was topped only once as shaded yellow. Figure 4 is a graphical representation of Table 2. Having the best result overall, the graph easily shows additional details, such as the range of array sizes (N = 100 · · · 1000) where the NewCode best tops the other algorithms in efficiency. Table 2. Performance Summary, in Milliseconds, of All the Algorithms with Respect to Six Randomly Generated Arrays of Different Sizes. The Array Sizes were 10, 100, 1,000, 10,000, 100,000 and 1,000,000. The NewCode’s Tree, i.e., the Code Implementation of the Proposed Mechanisms had Faster Running Time, for 5 out of the 6 Arrays InputSize NewCode Geeks
ProgZ
Weiss
RBT
10 0.072798 0.165487 100 0.875788 1,000 21.45675 10,000 453.987 100,000 1,000,000 44034.8
0.097037 0.165345 22.5496 28.5639 486.586 45185.9
0.092807 0.168576 15.0326 24.7441 492.179 45608.2
0.09172 8.94057 11.8101 15.2638 556.751 47737.7
0.0991 0.1632 21.7271 27.3862 475.358 45046.1
14
O. D. Anyiawe et al.
Fig. 4. Graphical Representation of Table 2: Performance of Different AVL Tree Algorithms with Different Arrays of Random Inputs and Sizes.
4
Concluding Analysis
Among other properties, the efficiency of an Algorithm reflects the theoretical technicality of the problem definition or subject at hand. If care is taken, simple definitions lead to simple/easy implementation. Most tasking is the need to cleverly break down complex problems into simple and efficient algorithms. The result presented in this work aggregated empirical knowledge from geometric phases, orientations and weight of the intermediate steps between the start phase and end phase of balancing operations, akin to Big O, Table 2. Dynamical assignment of alpha nodes, alpha’s child and alpha’s grandchild upon identification of an alpha node based on height condition, reduces the proposed algorithm’s overhead cost when compared to rotation based algorithms. In lieu of double rotation, the elegance of sling operation is that it combines STL’s in built functions to balance the tree. Thus making the sling operation easy, cheap and faster. We submit that understanding the new geometries could inform easier, smarter and cheaper implementation and wider application of AVL trees. The literature and algebra of determining the heights of nodes in a tree remains the same, so nothing was done about it in this paper (though per implementation, the routine of sling function can update this internally), but our focus was on reducing the overhead costs of re-balancing AVL trees. To endear more users and applications for this powerful data structure. Acknowledgement. Funding information: Howard Hughes Medical Institute, Grant/ Award Number: 52008705
Hamiltonian Mechanics
15
References 1. Allen, W.M., et al.: Data Structures and Algorithm Analysis in C++. Pearson Education India, Bengaluru (2007) 2. Pfaff, B.: Performance analysis of BSTs in system software. In: ACM SIGMETRICS ’04/Performance ’04: Proceedings of the Joint International Conference on Measurement and Modeling of Computer Systems, June 2004, pp. 410–411 (2004). https://doi.org/10.1145/1005686.1005742 3. Adel’son-Vel’skii, G.M., Landis, E.M.: An algorithm for the organization of information. Soviet Math. Doklady 3, 1259–1262 (1962) 4. Knuth, D.E.: Sorting and Searching, vol. 3 of The Art of Computer Programming, section 6.2.3, 2nd edn, p. 460. Addison- Wesley, Reading (1997) 5. Sedgewick, R.: Algorithms in C, Parts 1-4, section 13.4, 3rd ed, p. 556. AddisonWesley, Reading (1998) ´ anyi, T.: Lecture Notes: Trees. http://aszt.inf.elte.hu/∼asvanyi/ds/AlgDs2/ 6. Asv´ AlgDs2trees.pdf. Accessed May 2022 7. L´ azaro, J., Bidarte, U., et al.: Fast and efficient address search in system-on-aprogrammable-chip using binary trees. Comput. Electr. Eng. 96, 107–403 (2021). https://doi.org/10.1016/j.compeleceng.2021.107403 8. Wu, X., Xie, L.: Performance evaluation of industrial Ethernet protocols for networked control application, vol. 84, pp. 208–217 (2019). https://doi.org/10.1016/ j.conengprac.2018.11.022 9. Rehman, A., et al.: M-SMDM: a model of security measures using green internet of things with cloud integrated data management for smart cities. Environ. Technol. Innov. 24, 101802 (2021). https://doi.org/10.1016/j.eti.2021.101802 10. Web page https://www.tutorialspoint.com/data structures algorithms/avl tree algorithm.htm. Accessed Nov 2022 11. Sun, Y., Ferizovic, D., Belloch, G.E.: PAM: parallel augmented maps. SIGPLAN Not. 53(1), 290–304 (2018). https://doi.org/10.1145/3200691.3178509 12. Blelloch, G.E. Ferizovic, D., Sun, Y.: Just join for parallel ordered sets. In: Proceedings of the 28th ACM Symposium on Parallelism in Algorithms and Architectures (Pacific Grove, California, USA) (SPAA ’16). Association for Computing Machinery, New York, NY, USA, pp. 253–264 (2016). https://doi.org/10.1145/2935764. 2935768 13. Iannetta, P.: A new memory layout for self-rebalancing trees. In: CGO’21, February 2021, S´eoul, South Korea (2021). hal-03123491 14. Lynda Bounif and Djamel Eddine Zegour: A revisited representation of the redblack tree. Int. J. Comput. Aided Eng. Technol. 16(1), 95–118 (2021). https://doi. org/10.1504/IJCAET.2022.119541 15. Jacoby, C., King, A.: pdfs.semanticscholar.org, 2017: Comparing Implementations of Optimal Binary Search Trees. https://pdfs.semanticscholar.org/a722/ b66e9571d82fb973574c33a54dba719ed1ce.pdf. Accessed 11 December 2018
Tensor Algebra on an Optoelectronic Microchip Sathvik Redrouthu(B) and Rishi Athavale Procyon Photonics, Ashburn, VA, USA [email protected] Abstract. Tensor algebra lies at the core of computational science and machine learning. Due to its high usage, entire libraries exist dedicated to improving its performance. Conventional tensor algebra performance boosts focus on algorithmic optimizations, which in turn lead to incremental improvements. In this paper, we describe a method to accelerate tensor algebra a different way: by outsourcing operations to an optical microchip. We outline a numerical programming language developed to perform tensor algebra computations that is designed to leverage our optical hardware’s full potential. We introduce the language’s current grammar and go over the compiler design. We then show a new way to store sparse rank-n tensors in RAM that outperforms conventional array storage (used by C++, Java, etc.). This method is more memory-efficient than Compressed Sparse Fiber (CSF) format and is specifically tuned for our optical hardware. Finally, we show how the scalar-tensor product, rank-n Kronecker product, tensor dot product, Khatri-Rao product, face-splitting product, and vector cross product can be compiled into operations native to our optical microchip through various tensor decompositions. Keywords: Data Analytics · Machine Learning Scientific Computing · Tensor Algebra
1 1.1
· Optical Computing ·
Introduction Tensor Algebra
Tensor algebra has numerous applications in scientific disciplines. For example, widely used multiphysics simulation software (e.g. COMSOL Multiphysics, Ansys Lumerical, etc.) must perform large-scale numerical computations to solve problems in numerous fields such as fluid dynamics, structural mechanics, heat transfer and electromagnetics [2–4]. Many of these computations are streamlined through chained tensor algebra expressions [18]. In addition, advances in machine learning (ML) due to large neural networks (e.g., DALL-E 2, GPT-3, PaLM, etc.) also make use of massive tensor algebra computations [5]. Optimizing tensor algebra becomes exceedingly important when ML models must meet time constraints (e.g., high-frequency stock trading bots) [6]. Tensors themselves can be thought of as n-dimensional numerical arrays for the purposes of this paper. Each dimension of a tensor is referred to as a mode. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): SAI 2023, LNNS 711, pp. 16–33, 2023. https://doi.org/10.1007/978-3-031-37717-4_3
Tensor Algebra on an Optoelectronic Microchip
17
A tensor’s rank is the number of modes it has and therefore the number indices needed to access a specific value [31]. Rank-0 tensors, having 0 modes, require no indices to access values and thus represent a single number, or a scalar. Similarly, rank-1 tensors are simply vectors and rank-2 tensors are matrices. Tensors of rank n > 0 are very useful in representing indexed data. For example, a search engine tracking page URLs, keywords, and backlinks can store collected data in a rank-3 tensor. Typically, however, not every element in this tensor is useful. It is often not the case that any given website contains each keyword and backlink ever indexed by the search engine. In the frequent scenario where a page URL does not map to a specific keyword-backlink combination, a 0 can simply be placed at tensor[URL][keyword][backlink]. This results in most of the tensor’s entries becoming 0; such a tensor is referred to as a sparse tensor [32]. We discuss efficient storage methods for sparse tensors in Sect. 5. Of course, search giants such as Google collect much more information than described in the example. Other companies are in the same boat; in fact, according to [19], a specific rank-3 Facebook tensor has dimensions 1591×63891×63890. Huge computations are performed constantly on tensors like these; such is the case for most large-scale graph applications [13]. Even after numerous algorithmic optimizations, however, such computation is far too slow to keep up with increasing demands [30]. For example, animation firms like Pixar can take up to 39 h of computing time to render a single frame [21]. It is therefore apparent that some form of optimization sustainable throughout the future is necessary.
Fig. 1. Example of a 5 × 3 × 2 Tensor. The Tensor is Referred to as “sparse” as most of the Entries are 0. Most Tensors Encountered are Sparse
1.2
The Photonic Advantage
Many highly optimized tensor algebra libraries currently exist (e.g., Eigen, MATLAB Tensor Toolbox, and SPLATT) [20,25,29]. However, as Moore’s Law and Dennard Scaling reach their limits and the demand for tensor algebra increases, running tensor algebra on classical hardware will no longer be viable and these libraries must adapt [30]. An alternative to classical hardware involves optical computing (the use of photons to perform computations), which offers a significant speed increase and
18
S. Redrouthu and R. Athavale
surmounts most of the energy challenges posed in conventional computer engineering [10]. Moreover, its lack of dependence on the conventional transistor leads it to be independent from the decline of Moore’s Law. Recognizing this, some of us at Procyon Photonics have designed an optical microchip able to perform high-speed matrix-vector multiplication (MVM). The chip (named Tachyon 1) maintains a compact form and is inherently analog, indicating its potential in computational fields [15]. Performing tensor algebra on such a microchip would offer a significant speed increase while simultaneously sidestepping the decline of Moore’s Law. In this paper, we describe a method where this is possible. 1.3
The Need for Photonic-Optimized Software
Current software methodologies that are optimized for classical hardware fail to take full advantage of the power of optical computing. A specific example we aim to tackle is sparse tensor storage. Current sparse tensor storage methods, such as Compressed Sparse Fiber (CSF) format, work by storing the non-zero elements of a tensor into a tree structure. In the CSF format, every pointer is assigned to a child node, which minimizes the time it takes to index into a specific element. While this is an important optimization for classical hardware, it has little use for optical hardware. Since one of the main benefits of photonic computers are their ability to efficiently perform MVM operations, indexing into a given row in a tensor is far more important than indexing to a specific item. In order to make the best use of the potential of photonic computers, it is important that software optimized specifically for optical hardware is designed, such as in the case of sparse tensor storage. 1.4
Apollo
To our knowledge, no programming language has been invented that can leverage an optical microchip’s full potential and link it to fields that can be influenced by its capabilities. For these reasons, we introduce Apollo, a computing language designed specifically for Tachyon 1. Apollo supports important tensor algebra operations that are mapped onto the corresponding units on the host computer and optical chip. The language will be extended to support operations and algorithms that are not related solely to tensor algebra but still important for computationally expensive tasks, such as deep neural network (DNN) training/inference. We begin by going through preliminary notation and definitions in Sect. 2. Next, we cover the language’s grammar and supported operations in Sect. 3. In Sect. 4, we go over the workflow, compiler front-end, and virtual machine (VM). It is here where we introduce the most important VM instruction that Sect. 6 revolves around. Next, we illustrate a new method to store large, sparse tensors in Sect. 5, which we found to surpass the conventional array storage method from a memory
Tensor Algebra on an Optoelectronic Microchip
19
viewpoint. In addition, we show how our method is more efficient than CSF format for our optical hardware. Finally, since Tachyon 1 is engineered to perform matrix-vector multiplication in a single instruction, we focus on decomposing complex tensor algebra expressions into sequences of matrix-vector products in Sect. 6. Efficient tensor decompositions would allow entire tensor algebra expressions to be run at an incredible speed.
2 2.1
Preliminaries Notation
Tensors of rank n > 2 are denoted in scripted letters (e.g., X ). Matrices are denoted in uppercase boldface and vectors are denoted in lowercase boldface (M and v respectively). The identity matrix is denoted as I. 2.2
Definitions
We use multiple tensor operations in Apollo, some of which are modifications of existing definitions. In this section, we define each operation the way it is used within the language. Definition 1 (Scalar-tensor product). Given a scalar λ and a tensor X ∈ RI1 ×I2 ×···×In , the scalar-tensor product λX ∈ RI1 ×I2 ×···×In is given by: (λX )i1 i2 ...in = λ(xi1 i2 ...in ) Definition 2 (Rank-n Kronecker product). Given two tensors X ∈ RI1 ×I2 ×···×In and Y ∈ RJ1 ×J2 ×···×Jn , the rank-n Kronecker product X ⊗ Y ∈ RI1 J1 ×I2 J2 ×...In Jn is given by: (X ⊗ Y)i1 i2 ...in = (xj1 j2 ...jn )Y Each index i1 i2 . . . in is a corresponding index in a block tensor. Definition 3 (Tensor inner product). Given two tensors X , Y RI1 ×I2 ×···×In , the inner product X , Y ∈ R is given by: ··· xi1 i2 ...in yi1 i2 ...in X , Y = i1
i2
∈
in
Definition 4 (Tensor dot product). Given two tensors X ∈ RI1 ×I2 ···×Im and Y ∈ RJ1 ×J2 ×···×Jn−1 ×Jn , the tensor dot product X · Y ∈ I1 ×I2 ×···×Im−1 ×J1 ×J2 ×···×Jn−2 ×Jn is given by: R (X · Y)i1 i2 ...im j1 j2 ...jn = xi1 i2 ...im yj1 j2 ...jn−1 jn im ,jn−1
where Im = Jn−1 .
20
S. Redrouthu and R. Athavale
Definition 5 (Khatri-Rao product). Given two matrices A ∈ RI×K and B ∈ RJ×K , the Khatri-Rao product A B ∈ RI·J×K is given by: A B = a1 ⊗ b1 a2 ⊗ b2 · · · aK ⊗ bK This can be thought of as a column-wise Kronecker product. Definition 6 (Face-splitting product). Given two matrices A ∈ RK×I and B ∈ RK×J , the face-splitting product A • B ∈ RK×I·J is given by: ⎡ ⎤ a1 ⊗ b1 ⎢ a2 ⊗ b2 ⎥ ⎢ ⎥ A•B=⎢ ⎥ .. ⎣ ⎦ . aK ⊗ bK This can be thought of as a row-wise Kronecker product. Definition 7 (Vector cross product). Given two vectors u ∈ R3 and v ∈ R3 , the vector cross product u × v ∈ R3 is given by: e1 e2 e3 u × v = a1 a2 a3 b1 b2 b3 We discuss how to run each of these operations on our optical hardware in Sect. 6.1
3
Language Details
3.1
Grammar
The language ideally would have a grammar that is intuitive, vast, and requires minimal coding on the user’s side. Since this is a prototype, however, the grammar is limited and technical. This minimizes the number of compiler tricks needed, which we found was a good avenue to take to focus on compiling tensor algebra expressions. The current grammar is described in Fig. 2 in Extended Backus-Naur Form (EBNF) notation. Notable emphasis is placed on expressions, as they are the focus for our application of optical computing. Currently, is only possible to declare new statements, as shown Fig. 2. However, all operations we discuss are able to be performed with solely this grammar, which we plan to expand in the future. 3.2
Supported Operators
The standard PEMDAS order is supported for scalars. For tensors of rank n ≥ 1, the order of operations should be defined with parenthesis. We show the operators supported in this Apollo prototype in Tables 1 and 2. 1
We refer to Definition 3 in the general case to provide a complete definition, but only discuss implementation in the vector case.
Tensor Algebra on an Optoelectronic Microchip
21
Fig. 2. Apollo’s Grammar Shown in EBNF. The Base Case in the Recursive Tensor Structure is a List of Comma Separated Integers and/or Floating Point Values
Table 1. Binary Operators Name
Operator Usage
Addition
+
s1 + s2
Subtraction
−
s1 − s2
Multiplication, dot product *
s1 s2 , s1 · T1 , T1 · s1 , T2 · T2
Division
x÷y
/
Kronecker product
@
T1 ⊗ T 2
Khatri-Rao product
&
T1 T 2
Face-splitting product
%
T1 • T2
Cross product
#
x×y
22
S. Redrouthu and R. Athavale Table 2. Unary Operators Name
Operator Usage
Negation –
4 4.1
−x
Compiler Design Workflow
The workflow we decided on is shown in Fig. 3. Note that the Apollo compiler is 2-stage.
Fig. 3. Native Apollo Code Gets Compiled into Apollo Virtual Machine (AVM) Instructions by the Compiler Front-End. AVM Generates Standard Assembly Instructions for Regular Operations and Compiles Tensor Algebra to t1926 Instructions. Respective Assemblers Target the host CPU and Tachyon 1. This Chosen Workflow Enables Tensor Algebra to be Outsourced to Tachyon 1. Note that the Scope of this Paper is Limited to AVM Instruction Generation
The standard assembler targets the host CPU, whereas the t1926 assembler targets Tachyon 1. Such a setup is used because Tachyon 1 is geared towards certain types of computations only.
Tensor Algebra on an Optoelectronic Microchip
4.2
23
Compiler Front-End
We use a hand coded compiler front-end (lexer, parser, and code generator). This is because we have found that parser generators do not cooperate well with tensor algebra and our storage choice. We use a recursive descent parser, which works well for performance. The in-compiler tensor storage we describe in Sect. 5.2 is more easily implemented with such a parser. It is noteworthy that there are many instances in the language where operators are overloaded. For example, consider the multiplication operator, *. If A * B is called, four cases are possible. 1) A is a scalar and B is a tensor of rank n > 0, 2) A is a tensor of rank n > 0 and B is a scalar, 3) A and B are both scalars, or 4) A and B are both tensors of rank n > 0. The parser considers these cases and generates abstract syntax tree (AST) nodes of the correct type (e.g., variable nodes, scalar nodes, tensor nodes, etc.). The AST is traversed in pre-order by the code generator, sequentially producing the appropriate VM instructions. Standard procedures are followed for variable handling. In the case of more exotic AST nodes (e.g., tensor nodes) the code generator calls special functions (discussed in Sect. 6) to generate the correct code. The VM instruction set is outlined in Sect. 4.3. 4.3
Virtual Machine
Apollo’s VM is stack-based. It provides 4 memory segments (namely, the constant, global, pointer, and this segments), shown in Fig. 42 .
Fig. 4. The constant (abstract), global, pointer, and this Virtual Segments Respectively
Each one of these segments are anchored to a specific location in RAM at compile time3 . They are fixed in their locations, except for the this segment, which we use for tensors. Index 0 in the pointer segment contains the base address of the this segment, so if the value at index 0 changes, the this segment gets anchored to a different RAM location, similar to [24]. As the language 2
3
The RAM referred to throughout this section is a simplified virtual abstraction. Hence, we freely interact with it using numbers in the decimal system. The actual RAM is referred to when discussing compilation to target architectures, which will be done in a future paper. Exact RAM indices are not included.
24
S. Redrouthu and R. Athavale
expands, we may add additional memory segments that can dynamically change location during run-time; if we take this route, we will allocate more RAM and add more values to the pointer segment. The constant segment is used to push and pop constants to and from the stack, as in [24]. Note that despite showing solely integers, the segment supports integer and floating-point values. The global segment is used in conjunction with the symbol table to store variable values, which can be accessed throughout the lifetime of the program. Values in the global segment can also be references to tensors4 . See Sect. 5 for more information regarding tensor storage. The memory access commands are push [segment] i and pop [segment] i. The push instruction pushes the value at index i of memory segment [segment] onto the stack. The pop instruction pops the value on top of the stack onto index i of memory segment [segment] [24]. The rest of the AVM instruction set (composed of arithmetic instructions and built-in subroutines) is given in Tables 3 and 4. Table 3. AVM Arithmetic Instruction Set Operation Compiles to Description neg
Host ASM
Negates the value at the top of stack
add
Host ASM
Pops stack into b. Pops stack into a. Pushes a + b to stack
sub
Host ASM
Pops stack into b. Pops stack into a. Pushes a − b to stack
mult
Host ASM
Pops stack into b. Pops stack into a. Pushes ab to stack
div
Host ASM
Pops stack into b. Pops stack into a. Pushes a ÷ b to stack
mvmul
Host ASM
Pops stack into b. Pops stack into A. Pushes Ab to stack
Note that each arithmetic instruction can be done in a single instruction by the corresponding processor. Subroutines are handled with the instruction call [fname] [nArgs]. The first [nArgs] values are treated as arguments, so the virtual machine would pop the stack [nArgs] times if the call command is generated. 4
Apollo does not yet support user-defined subroutines, so a local segment is not required.
Tensor Algebra on an Optoelectronic Microchip
25
Table 4. AVM Subroutine Instruction Set Name
Args
Description
malloc int size Finds an unused RAM segment of length size, pushes pointer pointing to the first segment index to stack
Since malloc has 1 argument, a possible code fragment for it looks like: push constant 3 call malloc 1 This would 1) push 3 onto the stack, 2) pop 3 off the stack and pass it into malloc, 3) find an unused RAM segment of size 3, and 4) push a pointer to the first index of that segment to the RAM. Its behavior mimics Memory.alloc in [24].
5 5.1
Sparse Tensor Storage Current Methods
Tensor components are conventionally represented as nested arrays in standard programming languages. In C++, the components are stored as one contiguous array. To access the element at index ij, the element at index base+i+j in the flattened block is indexed (where base is the base address of the array) [11]. In Java, each array of dimension n + 1 contains pointers to each sub-array of dimension n. If n = 0, the (n + 1)-dimensional array simply stores scalar values [1]. Since tensors are often sparse, however, these conventional methods often end up storing excess zeros, making them sub-optimal. The Facebook tensor discussed in Sect. 1.1 has only 737,934 nonzero values and is therefore 99.999% made up of zeroes. It is apparent that tensor storage optimizations must be considered. Compressed Sparse Fiber (CSF) format is a better method that stores a tensor in a tree structure, where the indices and values for only nonzero components are contained, as shown in Fig. 5. CSF performs significant better than conventional approaches for applications involved in highly sparse tensor algebra. However, CSF requires storing pointers to each child node, likely integrated to enable fast indexing [29]. Such an optimization is incredibly important for classical computing hardware; however, since our optical hardware can do an MVM in a single instruction, it is not necessary that we are able to access indices efficiently in intermediate computations. Rather, it is important that we return an entire row of indices as fast as possible. Section 6 provides insight into
26
S. Redrouthu and R. Athavale
Fig. 5. CSF Representation of the Tensor in Fig. 1, as Proposed by [29]
why this is the case. By shifting to a sparse tensor storage method that optimizes for row indexing rather than item indexing, we would be able to more efficiently store tensors for optical hardware. 5.2
Binary Sparse Tensor Tree Format
To save memory and return sub-tensors quickly, we store the tensor in Fig. 1 as shown in Fig. 6. The corresponding array is shown in Fig. 7. We only use this format for intermediate computations. It is slower to index into a specific value, but this is irrelevant as such indexing is not necessary for Apollo-supported intermediate computations on Tachyon 1. Again, however, we must be able to access a full row of rank-n indices easily. This is efficient with our format as we can simply return a pointer to the first index. Hence, our method is more useful than CSF format for optical hardware. In addition, it saves on memory because it eliminates the need to store a pointer to each child node.
6
Compiling Tensor Algebra Expressions
As stated in earlier sections, the most powerful tensor algebra operation supported by Tachyon 1 that can be done in a single instruction is matrix-vector multiplication (MVM). Therefore, it is the compiler’s job to translate more complex operations into sequences of MVMs when applicable, thereby accelerating computation of the whole expression. For clarity, note that Tachyon 1 multiplies matrices and vectors in the order Ax = b. Also note that decomposition into sub-tensors of who’s sizes are supported by Tachyon 1 is not covered in this paper.
Tensor Algebra on an Optoelectronic Microchip
27
Fig. 6. Our Representation of the Tensor in Fig. 1, which we Refer to as Binary Sparse Tensor Tree (BSTT) Format. Each Non-Leaf Node Contains an Index. The Left Child is Always the Root of a Sub-Tensor Belonging to the Current Tensor. The Right Child is always the Root of the Next Sub-Tensor Belonging to the Parent Tensor Shared by the Current Tensor
Fig. 7. Pre-Order Traversal of BSTT Format Results in the Following Array, which is then Stored on the Heap (we Plan to make Tensors Mutable in Future Apollo Versions). Values are Always Assumed to be Floating Point Numbers, a Safe Assumption Due to the Large Number of Non-Integer Values Encountered in the Targeted Fields [19, 29– 32]. Indices are Always Integers. This Allows us to Determine the Leaf Nodes and “Reconstruct” the Tree when Needed
6.1
Scalar-Tensor Product
The scalar-tensor product as defined in Definition 1 is a commutative operation that multiplies each element in a tensor X by a scalar λ5 . The product is very easy to compile; simply iterate through each vector xi1 i2 ...in−1 in the tensor
and λ0 generate the mvmul instruction to multiply it by the matrix λI = . Note 0λ that the compiler reorients the product to generate the matrix before the vector if the user calls it in the opposite order. In other words, it ensures that running the generated code results in a product in the order λIxi1 i2 ...in−1 . 5
X is assumed to be a tensor of rank n > 0, since the parser would map the scalar case to scalar multiplication.
28
6.2
S. Redrouthu and R. Athavale
Rank-[SPSDOLLAR5DOLLARSPS] Kronecker Product
The Kronecker product is useful in signal and image processing [23]. Through the Khatri-Rao product, it is useful in neural networks (through minimization of convolution and tensor sketch operations) and natural language processing [17]. Refer to Definition 2 for the definition of the Kronecker product. For clarity, each element in the result is simply the element at xi1 i2 ...in multiplied by Y for two tensors X and Y. The product can be represented compactly between two matrices as ⎡ ⎤ a11 B a12 B · · · a1n B ⎢ a21 B a22 B · · · a2n B ⎥ ⎢ ⎥ A⊗B=⎢ . .. . . . ⎥ ⎣ .. . .. ⎦ . am1 B am2 B · · · amn B Therefore, the compiler can compute the scalar-tensor product for each element in the resultant block tensor through the method outlined in Sect. 6.1. 6.3
Tensor Dot Product
Many fields, including machine learning and physics, demand the ability to compute the dot product efficiently [12,26]. To allow Tachyon 1 to meet this demand, we must also provide a way for the Apollo compiler to transform this operation into a sequence of MVMs. The tensor dot product is an operation between two rank-n tensors, A and B. Some possibly familiar tensor dot products include the rank-0, rank-1, and rank-2 dot products (scalar product, vector dot product, and matrix multiplication, respectively). The compiler considers the dot product operation over the component arrays. A few cases are possible: A and B are both scalars Either A or B is a scalar, but not both A is a vector/matrix, whereas B is a vector A is a vector/matrix, whereas B is a rank-n tensor with n > 2 A is a rank-n tensor where n > 2, whereas B is a vector A is a rank-n tensor and B is a rank-m tensor, where n > 1, m > 1, and n = m 7. A and B are both rank-n tensors
1. 2. 3. 4. 5. 6.
Cases 1 and 2 are irrelevant since the parser maps Case 1 to the scalar product and Case 2 to the scalar-tensor product (Definition 1; discussed in Sect. 6.1). In Case 3, A is always treated as a matrix and the mvmul command is simply generated (this accounts for Definition 3 if A is a vector). From this point on, we define a function fn that refers to Case n (e.g., f3 generates an MVM instruction). Continuing, in Case 4, A is also treated as a matrix. B is decomposed into a chain of vectors and a series of references to Case 3 (f3 (A, bi1 i2 ...in−1 )) are made. In Case 5, A is decomposed into a chain of matrices and a series of references to Case 3 (f3 (Ai1 i2 ...in−2 , B)) are again made.
Tensor Algebra on an Optoelectronic Microchip
29
In Cases 6 and 7, we consider Definition 4. In Case 6, the tensor of lower rank is first decomposed. Case 4 is then referenced for each matrix if A was decomposed (always into a matrix chain, resulting in calls to f4 (Ai1 i2 ...in−2 , B)) and Case 5 is referenced if B was decomposed (always into a vector chain, resulting in calls to f5 (A, bi1 i2 ...in−1 )). Finally, in Case 7, A is decomposed and Case 4 is referenced (f4 (Ai1 i2 ...in−2 , B)). 6.4
Vector Cross Product
The cross product is an operation that appears frequently in computational geometry/computer graphics. A common task is to generate a third vector orthogonal to two other vectors (or a plane formed by 3 points) [14]. The cross product can also be used to calculate the distance between two lines and calculate if they are parallel. It also appears in a multitude of physics simulations. For most applications, cross products are in R3 and between two vectors6 . We consider the cross product in a positively oriented orthonormal basis. The cross product of two vectors in R3 as defined in Definition 7 is also given by the antisymmetric matrix-vector product ⎤⎡ ⎤ ⎡ b1 0 −a3 a2 a × b = ⎣ a3 0 −a1 ⎦ ⎣b2 ⎦ −a2 a1 0 b3 The mvmul command can simply be generated from here. 6.5
Other Tensor Products
The Khatri-Rao product is useful in variances in statistics, multi-way models, linear matrix equations, and signal processing [7–9,22,27]. The face-splitting product is useful in convolutional layers in neural networks and digital signal processing in a digital antenna array [16,28]. The code generation method shown in Sect. 6.2 can be easily extended to support the Khatri-Rao and face-splitting products given in Definitions 5 and 6 respectively. An mvmul command can be generated for each index i on the operands ai and bi . 6.6
Compilation of Expressions
Chaining multiple operations into expressions is supported. The code generator traverses the AST with the tensor algebra operator precedence discussed in Sect. 3.1, and each code generation command is called sequentially as outlined in Sect. 6. However, as a prototype, the Apollo compiler assumes the arguments are valid and performs no expression-related error handling. 6
Higher rank cross products can be defined using the Levi-Civita symbol ijk , which we omit due to relatively few applications.
30
7
S. Redrouthu and R. Athavale
Discussion
There are still additions that will need to be made to the Apollo language in order to fully optimize optical computations. Most importantly, we will need to use our tensor storage algorithm only for highly sparse tensors involved in intermediate computations; we currently implement it for all tensors. We plan to also extend Apollo to generate t1926 and host instructions, integrate t1926 instructions with Tachyon 1, and develop the methodology by which Tachyon 1 would interact with the host CPU. Neural network activation functions, such as ReLU, sigmoid, and softmax, are planned to be hard-wired into the microchip; we will extend the language to support neural networks when this occurs. We also plan to add more useful tensor algebra operations based on the foundation discussed in this paper, such as the Matricized Tensor Times Khatri-Rao Product (MTTKRP). These and other extensions would help Apollo become a more robust and efficient language. Future research should explore tensor storage methods that will be able to more efficiently represent sparse tensors while still making them easy to index into. In order to optimize for speed, it will also be crucial to investigate how to best minimize required communication between Tachyon 1 and the host CPU, as converting between optical and electrical signals takes a significant amount of time. We plan to conduct this research ourselves, but at the same time encourage others to look into it as well. In the future, we plan on extending the advances made in developing the Apollo language to build APIs for high-level languages (e.g., Python, Java, C++, etc.) so that they will be able to utilize Tachyon 1. This will allow users of conventional languages to be able to harness the speed of optical computing for applications such as physics simulations and ML. We specifically plan on building libraries able to integrate with the TensorFlow and PyTorch APIs so that users will be able to run ML models made with these APIs on Tachyon 1. Our current design framework for Apollo leads the way for more powerful calculations to be performed faster on a new generation of hardware. With future advancements and optimizations, Apollo has the potential to impact numerous fields in engineering, computer science, and the natural sciences by allowing for significantly faster tensor algebra computations.
8
Conclusion
In this paper, we show how to perform tensor algebra computations on an optoelectronic microchip through Apollo, a domain specific language designed for this purpose. We then go over the language, compiler, and virtual machine designs. We also emphasized matrix-vector multiplication (MVM) operations in the design of Apollo in order to take advantage of optical hardware’s proficiency at high-speed MVM. Through this, we ensure that Apollo is optimized for photonic computation and that it will be a viable option for photonic computing applications, such as physics simulations and DNNs.
Tensor Algebra on an Optoelectronic Microchip
31
We also showed why existing sparse tensor storage methods like CSF are sub-optimal for optical hardware because, while the CSF format specifically optimizes for item indexing, optical hardware makes far greater use of row indexing for MVM operations. To solve this problem, we introduced BSTT, which boasts a binary tree structure that optimizes for row indexing and doesn’t require that a pointer be assigned to every child node. Through this, we present a key step forward in the development for photonic-optimized software that will help make photonic-powered physics simulations and DNNs more efficient. Finally, we go over the compilation of tensor algebra expressions into MVMs, which are native to our microchip. We illustrate how complex tensor algebra expressions can be run quickly and efficiently through our methods. Finally, we discuss the impact of our research, provide suggestions for future research avenues, and outline how we plan to extend the Apollo language.
9
Future Scope and Apollo’s Next Steps
As mentioned in Sect. 7, there are still steps that will need to be taken to build upon this research and strengthen Apollo. We will tune Apollo to specific Tachyon ISAs to further optimize it for photonic hardware, though the core tenants of Apollo discussed in this paper (an emphasis on MVM operations and sparse tensor storage that optimizes for row indexing) will remain in future Apollo versions. Furthermore, we aim to make Apollo more accessible by creating APIs that will allow a wider pool of software engineers and researchers to utilize Apollo through higher level languages such as Java, C++, and Python. We hope that this will allow software designers without specific domain knowledge of photonics to utilize the power of optical hardware in a similar way to how the Tensorflow and PyTorch libraries allow software designers who may not be familiar with calculus to still make and use machine learning applications. Though work still needs to be done to bring Apollo to the world, our current design introduces concepts that will be crucial for advancing photonic computing and continuing the pursuit of computational growth after the death of Moore’s Law. Acknowledgments. We thank Dhruv Anurag for Apollo-related discussion and testing. We thank Jagadeepram Maddipatla for creating test cases. We thank Dr. Jonathan Osborne for mathematical discussion and advice. We thank Mr. Emil Jurj for supporting this project. We thank Shihao Cao for support and useful discussion about the project’s future. Finally, we thank our families for extended support and patience.
References 1. 2. 3. 4.
Arrays Comsol multiphysics® software - understand, predict, and optimize Engineering simulation software | ansys products Multiphysics modeling
32
S. Redrouthu and R. Athavale
5. Blalock, D., Guttag, J.: Multiplying matrices without multiplying. In: Meila, M., Zhang, T. (eds.) Proceedings of the 38th International Conference on Machine Learning, vol. 139. Proceedings of Machine Learning Research, pp. 992–1004. PMLR, 18–24 July 2021 6. Briola, A., Turiel, J.D., Marcaccioli, R., Aste, T.: Deep reinforcement learning for active high frequency trading. CoRR, abs/2101.07107 (2021) 7. Bro, R.: Multi-way Analysis in the Food Industry. Models. Algorithms and Applications 8. Budampati, R.S., Sidiropoulos, N.D.: Khatri-Rao space-time codes with maximum diversity gains over frequency-selective channels. In: Sensor Array and Multichannel Signal Processing Workshop Proceedings, 2002. IEEE (2003) 9. Chambers, R.L., Dorfman, A.H., Wang., S.: Limited information likelihood analysis of survey data. J. R. Stat. Soc. Ser. B Stat. Methodol. 60(2), 397–411 (1998) 10. Cole, C.: Optical and electrical programmable computing energy use comparison. Opt. Express 29(9), 13153–13170 (2021) 11. Corob-Msft. Arrays (c++) 12. Dahl, G., Leinaas, J.M., Myrheim, J., Ovrum, E.: A tensor product matrix approximation problem in quantum physics. Linear Algebra Appl. 420(2), 711–725 (2007) 13. Dunlavy, D.M., Kolda, T.G., Kegelmeyer, W.P.: 7. Multilinear Algebra for Analyzing Data with Multiple Linkages, pp. 85–114 14. Eisele, R.: 3D cross product 15. Garg, S., Lou, J., Jain, A., Nahmias, M.A.: Dynamic precision analog computing for neural networks. CoRR, abs/2102.06365 (2021) 16. Ha, D., Dai, A.M., Le, Q.V.: Hypernetworks. CoRR, abs/1609.09106 (2016) 17. Jagtap, A.D., Shin, Y., Kawaguchi, K., Em Karniadakis, G.: Deep kronecker neural networks: a general framework for neural networks with adaptive activation functions. CoRR, abs/2105.09513 (2021) 18. Keyes, D.E., et al.: Multiphysics simulations: challenges and opportunities. Int. J. High Perform. Comput. Appl. 27(1), 4–83 (2013) 19. Kjolstad, F., Kamil, S., Chou, S., Lugato, D., Amarasinghe, S.: The tensor algebra compiler. Proc. ACM Program. Lang. 1(OOPSLA), 77:1–77:29 (2017) 20. Kola, T., et al.: Tensor toolbox for matlab v. 3.0, 3 2017 21. Lehrer, J.: 1,084 days: How toy story 3 was made, June 2010 22. Lev-Ari, H.: Efficient solution of linear matrix equations with applications to multistatic 23. Van Loan, C.F.: The ubiquitous kronecker product. J. Comput. Appl. Math. 123(1), 85–100 (2000). Numerical Analysis 2000. Vol. III: Linear Algebra 24. Nisan, N., Schocken, S.: The Elements of Computing Systems: Building a Modern Computer from First Principles. The MIT Press, Cambridge (2021) 25. Peltzer, P., Lotz, J., Naumann, U.: Eigen-ad: algorithmic differentiation of the eigen library. CoRR, abs/1911.12604 (2019) 26. Rabanser, S., Shchur, O., G¨ unnemann, S.: Introduction to tensor decompositions and their applications in machine learning (2017) 27. Sims, C.A., Stock, J.H., Watson, M.W.: Inference in linear time series models with some unit roots. Econometrica 58(1), 113 (1990) 28. Slyusar, V.: New matrix operations for dsp, 11 1999 29. Smith, S., Ravindran, N., Sidiropoulos, N.D., Karypis, G.: Splatt: efficient and parallel sparse tensor-matrix multiplication. In: 2015 IEEE International Parallel and Distributed Processing Symposium, pp. 61–70 (2015) 30. Srivastava, N.K.: Design and generation of efficient hardware accelerators for sparse and dense tensor computations (2020)
Tensor Algebra on an Optoelectronic Microchip
33
31. Tew, P.A.: An investigation of sparse tensor formats for tensor libraries. M.eng. thesis, Massachusetts Institute of Technology, Cambridge, MA, June 2016 32. Xu, H., Kostopoulou, K., Dutta, A., Li, X., Ntoulas, A., Kalnis, P.: Deepreduce: a sparse-tensor communication framework for federated deep learning. In: Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P.S., Vaughan, J.W. (eds.) Advances in Neural Information Processing Systems, vol. 34, pp. 21150–21163. Curran Associates Inc (2021)
Evaluation of Accuracy: A Comparative Study Between Finger Pointing and Stylus Pointing Using Mid-Air Input Zeeshan Haider Malik(B) , Bassam Siddiqui, Afsheen Sabir, and Ismail Afzal Forman Christian College (A Chartered University), Lahore, Pakistan [email protected]
Abstract. This paper is based on finding the accuracy difference between the Finger pointing and Stylus pointing using LEAP Motion. Experimentation has been conducted on customized software specifically developed for testing purposes. Experiments have been performed using LEAP Motion device with three different groups of users which include Computing, Non-computing and Laymen. All users were asked to perform two types of testing i.e. Finger pointing and Stylus pointing. All of them performed the experiments under same conditions. After doing detailed experimentation, analysis has been conducted on the obtained results for finding out whether finger pointing is better or stylus pointing as a midair gesture input. The significance of doing this research is that if the results of finger pointing or stylus pointing appear to be much accurate then it can be introduced in different fields. Keywords: Midair Gesture Input · Leap Motion · Finger Pointing · Stylus Pointing · Human Computer Interaction
1 Introduction Latest technologies in the domain of vision sensors are capable of capturing 3D finger positions and movements efficiently. Such technologies suggest a new way to interact and control with computers by moving fingers and using hand gestures in the air. The positions of fingers and the hand gestures are accurately captured by a unique computer vision device. Through tracking the moving patterns of fingers, the system is able to recognize users’ intended input information or control commands. Midair gestures have the benefit in terms of allowing users to directly point to an object, which is like touch, apart from that users can perform that action from a distance. Most known mid-air techniques use ray casting, which extends an imaginary line of a finger or an object to find the point of contact with respect to the display. Earlier research took the help of laser pointers in order to interact with the objects which were placed at a distance. Later research work has focused on freehand pointing. The research conducted on mid-air interaction have been beneficial in terms of addressing several challenges. The research work done in past has proposed many gestures for freehand pointing, a number of gestures © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): SAI 2023, LNNS 711, pp. 34–51, 2023. https://doi.org/10.1007/978-3-031-37717-4_4
Evaluation of Accuracy: A Comparative Study
35
have been proposed for triggering a selection, the one that is quite often is the pinch gesture where users pinch their index finger and thumb together to trigger an action, moreover some other techniques such as Air Tap, Thumb Trigger, or Side Trigger can also be triggered via using different gestures. The pointing is affected by how fingers are used in different manners. Furthermore, the past research also illustrates that mid-air pointing has a low accuracy generally. Various researches have been done on finding the accuracy of mid-air input devices like Microsoft Kinect sensor [1] and Leap Motion Device. All those researches revolved around finger pointing and full body gestures but in this research the concern is about finding the accuracy of using a tool (Stylus) as a mid-air input instead of just finger pointing and hand gestures.
2 Related Work Research was done in medical scenarios in July 2008 where the accuracy of stylus versus finger input in 2D was investigated [2]. For experimentation they used different devices for testing purpose. They used tablet PC’s having a resolution of 1024 × 768 and running Microsoft Windows XP Tablet PC Edition. The first device was a Motion C5 which is tablet PC with a stylus designed for applications in professional healthcare environment such as hospitals. The second device was a Motion LE1600 which can be operated by a finger or stylus. Using Leap Motion Device along with additional attached devices in 3D modeling can show different types of behavior [3]. For calculating the accuracy of leap motion controller, the measurement of human hand is the most relevant factor, but human hand is essentially affected by the so-called tremor which is defined as a rhythmic movement of muscles. Therefore, for evaluation of the Leap Motion controller with regard to human gesture-based user interfaces, it is necessary to establish a reference system with accuracy below the human tremor, which is below 0.2 mm [4]. Leap Motion being a mid-air input device also deals with selection-based input technique. Leap Motion has Relative Indirect Input method in which the position of the cursor is controlled by the input that can be finger or stylus. Though this technique will have some parallax problem for users i.e., incorrect perception of where the target is actually located [5]. According to research it is observed that finger pointing is more accurate for well-trained computer experts on the other hand elders feel more comfortable with mouse pointing rather than finger pointing. This implies that HCI designers should consider about elder people while designing and creating User Interfaces for finger pointing [6].
3 Experiment Design The experiment was within-group design [7], so all participants performed same tasks. Testing software was developed for evaluating the performance done by users using Finger and Stylus Pointing. The software used for conducting the study was developed on. Net platform using C#. The dependent variables were wrong attempts, right attempts, execution time to click a button, whereas independent variables were sizes of a button, location of a button, distance of button.
36
Z. H. Malik et al.
3.1 Fitts’ Law Fitts’ law helps in order to determine the relation between time to perform a task (for example an action to point), the distance between starting point and the target, and the overall size of the target [8]. It was developed in 1954 by researcher Paul Fitts’ and is an essential basis for testing the overall performance of the devices that require to perform pointing actions. At a later stage, it got into the ISO 9241–9 standard for evaluating the usability and efficiency of these kind of devices. Buxton and added further improvements to it and extended it to the 2D interfaces. The Shannon’s formulation is denoted as: T = a + b log2 (D/W + 1) The time it will take to move a pointing marker into a specific target area depends on the distance between start point and the target, and also on the size W of the target area. The term is the called as the Index of Difficulty (ID). A task will get more difficult if the target is located further away and is smaller in size. This demonstrates well the common perception that the time it takes to perform a pointing task is linearly correlated with the value of it Index of Difficulty (ID). log2 (D/W + 1)
3.2 Working of Software The testing software will firstly pop the login screen where user will enter their credentials and press the next button (Fig. 1). Next screen will appear where user will select the type of pointing and then click on start training as shown in Fig. 2.
Fig. 1. Login Screen
Evaluation of Accuracy: A Comparative Study
37
Fig. 2. Selection Screen
3.3 Leap Motion Controller It is a small USB enabled peripheral device which is designed in a way to be placed in front of a physical desktop and facing upward. It is possible to mount it onto a virtual reality headset. It uses two monochromatic IR cameras and a set of three infrared LEDs. The device is capable of observing a roughly hemispherical area of about one meter distance. The set of three infrared LEDs generate a pattern-less IR light, and at same time the cameras generate reflected data of almost 200 frames per second. Later this is sent through a USB cable to the desktop computer, where this data is analyzed by the software for Leap Motion Controller. The software uses some “complex maths” in order to analyse the data. The company has not disclosed the way the data is analysed. A study in 2013 illustrated that the overall average accuracy of the controller is 0.7 mm. This product differentiates from Kinect due to its higher resolution and smaller observation. The Kinect is more adequate while tracking whole-body in a space like a living room. The device is capable of high precision drawing, navigating a website, manipulating some complex 3D data visualizations, and pinch-to-zoom gestures on maps. 3.4 Experiment Design Testing was conducted in a controlled environment on 1.9 GHz Sony Vaio SVF14213 machine running Windows 8. The 14” screen size had a resolution of 1366 × 768. The laptop was placed in front of user. The leap motion device was attached to a laptop in order to get a mid-air input. Both laptop and device were placed on same level as shown in Fig. 3. Video for facial expressions was recorded by an evaluator whereas screen recording has been done using a screen casting software named as “Camtasia 9”. The stylus used for pointing was Surface Pen by Microsoft (Fig. 4). The dimension of stylus was 5.67 × 0.37 × 0.40 in (144 × 9.5 × 10.2 mm) and weight was 20g.
38
Z. H. Malik et al.
Fig. 3. Leap Motion Device
Fig. 4. Leap Motion Device
3.5 Pilot Testing The pilot study was conducted in a controlled environment to find out the errors in the software that were not discovered during implementation. Two pilot users were chosen for the pilot study. Both were the faculty members of Computer Science Department. One of them is the expert of Human Computer Interaction so his feedback was helpful in understanding and modeling the usability issues of the testing software. The pilot study was conducted in a controlled environment with a laptop placed on the table and a leap motion device. Users were asked to sit in front of it on a chair. The leap
Evaluation of Accuracy: A Comparative Study
39
motion device was placed after the laptop parallel to the track pad so that it will provide natural angle for pointing. All users performed two types of tasks i.e. Finger Pointing and Stylus Pointing. One of the user suggested to set the testing time span according to the user’s computing experience that is if the user is a beginner, intermediate or a complete layman he or she must have given more time for testing than an expert computing user. After doing proper analysis generalized time duration was set for every type of user and that was three minutes. Only after doing training the user is asked to perform the actual task which was of 40 s. 3.6 Procedure of User Testing Two types of tasks are conducted, Finger Pointing and Stylus Pointing. Latin Square [9] was used to specify the order of tasks to be performed. This order varies from user to user [10]. In finger pointing, user was supposed to sit in front of laptop and use finger for pointing purpose. Three minutes training session was conducted before the actual testing to get the user familiar with the interface and the device. Buttons of different sizes appeared on the screen at random positions. User was supposed to click on each button using finger pointing. Every user has given an equal time span of 40 s to complete their tasks in the actual testing. In stylus pointing, user was supposed to sit in front of laptop and use stylus for pointing purpose. Three minutes training session was conducted before the actual testing to get the user familiar with the interface and the device. Buttons of different sizes appeared on the screen at random positions. User was supposed to click on each button using finger pointing. Every user has given an equal time span of 40 s to complete their tasks in the actual testing. User is given a short break between all tasks. After completing all tasks, users were asked to fill the questionnaire containing both general and technical questions in order to get their feedback. 3.7 Procedure of User Testing Testing was conducted with total of 66 users including 42 males (Computing: 22, Noncomputing: 13, Layman: 7) and 24 females (Computing: 12, Non-computing: 10, Layman: 2). Most of them were university students whereas some laymen were workers of university including sweepers, liftman, and technician. 53 participants belong to age-group 18–26 years whereas 8 participants belong to 26–40 years, two participants belong to below 18 years and three participants belongs to age-group of above 40 years.
40
Z. H. Malik et al.
4 User Study 4.1 Results of Questionnaire
30.00% 20.00% 10.00% 0.00%
Layman Compung Strongly Agree Neutral Disagree Strongly Agree Disagree
Non-Compung
Fig. 5. Question 1: Do you think Leap Motion Device was Easy to Use?
The purpose of this question was to check the comfort level of users. Figure 5 illustrates that most of the users find it easy to use leap motion device. Even the laymen find leap motion device easy to use.
30.00% 20.00%
Layman
10.00%
Compung
0.00%
Strongly Agree
Agree
Neutral Disagree Strongly Disagree
Non-Compung
Fig. 6. Question 2: Were you Feeling Confident while using Leap Motion Device?
This question tells the confidence level of people using the leap motion device. Figure 6 illustrates that many people do not feel confident dealing with technological stuff unless or until they fully understand how to use it. Rather they would like to stick to the old conventions. Our results show that most of the people were confident using leap motion device. Our results show that despite of the fact leap motion was new to them majority felt confident using it.
Evaluation of Accuracy: A Comparative Study
20.00% 18.00% 16.00% 14.00% 12.00% 10.00% 8.00% 6.00% 4.00% 2.00% 0.00%
41
Layman Compung Non-Compung
Fig. 7. Question 3: What was the Difficulty Level?
The purpose of this question was to evaluate the degree of difficulty faced by the users. The results in Fig. 7 show that majority of the laymen faced no difficulty or little difficulty and few of them find it tiring. In case of Computing field users majority of them faced no difficulty or little difficulty and rest of them find it tiring. Likewise, majority of the non-computing users faced no difficulty or little difficulty and the remaining of them find it tiring. 25% 20% 15%
Layman
10%
Compung Non-Compung
5% 0%
Strongly Agree
Agree
Neutral Disagree Strongly Disagree
Fig. 8. Question 4: Did you feel any Tremor Issue while Doing Finger Pointing?
The purpose of this question was to check that whether users faced any tremor issue using finger pointing. Because it might be possible that some people have health issues due to which their hands start shaking after performing some physical tasks. The results in Fig. 8 show that none of the layman had any tremor issue but some of the users from the other two fields faced this issue.
42
Z. H. Malik et al.
20.00% 15.00% Layman
10.00%
Compung 5.00% 0.00%
Non-Compung
Strongly Agree
Agree
Neutral Disagree Strongly Disagree
Fig. 9. Question 5: Did you find it Easy to Point on Targets using Finger Pointing?
The purpose of this question was to check the degree of acceptance and nonacceptance by the users for using finger pointing as mid-air input. Figure 9 illustrates that majority of the users find it easy but there was variation in case of computing users. Majority of Laymen i.e. 18.51% were in favor of it. Likewise, majority of Computing and Non-Computing also agreed that it was easy to point using finger pointing.
20.00% 15.00% Layman
10.00%
Compung 5.00% 0.00%
Non-Compung Strongly Agree
Agree
Neutral Disagree Strongly Disagree
Fig. 10. Question 6: Were you able to hit the Small Target Areas Efficiently using Finger Pointing?
The purpose of this question was to check efficiency of leap motion for pointing on small size buttons using finger pointing, because it might be possible that users might found it difficult to focus on small buttons than large sized buttons. Figure 10 illustrates that majority of the users were able to hit small sized buttons efficiently. Majority of the users could hit the small targets efficiently using finger pointing.
Evaluation of Accuracy: A Comparative Study
43
25.00% 20.00% 15.00%
Layman
10.00%
Compung Non-Compung
5.00% 0.00%
Strongly Agree
Agree
Neutral Disagree Strongly Disagree
Fig. 11. Question 7: Were you able to Hit Large Target Areas Efficiently using Finger Pointing?
The purpose of this question was to check efficiency of users hitting the large target areas using finger pointing to see how swiftly and easily they can hit on them as compared to small target areas. The results in Fig. 11 show that large target areas are easy to hit than small. Majority of the users were able to hit large sized buttons efficiently. Almost every user was able to hit the large sized button easily.
30% 25% 20% Layman
15%
Compung
10%
Non-Compung
5% 0%
Strongly Agree
Agree
Neutral Disagree Strongly Disagree
Fig. 12. Question 8: Did you Face any Difficulty to Grip the Stylus?
The question was to check whether it is easy to grip stylus or not, because many people don’t like to use stylus due to its weight, some have sweaty hands issue which will lead to the slippering of stylus, some have fat finger issue due to which they find it difficult griping small and thin objects. But the results in Fig. 12 show that majority were comfortable with the holding of stylus but there were few users who faced difficulty in gripping the stylus.
44
Z. H. Malik et al.
20.00% 15.00% Layman
10.00%
Compung 5.00% 0.00%
Non-Compung Strongly Agree Agree
Neutral Disagree Strongly Disagree
Fig. 13. Question 9: Did you find it Easy to Point on the Targets using Stylus Pointing?
The purpose of this question was to check the user’s feedback regarding their overall experience of using stylus for pointing purpose in mid-air as compared to finger pointing. The results in Fig. 13 show that Majority of Laymen i.e. 18.51% have shown great acceptance. Likewise, majority of Computing users i.e. 12.74% also accepted this idea but 10.14% of users have totally gone against it.
18.00% 16.00% 14.00% 12.00% 10.00% 8.00% 6.00% 4.00% 2.00% 0.00%
Layman Compung Non-Compung
Strongly Agree
Agree
Neutral Disagree Strongly Disagree
Fig. 14. Question 10: Were you able to Hit Small Target Areas Efficiently using Stylus Pointing?
The purpose of this question was to check efficiency of users hitting the small target areas using stylus pointing to see how swiftly and easily they can hit on them as compared to large target areas. The results in Fig. 14 show that large target areas are easy to hit than small. Majority of the users i.e. 14.81% from Laymen category find it easy to hit on large target areas. Likewise, majority of the users from Computing field i.e. 16.66% also find it easy but if we look at the users from Non-computing field then they had issues
Evaluation of Accuracy: A Comparative Study
45
and that is why majority of them i.e. 11.59% said that it was difficult for them to hit on small target areas as compared to large target areas.
25.00% 20.00% 15.00%
Layman
10.00%
Compung Non-Compung
5.00% 0.00%
Strongly Agree Agree
Neutral Disagree Strongly Disagree
Fig. 15. Question 11: Were you able to Hit the Large Target Areas Efficiently using Stylus Pointing?
The purpose of this question was to check efficiency of users hitting the large target areas using stylus pointing to see how swiftly and easily they can hit on them as compared to small target areas. The results in Fig. 15 show that large target areas are easy to hit than small. Majority of the users from all three fields find it easy to hit on large target areas. 16.00% 14.00% 12.00% 10.00% 8.00% 6.00% 4.00% 2.00% 0.00%
Layman Compung Non-Compung Strongly Agree
Agree
Neutral Disagree Strongly Disagree
Fig. 16. Question 12: Would you Prefer Stylus Pointing over Finger Pointing using Mid-Air Pointing?
The purpose of this question was to check whether users prefer stylus pointing over finger pointing. Figure 16 illustrates that majority of the laymen i.e. 11.11% disagreed to this statement and preferred finger pointing over stylus pointing. Majority of the
46
Z. H. Malik et al.
Computing users i.e. 14.70% also preferred finger pointing and likewise majority of Non-computing users i.e. 11.59% also preferred finger pointing over stylus pointing.
25.00% 20.00% 15.00%
Layman
10.00%
Compung
5.00% 0.00%
Non-Compung Strongly Agree Agree
Neutral Disagree Strongly Disagree
Fig. 17. Question 13: Mid-Air Pointing can Introduce New Ways to Interact with Systems in an Effective Manner
The purpose of this question was to take feedback from the users about the scope of mid-air pointing. Figure 17 illustrates that majority of the users from all three fields agreed to this point that mid-air input can be used to interact with system in effective manner using leap motion device. 4.2 Results of Testing Software The results of testing software illustrated some interesting insights.
Computing Experts 40000 30000 20000 10000 0 User User User User User User User User User User User User User User User User 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 Finger Pointing
Stylus Pointing
Fig. 18. Average Execution Time to Click on a Button in Millisecond
Evaluation of Accuracy: A Comparative Study
47
Execution time is the time taken by a user to click on a button. It is measured in milliseconds. The results in Fig. 18 show that 71.87% of the users performed finger pointing accurately and the remaining 28.12% performed stylus pointing more accurately.
Compung Experts 1000 800 600 400 200 0 User User User User User User User User User User User User User User User User 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 Finger Poinng
Stylus Poinng
Fig. 19. Average Fitts’ Time to Click on a Button in Millisecond
Fitts’ time is the time taken by a user to click on a button, but this is calculated using Fitts’ law. The Fitts’ law results in Fig. 19 show that the 59.37% of the users performed finger pointing accurately and the remaining 40.63% performed stylus pointing more accurately.
Non-Compung 15000 10000 5000 User 1 User 2 User 3 User 4 User 5 User 6 User 7 User 8 User 9 User 10 User 11 User 12 User 13 User 14 User 15 User 16 User 17 User 18 User 19 User 20 User 21 User 22 User 23
0
Finger Poinng
Stylus Poinng
Fig. 20. Average Execution Time to Click on a Button in Millisecond
48
Z. H. Malik et al.
Execution time is the time taken by a user to click on a button. It is measured in milliseconds. The results in Fig. 20 show that 73.91% of the users performed finger pointing accurately and the remaining 26.08% performed stylus pointing more accurately. 17.39% of the total users were not able to perform stylus pointing at all.
Non-Computing
User 1 User 2 User 3 User 4 User 5 User 6 User 7 User 8 User 9 User 10 User 11 User 12 User 13 User 14 User 15 User 16 User 17 User 18 User 19 User 20 User 21 User 22 User 23
1000 800 600 400 200 0
Finger Pointing
Stylus Pointing
Fig. 21. Average Fitts’ Time to Click on a Button in Millisecond
Fitts’ time is also the time taken by the user to click on a button, but this is calculated using Fitts’ law. The Fitts’ law result in Fig. 21 also shows that the 65.21% of the users performed finger pointing accurately and the remaining 34.79% performed stylus pointing more accurately. 17.39% of the total users were not able to perform stylus pointing at all.
Layman 10000 8000 6000 4000 2000 0 User 1
User 2
User 3
User 4
Finger Poinng
User 5
User 6
User 7
User 8
Stylus Poinng
Fig. 22. Average Execution Time to Click on a Button in Millisecond
User 9
Evaluation of Accuracy: A Comparative Study
49
In case of layman, if we compare the results in Fig. 22 of Finger and stylus pointing 55.5% of the users performed better using stylus pointing and 44.5% performed better using Finger pointing.
Layman 1000 800 600 400 200 0 User 1
User 2
User 3
User 4
Finger Poinng
User 5
User 6
User 7
User 8
User 9
Stylus Poinng
Fig. 23. Average Fitts’ time to Click on a Button in Millisecond
Fitts’ time is the time taken by the user to click on a button but this is calculated using Fitt’s law. The results in Fig. 23 shows that when we compare Stylus Pointing and Finger Pointing, 55.5 % performed better using Stylus Pointing and 44.5 % performed better using Finger Pointing.
5 Discussion and Analysis The idea of mid-air pointing was highly welcomed by them, and they preferred finger pointing over stylus pointing. The technical stats calculated by the testing software, stylus pointing did not prove to be better than Finger pointing. Users from Computing and Non-Computing field performed well doing finger pointing. Their right attempts for clicking on buttons using finger were more as compared to using stylus. Whereas, in case of Laymen, users performed well using stylus pointing. In case of Computing and Non-Computing users the measured speed to click on a button is higher using finger pointing. In case of Laymen the measured speed for stylus pointing is higher than finger pointing. In case of large buttons, users find it easy to point on them using both finger and stylus pointing. Whereas in case of small buttons, all users find it easy to point on them using finger pointing but with stylus pointing only users belonging to Computing field find it easy. Apart from the fact that the performance of majority of the users using finger pointing was better than the stylus pointing still we can say that using stylus pointing is viable for them. Laymen were really excited about using leap motion device and though they performed well using touch screen pointing as compared to mid-air input, still they preferred mid-air input.
50
Z. H. Malik et al.
By looking at the feedback received from the users it is observed that all users were thrilled to interact with the systems using mid-air input leaving behind the old conventions of using mouse, track pad and touch screen and almost every user agreed to this question. Some of the workers were so excited that they asked to fit leap motion device in their system where they have touch screens for operating the generators and air-cooling system. They said that it would be easy for them if they just have to point on buttons instead of touching them because at times their hands are not clean and there is a risk of getting harmed, both to the system and workers.
6 Conclusion The research was focused on finding accuracy difference between finger and stylus pointing as Mid-air input. The research started off with a hypothesis that stylus pointing is more accurate than finger pointing as Mid-air input. Experiments were performed with three groups of people i.e., computing, non-computing, and laymen. Every user had to undergo series of steps of training and testing session. Each of them was given equal time for training and testing session. The purpose of having three types of users is to evaluate the comfortability of users and see how efficiently they perform the tasks using both types of pointing i.e., Finger, and Stylus. The performance of users is evaluated based on their personal feedback and the results obtained via testing software. Their personal feedback helped to illustrate that which form of pointing they think is preferable for them and the results obtained from the testing software illustrates how swiftly a user performed the task using each type of pointing. After analyzing all the results obtained from all groups of users, it has been established that finger pointing is more accurate than stylus pointing, and people performed better using finger pointing. However, if the results of laymen are considered it is interesting to note that people who have not had any interaction with computers performed better using stylus pointing as compared to finger pointing using Mid-air input.
References 1. Vikra, S., Li, L., Russell, S.: Handwriting and gestures in the air. Recogn. Fly (2013) 2. Holzinger, A., Schedlbauer, M., Urlesberger, B.: An investigation of finger versus stylus input in medical scenarios. In: ITI 2008-30th International Conference on Information Technology Interfaces, pp. 433–438 (2008) 3. Reiten, J.E.: 3D modelling using Leap Motion, Focusing on homogeneous transforms (2014) 4. Weichert, F., Bachmann, D., Rudak, B., Fisseler, D.: Analysis of the accuracy and robustness of the leap motion controller. Sensors 13(5), 6380−6393 (2013) 5. Forlines, C., Balakrishnan, R.: Evaluating tactile feedback and direct vs. indirect stylus input in pointing and crossing selection tasks. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 1563−1572 (2008) 6. Fürntratt, H.: Finger pointing accuracy on leap motion sensor. Interfaces Hum. Comput. Interact. (IHCI), 233–237 (2014) 7. Malik, Z.H.: Usability Evaluation of Ontology Engineering Tools, pp. 567–584. IEEE, USA (2017)
Evaluation of Accuracy: A Comparative Study
51
8. Malik, Z.H., Arfan, M.: Evaluation of accuracy: a comparative study between touch screen and midair gesture input. In: Arai, K., Bhatia, R. (eds.) Advances in Information and Communication. FICC 2019. Lecture Notes in Networks and Systems, vol. 69, pp. 448−462. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-12388-8_32 9. MacKenzie, I.: Human-Computer Interaction: An Empirical Research Perspective; Morgan Kaufmann Elsevier Science & Technology Books: Waltham. MA, USA (2012) 10. Malik, Z.H., Munir, T., Ali, M.: Usability Evaluation of Online Flight Reservation Systems. Lecture Notes in Networks and Systems, vol. 69, pp. 448–462. Springer, USA. 2019.
Exploring the Application of Gamification in the Software Development Process Lutendo Lesley(B) and Ernest Mnkandla School of Computing, University of South Africa, Johannesburg, South Africa [email protected], [email protected]
Abstract. The study aims to investigate the use of game elements in software development teams and their impact on the software development process in South African financial institutions. Gamification relates to the use of game design elements in non-game contexts; it has been applied in areas such as e-learning, consumer behavior, software development, and other areas to motivate action and improve knowledge and behavior. The software development process involves a set of actions in which humans play a major role in generating software applications. The aspects of software development that involve human activity present challenges relating to the engagement, collaboration, communication, and motivation of the developers. Gamification could be applied to augment the human aspects of the software development process, thereby reducing the challenges connected to human factors. This study applied quantitative research to survey software development team dynamics and behavior from four South African banks. Based on the findings, the gamification effective theory was applied to develop a framework for the understanding and application of game elements in software development in the banking industry in South Africa. Understanding the factors which impact the application of gamification among financial institutions is neglected. This research therefore addresses the gap in literature on gamification and reveals how the use of game elements in software development teams impacts the software development process in the banking industry in South Africa. The findings of this research, and the proposed framework could help software development teams in the banking industry to enhance the software development process through gamification. Keywords: Software Development Process · Gamification · Software Development Teams
1 Introduction The application of gamification in a software development process appears to reduce the rate of project failure and other software challenges for many organisations globally [26]. Therefore, understanding how the use of game elements in software development teams impacts the software development process has become an essential agenda. Several researchers have highlighted the outcomes of using game elements in software development activities. For example, [33] and [7] have noted improved engagement, motivation, and performance; and [34] have noted improved productivity and motivation of a positive behavioural outcome. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): SAI 2023, LNNS 711, pp. 52–68, 2023. https://doi.org/10.1007/978-3-031-37717-4_5
Exploring the Application of Gamification
53
Software engineering is a discipline that includes the structured design, production, and maintenance of a software product. Over recent years, software engineering has been noted as a positive evolution [13]. The last decade has seen the development of new technologies, and the improvement of software and hardware, with the focus being on improving software processes, including agile development, and upholding product quality improvement [14]. However, there are many challenges faced in a software development process. For example, the most common reason for the failure of software development projects’ is rooted in the project management process and the aligning of IT with organisational cultures [18]. These failures impact the failure of software development teams to meet their expectations in terms of functionality, quality, cost, and delivery schedule [18]. The failure of software development projects affects the morale of the development team because project delays often result in developers having to endure long hours of unpaid overtime [3]. This, in turn, affects their personal lives and leads to a loss of motivation, engagement, and performance and ultimately result in high costs and staff turnover. The application of gamification in software engineering appears to have potential. Gamification uses game design philosophy, as well as game design elements, and mechanics in non-game settings to encourage specific human behaviour by enhancing the player’s motivation and engagement in an activity. Gamification uses specific attributes to make real games entertaining and inviting, and to improve the player’s experience in a non-game environment, for example, in a workplace or school environment [8]. When applied to software development, gamification can provide several benefits. It encourages project teams and developers to learn about new software or technologies; greatly encourages and raises the performance levels of project teams and developers; and influences the standard and quality of work if applied to motivate best practices [11]. The contributions of this paper are: • To assess the perceptions of software development team members on engagement, motivation, and performance in a software development process. • To propose and present how game elements can be used in a software development process. • To link Software Development Teams and gamification elements. The paper is structured as follows: A brief background to the relevant concepts in Software Development and Gamification is presented in Sect. 2. The literature review and the proposed gamification framework for enhancing the software development process is presented in Sect. 3. The research methodology adopted for the study is presented in Sect. 4 and the results discussed in Sect. 5. The study is discussed in Sect. 6, and lastly, the conclusions and recommendations are presented in Sect. 7.
2 Background The term “gamification” can be defined as “a use of game design elements in non-game contexts” to increase participation and encourage a particular behaviour [8]. Although the term “gamification” became more prevalent around 2008, its use only became
54
L. Lesley and E. Mnkandla
widespread in 2010 with increased interest in gamification [4]. In recent years, gamification has gained great attention in academia and commercial applications [4]. An example of gamification is StackOverflow (stackoverflow.com), a developer questionand-answer website whereby users receive points for executing various actions via Twitter and Facebook [1]. Owing to its effectiveness, StackOverflow has since contributed to the widespread use of gamification in other domains. The examination of theories on human-computer interaction (HCI), found that gamification was originally used to increase retention and user activity in online marketing, education, or software for mobile applications [8, 42]. Gamification has been found to work effectively in these areas, albeit with challenges [15, 27]. The application of gamification in software development has attracted a lot of attention from organisations and researchers, especially due to its potential to increase engagement and motivation in software development teams. Games provide advantages that could be useful in a software development process. The use of games in the past was mainly for entertainment purposes [9]. However, according to [22] they have been found to be useful in solving real-world problems, especially in environments where it is difficult to improve people’s engagement and motivation and influence their behaviour, resulting in improved productivity on activities, and quality outcomes that add value. This is especially true in a world where people dedicate a significant amount of time to video games [11, 29]. This study focused on the application of gamification in software development teams and its effectiveness in tackling software project challenges in the software development process. As a developing field, gamification is expected to contribute to addressing human factors (i.e., engagement, human involvement, motivation, collaboration), as well as software development process challenges [23]. Furthermore, a few software development process tools have begun integrating into game elements to profit from gamification standards [23]. Some examples of commonly known commercial tools of gamification mechanisms are JIRA, Hero, SrumKnowsy, and MasterBranch, which are adopted in software development teams [23]. Whereas practitioners and researchers have found that game elements can be applied to software development teams, the application of gamification is not apparent [23]. For this reason, further research on the application of gamification in a software development process is required.
3 Literature Review 3.1 Gamification Effective Theory This research aimed to establish whether game elements enhance a software development process. The paper proposes a framework to understand the effectiveness of gamification. The gamification effective theory (GET) model is one of the approaches developed to evaluate the impact of game elements within organisations. The model clearly defines the effectiveness of gamification and the extent to which a gamified system is used and contributes to part of the specific goals of the system and its users’ [1]. The effectiveness is displayed in three dimensions of formative constructs and is influenced by antecedents, which are factors that cause effectiveness, as shown in Fig. 1.
Exploring the Application of Gamification
55
Fig. 1. Gamification Effectiveness Theory (Amir & Ralph 2014)
A software development process is considered a challenging task, conducted by individuals in a project team, and is not easily and successfully mastered. Social variables significantly affect how users interact in a project team [11]. In addition, the rate of project failure is very high. Social variables influence how individuals in a project team conduct themselves when performing a task. Limited theories have been used to explore various dimensions of gamification. In addition, much of the theoretical foundations of gamification have not yet been welldefined. [8] link the Self-Determination Theory (SDT) to the theoretical foundation for gamification as a whole [31]. SDT is a macro theory of human motivation that is used to understand an individual’s behaviour. According to SDT, it helps to understand how and why human behaviour is initiated and regulated by discussing environmental and social conditions that could affect engagement in activities [7]. However, SDT proposes a different approach to motivation; it makes a distinction between the two types of motivation: intrinsic and extrinsic, and their disadvantages. In gamification research, gamification leans more toward intrinsic motivation than extrinsic motivation because intrinsic is more about the user having a motive, whereas extrinsic is imposed by rewards. Along the same lines, according to [5], the Flow theory emphasises the internal state of full participation in an activity, it is also known as the flow experience and explains why people perform certain activities. The concept of flow is also called an internal experience [5]. This means that people do something for their own sake. People experience a flow when the activity matches their skills [5]. This theory was not suitable for this study due to the fact that it focuses more on internal experience than a project team’s goal, unlike GET which was found fitting for this research. Based on the focus of the study on game elements within a development team, the GET developed by [1] was used to structure the study’s objective. As shown in Fig. 1 above, the GET consists of the following key drivers of effectiveness: extrinsic motivation, intrinsic motivation, game mechanics, and immersive dynamics [1].
56
L. Lesley and E. Mnkandla
Although limited empirical studies found GET to explain gamification in software development teams, GET is becoming a well-established theory for predicting motivation, engagement, and performance in project teams [12]. GET is the most relevant theory for this research because the main aim of this study is to establish how the use of game elements in software development teams impacts the software development process. GET comprehensively covers all the game elements unlike the other theories reviewed in this section. 3.2 The Concept of Gamification According to [4], the past decade has witnessed the transformation brought about by games when reshaping communication with the assistance of social media to encourage competition and cooperation. Serious games are utilised for game-based social skills training that assists individuals to improve their social responsibility through creating excitement and engaging settings [37]. Apart from this, [39] highlighted emerging trends that are enhancing the popularity of practitioners and researchers popularity who have redefined the concept of game elements in non-gaming contexts. Consequently, gamification has become an emerging topic for enhancing software development. Gamification has been demonstrated to align individuals’ motivation with software development activities and assist in pointing out various information technology-related issues [38]. 3.3 Game Elements After researching a hundred gamification implementation projects, [40] discovered that the most used game elements are points, badges, and leader boards, which are often referred to as PBLs. PBLs are so common that they are usually defined and viewed as gamification [40]. Be that as it may, PBLs are not the only elements that form gamification [40]. Other game elements are characterised by dynamics, mechanics, and components. 3.4 Connecting Gamification to Motivation and Engagement of Project Teams Engagement and motivation are often distinguished in occurrence [17]. Intrinsic motivation and prior attitudes about software development increase participation and task engagement [38]. Participation is known to work in the opposite direction, altering previous negative attitudes [17, 38]. As far as [6] are concerned, high task engagement and strong motivation enable successful project delivery. Engagement as an evident positive behaviour is motivated by prior attitude [30]. Gamification is usually applied to increase employee motivation. [30] have stated that when individuals are motivated; they are energised and behave in a particular way when performing a task. A few key concepts are pertinent when considering gamification. As mentioned above, gamification involves using game elements in a non-game context to improve the motivation and engagement of users. The game elements to be chosen should depend on the project team’s goal regarding the kind of behaviour they desire to motivate. [20] point out three parts of motivation that a player experiences, namely: ‘the cognitive area’, ‘the emotional area’, and ‘the social area’. Therefore, gamification in
Exploring the Application of Gamification
57
the software development context should focus on these three areas, which are discussed in more detail as follows: The Cognitive Area. This area of motivation requires a player to learn and understand how things work [2]. The game makes use of cycles which teach the player the rules of behaviour. The mechanisms used for this are storytelling and hierarchical and structural tasks or visual representations. This involves presenting facts to a project team with reasons why something is relevant in a software development context. A developer’s code lends itself to game elements. However, writing (or applying) and understanding a code is a challenge [4]. The Emotional Area. This area of motivation refers to the theory of rewarding wanted behaviour and fining unwanted behaviour [2]. When an emotional experience is created, it aims to involve users in the failure or success of completing a task correctly or incorrectly. The game elements involved are badges, player penalties, trophies, levels, and reward systems [2]. The challenge faced here is balancing the difficulty. An individual must be happy to complete a challenge; however, the challenge must not be too difficult to complete. It should not be too difficult to ensure users are motivated and are able to try again. Preferably, the difficulty of the task and the number of rewards the players receive should be modified to a player’s skill level [20]. A characteristic of the game concept is that it is considered to possess a low risk of failure. Misunderstandings are part of daily life; however, understanding how a game works and how to perform better is important. Business stakeholders or end-users expect results in a software development context. In these instances, failure is not accepted as a part of learning; however, gamification plays a role [20]. For example, developers develop code in a production test environment and ensure the business requirements are met before business stakeholders or users test the results. Traditionally, the developers can ensure functionality has been completed and view the final requirement in a live platform. Interactive updates can be made should the need arise at any given time; however, adding the element of trial and error is a characteristic of games [10] further indicate that it works for better and more engaging project teams. The Social Area. This area of motivation area relates to communicating and comparing one’s progress with others [2]. When playing, a player assumes specific roles in this area. The roles allow players to behave differently from their usual day-to-day behaviour [20]. This can be seen as part of video games. In a non-game context, individuals take on roles that apply to the situation; academic leader and caretake [2]. Furthermore, [2] explains that when people perform tasks motivated by external factors, they get more involved because they see it as part of their identity. The game elements are avatars, leader boards, customisation, with communication features [8, 10]. As attested by [28], an essential difference in motivation in gamification is the difference between intrinsic and extrinsic motivation. Intrinsic motivation refers to doing something that one enjoys or finds fascinating. In contrast, extrinsic motivation refers to doing something with an expected external outcome unrelated to the task, such as points, rewards, or promotions [28]. Extrinsic motivators are only effective if a person is present. For example, when an individual receives a promotion, some no longer put in extra hours. Once a raise has been earned, employee standards adjust to a new normal, and their motivation decreases [28].
58
L. Lesley and E. Mnkandla
On the other hand, intrinsic motivators are believed to be more challenging. However, in most scenarios, such as job satisfaction rather than a higher salary, a person is motivated in their job when rewarded for performing well [28]. When project teams intend to gamify a development process, examining the type of motivation to be applied, and how it will affect the users, is essential. The three areas (cognitive, emotional, and social) mentioned above are the trigger and basis for a player’s motivation. However, [28] recognise the challenge of separating the areas because of their close relationship and interaction, since game mechanics usually covers more than one simultaneously. For instance, the awards a player accumulates contributes to a new set of skills, increasing the complexity and difficulty of the games. Therefore, the cognitive and emotional areas are affected in the process. Similarly, the social area is continuously linked to the cognitive area. For example, reward systems impact the player’s social status when a task is required to be accomplished through a player’s interaction. Based on the discussion above, the study hypothesises that: H01 : There is no relationship between engagement, motivation, and performance in a software development process. H1 : There is a relationship between engagement, motivation, and performance in a software development process.
Sub-Hypotheses H02 : There is no relationship between engagement and performance in a development process. H2 : There is a relationship between engagement and performance in a development process. H03 : There is no relationship between motivation and performance in a development process. H3 : There is a relationship between motivation and performance in a development process.
software software software software
4 Methodology 4.1 Research Design A quantitative approach was adopted to carry out the research. A questionnaire was developed using a survey method and distributed to project teams in financial institutions. Surveys allow the researcher to collect large amounts of data and ensure the data is quantifiable and the respondents bias is eliminated [32]. Hence, the study adopted the deductive approach as the most suitable method for this research [24].
Exploring the Application of Gamification
59
4.2 Population and Sampling A total of 95 respondents completed the questionnaire. According to [4], the sample size is based on the heterogeneity of a population. A stratified sampling technique was used to collect data from various roles in different project teams in South African financial institutions. 4.3 The Instrument Model The instrument was tested through descriptive statistics and a gameplay scale [23]. The gameplay scale 5-point Likert consisted of questions that have five choices ranging from “strongly disagree” to “strongly agree” [36]. In this study, the statements were designed in the form of a questionnaire survey. The statistics are displayed per construct for each instrument as depicted in the Table 1 below. Table 1. Statistics per Instrument N Engagement
Minimum
Maximum
Mean
Std. Deviation
E1: When game elements are applied in the software development process/project, I feel full energy
1
5
3.73
.950
E2: Time flies on 95 the project while making use of game elements
1
5
3.58
.894
E3: I am enthusiastic about working on projects
95
1
5
4.37
.851
E4: I feel optimistic 95 when I am working within the project team
1
5
4.20
.906
E5: I engage better in a team when game elements are applied
1
5
3.45
.796
95
(continued)
60
L. Lesley and E. Mnkandla Table 1. (continued) N
Motivation
Performance
Minimum
Maximum
Mean
Std. Deviation
M1: While applying 95 game elements, I feel satisfied in achieving my work goals
1
5
3.73
.791
M2: While applying 95 game elements, I enjoy communicating my ideas to my team
1
5
3.80
.807
M3: Applying game 95 elements will help me improve and deliver better
1
5
3.60
.868
M4: Applying game 95 elements will assist me to communicate and contribute more efficiently and effectively to the project team
1
5
3.60
.843
M5: While applying 95 game elements, I feel motivated to take on new tasks
1
5
3.69
.876
P1: When game 95 elements are applied in the software development process, my performance has improved
1
5
3.56
.834
P2: When game 95 elements are applied in the software development process, my communication has improved
1
5
3.58
.894
(continued)
Exploring the Application of Gamification
61
Table 1. (continued) N
Minimum
Maximum
Mean
Std. Deviation
95
1
5
4.20
.807
P4: I feel optimistic 95 when I am working within the project team
1
5
4.16
.854
P5: While applying game elements, I feel satisfied in achieving my work goals
95
1
5
3.61
.854
Valid N (listwise)
95
P3: I am enthusiastic about working on projects
4.4 Data Analysis To test the six hypotheses developed, the Statistical Package for the Social Sciences (SPSS) version 22.0 was used for data analysis. The SPSS system is suitable for this study due to its intuitive user interface. In addition, SPSS enables an analysis of large datasets.
5 Results 5.1 Demographic Profile The results of the study are presented in this section. The results are provided in the Table 2 below and an interpretation of the results are also given. The data provided in Table 1 displays the profile of respondents according to gender, age, financial institution, job role and working experience. The sample size of the study consisted of 95 respondents with the majority of the respondents were males 54 (56.8%), females were 41 (43.2%). Of the 95 respondents surveyed, 1 (1%) were in the age group below 25 years, 19 (20%) were in the 26–30 years age group, and 57 (60%) were between the ages of 31–45 years. Those above 45 years comprised of 4 (4%) of the study sample, while those in the other category comprised of fourteen (15%). The majority of respondents 32 (33.7%) were staff members of financial institution 3 and the least were affiliated with financial institution 2. An overwhelming majority 58 (61%) reported job role were business analysts and a paltry 7 (7%) were system analysts. Lastly, in terms of working experience in IT, most respondents had more than 10 years working experience 39 (41%), and none had less than 1 year work experience 0 (0%).
62
L. Lesley and E. Mnkandla Table 2. Demographic Profile Characteristic
Category
Frequency
Gender
Male Female
54 41
Percentage (%) n =95 56.8% 43.2%
Age
Below 25 Years 26 – 30 Years 31 – 45 Years Above 45 Years Other
1 19 57 4 14
1% 20% 60% 4% 15%
Financial Institution
Financial Institution 1 Financial Institution 2 Financial Institution 3 Financial Institution 4 Other
22 13 32 14 14
23.2% 13.7% 33.7% 14.7% 14.7%
Job Role
Developers Business Analyst Test Analyst System Analyst Project Manager
9 58 10 7 11
9% 61% 11% 7% 12%
Working Experience in IT
Less than 1 Year 1 – 5 Years 6 – 10 Years More than 10 Years
0 19 37 39
0% 20% 39% 41%
5.2 Summary of Exploratory and Reliability Analysis Results Exploratory factor analysis (EFA) was used to determine construct validity. The Bartletts test of sphericity and the Kaiser-Meyer-Olkin (KMO) are essential before performing exploratory factor analysis. These tests confirm whether it is appropriate to proceed with exploratory factor analysis [16]. Bartletts test of sphericity is performed to access the adequacy of the number of correlations between the variables [25]. The KMO access whether the sample size is sufficient for factor analysis [25]. The KMO was .903 above the recommended value of 0.6 [35]. The Bartletts test of sphericity was statistically significant (p < 0.05). As presented in Table 3, using the Kaisers criterion three factors were extracted explaining a total variance of 76.084% above the threshold value of 60% [41]. Factor one was labelled engagement, it explained the 57.455 of the total variance. Factor two was labelled motivation, it explained the 13.405 of the total variance and factor three being performance explained the 76.084 of the total variance. The Cronbach alpha measured the internal consistency variables. The following results were yielded; the Cronbach’s alpha coefficient value of 0.828 for engagement is above the recommended threshold for exploratory analysis of 0.6 and is therefore deemed acceptable. The Cronbach’s alpha coefficient value of 0.921 for motivation is above the
Exploring the Application of Gamification
63
Table 3. Summary of exploratory factor & reliability analysis results Item
Rotated factor loadings Engagement
Motivation
Performance
Engagement E1: When game elements are applied in the software development process/project, I feel full energy
.780
E2: Time flies on the project while making use of game elements
.764
E3: I am enthusiastic about working on projects
.852
E4: I feel optimistic when I am working within the project team
.851
Motivation M1: While applying game elements, I feel satisfied in achieving my work goals
.595
M2: While applying game elements, I enjoy communicating my ideas to my team
.663
M3: Applying game elements will help me improve and deliver better
.877
M4: Applying game elements will assist me to communicate and contribute more efficiently and effectively to the project team
.857
M5: While applying game elements, I feel motivated to take on new tasks
.808
Performance P1: When game elements are applied in the software development process, my performance has improved
.681
(continued)
64
L. Lesley and E. Mnkandla Table 3. (continued)
Item
Rotated factor loadings
P2: When game elements are applied in the software development process, my communication has improved
.634
P3: I am enthusiastic about working on projects
.821
P4: I feel optimistic when I am working within the project team
.847
P5: While applying game elements, I feel satisfied in achieving my work goals
.777
Kaiser-Meyer-Olkin Measure of .903 Sampling Adequacy Bartlett’s test of sphericity
p < 0.05
Eigenvalue
8.618
2.011
.784
% of variance
57.455
13.405
76.084
Cronbach’s Alpha
.828
.921
.871
recommended threshold for exploratory analysis of 0.6 and is therefore regarded as being acceptable. The Cronbach’s alpha coefficient value of 0.871 for performance is above the recommended threshold for exploratory analysis of 0.6 and is thus considered acceptable. Table 4. Regression Analysis Statistic
Beta
t
p-level
Engagement
.468
6.092
.000
Motivation
.445
6.242
.000
Dependent variable Performance Independent variables
R Square
.743
Adjusted R Square
.737
F statistic
132.959
p-level
.000
Exploring the Application of Gamification
65
Regression analysis was used to test the relationships and it was predicted that one variable would do based on the score of the other [19]. The predicted variable in this study was Performance, and the predictor variables were Motivation and Engagement. The coefficient of regression (R2) for Performance was found to be 0.743, which indicates that independent variables of engagement and motivation explained the variance. The results are shown in Table 4.
6 Discussion Theoretically, the study’s findings contribute to the knowledge in technology, gamification technology, and software development, especially in financial institutions. As a quantitative study, the study addressed the perceptions of project team members in financial institutions on gamification in a software development process. The study has shown that the use of GET is not applicable to every circumstance and context. Although numerous studies reviewed in the literature have used the SelfDetermination Theory (SDT) to understand the application of gamification, it yielded different outcomes. For example, some studies (e.g., [8]) used SDT constructs, however, not all constructs are relevant to all studies. The results indicated that GET is applicable in the gamification environment to aid the understanding of gamification attributes of software development that affect adoption. Although not all the variables were congruent with the respondents’ needs, neither did they yield significant results; this leaves room for understanding adoption better contextually. For example, the constructs of motivation and engagement were highlighted as a determinant of the impact of gamification on software development. The study found that, although motivation and engagement were high, it does not necessarily translate into usage. To academia, this research would present a source of academic reference for future studies. To policymakers and the regulatory environment, it serves as a platform for considering the development of policies that favour gamification and bring underperforming or struggling project teams into the realm of exploring gamification for their institutions. This research indicates that motivation and engagement constructs as independent variables contribute 74% towards performance as a dependent variable.
7 Conclusion Gamification has changed how project teams communicate, engage, and do business. Therefore, game elements will likely be used by project teams and industries worldwide in their software development process. This research study examined several factors related to the use and application of gamification amongst project teams in financial institutions in South Africa. Considering the factors that influenced the increased use of gamification in a software development process, it was essential to understand the issues related to gamification and its application, mainly in a financial institution context. The literature reviewed showed that although the uptake of game elements in project teams is considerately moderate in financial institutions, the use of game elements is
66
L. Lesley and E. Mnkandla
mostly low. The constructs determined for this study were primarily based on the Gamification Effective Theory (GET). This study explains the factors that impacted the use of game elements amongst project teams in financial institutions. Therefore, it contributes to knowledge in technology, gamification technology, and software development. Furthermore, the study emphasises the software development context of gamification. It examines the factors that impact gamification applications that are important for improving the software development process. Numerous existing studies are aimed at understanding the application of gamification, however, they are limited in the South African context, especially for financial institutions. Most studies tend to focus on online marketing [21]. To this end, this study sought to exploit this research gap by investigating the application of gamification in software development teams in South Africa’s financial institution context. Furthermore, considering that no studies have been reported on the factors that impact the application of gamification in a software development process, the findings of this study could pave a way for enhancing future research in a software development context. Additionally, project teams in financial institutions have high project delays and failures, further contributing to decreased lack of motivation, engagement, and performance. Since gamification is now very popular in the industry, it was important to establish whether project teams use gamification and to determine the impact on the software development process. 7.1 Limitation of the Study The study’s main limitation is that it is limited to commercial banks in South Africa. Other financial institutions such as insurance companies were excluded from the study. Future research would need to be carried out on insurance companies since they form part of the financial services sector in South Africa. Furthermore, the study was conducted within a specific developing country (i.e., South Africa) therefore, the study cannot be generalised to all developing countries; this is the study’s second limitation.
References 1. Vil’Anilam, J.V.: India. In: Watson, T. (ed.) Asian Perspectives on the Development of Public Relations: Other Voices. NPDPR, pp. 34–47. Palgrave Macmillan UK, London (2014). https:// doi.org/10.1057/9781137398154_4 2. Buisman, A.L.D., van Eekelen, M.C.J.D.: Gamification in educational software development. In: Proceedings of the Computer Science Education Research Conference, pp. 9–20 (2014) 3. Butler, S., Ahmed, D.T.: Gamification to engage and motivate students to achieve computer science learning goals. In: 2016 International Conference on Computational Science and Computational Intelligence (CSCI), pp. 237–240. Las Vegas, USA. IEEE (2016) 4. Cohen, A.M.: The gamification of education. Futurist 45(5), 16–17 (2011) 5. Csikszentmihalyi, M., LeFevre, J.: Optimal experience in work and leisure. J. Pers. Soc. Psychol. 56(5), 815–822 (1989) 6. Davis, M.H., McPartland, J.M.: High school reform and student engagement. In: Christenson, S., Reschly, A., Wylie, C. (eds.) Handbook of Research on Student Engagement, pp. 515–539. Springer, Boston, MA. US (2012). https://doi.org/10.1007/978-1-4614-2018-7_25
Exploring the Application of Gamification
67
7. Dawud, Y., Nokolic, S.: Title Impact of Gamification on Social Network Platforms [seminar]. Course BUSN39 Business Administration: Degree Project in Global Marketing. 2020 4 June. Lund University (2020) 8. Deterding, S., Dixon, D., Khaled, R., Nacke, L.E.: Gamification: towards a definition. In: CHI 2011 Gamification Workshop Proceedings, vol. 12, p. 15 (2011) 9. Dhawale, S.A., Dubey, K.: Quality improvisation in game design techniques through game development life cycle (GDLC) model. In: National Conference on Computational Neuroscience, March, 1–8 (2011) 10. Domínguez, A., Saenz-De-Navarrete, J., De-Marcos, L., Fernández-Sanz, L.C., MartínezHerráiz, J.J.: Gamifying learning experiences: practical implications and outcomes. Comput. Educ. 63, 380–392 (2013) 11. Dubois, D.J., Tamburrelli, G.: Understanding gamification mechanisms for software development. In: Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering, pp. 659–62 (2013) 12. Fulton, J.N., Howard, W.: Running Head: Theory of Gamification-Motivation [dissertation]. William Howard Taft University, Colorado (2019) 13. García, F., Pedreira, O., Piattini, M., Cerdeira-pena, A., Penabad, M.: A framework for gamification in software engineering. J. Syst. Softw. 132, 21–40 (2017) 14. Gordieiev, O., Kharchenko, V., Fominykh, N., Sklyar, V.: Evolution of software quality models in context of the standard ISO 25010. In: Zamojski, W., Mazurkiewicz, J., Sugier, J., Walkowiak, T., Kacprzyk, J. (eds.) Proceedings of the Ninth International Conference on Dependability and Complex Systems DepCoS-RELCOMEX. June 30 – July 4, 2014, Brunów, Poland. AISC, vol. 286, pp. 223–232. Springer, Cham (2014). https://doi.org/10.1007/978-3319-07013-1_21 15. Hamari, J., Sarsa, H., Koivisto, J.: Does gamification work ? a literature review of empirical studies on gamification. In: 2014, 47th Hawaii International Conference on System Sciences, 3025–3034 (2014) 16. Hinton, P., McMurray, I., Brownlow, C.: SPSS Explained, 1st edn. Routledge, London (2004) 17. Ismail, J., Musa, A., Shawalludin, S., Harniza, H.E., Ahmad, K., Yatim, S.N.M.: Committee Page Voice of Academia Academic Series of Universiti Teknologi MARA Kedah (2020) 18. Kaur, R., Sengupta, J.: Software process models and analysis on failure of software development projects. Int. J. Sci. Eng. Res. 2, 2–3 (2011) 19. Lane, D.: The problem of too many statistical tests: subgroup analyses in a study comparing the effectiveness of online and live lectures. Numeracy 6(1) (2013) 20. Lee, J.J., Hammer, J.: Gamification in education: what, how, why bother? Acad. Exchange Q. 15(2), 146 (2011) 21. Lucassen, G., Jansen, S.: Gamification in consumer marketing - future or fallacy? Procedia. Soc. Behav. Sci. 148, 194–202 (2014) 22. Mcgonigal, J.: Reality Is Broken : Why Games Makes Us Better and How They Can Change the World. The Penguin Press, New York (2011) 23. Olgun, S., Yilmaz, M., Clarke, P., Connor, R.V.O.: A systematic investigation into the use of game elements in the context of software business landscapes : a systematic literature review. In: Software Process Improvement and Capability Determination: 17th International Conference (SPICE), pp. 384–398. Palma de Mallorca, Spain (2017) 24. Ormerod, R.: The history and ideas of pragmatism. J. Oper. Res. Soc. 57(8), 892–909 (2006) 25. Pallant, J.: SPSS Survival Manual, 7th edn., pp. 1–378. Routledge, London (2020) 26. Platonova, V., Berzisa, S.: Gamification in software development projects. Inf. Technol. Manag. Sci. 20, 58–63 (2017) 27. Pedreira, O., García, F., Brisaboa, N., Piattini, M.: Gamification in software engineering – a systematic mapping. Inf. Softw. Technol. 57, 157–168 (2015)
68
L. Lesley and E. Mnkandla
28. Perryer, C., Celestine, N.A., Scott-Ladd, B., Leighton, C.: Enhancing workplace motivation through gamification: transferrable lessons from pedagogy. Int. J. Manage. Educ. 14(3), 327– 335 (2016) 29. Procaccino, J.D., Verner, J.M., Shelfer, K.M., Gefen, D.: What do software practitioners really think about project success : an exploratory study. J. Syst. Softw. 78, 194–203 (2005) 30. Ryan, R.M., Deci, E.L.: Self-determination theory and the facilitation of intrinsic motivation, social development, and well-being. Am. Psychol. 55(1), 68–78 (2000) 31. Ryan, R.M., Rigby, C.S., Przybylski, A.: The motivational pull of video games: a selfdetermination theory approach. Motiv. Emot. 30, 347–363 (2006) 32. Sekaran, U., Bougie, R.: Research Methods for Business: a Skill-Building Approach, 6th edn. New York, Wiley (2013) 33. Sereno, M.: Effects of gamification in the banking industry: A comparison analysis between gamified and non-gamified training [thesis]. Southern Institute of Technology (2021) 34. Shavab, O.A.K., Yulifar, L., Supriatna, N., Mulyana, A.: Gamification in history learning: a literature review. In: 6th International Conference on Education & Social Sciences (ICESS). Atlantis Press (2021) 35. Tabachnick, B.G., Fidell, L.S.: Using Multivariate Statistics, 6th edn. Pearson Education, Boston (2013) 36. Toprac, P.: Motivation by design: using digital-game based learning techniques to create an interesting problem-based learning environment. In: Handbook of Research on Improving Learning and Motivation through Educational Games : Multidisciplinary Approaches, vol. 1, pp. 283–309 (2011) 37. Üsfekes, Ç.: Towards an Auction-Based Reward Mechanism for Effective Bug Resolution [thesis]. Çankaya University, Turkey (2019) 38. Üsfekes, Ç., Yilmaz, M., Tuzun, E., Clarke, P.M., O’Connor, R.: Examining reward mechanisms for effective usage of application lifecycle management tools. In: Stolfa, J., Stolfa, S., O’Connor, R., Messnarz, R. eds. Communications in Computer and Information Science, vol. 748, pp. 259–268, Springer Verlag (2017). https://doi.org/10.1007/978-3-319-64218-5_21 39. Wang, W., Huo, S.: One belt one road—new game, established rules. Open J. Polit. Sci. 9(03), 582–597 (2019) 40. Werbach, K., Hunter, D.: For the Win: How Game Thinking Can Revolutionize Your Business. Wharton Digital Press, University of Pennsylvania (2012) 41. Wood, L.C., Reiners, T.: Gamification. In: Encyclopedia of Information Science and Technology. 3rd edn, pp. 3039–3047. IGI Global (2015) 42. Zichermann, G., Cunningham, C.: Gamification by Design : Implementing Game Mechanics in Web and Mobile Apps, 1st edn., pp. 1–182. O’Reilly Media, Inc., Sebastopol, California (2011)
FCA-SAPO: A New Comprehensive Fog Computing Adoption Model for Saudi Arabian Public Organisations Mohammed Alyami1,2(B) , Natalia Beloff1 , and Martin White1 1 Department of Informatics, University of Sussex, Brighton, UK
[email protected] 2 Department of Information Technology and Security, Jazan University, Jazan, Saudi Arabia
Abstract. Fog Computing, for example, has emerged as an increasingly relevant paradigm in developing countries because of its comprehensive features. In spite of this, it might be argued that very little research has been conducted on the reasons, potential benefits, and challenges associated with adopting Fog Computing in Saudi Arabia. Consequently, this study explores various factors, advantages, and challenges associated with Fog Computing adoption in Saudi Arabian public organisations by developing a novel comprehensive model of Fog Computing adoption in Saudi Arabian public organisations (FCA-SAPO). In this framework (FCA-SAPO), we extend the Technology, Organisations, and Environment (TOE) framework with a financial context to discuss Fog Computing adoption in Saudi Arabia’s public organisations. An outline of the methodology to be used to assess the proposed framework is provided in this position paper. As outlined in the FCA-SAPO research model, innovation adoption variables can be classified into four dimensions: technological, organisational, environmental, and financial. This study aims to provide decision-makers in public organisations of Saudi Arabia with a comprehensive understanding of the factors, benefits, and issues related to Fog Computing so that in the potential event of its application, they can make an informed decision. Therefore, in the next few years, the results of this study will assist both researchers and governments in identifying the factors to consider when adopting Fog Computing and the challenges it may present. Keywords: Fog Computing · TOE framework · Saudi Arabia
1 Introduction As a result of advances in Information and Communication Technology (ICT), government organisations can simplify how they interact with citizens, other institutions, and businesses, as well as facilitate the function and structure of those organisations [1]. Using emerging technologies effectively allows public organisations to integrate backoffice systems in order to provide fully customised electronic services to individuals, businesses, and government organisations. In fact, Fog Computing is a new platform © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): SAI 2023, LNNS 711, pp. 69–85, 2023. https://doi.org/10.1007/978-3-031-37717-4_6
70
M. Alyami et al.
and technology that can facilitate the provision of user-centred and cost-effective services. As an alternative to centralised data centres that provide computing, storage, and distribution services, Fog Computing provides all these services locally in a distributed architecture closer to end-user devices where data is generated [2]. In spite of the fact that Fog Computing and Edge Computing can shift the computation process from the core of the network up to the edge of the network, Fog Computing provides computing, networking, storage, control, and acceleration from the Cloud down to things. On the other hand, Edge Computing generally consists of computing at the network’s edge [3]. The Fog Computing approach is similar to Cloud Computing in that users can access computing resources, such as data, storage, and application services [4]. Fog Computing has similar attributes to Cloud Computing, but it offers additional features such as heterogeneity, low latency, mobility, location awareness, real-time interaction, and a large number of geographically distributed nodes [5]. Fog Computing platforms are relatively new, so its adoption must be carefully planned and understood prior to its implementation. In computing services and resources, Fog Computing has proven to be one of the paradigm-shifting platforms. Developing countries will benefit from Fog Computing when they adopt or utilise it as a new technological revolution [6]. While there is little literature on Fog Computing that has examined the factors, opportunities, and issues associated with developing countries such as Saudi Arabia, our own research group has examined Fog Computing in the context of Saudi Arabian SMEs and Government Systems [7, 8]. This paper, on the other hand, focuses on the adoption of Fog Computing in public organisations while examining its factors, challenges and benefits. Any new technology must be thoroughly analysed for benefits and hindrances in order to be adopted [6]. In adopting new technologies, it is more critical to evaluate the costs and benefits of adopting cutting-edge technology than simply deciding whether to do so [9]. Saudi Arabia’s organisations may consider both positive and negative characteristics gained from implementing a Fog Computing platform when deciding whether to adopt it. Furthermore, adopting new technology may involve some risks. Thus, developing a model that articulates the relative advantages and disadvantages of Fog Computing in Saudi Arabia is valuable. Accordingly, the choice of whether to provide E-Government services in Saudi Arabia using the Fog Computing paradigm can be made based on the model proposed in this study. As a result of this study, we expect new factors will appear to influence the adoption of Fog Computing in the Saudi Arabian public sector. In view of this, this paper proposes a new model for Fog Computing Adoption in Saudi Arabian Public Organizations (FCA-SAPO). The remainder of this paper will be structured as follows: Sect. 2 provides a review of related research, Sect. 3 presents the theoretical frameworks for technology acceptance, Sect. 4 describes the research model and hypotheses, Sect. 5 outlines the research methodology, and finally, Sect. 6 concludes the paper.
2 Related Work In Edge Computing, data is processed at the network’s edge via small data centres, which improves data management and storage [10]. The emergence of IoT has led to increased dependency on cases that generate big data from different sources. A Fog Computing
FCA-SAPO: A New Comprehensive Fog Computing Adoption
71
model is designed to analyse and control these real-time data effectively. Since the advent of the Internet of Things, there has been an increased reliance on scenarios that generate large amounts of data from various sources. Therefore, a Fog Computing model can effectively analyse and control these real-time data. Fog Computing is a new approach that extends from the Cloud to the network’s edge. In order to embrace Fog Computing, organisations seek an understanding of the factors that influence the adoption of emerging technologies. By exploring these factors, policymakers and decision-makers will better understand the potential advantages and challenges associated with the adoption process and plan. There is a need for a new infrastructure that can harness the benefits of the Internet of Things and respond to instantaneous events since current Cloud models cannot address the specifics of IoT. A paper by Jiang et al. [11] discussing the significant challenges and solutions in the Fog Computing system with adopted Cloud Computing orchestration frameworks has found four main reasons that impede the heavy reliance on the back-end Cloud. Firstly, the difficulty of transferring the massive amount of data generated from end users’ devices due to bandwidth limitations and the cost of data transmission. Secondly, data privacy and security issues were identified to be a concern. In other words, some users prefer to avoid transmitting and sharing their data over a long distance as this may involve some risk concerning data confidentiality. The third reason would be the low latency from end users’ devices to the Cloud, resulting in significant performance degradation and network congestion. Lastly, rapidly responding to contextual changes in localised applications can be complex for remote Clouds. However, Fog Computing can offer a platform where applications running in a local network do data analysis and processing rather than within a centralised Cloud. By properly orchestrating and controlling computing and the storage resources placed at the network’s Edge, Fog Computing can deal with the ever-increasing amount of data and connected devices. Also, fulfilling the requirements of IoT applications’ technological constraints would allow the platform designers to decide whether to serve an endpoint by the Fog, the Cloud or integrate the two. Cloud Computing and Fog Computing have overlapping characteristics; however, interaction in a real-time pattern, mobility, and low latency can be gained more through Fog Computing. In the context of Saudi public organisations, almost all studies have found that security and trust, compliance with regulations, and data confidentiality are the significant determinants hindering the adoption. A study conducted by [12] aims to examine the most significant factors affecting Saudi organisations’ intention to migrate to the Cloud. The result concluded that security, trust, and privacy issues were the vast concern for most of the participants of this study. In conclusion, there needs to be more literature examining the potential factors provided via Fog Computing. Current proposed models and frameworks have been mainly focused on the importance of Cloud Computing and the factors influencing the adoption of this paradigm. Nevertheless, while Fog Computing is articulated as an extension of Cloud Computing, there is an apparent shortage in the literature providing an in-depth understanding of the positive and negative features associated with Fog Computing.
72
M. Alyami et al.
3 Theoretical Frameworks of Technology Acceptance Different theories are employed in the adoption of a particular technology. Identifying the best practices when adopting technology can be accomplished by consulting these theories [13]. Various theories must be considered when determining the most effective strategy for technology adoption [14]. Once a technology is adopted, data must be collected to evaluate its impact. As part of a technology acceptance model, theories can help define the essential information to collect. This Saudi Arabian Public Organization Adoption of Fog Computing model is based on the Framework for Technology Organization Environment (TOE) [15]. 3.1 Technology-Organisation-Environment (TOE) In 1990, Tornatzky et al. [15] published The Processes of Technological Innovation as a framework for explaining technology, organisation, and environment. The TOE explains how technological innovation is adopted and implemented within organisations and the factors that influence their adoption [15]. Various types of research can be conducted using this method due to its flexibility. As a result of this model, most IT implementations are affected [16]. The TOE framework was chosen for this study for many reasons, according to research. Adopting new technology, such as Fog Computing, by public organisations, requires consideration of various factors. There are technical, organisational, environmental, and financial factors to consider. By understanding why public organisations adopt Fog Computing, decision-makers can make better decisions. Additionally, organisations should investigate how technology adoption affects innovation and new technologies, notably [17]. The Technology Organisation Environment (TOE) framework is ideal for this study. To begin with, research suggests that various factors influence the adoption of cuttingedge technologies, such as Fog Computing, by public organisations. Technical, organisational, environmental, and financial factors are among these factors. With the help of the TOE framework, decision-makers can better understand why public organisations are willing to adopt Fog Computing. Regarding adopting Fog Computing in public organisations, previous studies have examined a limited number of contexts of the TOE framework. This study uses the TOE framework in order to analyse all relevant factors, issues, and benefits associated with Fog Computing adoption in the public sector. This includes a financial factor in order to provide a comprehensive assessment of technology adoption factors. A third advantage of the TOE framework is its ability to help organisations evaluate the current impact of technology adoption, especially concerning innovation and new technologies [17]. As well as identifying the three components of the TOE framework as critical contexts, organisations have identified the three components as crucial factors in determining the impact of technology adoption. The fourth characteristic of the TOE framework is its flexibility, practicality, and suitability for analysing the performance of cutting-edge technologies. As a result, it is an ideal framework for analysing critical contexts associated with technological adoption, such as technology, organisation, and environment. Moreover,
FCA-SAPO: A New Comprehensive Fog Computing Adoption
73
to collect all essential factors influencing technology adoption, the TOE framework is employed in this study to analyse factors, issues, and benefits associated with adopting Fog Computing in the public sector, as well as a financial factor. 3.2 Other Theoretical Frameworks of Technology Acceptance The adoption of IT technology has been extensively studied, and several theories and models have been proposed in order to explain this phenomenon. These include the Unified Theory of Acceptance and Use of Technology (UTAUT) [18], the Theory of Reasoned Action (TRA) [19], the Technology of Acceptance Model (TAM) [20], and the Diffusion of Innovation Theory (DOI) [21]. Moreover, several research studies published by Zhang et al. [22] have proposed models specifically for adopting Fog Computing. These studies have shown that various factors play a significant role in influencing the adoption of Fog Computing technologies, including organisational readiness and perception of benefits. In the context of technology adoption, the Unified Theory of Acceptance and Use of Technology (UTAUT) is a commonly used model. Basically, it indicates that users will be more likely to adopt cutting-edge technologies if they are confident its performance will meet their expectations and they believe it will simplify their work [23]. The model, however, is limited in several ways that make it insufficient for this study. To begin with, the model does not include essential constructs and factors such as perceived awareness and quality of service. In addition, the UTAUT model neglects critical factors such as privacy, security, and policy and regulation, which are crucial to new technology’s reliability. As a result, these technical factors significantly impact technology adoption, and their absence from the UTAUT model renders the model less helpful in this context. In the Theory of Reasoned Action (TRA), users’ attitudes are explained, and it is possible to predict their behaviour based on their attitudes as they interact with cutting-edge technology [24]. Some previous studies have identified limitations associated with the TRA framework when it comes to predicting the adoption of cutting-edge technologies. In particular, the framework consists of only two concepts: attitude toward behaviour and subjective norms. Further, the TRA framework may be limited in considering certain factors that may affect users’ final decisions regarding adopting a technology. In addition, the framework may also lead to confusion between attitudes and norms, as attitudes can be redefined as norms and vice versa. Thus, it is difficult to determine the factors influencing technology adoption in public organisations. According to the Technology Acceptance Model (TAM), users accept and use technology and ultimately adopt it into their lives [25]. On the other hand, the TAM model does not consider financial factors, which can constitute a significant part of an organisation’s decision-making process when evaluating the adoption of cutting-edge technology. TAM is designed primarily to facilitate the acceptance of technology by individuals rather than by organisations such as libraries, which have different policies and regulations [26]. Because of the many limitations of a theoretical framework such as TAM, it is essential to consider multiple factors when implementing it. Moreover, more than the TAM model is required to predict the acceptance and adoption of information and communication technology (ICT) within an organisation.
74
M. Alyami et al.
In the Diffusion of Innovation Theory (DOI), new technologies are examined during the adoption process in order to determine how they spread among intended users. The DOI states that cutting-edge technologies will be adopted when they are appropriate, applicable, and meet the needs of their users. Although, the DOI framework fails to acknowledge the significant role that trust plays in ensuring that cutting-edge technology is accepted by organisations [27]. In addition, the DOI framework does not consider the impact of context and domain differences on technology adoption. As a result, a limitation of the framework is that it only measures compatibility and complexity from the organisation’s perspective and does not consider other variables, which limits its ability to achieve research objectives.
4 Research Model and Hypotheses Users, Organisations, and Technology are often studied to ensure a new technology’s successful implementation [28]. As a case study of Fog Computing adoption in Saudi Arabia’s public sector, this section presents and describes the research model and hypothesis. In this study, four theoretical contexts with associated factors have been defined as influencing factors in the adoption of Fog Computing in public organisations in Saudi Arabia. The following contexts are considered: 1. 2. 3. 4.
Technological Context Organisational Context Environmental Context Financial Context
The TOE framework demonstrates how innovation can affect organisations in the potential and actual sense. Since this method is highly flexible, it has been applied to various research types. The TOE framework significantly impacts the implementation of information technology [16]. Based on our literature review, we conclude that the existing TOE framework can be amended to include new factors that have a critical impact on adoption than those already included in the existing TOE framework [7, 8]. Fog Computing adoption will be evaluated using this adapted TOE framework in order to identify critical factors affecting Fog Computing adoption in Saudi Arabian Public organisations. In all four contexts, the existing TOE framework has been utilised in numerous studies [29, 30], which gives us confidence that an adaption will serve us well in this research. To adapt the TOE framework for this study, our investigation reveals a number of new factors that could be considered whenever a public organisation adopts new technology, such as Fog Computing. These factors are assigned to those mentioned above in technical, organisational, environmental, and financial contexts, as detailed in Fig. 1. Consequently, decision-makers can better understand why public organisations are willing to adopt Fog Computing. To gather all essential factors influencing technology adoption in the public sector, our adapted TOE framework (with its new financial context) is used in this study. In addition, few studies have only examined two or three TOE framework contexts when
FCA-SAPO: A New Comprehensive Fog Computing Adoption
75
it comes to adopting Fog Computing in public organisations. Therefore, to build a conceptual research model based on the study’s findings, it is necessary to incorporate the factors most influencing the adoption of new technologies into the framework. Another benefit of the TOE framework is that it helps organisations evaluate the current impact of technology adoption, specifically its impact on innovation [17]. Additionally, the three components of the TOE framework have been adopted by organisations as critical contexts in the adoption of technology. To conclude, the TOE framework is flexible, practical, and well-suited to analyse new technologies’ adoption by incorporating new factors for each context. To this end, the TOE framework allows for a detailed analysis of technology adoption’s technical, organisational, and environmental aspects. By utilising the TOE (technology-organisation-environment) framework, this study examines the factors we have selected through background research that may influence the uptake of Cloud-based technologies within organisations [31, 32]. Through this conceptual research model (FCA-SAPO), these factors will be analysed, integrated, and categorised. Accordingly, this study aims to develop a conceptual model (FCA-SAPO) to enhance the process of Fog Computing adoption in the Saudi Arabian governmental sector. As illustrated in Fig. 1, the proposed model integrates these factors.
Fig. 1. The Conceptual Research Model of Fog Computing Adoption in Saudi Arabian Public Organisations (FCA-SAPO)
4.1 Technical Context It is important to note that complexity prevents the widespread use of any novel technology [31]. Complexity leads to inefficient technology use and a lack of relevant data when
76
M. Alyami et al.
there is a high level of complexity. In addition to raising adoption costs, learning how to use the innovation interface takes time. In order for new technologies to be widely accepted, they must be easy to use or have accessible interfaces. The likelihood of human beings adopting cutting-edge technologies is moderated by various factors, including employee awareness and knowledge of such technologies. Despite focusing on users’ reactions to technology, this study contends that demographic characteristics such as a company’s size and location may mitigate the impact of innovative features, economic variables, and technological aspects when it comes to the adoption of Fog technology [33]. Quality of Service (QS). Even though the location of a Fog technology provider does not affect the quality of service provided or the effectiveness of the technology itself, it may influence its dissemination and the choice of users in the future [34]. Experienced employees, a fast Internet connection, and a reliable power supply can all have a profound impact on the decisions made regarding technology as it evolves. This leads to the following hypothesis: H1: A high level of Service Quality will increase a public organisation’s Intention to Adopt Fog Computing. Security (SE). Safety and confidentiality of user information are two concerns that must be considered when a company chooses to use cutting-edge technology. This study uses the term “security” to safeguard media, data centres, and services. The design of the system incorporates data privacy and security within the organisation. Various industries have faced problems related to online security, including e-commerce and, most notably, internet banking [34]. The implementation of Fog technology has been slowed by the same security concerns that have restricted the widespread acceptance of other forms of technology. Accordingly, the following hypothesis is proposed: H2: A high level of Security will increase a public organisation’s Intention to Adopt Fog Computing. Privacy (PR). Data privacy is the most important consideration when a company implements cutting-edge technology. This study refers to “constructs” as the privacy and security of data held by an organisation [35]. Many users have long expressed concern about the prevalence of security holes in sectors such as e-commerce and, most significantly, internet banking. Therefore, the following hypothesis is presented: H3: A high level of Privacy will increase a public organisation’s Intention to Adopt Fog Computing. Compatibility (CM). According to Russ [36], if Fog technology aligns with users’ expectations, needs, and past experiences, compatibility will be high. In prior literature, compatibility was used to measure users’ willingness to adopt and utilise information systems. In addition, it has been demonstrated that compatibility plays a significant role in the sense of being easy to work with. As a result, some users who have used
FCA-SAPO: A New Comprehensive Fog Computing Adoption
77
similar technology in the past think that they will be able to quickly and easily adopt the latest innovations in this area. Compatibility has been shown to significantly impact the perceived ease of use of Fog technology, indicating that users will be more likely to adopt it if they find it compatible with their values, beliefs, lifestyles, and requirements. As a result, the following hypothesis is proposed: H4: A high level of Compatibility will increase a public organisation’s Intention to Adopt Fog Computing. Complexity (CO). Complexity is a measure of how much effort is involved in understanding and implementing an innovation. In this study, the focus is on how users’ expectations of challenges affect their performance. Fog technology was found to be less likely to be accepted and utilised by users when they perceived it to be complicated. In addition, a study by Ullah et al. [37] found that complexity is inversely related to the difficulty of utilising something. Studies by Lee [38], among others, indicated that it is likely that users will be less inclined to use a complicated new system. Furthermore, according to Rzepka et al. [39], users’ perceptions of a system’s complexity and usability decreased with increasing technological implicitness. Thus, implicitness is a measure of the complexity of the system and a measure of the amount of work to be performed. An effort expectation is an indicator of the ease of use of the system. There is a possibility that the level of technical implicitness inherent in a piece of technology may influence the perceived difficulty of using the technology. This leads to the following hypothesis: H5: A reduced level of Complexity will increase a public organisation’s Intention to Adopt Fog Computing. Awareness (AW). Several factors influence a person’s willingness to embrace and accept cutting-edge technology, including their level of awareness and knowledge of such technologies. Understanding the task’s context is essential to determining what data should be handled and how it should be processed. Consequently, developing new computer models for collecting, modelling, inferring, and aggregating contextual information presents severe challenges. A comprehensive review of Fog-based context-aware systems has not yet been conducted, despite the growing interest in Fog Computing and context awareness over the past few years. To evaluate Fog-based context-aware systems, this study provides a paradigm that considers two factors: the type of application being reviewed and the environment in which it is being used [40]. Secondly, studies are reviewed and categorised according to these dimensions, and finally, current obstacles and potential future study areas are discussed. As a result, the following hypothesis is proposed: H6: A high level of Awareness will increase a public organisation’s Intention to Adopt Fog Computing.
78
M. Alyami et al.
4.2 Organisational Context New technologies are adopted and implemented in organisations based on a variety of organisational characteristics, including size, structure, strategy, and culture [41]. According to a meta-analysis of IT-related characteristics in companies, organisational support fared poorly in the adoption process. However, experts speculate that other factors, such as price, may act as moderators. To determine the size of an organisation, it is critical to take into account two factors. The first factor to consider is the economic aspect, such as the cost of Fog technology; larger companies will have an easier time affording this technology. Secondly, it is essential to take into account the concept of scale. According to several studies, such as that conducted by AlBar et al. [40], larger firms are more likely to embrace innovation as a means of survival and growth. Smaller companies have found that their adaptability and creativity facilitate the evaluation of new technologies more quickly. The size, structure, strategy, and culture of an organisation have a significant impact on the adoption and implementation of new technologies in developing countries such as Saudi Arabia. According to a recent meta-analysis by AlBar et al. [40], organisational support is one of the most critical determinants of IT adoption. Although experts believe that factors like cost may act as moderators due to the magnitude of other developments, they consider that costs may act as a moderator as well. Additional technical aspects of Fog technology, such as data security and privacy, are anticipated to have a direct impact on its installation and use. Senior Management Support (SM). Within an organisation, “intra-organizational acceptance” refers to the extent to which a new practice or idea is accepted and supported by the staff. Organisational determination plays a significant role in the speed and success with which cutting-edge technologies are adopted by organisations [42]. In determining the pace at which new technologies are introduced, organisational factors such as originality, structure, geography, and size play a significant role. Consequently, IT professionals make decisions concerning Fog technology installation based on organisational characteristics, such as size, location, and workers’ familiarity with Fog technology. This leads to the following hypothesis: H7: A high level of Senior Management Support will increase a public organisation’s Intention to Adopt Fog Computing. Technology Readiness (TR). In the early processing phases, IT-savvy businesses are more attractive to customers and better equipped to collect data both internally and externally [34]. Additionally, these businesses use more engaging content that provides practical applications for the technology. The adoption and utilisation of technology within a company have been influenced significantly by the intention and behaviour of employees. As a result of using modern technology, they have also experienced a change in the quality of their work [40]. As far as technology readiness is concerned, it refers to the understanding level of network technologies and corporation systems that are capable of supporting Fog Computing [40]. Other academics have observed that the definition of this phrase depends
FCA-SAPO: A New Comprehensive Fog Computing Adoption
79
on the amount of work required to operate a particular system. In terms of perceived usefulness, the users’ assessment of a system’s capacity to improve efficiency is captured by the term “perceived usefulness”. As a result, it is up to the IT experts of the company to decide whether cutting-edge technology should be implemented. Also, adopting it will be influenced by the decision maker’s behavioural intention. As a result, the following hypothesis can be drawn: H8: A high level of Technology Readiness will increase a public organisation’s Intention to Adopt Fog Computing. 4.3 Environmental Context Several factors can influence the diffusion of Fog technologies, including infrastructure and other environmental factors. Modelling this phenomenon can also be accomplished using a mediator of innovation parameters [34]. A literature review indicates that infrastructural factors mediate the efficacy of Fog technology. New approaches will be regulated by Fog technology providers based on an organisation’s position in relation to another comparable organisation and its geographical location. Obviously, geography matters, which is why organisations in metropolitan areas and other hubs are more likely to have management teams with a wide range of skills and experience in order to assess the potential consequences of the implementation of new technology. Monitoring the use of new technology in similar establishments may reveal the impact of adopting new technology in the same industry [43]. There is no correlation between the location of the provider of Fog technology and the quality of the services provided or the effectiveness of Fog technology. However, it may impact its dissemination and the choices that users may make in the future. A reliable power source, efficient Internet access, and access to competent workers are examples of resources that may moderate the decision to adopt cutting-edge equipment. Competitive Pressures (CP). Fog Computing will be increasingly adopted as a result of competitive pressure. External factors such as competitive pressures and pressure from trading partners influence public sector organisations to adopt Fog-based Enterprise resource planning (ERP) systems [34]. An external factor may have an impact on the choice made by the company. In order to remain competitive, a company must adopt cutting-edge technologies. Several empirical studies have demonstrated the importance of external pressure from competitors as a motivator for adoption. For example, companies are under intense pressure from their competitors to increase their productivity. In the context of small enterprises, competitive pressure was an influencing factor in adoption. Many companies have turned to outsource their IT systems in order to increase efficiency, which is supported by the literature on outsourcing [43]. Thus, a company that invests wisely in emerging technologies may be able to grow its market share by
80
M. Alyami et al.
lowering its prices and appealing to a broader audience. The following hypothesis is derived from this: H9: A high level of Competitive Pressures will increase a public organisation’s Intention to Adopt Fog Computing. Compliance with Regulations (CR). There are fewer restrictions on Fog Computing’s use that can facilitate its implementation by the government. In general, the costs associated with complying with regulations are negligible when compared to the total costs incurred by an organisation regulated by the government [43]. In addition to reducing productivity, overly restrictive regulatory environments may lead to companies relocating their investments and operations to areas with fewer restrictions. Many governments and industry requirements fall under the category of “compliance,” including the Sarbanes-Oxley Act and the EU Data Protection Act [44]. Regardless of whether the organisation has established internal controls, implementing public-Cloud infrastructure, Cloud-based applications, or anything in between will require some degree of control to be relinquished to the Cloud provider. Consequently, it leads to the following hypothesis: H10: A reduced level of Compliance with Regulations will increase a public organisation’s Intention to Adopt Fog Computing. 4.4 Financial Context Fog Computing firms and Edge Computing can have a significant impact on financial services. Increasingly intelligent and robust devices are being used on the periphery of the Internet. The data collected by these devices can be analysed in order to learn more about human behaviour as well as the ecosystem in which humans live [45]. The accuracy of weather forecasts can be compared with the shopping habits of consumers. In contrast, the Cloud represents the other end of the spectrum. Apparently, there is an unimaginable amount of computing and storage power available. The Internet of Things (IoT) is formed by the convergence of various gadgets, network infrastructure, and Cloud Computing services. In the financial services (FS) sector, companies are increasingly looking to software industry leaders in the areas of the Internet of things (IoT) and analytics for inspiration [45]. Cost (CT). Several factors have been cited as contributing to the widespread adoption of Fog technology, including its low cost and powerful protection algorithms [44]. Electronic commerce, including online banking, has experienced security problems over the years. Security concerns have slowed the implementation of Fog Computing for the same reasons that have slowed the acceptance of other internet-based technologies. Thus, the main characteristics that differentiate Fog technology from other technologies have been
FCA-SAPO: A New Comprehensive Fog Computing Adoption
81
identified as its cost-effectiveness and advanced protective algorithms, which may affect the adoption rate. Based on this, the following hypothesis can be formulated: H11: A reduced Cost level will increase a public organisation’s Intention to Adopt Fog Computing. Maintenance (MA). Fog Computing is expected to gain popularity in government organisations due to its low maintenance costs. As a result of reduced maintenance costs, public organisations are encouraged to embrace Fog Computing [44]. In addition to being less costly to maintain, low-maintenance buildings can significantly contribute to local economies. Moreover, it helps to maintain property values while reducing maintenance costs to a minimum and ensures that commercial and public areas are always well maintained [45]. As a result, it is essential to reduce the property’s overall impact on the environment in order to maintain its market value. Thus, it leads to the following hypothesis: H12: A reduced level of Maintenance will increase a public organisation’s Intention to Adopt Fog Computing.
5 Research Methodology A mixed method approach will be employed in this study, including a qualitative (semistructured interview) and a quantitative (survey) component. Using a mixed methods approach, it will analyse numerical and qualitative data to achieve the research objectives [46–48]. Fog Computing will certainly assist government organisations in reforming their services and in making informed decisions about the adoption of e-government services using Fog Computing. A two-stage research process will be followed. The first stage involves the distribution of the quantitative method (survey questionnaires) among Saudi Arabia’s governmental ministries. In order to make the distribution process more comprehensive, it will be conducted manually and electronically. There will be an identification of the respondents in the study who are in the top management and employment of IT employees in an organisation—identifying respondents who consider themselves critical experts in the field of Fog Computing to establish a target audience. A total of about 400 target samples will be collected from Saudi public organisations for the purpose of validating the proposed model and hypotheses. The target samples will reflect the Saudi public organisations’ awareness of and acceptance of the Fog Computing concept. Statistics provided by the Saudi Ministry of Statistics will be used to determine the number of IT staff. In order to analyse the data collected via the target sampling, structural equation modelling (SEM) will be used, as this methodology employs multivariate analysis using the AMOS software. A descriptive statistical analysis will also be used to illustrate the data collected regarding the adoption of Fog Computing in the Saudi Arabian public sector. Therefore, the proposed conceptual model (FCA-SAPO) will be analysed using Structure Equation Modeling (SEM) procedures and descriptive statistics.
82
M. Alyami et al.
The second phase of this research involves the qualitative method (semi-structured interviews). In each ministry, the head of the IT department will be interviewed. As a result of their experience, constant interaction with upper management, and their involvement in IT and organisational-related decisions, this interview is conducted only with the heads of the IT departments (working in headquarters in each ministry).
6 Conclusion As a field of research, Fog Computing is an area that has attracted a lot of attention and, given its challenges in adoption, should be explored in further depth [49–51]. As the study is aimed at exploring certain factors, benefits, and issues that may arise when applying Fog Computing to public organisations within Saudi Arabia, this study explores how to implement Fog Computing within these organisations. A conceptual framework has been developed on the basis of the TOE framework in order to come up with a conceptual model (FCA-SAPO) in order to identify all critical factors that may have an effect on the adoption of Fog Computing in the governmental sectors in Saudi Arabia based on the TOE framework. As a result, this comprehensive model will be able to give decision-makers in e-government and IT top managers a better understanding of all critical factors that influence the adoption of Fog Computing within Saudi Arabia’s public organisations so that they can make more informed decisions.
References 1. Cordella, A., Tempini, N.: E-government and organizational change: reappraising the role of ICT and bureaucracy in public service delivery. Gov. Inf. Q. 32(3), 279–286 (2015). https:// doi.org/10.1016/j.giq.2015.03.005 2. Jalali, F., Khodadustan, S., Gray, C., Hinton, K., Suits, F.: Greening IoT with fog: a survey. In: Proceedings - 2017 IEEE 1st International Conference on Edge Computing, EDGE 2017, pp. 25–31 (2017). https://doi.org/10.1109/IEEE.EDGE.2017.13 3. Yousefpour, A., et al.: All one needs to know about fog computing and related edge computing paradigms: a complete survey. J. Syst. Archit. 98, 289–330 (2019). https://doi.org/10.1016/j. sysarc.2019.02.009 4. Stojmenovic, I., Wen, S.: The fog computing paradigm: scenarios and security issues. In: 2014 Federated Conference on Computer Science and Information Systems, FedCSIS 2014, pp. 1–8 (2014). https://doi.org/10.15439/2014F503 5. Osanaiye, O., Chen, S., Yan, Z., Lu, R., Choo, K.K.R., Dlodlo, M.: From cloud to fog computing: a review and a conceptual live VM migration framework. IEEE Access 5, 8284–8300 (2017). https://doi.org/10.1109/ACCESS.2017.2692960 6. Mahmood, Z., Ramachandran, M.: Fog computing: concepts, principles and related paradigms. In: Mahmood, Z. (ed.) Fog Computing, pp. 3–21. Springer, Cham (2018). https:// doi.org/10.1007/978-3-319-94890-4_1 7. Al Mudawi, N., Beloff, N., White, M.: Developing a framework of critical factors affecting the adoption of cloud computing in government systems (ACCE-GOV). In: Arai, K. (ed.) Intelligent Computing. LNNS, vol. 283, pp. 520–538. Springer, Cham (2022). https://doi. org/10.1007/978-3-030-80119-9_32
FCA-SAPO: A New Comprehensive Fog Computing Adoption
83
8. Alqahtani, M., Beloff, N., White, M.: A new adoption of cloud computing model for Saudi Arabian SMEs (ACCM-SME). In: Arai, K. (eds.) Intelligent Systems and Applications. IntelliSys 2022. Lecture Notes in Networks and Systems, vol. 542, pp. 192–210. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-16072-1_15 9. Hall, B.H.: Adoption of New Technology (2004). https://www.researchgate.net/publication/ 23742215 10. Yousefpour, A., Ishigaki, G., Gour, R., Jue, J.P.: On Reducing IoT Service Delay via Fog Offloading (2018). https://doi.org/10.1109/JIOT.2017.2788802 11. Jiang, Y., Huang, Z., Tsang, D.H.K.: Challenges and solutions in fog computing orchestration. IEEE Netw. 32(3), 122–129 (2018). https://doi.org/10.1109/MNET.2017.1700271 12. Alkhater, N., Wills, G., Walters, R.: Factors influencing an organisation’s intention to adopt cloud computing in Saudi Arabia. In: Proceedings of the International Conference on Cloud Computing Technology and Science, CloudCom, vol. 5, no. 2, pp. 1040–1044 (2015).https:// doi.org/10.1109/CloudCom.2014.95 13. El-Sofany, H.F., Al-Tourki, T., Al-Howimel, H., Al-Sadoon, A.: E-Government in Saudi Arabia: barriers, challenges and its role of development. Int. J. Comput. Appl. 48(5), 16–22 (2012). https://doi.org/10.5120/7344-0119 14. Chhonker, M.S., Verma, D., Kar, A.K.: Review of technology adoption frameworks in mobile commerce. Procedia Comput. Sci. 122, 888–895 (2017). https://doi.org/10.1016/j.procs.2017. 11.451 15. Tornatzky, M., Fleischer, L.: The Processes of Technological Innovation. Lexington, Mass (1990) 16. Oliveira, T., Martins, R.O., Martins, M.F.: Literature review of information technology adoption models at firm level. Electron. J. Inf. Syst. Eval. 14, 110 (2011) 17. Dillon, T., Wu, C., Chang, E.: Cloud computing: issues and challenges. In: Proceedings International Conference on Advanced Information Networking and Applications, AINA, vol. 20, no. 23, pp. 27–33 (2010). https://doi.org/10.1109/AINA.2010.187 18. Venkatesh, V., Morris, M.G., Davis, G.B., Davis, F.D.: User acceptance of information technology: toward a unified view. MIS Q. 27(3), 425–478 (2003). https://doi.org/10.2307/300 36540 19. Ajzen, I., Fishbein, M.: Belief, Attitude, Intention, and Behaviour: An Introduction to Theory and Research. Addison-Wesley, Reading, MA (1975) 20. Davis, F.D., Bagozzi, R.P., Warshaw, P.R.: User acceptance of computer technology: a comparison of two theoretical models. Sour. Manage. Sci. 35(8), 982–1003 (1989). https://doi. org/10.1287/mnsc.35.8.982 21. Rogers, E.: Attributes of Innovations and their Rate of Adoption. Library of Congress Cataloging-in-Publication Data (1995) 22. Hu, P., Dhelim, S., Ning, H., Qiu, T.: Survey on fog computing: architecture, key technologies, applications and open issues. J. Netw. Comput. Appl. 98, 27–42 (2017). https://doi.org/10. 1016/j.jnca.2017.09.002 23. Dwivedi, Y.K., Rana, N.P., Tamilmani, K., Raman, R.: A meta-analysis based modified unified theory of acceptance and use of technology (meta-UTAUT): a review of emerging literature. Curr. Opin. Psychol. 36, pp. 13–18 (2020). https://doi.org/10.1016/j.copsyc.2020.03.008 24. Nickerson, C.: Theory of Reasoned Action, Simply Psychology (2022). https://www.simply psychology.org/theory-of-reasoned-action.html. Accessed 03 Aug 2022 25. Kamal, S.A., Shafiq, M., Kakria, P.: Investigating acceptance of telemedicine services through an extended technology acceptance model (TAM). Technol. Soc. 60, 101212 (2020). https:// doi.org/10.1016/j.techsoc.2019.101212 26. Ajibade, P.: Technology acceptance model limitations and criticisms: exploring the practical applications and use in technology-related studies, mixed-method, and qualitative researches. In: Procedia Computer Science (2018). http://digitalcommons.unl.edu/libphilprac/1941
84
M. Alyami et al.
27. MacVaugh, J., Schiavone, F.: Limits to the diffusion of innovation. Eur. J. Innov. Manag. 13(2), 197–221 (2010). https://doi.org/10.1108/14601061011040258 28. S-Mohamadali, N.A.K., Garibaldi, J.M.: Understanding and addressing the ‘fit’ between user, technology and organization in evaluating user acceptance of healthcare technology. In: Proceedings of the International Conference on Health Informatics, pp. 119–124 (2012). https://doi.org/10.5220/0003696901190124 29. Borgman, H.P., Bahli, B., Heier, H., Schewski, F.: Cloudrise: exploring cloud computing adoption and governance with the TOE framework. In: 2013 46th Hawaii International Conference on System Sciences, pp. 4425–4435 (2013). https://doi.org/10.1109/HICSS.2013.132 30. Tashkandi, A.A., Al-Jabri, I.: Cloud computing adoption by higher education institutions in Saudi Arabia: analysis based on TOE. In: 2015 International Conference on Cloud Computing, ICCC 2015 (2015). https://doi.org/10.1109/CLOUDCOMP.2015.7149634 31. Gangwar, H., Ramaswamy, R.: Understanding determinants of cloud computing adoption using an integrated TAM-TOE model. J. Enterp. Inf. Manag. 28(1), 107–130 (2015). https:// doi.org/10.1108/JEIM-08-2013-0065 32. Awa, H.O., Ojiabo, O.U., Orokor, L.E.: Integrated technology-organization-environment (TO-E) taxonomies for technology adoption. J. Enterp. Inf. Manag. 30(6), 893–921 (2017). https://doi.org/10.1108/JEIM-03-2016-0079 33. Priyadarshinee, P.: Applications of fog computing and Internet of Things in Indian smart cities. Int. J. Soc. Ecol. Sustain. Dev. 13(1), 1–17 (2022). https://doi.org/10.4018/IJSESD. 302647 34. Aljawarneh, N.M., Sokiyna, M., Obeidat, A.M., Alomari, K.A.K., Alradaideh, A.T., Alomari, Z.S.: The role of CRM fog computing on innovation and customer service quality: an empirical study. Mark. Manage. Innov. 2, 286–297 (2020). https://doi.org/10.21272/mmi.2020.2-21 35. Ullah, S., Xuefeng, Z.: Cloud computing: a prologue. Adv. Res. Comput. Commun. Eng. 1(1), 1–4 (2013). https://doi.org/10.48550/arXiv.1304.2981 36. Russ, W.: The Relationship Between Technology Adoption Determinants The Relationship Between Technology Adoption Determinants and the Intention to Use Software-Defined Networking and the Intention to Use Software-Defined Networking, Walden University (2021). https://scholarworks.waldenu.edu/dissertations 37. Ullah, N., Al-Rahmi, W.M., Alzahrani, A.I., Alfarraj, O., Alblehai, F.M.: Blockchain technology adoption in smart learning environments. Sustainability 13(4), 2021 (2021). https:// doi.org/10.3390/su13041801 38. Lee, J.W.: Diffusion of innovations. In: Encyclopedia of Sport Management, Edward Elgar Publishing, pp. 137–138 (2021). https://doi.org/10.4337/9781800883284.diffusion.of.innova tions 39. Rzepka, C., Berger, B.: User interaction with AI-enabled systems: a systematic review of IS research. In: International Conference on Information Systems, vol. 39 (2018). https://www. researchgate.net/publication/329269262 40. AlBar, A.M., Hoque, M.: Factors affecting the adoption of information and communication technology in small and medium enterprises: a perspective from rural Saudi Arabia. Inf. Technol. Dev. 25(4), 715–738 (2019). https://doi.org/10.1080/02681102.2017.1390437 41. Kyratsis, Y., Ahmad, R., Holmes, A.: Technology adoption and implementation in organisations: comparative case studies of 12 English NHS Trusts. BMJ Open 2(2), e000872 (2012). https://doi.org/10.1136/bmjopen-2012-000872 42. Cascio, W.F., Montealegre, R.: How technology is changing work and organizations. Annu. Rev. Organ. Psych. Organ. Behav. 3(1), 349–375 (2016). https://doi.org/10.1146/annurev-org psych-041015-062352 43. Kumar, S., Tiwari, P., Zymbler, M.: Internet of Things is a revolutionary approach for future technology enhancement: a review. J. Big Data 6(1), 1–21 (2019). https://doi.org/10.1186/ s40537-019-0268-2
FCA-SAPO: A New Comprehensive Fog Computing Adoption
85
44. Narwane, V.S., Raut, R.D., Gardas, B.B., Kavre, M.S., Narkhede, B.E.: Factors affecting the adoption of cloud of things: the case study of Indian small and medium enterprises. J. Syst. Inf. Technol. 21(4), 397–418 (2019). https://doi.org/10.1108/JSIT-10-2018-0137 45. Al Hadwer, A., Tavana, M., Gillis, D., Rezania, D.: A systematic review of organizational factors impacting cloud-based technology adoption using technology-organization-environment framework. Internet Things 15, 100407 (2021). https://doi.org/10.1016/j.iot.2021.100407 46. Ishtiaq, M.: Book review. In: Creswell, J.W., Research Design: Qualitative, Quantitative and Mixed Methods Approaches (4th ed.). Thousand Oaks, CA: Sage, English Language Teaching, vol. 12, no. 5, p. 40 (2019). https://doi.org/10.5539/elt.v12n5p40 47. Creswell, J.W.: Research Design (International Student Edition): Qualitative, Quantitative, and Mixed Methods Approaches (4th ed). SAGE Publications (2013) 48. Bonham, L.A.: Candy, Philip C self-direction for lifelong learning. San Francisco: JosseyBass, 567 pages. $45.00. Adult Educ. Q. 42(3), 192–193 (1992).https://doi.org/10.1177/074 171369204200307 49. Naha, R.K., et al.: Fog computing: survey of trends, architectures, requirements, and research directions. IEEE Access 6, 47980–48009 (2018). https://doi.org/10.1109/ACCESS.2018.286 6491 50. Aazam, M., Zeadally, S., Harras, K.A.: Deploying fog computing in industrial Internet of Things and industry 4.0. IEEE Trans. Industr. Inform. 14(10), 4674–4682 (2018). https://doi. org/10.1109/TII.2018.2855198 51. Bouzarkouna, I., Sahnoun, M., Sghaier, N., Baudry, D., Gout, C.: Challenges facing the industrial implementation of fog computing. In: Proceedings - 2018 IEEE 6th International Conference on Future Internet of Things and Cloud, FiCloud 2018, pp. 341–348 (2018). https://doi.org/10.1109/FiCloud.2018.00056
A Review of Computational Load-Balancing for Mobile Edge Computing Michael Wilson1,2(B) , Henry Nunoo-Mensah1,3 , and Kwame Osei Boateng1,3 1 Department of Computer Engineering, Kwame Nkrumah University of Science and
Technology, Kumasi, Ghana [email protected], {hnunoo-mensah,boat.soe}@knust.edu.gh 2 Institute for Scientific and Technological Information, Council for Scientific and Industrial Research (CSIR), Accra, Ghana 3 Connected Devices Lab, Kwame Nkrumah University of Science and Technology, Kumasi, Ghana
Abstract. Living and working smarter is becoming the new trend, and this is primarily made possible by the growth in the number of connected internet of things (IoT) devices globally. IoT’s foremost challenges include energy challenges and their inability to meet the latency demands of user applications. These are somewhat connected to their over-dependence on cloud services for data processing. In recent years, edge computing variants have been considered for a shift from IoT applications’ dependence on traditional cloud for data processing. However, the compute resources of individual edge devices are woefully inadequate. Hence, the need for efficient offloading strategies and resource allocation algorithms that will distribute and balance computational tasks among several collaborating edge servers/devices. Different algorithms, including Heuristic algorithms, Machine Learning techniques, Bio-inspired and Genetic algorithms and Game theory-based approaches, have been studied by researchers towards optimizing computational load balancing in edge computing solutions. This paper systematically reviews the adaptation of Mobile Edge Computing for latency improvement in IoT applications with a specific focus on state-of-the-art computational load balancing and offloading strategies. The paper further outlines future research directions and open issues that researchers can explore to find solutions to the latency gap in using Mobile Edge Computing for IoT data processing. Keywords: Computational Offloading · Optimization Algorithms · Load Balancing · Mobile Edge Computing · IoT Data Processing
1 Introduction The first proposal on the concept and term Internet of Things (IoT) was in 1998 during a talk show for the Auto-ID centre at the MIT by Kevin Ashton [1]. The formal introduction of IoT, however, was done by the International Telecommunications Union in 2005 through its ITU Internet Report [2, 3]. H. Nunoo-Mensah and K. Osei Boateng—These authors contributed equally to this work. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): SAI 2023, LNNS 711, pp. 86–110, 2023. https://doi.org/10.1007/978-3-031-37717-4_7
A Review of Computational Load-Balancing for Mobile Edge Computing
87
In its simplicity, an IoT application connects sensors and devices having “ON” and “OFF” switches to the Internet and gathers data from these sensors to control the switches of the connected actuators intelligently. In its early days, Cloud computing stamped its feet as the most used approach to offloading computing tasks in IoT applications. A typical cloud architecture for IoT data processing gathers raw data emanating from varying edge devices and are transmitted the data through high bandwidth data networks to remote servers for processing, storage, knowledge extraction, decision making and finally, sends the results to the same or other edge nodes for actuation. A report from Statista, however, shows that the number of connected IoT devices worldwide as of 2018 had reached 22 billion and projected the number to reach 50 billion by 2030 [4]. Furthermore, Cisco projects that by 2030, the current cloud and fog architecture will require about a fifth of globally generated power to transmit, process and store the vast volumes of IoT data [5]. This foreseen growth in the number of connected devices has been of significant concern to many researchers across the globe who have flagged the use of cloud computing for IoT data processing as unsustainable [1, 6, 7]. Also, the success of many IoT applications like self-driving cars, Intelligent Surveillance Systems, Digital Twins (DT), etc., impose response time-constrained which are not met by even modern data networks and cloud solutions [8–10]. Figure 1 shows the recent transitions from IoT’s dependence on traditional cloud computing architecture through the era of fog computing [11, 12] and, most recently, to the use of mobile edge computing for IoT data processing.
Fig. 1. The Transition from Cloud to Edge Computing
Various works in edge computing [13, 14], mobile edge computing [15, 16] and edgecloud computing [14] solutions argue that we are at a point where edge devices have adequate processing capabilities. These researchers purport that, due to the capabilities of such edge and end devices, they can, in turn, form part of the global collaborative data processing architecture for wide range of applications, such as augmented reality, virtual reality, and the Internet of Things [17–19]. Mobile Edge Computing is a promising solution that can (i) minimise latency and improve response time in IoT applications (ii) reduce overall global networks traffic; (iii) reduce load on end-user devices (iv) reduce computational cost and (v) minimize power consumption of IoT applications.
88
M. Wilson et al.
Managing the computational load effectively however, remains a key challenge in MEC. To overcome this challenge, efficient offloading strategies and resource allocation algorithms must be devised to distribute and balance computational tasks among several collaborating edge servers/devices. These algorithms are to ensure that individual node resources are not overburdened while at the same time, other nodes remain relatively ideal. These strategies should also ensure that local node activities are not adversely affected by global tasks during collaborative computing. The task offloading problem has been formulated and addressed using different techniques: including statistical models, machine learning, game theories, genetic algorithms, and a hybrid of multicriteria decision-making strategies [20, 21]. However, these existing approaches to resource allocation and optimization of computational load balancing in mobile edge computingbased IoT setups still has lots of unaddressed issues and open areas for research studies [21, 22]. Key among unaddressed load balancing issues include low-performance, high latency, availability, scalability and implementation complexity [23, 24]. Solutions using statistical models are not flexible and are very complex to design as compared to other load-balancing techniques [25, 26]. Both statistical model and Machine learning based solutions also require more resources (e.g., CPU and memory) for their implementation and rely heavily on data to make accurate offloading decisions [26, 27]. Interested readers may find an earlier review specific to machine learning approaches used before 2019 in survey paper [28]. The use of genetic algorithms on the other hand, rely on certain assumptions about the behavior of system components which may result in inaccurate load-balancing decisions if these assumptions do not hold [29]. For solutions modeled around game theories, the highly dynamic and distributed situation of IoT systems make it very difficult to model the task-offloading problem as a complete information game. Some existing attempts have treated loadbalancing as non-cooperative normal form games where nodes are purely decentralized players whose task-offloading decisions are treated as a single move with no information about the moves taken by other nodes (players). To achieve Nash Equilibrium will however, required cooperations and iterations of the decision-making processes but this will adversely impact latency values [22]. This paper reviews and discusses task offloading problem formulation, loadbalancing/resource allocation techniques, evaluation metrics, significant contributions, and limitations of recent literature that uses variants of mobile edge computing for IoT data processing. In the selection of articles for this review, a search was conducted across several databases (predominantly from IEEE Xplore, Google Scholar and ResearchGate) using variants of the three key phrases below: • Computational load balancing techniques for Mobile Edge Computing based IoT • Resource allocation for data processing in Mobile Edge Computing based IoT systems • Mobile Edge Computing for IoT Computational Task Offloading The resulting articles were further filtered by year of publication and keyword count as outlined in the algorithm below. Abstracts of the filtered articles were read and articles relevant to the purpose of this review were finally selected for review.
A Review of Computational Load-Balancing for Mobile Edge Computing
89
To the best of our knowledge, no survey paper has reviewed and discussed state-ofthe-art approaches to computational load balancing in mobile edge computing from 2019 to 2022, thus, driving the need for this paper. The primary goal of this paper is to review and discuss practical and state-of-the-art techniques for computational load balancing and resource allocation in the use of edge computing variants for IoT data processing. This paper also contributes to identifying essential task offloading problem formulation methods and proposed algorithms that have been used to address the load balancing and resource allocation problem in edge computing and discusses open issues/opportunities for future research in this direction. The remainder of the paper is organised as follows. Section 2 discusses a broad categorisation of approaches used in recent works to address the task offloading problem. Section 3 discusses specific state-of-the-art methods for load balancing and task offloading techniques in mobile edge computing. The problem formulation, the primary solution used, metrics, benefits and limitations of each approach are identified and summarised. Section 4 discusses open issues and opportunities for future research, and the paper is finally concluded in Sect. 5.
2 Classification of Load Balancing and Offloading Strategies Offloading strategies have been classified by [30] as either being Fine-Grained or CourseGrained. Offloading strategies where tasks can be divided into multiple sub-tasks are said to be fine-grained whiles those that do not support task division are classified
90
M. Wilson et al.
as course-grained offloading strategies. This paper further categorises the strategies for carrying out load balancing and offloading into five granular groups based the technical techniques adopted for load balancing and offloading. These groups are heuristics-based, bio-inspired, machine learning-based, combinatorial optimisation, and game-theoretical techniques. Figure 2 shows the main categorisation of existing load balancing strategies.
Fig. 2. Categories of Load Balancing Techniques
2.1 Bio/Genetic Inspired Techniques Nature’s success in optimally solving problems with high complexity, extreme diversity and dynamism has been the backbone for a class of meta-heuristic optimisation techniques classified to be bio-inspired. Bio-inspired algorithms have been studied extensively in past decades for optimisation in computing [31]. Genetic algorithms based on natural selection processes or biological evolution are considered bio-inspired and have also been extensively used for both constrained and unconstrained optimisation in many fields, including computational resource management and others [32]. The stochastic nature of bio-inspired and genetic techniques presents an advantage over their deterministic counterpart approaches when size and complexity increase [31]. Bio-inspired and genetic techniques are broadly classified to be evolutionary (e.g., Genetic programming [33], Differential Evolution [34], etc.), Swarm Based [35] (e.g., Particle Swarm Optimization [36], Ant Colony Optimization [37], etc.) or Ecological (e.g., Invasive Weed Colony Optimization Invasive [36]). In this paper, load balancing optimization strategies categorised as bio/genetic inspired are those that a meta-heuristic in nature and optimize computational offloading by either mimicking the intelligent interaction of some living organisms in their environment when solving real world problems or mimic the biological evolution of organisms in creating new breeds that adapt better to problem. 2.2 Heuristic-Based Techniques Optimisation strategies classified as heuristic in this paper have the common feature of not necessarily finding the most optimal solution to the optimisation problem but rather
A Review of Computational Load-Balancing for Mobile Edge Computing
91
focus on finding a good enough offloading solution within a reasonable time frame. Their outcomes are not provable or guaranteed always and may not be reproduced easily but are usually good enough with high accuracy scores. Heuristic approaches to problemsolving present a valuable shortcut opportunity to decision making that neglects the need to undergo an exhaustive analysis of all possibilities [38, 39]. Attempts have been made to solve some of the most challenging NP-complete combinatorial optimisation problems like the travelling salesman problem heuristically [40]. However, heuristic-based algorithms are prone to error and biased judgements. Their reliance on existing heuristics also carries the inherent disadvantage of inhibiting the discovery of alternatively better solutions. 2.3 AI/Machine Learning-Driven Techniques Systems designed to learn from data for the discovery of patterns for decision making with close to no human intervention have been a buzzing research field in recent years. These systems are considered to be performing Machine Learning, leading to the development of artificially intelligent agents or models. All Machine Learning algorithms that perform some form of classification or regression can be considered optimisation algorithms. A review of how machine learning has been used for solving discrete and/or continuous optimisation problems is presented in [41]. In its simplicity, a key underlining feature of offloading strategies this paper classifies as AI/Machine learning based is that, they have and underlining programmed algorithm trained with pre-collected data and capable of independently predicting or supporting the prediction of optimised offloading decisions. The increasing dynamism and complexity of modern optimisation problems is making machine learning and artificial intelligence techniques necessary. 2.4 Combinatorial Optimization Techniques Combinatorial optimisation is a mathematical optimisation subfield that consists of finding an optimal object from a finite set of objects. The set of feasible solutions is discrete or can be reduced to a discrete set. Combinatorial optimisation problems are classified as NP-hard problems for which provably efficient algorithms do not exist [25]. Computational task offloading and resource allocation can be viewed as NP-hard in their dynamism and complexity. A commonality of the algorithms classified as combinatorial in this paper is the use of mathematical methods for the finding computational task offloading decision within a finite set of available options. Possibilities of doing so is seen to be strongly related to how the problem is formulated, leading to a direction of research in combinatorial problem formulation in many optimisation environments [25]. 2.5 Game Theory Based Techniques These strategies model an optimisation problem as a competitive environment of interrelated activities (restricted by some constraints) of actors seeking to win in a struggle for the same or limited options. Each actor tries to find a pure strategy that maximises its chances of winning concerning the activities of other actors. Thus, contrary to the
92
M. Wilson et al.
earlier discussion of bio-inspired approaches where the agents cooperate in finding the optimised offloading strategy, the game theory based approaches discussed in this paper rather have the individual agents working against each other but bounded within a set of defined rules in a quest for an offloading decision that maximises the efficiency of it’s local tasks. As such they apply control theory in finding an algorithm for updating their individual strategies until they reach an optimal solution termed Nash Equilibrium [42] where no participant can gain by a unilateral change of strategy if the strategies of the others remain unchanged. Real-life optimisations with game theory are seen in many situations, including missile defense systems, energy regulation, automated auctions, computational load balancing, and many more [43, 44].
3 Review of State-of-the-Art Works on Computational Task Offloading and Load Balancing 3.1 Bio/Genetic Inspired Techniques Xu et al. [20] proposed the non-dominant sorting genetic algorithm III to solve the multiobject optimisation problem of balancing computational task offloading in mobile edge computing. Computational offloading strategies were modelled as genes with chromosomes representing a solution to the task allocation problem. Each chromosome in a set of possible strategies is evaluated based on three objective functions (time consumption, energy consumption, and load balancing). The non-dominated sorting genetic algorithm was used to filter qualified strategies. The optimal strategy was finally selected as the chromosome with the highest utility value computed as a simple weighted sum of the three objective functions. The approach led to an increase in the utilisation of ideal nodes during an experiment. However, the weights assigned to the objective functions in the computation of the utility values were predetermined, making the solution impractical for use in dynamic situations. Niu et al. [44] proposed a workload balance initialisation algorithm that minimised network delay and a resource allocation algorithm modelled around a modified particle swarm algorithm to minimise service delay. Task allocation among multiple edge servers is defined in this work as a semi-definite programming problem. The convex optimisation method was combined with the Gaussian random method to solve the defined problem. The solution modified the particle swarm optimisation to overcome the disadvantage of fastly falling into local optimum and also converging faster. The complexity of the solution could not be expressed due to uncertainties in the heuristic nature of the algorithm used.
A Review of Computational Load-Balancing for Mobile Edge Computing
93
Chen and Li [45] formulated an offloading problem to minimise system energy consumption as a Mixed-integer nonlinear programming problem. Selection, crossover and mutation principles borrowed from the genetic algorithm were combined with a grey wolf optimiser to solve the formulated problem. The proposed solution could combine global search capability in a genetic algorithm with quick convergence in a grey wolf optimiser. Tang et al. [27] modelled the very likely situation where the ending computational tasks outweigh the total available resources as a multiple knapsacks problem. An offload indicator indicated an offloading device preference (local execution or on edge server) based on a weighted sum of delay and energy consumption. K-Means clustering classified devices during offload decisions into three clusters based on their indicator values. Rather than traversing all K-values of all devices, a genetic algorithm is used to find a joint optimal offload decision using. 3.2 Heuristic-Based Techniques Ning et al. [26] formulated the competition for resources in a single-user and multiuser computation offloading problem as a mixed-integer linear programming problem. A heuristic mobile edge computing resource allocation algorithm uses the branch and bound method to decide whether each end-user task should be performed locally, offloaded to a mobile computing server, or a mobile cloud computing platform for processing in a Multiuser hierarchy as shown in Fig. 3.
94
M. Wilson et al.
Fig. 3. Multiuser Hierarchy Computation Offloading Model. [26]
The offloading decisions are continuously updated until all conflicts are resolved. The optimisation target in this work was based only on data transmission and data processing time. It is shown to lower execution delays up to about 30% when the number of IoT-based user equipment is no more than twice the number of MEC servers. The computational complexity was shown to outperform the brute-force searching algorithm. However, vertical offloading is used and this subjects the solution to significant deterioration in performance as the user devices increase. In Fan’s work, [46] an Application awaRE workload Allocation (AREA) scheme for edge computing-based IoT is proposed to use a heuristic algorithm for the computation of response time as a convolute of both computational delay and network delay. AREA starts by allocating each App to the closest cloudlet and then iteratively selects a suitable App with the highest response time and reassigns it to an alternative cloudlet which reduces its response time until each App cannot find a better cloudlet. AREA proved that it could yield better latency values compared to the density-based clustering (DBC) strategy used in [47] and the latency-based (LB) strategy in [48]. However, the iterative nature of the technique increases delays if it is used in mobile edge computing platforms that support device-todevice (D2D) offloading with high numbers of user computing devices. Kim et al. [49] proposed that an over-burdened edge server task is offloaded downward to local IoT devices governed by an edge server. Their solution sought an optimal schedule for task execution in each IoT device that maximised the number of tasks that met their respective deadlines. In addition, the collaborative software scheduler depicted in Fig. 4, was designed with a communication module and two heuristic algorithms (Hierarchical Weight Allocation (HWA) and Completion time-based Task Assignment (CTA)) running a task executor in each device and a task distributor in the edge server, respectively.
A Review of Computational Load-Balancing for Mobile Edge Computing
95
Fig. 4. Software Architecture for the Collaborative Task Scheduling [49].
The edge server determined tasks to offload, while the IoT devices determined the execution schedule of tasks offloaded. However, relying on end devices for such decisions is undesirable for time-critical solutions. Chen et al. [50] identified energy expended in IoT task offloading as a function of transmission power and offloading duration (which indirectly hinges on bandwidth availability). Balancing the energy consumed in offloading while guaranteeing satisfactory latency was formulated as a stochastic unified optimisation problem. It used the Lyapunov optimisation techniques to propose a dynamic online decision between transmission energy consumption and task delay. The solution efficiently reduced transmission energy consumption, but it was achieved at the cost of latency, which is a key performance indicator in IoT applications. Yue et al. [51] broadly formulated the task offloading problem as a delayconstrained long-term stochastic optimisation problem. The problem was formulated under unknown prior statistical knowledge and then broken down into two convex optimisation problems, which handle task delay guarantees and one maximum bipartite matching problem for offloading decision making. Distefano [52] proposed the use of Web Ontology Language (OWL) reasoning for identifying suitable edge nodes based on the task requirements and ultimately mapped decomposed tasks to selected nodes. In its task selection process, task requirements are represented with values and cardinality constraints, while device descriptions are represented with specific property values. For example, a match was established for devices with descriptions within the range of requirement values and constraints. Finally, the task allocation presented as an optimisation problem was addressed with a mapper, driven by the logic of minimising the number of involved nodes to reserve nodes that can substitute disconnecting nodes. Finally, the use of OWL for device discovery and selection was seen as a potential performance bottleneck when the amount of input data grew. 3.3 AI/Machine Learning-Driven Techniques In Fig. 5 below, Qian et al. [53] attempted to allocate resources in an edge computingbased workflow-aided IoT setup dynamically. The resource allocation and task offloading problem was modelled as a Markov Decision Process (MDP) and was solved using a
96
M. Wilson et al.
deep Q-network (DQN) algorithm [54] to adaptively allocate a number of available edge virtual machines to tasks targeting an overall reduced latency in a network of nodes.
Fig. 5. The Structure of DRL-Based VM Allocation Algorithm. Modeled as an MDP with parameters st, at, Rt, st + 1, at + 1, where st and at are the State and Action in the Current Time Slot, Rt is the Rewarding Function, and st + 1 and at + 1 are the State and Action in the Next Time slot. [53]
The proposed algorithm was run on cluster heads with enough resources. However, running DQN on an average lightweight mobile edge computing device like smartphones is farfetched. With that in mind, it tends to cause system failure for networks without an edge server with enough resources to run the proposed solution. Wang et al. [55] optimised the computational task offloading problem by formulating a multi-objective latency optimisation problem. The alternating direction method of multipliers (ADMM) algorithm was used to minimise three latency metrics: offloading, transmission, and computing. A multiclass classification algorithm based on the random forest classification model is used to predict the resource status (i.e., busy or normal) of edge nodes for an adaptive resources allocation and computation offloading in a campus network. The proposed solution is centered on a specific Access Point (AP)based computing architecture and hence is not applicable for other scenarios. Lei et al. [56] formulated the computational offloading problem in the stochastic arrival of tasks in edge systems as an infinite-horizon average-reward continuous-time Markov decision process (CTMDP) model. They proposed using the reinforcement learning technique as the policy (a function that specifies the decision-maker’s action in a state) of the MDP problem, aiming to minimise the weighted sum of the average delay and power consumption over all the IoT devices. The CTMDP model used depends on post-decision states to obtain a model-free solution without requiring the knowledge of the underlying stochastic process. However, the decision-making process was centralised in the base station, which became a single point of failure. Ale et al. [57] proposed an end-to-end deep reinforcement approach to task offloading aimed at increasing the throughput of tasks that meet their respective deadlines while at the same time reducing energy consumed. The offloading problem addressed
A Review of Computational Load-Balancing for Mobile Edge Computing
97
is formulated as a Markov Decision Process (MDP) to maximise expected long-term rewards. Latency reduction is formulated as a Non-convex end-to-end latency minimisation problem by Do-duy et al. [58]. A digital twin that replicates the real-time state of involved MEC servers is proposed for use by an iterative algorithm to optimise latency. However, the solution is disadvantaged by the replication of server states in real-time, which is resource intensive when scaled and introduces extra latency overhead on the optimisation solution. In [23], Talaat et al. implemented a load balancing and optimization strategy (LBOS) to monitor and continuously collect information about edge servers on Fog architecture. The LBOS then accepts incoming data processing requests and distributes them to available servers using a Reinforcement Learning and Genetic algorithm-based resource allocation method. The LBOS is implemented as a software that iteratively computes the response time for each fog server connected to a master server that in a fog region to select the best server to handle a task. This solution however is not scalable beyond the fog server level to make use of compute resources that are not directly connected to the master and hence, leaves a chunk of compute resources unused within the network architecture. The iterative computation also increases the time complexity of the solution as the number of servers is increased. Prior to work by Cicconetti et al. [59], assigning client task to edge nodes were built mainly around centralized resource orchestration or on distributed overlay architecture setups. An uncoordinated opportunistic task assignment approach is explored where each client node with a task to execute, explores an assigned pool of edge nodes termed as executors within a proposed serverless IoT environment [59]. For the selection of an executor, the explorable pool of executors are iteratively probed with the execution of a lambda function and corresponding response times are used to select the most suitable executor for task offloading. The uncoordinated approach used is shown to have less device and network complexity as well as makes more effective use of network resources as compared to centralized and distributed alternatives at the time. The method is not an efficient way of balancing load on available compute resources since some clients might not be using their assigned pool of executors while others are in dire need of resources. Furthermore, the work does not touch on any algorithm or strategy for executor pool allocation to clients. In [60] the offloading decision problem is modeled as a MAPE-K loop over available edge colonies, each having a limited set of compute resources. A Deep Neural Network has been used for the prediction of expected delay and energy consumption when a task is executed locally or remotely. The prediction is then used for offloading decision-making in a mobile edge computing environment. A disadvantage of this solution is the resource intensive setup required for constant monitoring of available edge colonies and the use of a busy waiting loop for listening to client requests as well as server responses. 3.4 Combinatorial Optimization Techniques Zhang et al. [61] formulated the problem of maximising overall system utility in a multiuser, multi-server based Vehicular Edge Computing setup as a mixed-integer nonlinear programming problem. The problem was constrained by a permissible latency metric
98
M. Wilson et al.
and a measured processing delay as performance metrics. An approximation algorithm using a combination of the standard convex method [62] and rounding method [63] was proposed to solve the server selection problem. At the same time, optimisation of offloading ratio was addressed with the Hungarian algorithm. Thus, a joint algorithm for selection decision, computation resource and an offloading algorithm was presented. The proposed solution did not integrate the dynamic nature of nodes entering and exiting a MEC setup when used for IoT. A decentralised agent-based cooperative service placement algorithm for IoT service placement in a distributed edge-to-cloud infrastructure aimed at optimising workload balance while minimising service execution cost was introduced by Nezami et al. [64]. The load-balancing problem is formulated using two objective functions that allow the nodes to collaborate and exchange information with other nearby nodes to choose service placement plans that achieve load-balancing. Each agent takes part in a two-step offloading decision process. Firstly, each agent generates a set of possible plans, and in the in a second step, the agent selects one of them. This procedure is repeated for all new requests that enter the network. A global view of the proposed solution is delineated in Fig. 6 below.
Fig. 6. Global System View of Service Placement Plan [64]
While many works assume that latency optimisation begins with strategies for offloading tasks, Feng et al. [65] focused on optimising the partitioning of tasks into sub-tasks before offloading decisions are made. For example, task partitioning for independent sub-tasks and sequentially dependent sub-tasks is formulated as two different mixed-integer linear programming problems, aiming to reduce the average latency of user equipment. A dual decomposition algorithm obtains a latency reduction solution for user association based on a latency function. Zhang et al. [66] proposed the use of a central software-defined network (SDN) controller to manage the assignment of end-user device tasks to the best available small base stations based on information (computation tasks, computational resources and the communication link) shared from the base stations to the central controller. The controller formulates the task allocation problem as a single optimisation task with a linear objective function targeting decreased system cost (energy and time). The branch and bound method was used to find the optimal solution to the formulated problem.
A Review of Computational Load-Balancing for Mobile Edge Computing
99
Compared to most solutions, the proposed solution has low complexity but does not address issues related to the high dynamism of complex IoT setups. 3.5 Game Theory Based Techniques Game theory was used in an SDN-based load-balancing task at the edge server level for processing delay reduction in a centralised vehicular edge computing network by Zhang et al. [61]. Three game theories are explored for load balancing decisions where vehicles compete for compute slots and recursively update their decisions based on the decision of other vehicles until Nash equilibrium is achieved. The algorithm is a good way of decentralising the offloading decision. However, it leaves the resources of some servers underutilised while others are overloaded since all nodes rush for the best of services. Huang et al. [30] designed a nonlinear and non-convex delay optimization offloading problem based on non-cooperative game theory. The problem was formulated by conducting quantitative analysis of both local and offloading delays. The problem formulation was targeted at minimizing offload delay of each device in a multi-device environment with multiple edge servers. To solve the problem, an offloading game is designed for use by end user devices where the devices compete for limited resources. In the formulated offloading game, device tasks are broken into sub-tasks and run parallel on multiple servers to keep the largest delay of sub-tasks as small as possible. Each device partakes in a synchronised iteration of making a request to update its offloading decisions (which is accepted or rejected) until there is no change request recorded over a specified number of epochs. Attaining Nash Equilibrium was shown to grow with increasing devices. It is also shown that as the number of devices increases, the average offloading percentage decreases and the decreasing rate becomes slower as the number of devices increases further as shown in Fig. 7. The solution is however, not effective if task dependencies do not allow for partitioning into sub-tasks.
Fig. 7. Observed Decrease in Successful Task Offloaded as Number of Devices Increase [30]
Mixed-integer nonlinear An approximation • Permissible Latency programming problem algorithm using a • Processing Delay combination of the standard convex method and rounding method is used for server selection, and a Hungarian algorithm is used for offloading
Mixed integer linear programming problem
[26]
Uses the branch and • Transmission time bound algorithm for • Processing time offloading decision and an iterative update function for optimisation
• Latency • Energy consumption • Load balancing
[25]
Non-dominant sorting genetic algorithm III
Multi-object optimization problem
Metric
[20]
Mechanism
Formulation
Ref
Limitations
Transmission time Processing time Reduces execution delay and achieves a better computational complexity than the brute-force searching algorithm
Unlike Brute Force Schemes that greedily choose servers with maximum resources, the proposed approach achieves a fairer load allocation
Genetic Inspired
Category
Vertical offloading is used, leading to significant deterioration in performance as the user devices increase
No
No
No
D2D
(continued)
Heuristic algorithm based
The proposed solution Combinatorial does not factor in the Optimisation dynamic nature of nodes Entering and exiting a MEC setup
The approach leads to an Weights of objective increase in utilization of functions in the ideal nodes computation of the utility values are predetermined, making the solution impractical in a dynamic situation such as in IoT applications
Benefits
Table 1. Summary of Reviewed Works
100 M. Wilson et al.
Formulation
Multiple knapsack problem
Nonlinear and nonconvex delay optimization offloading problem based on noncooperative game theory
Semi-definite programming problem
Ref
[27]
[30]
[44]
Metric
• Offloading Delay
• Energy consumption • Latency
Resource allocation with • Network Delay a particle swarm • Service Delay algorithm modified using convex optimisation combined with Gaussian random method
Non-cooperative game theory
Combines K-Means Clustering with Genetic algorithm
Mechanism
The modification made overcomes particle swarm optimisation’s disadvantages of fastly falling into local optimum and converges faster
The solution adopted allows arbitrary task partitioning allowing for parallel processing and scalable D2D task offloading
Handles the unique situation where computational demand overflows available resources
Benefits
Table 1. (continued) Limitations
Category
Game theory
Machine Learning, Genetic Algorithm
The complexity of the Bioinspired solution is uncertain and thus not suitable for critical applications
The solution is not effective if task dependencies do not allow for partitioning. Also, achieving Nash Equilibrium is not guaranteed by this approach
Does not factor in cloud offloading in the overloaded situation. It is likely to achieve better latency with cloud offloading in such a situation
D2D
(continued)
No
No
No
A Review of Computational Load-Balancing for Mobile Edge Computing 101
NP-Hard Non deterministic Polynomial Time problem
Multi-objective latency optimisation problem
[46]
[49]
• Energy Consumption
Metric
Downward offloading of task by a scheduler based on Hierarchical Weight Allocation and Completion time-based task assignment
• Task Throughput
Iterative selection of best • Response Time fit server based on • Computational delay network delay degradation and response time
Mixed-integer nonlinear Combines grey wolf programming problem optimiser with a genetic algorithm to solve the formulated problem leading to an energy-efficient offloading decision
[45]
Mechanism
Formulation
Ref
Optimises the use of available resources and compute power of enduser devices
Yield a better latency value compared to the density-based clustering [31] and Latency based strategy [32]
Advantage of combining global search capability in a genetic algorithm with quick convergence in a grey wolf optimiser
Benefits
Table 1. (continued)
Genetic/ Bioinspired
Category
Undesirable for timecritical solutions
Heuristic based
The iterative nature of Heuristic based the technique will increase delays if used in mobile edge computing that supports D2D offloading with high numbers of user computing devices
An end-user device can only offload its task to one and only one MEC server This is impractical for D2D offloading
Limitations
(continued)
Yes
No
No
D2D
102 M. Wilson et al.
Formulation
Stochastic unified optimization problem
Maximum bipartite matching problem
Web Ontology Language reasoning problem
Markov decision process
Ref
[50]
[51]
[52]
[53]
Metric
• Task processing rate • Task transmission rate
• Transmission Power • Offloading Duration
Dynamic resource • Latency allocation decision using deep Q-network (DQN) algorithm
Use a mapper that • Latency minimizes the number of involved nodes to reserve nodes that can substitute disconnecting nodes
A Multi-Robot Assignment approach
Trade-off between the transmission energy consumption and task delay using the Lyapunov optimization techniques
Mechanism
Introduces autonomous dynamic offloading decision support
The approach improves reliability by making room for failed nodes or nodes that may exit a dynamic IoT setup
Has relatively low computational cost
Efficient in reducing transmission energy consumption
Benefits
Table 1. (continued) Limitations
Category
Combinatorial
D2D
No
Yes
No
No
(continued)
Success is subject to the Machine Learning availability of an edge based server on the network with enough resources to run the DQN algorithm
The use of OWL for device discovery and selection will hit a performance bottleneck as the amount of input data grows
The use of the Heuristic based MultiRobot Assignment algorithm alleviated computation complexity but led to relatively large communication cost
The reduction in energy Heuristic algorithm consumption is achieved based at the cost of latency
A Review of Computational Load-Balancing for Mobile Edge Computing 103
Formulation
Multi-objective latency optimisation problem
Infinite-horizon average-reward continuous Time Markov decision process (CTMDP)
Markov Decisi0on Process (MDP) to maximize expected long-term rewards
Non-convex endto-end latency minimization problem
Ref
[55]
[56]
[57]
[58]
• Offloading Latency. • Transmission Latency • Computing Latency
Metric
Iterative algorithm using digital twin for processing rate estimation
An end-to-end Deep Reinforcement Learning with an enhanced training approach to prevent action values from oscillating or diverging • Transmit Power • Latency • Quality of service
• Energy Cost • Task Throughput
implements the policy of • Average Delay. • the MDP problem with a power consumption reinforcement learning technique
Prediction of node resource status using random forest classification and optimisation of metrics using the alternating direction method of multipliers
Mechanism
Tasks are offloaded wholly without provision for task partitioning. This limitsthe ability of end devices to participate in a MEC environment
Decision making is centralised on a Base Station which becomes a single point of failure
The proposed solution is centered on a specific Access Point (AP)-based computing architecture and might not be scalable other scenarios
Limitations
The use of digital twin A latency overhead is improves the accuracy of added by the digital estimated processing rate twin used used in optismisation decision
The solution is end-to-end, requiring no further mathematical optimization after the deep reinforcement agent’s offloading decision
Improves the efficiency of consumed power while reducing overall system delay
Achieves node status prediction
Benefits
Table 1. (continued)
Combinatorial
Machine Learning-Based actively
Machine LearningBased
Machine LearningBased
Category
(continued)
No
No
No
No
D2D
104 M. Wilson et al.
Formulation
Optimal mixed strategy problem
Linear programming problems with constraints
Mixed-integer Linear Programming Problems
A single optimization task with a linear objective function
Ref
[61]
[64]
[65]
[66]
Metric
Uses branch and bound method to find the optimal solution to the formulated objective function
Partitioning of tasks into subtasks before offloading decisions are made
Uses objective functions that allow node collaboration in service placement selection
• Energy consumption • Latency
• Task Partitioning • Latency
• Deadline Violation • Service Deployment Cost • un-hosted services cost
Nash equilibrium is used • Latency for optimized compute slot competition in a vehicular network
Mechanism
Limitations
Category
Combinatorial
Game theory
Combinatorial
The task partitioning Combinatorial does not incorporate the Optimisation resource availability at the processing nodes
The formulation of the problem as a linear programming problem assumes all constraints are linear and can be quantified
The solution used a selfish strategy which leaves resources of some servers underutilized while others are overloaded since all nodes rush for the best of services
The solution has low Does not address issues complexity compared to related to the high many proposed solutions dynamism in a complex IoT setup
Introduces a layer of latency optimization prior to task offloading
Solves the resource placement problem with a decentralised approach
Three algorithms are explored and are shown to be good for decentralising offloading decisions
Benefits
Table 1. (continued) D2D
No
Yes
No
No
A Review of Computational Load-Balancing for Mobile Edge Computing 105
106
M. Wilson et al.
Table 1 summarizes load-balancing techniques, evaluation metrics, significant contributions, and limitations of recent research works that uses variants of mobile edge computing for IoT data processing. At a glance, it can be deduced that less than 20% of solutions are actively taping the resources of end user devices for collaborative computing as a solution to the problem in question.
4 Discussion and Future Research This work has presented a compact and global view of state-of-the-art trends and novel strategies for computational task offloading in mobile edge computing for IoT. From a global perspective, the problem of computational task offloading in the IoT environment has been established as an NP-Hard problem. Hence, it may be best to explore nondeterministic algorithms for optimal solutions. The above realization seems to have triggered a strong interest in exploring heuristic optimization techniques, including bio-inspired and genetic algorithms, game theory and some heuristic-based machine learning approaches to optimizing computational offloading in edge-based IoT setups. Results so far show that these approaches can achieve some good results. However, many of the reviewed works address different silos (latency, energy, privacy, delay, etc.) of the entire problem. With projections from credible institutions, the number of connected devices will undoubtedly keep to its exponential growth trajectory, at least for the next few decades, requiring that very advanced computational offloading methods are developed since edge computing variants are being developed at a fast pace to play a role in IoT data processing. The limited compute resources of edge devices remain a disadvantage to the concept of edge computing. A recommendation to overcome this drawback will be to explore ways of advancing collaborative computing at the mobile edge level. This will require efficient methods of segmenting computational tasks into smaller sub-tasks capable of being offloaded to remote edge devices with limited resources with consideration to the security, privacy, latency and energy overheads that may be introduced. So far, very few works have considered including end-user devices like smartphones, smartwatches, CCTV cameras, etc., in the pool of computational resources for IoT data processing. This certainly is an avenue to explore since the computational capacity of these devices is on the rising edge with reduced cost and increasing energy-efficient technologies. However, the mobility and dynamic use of these end-user devices increases the complexity of such D2D offloading solutions. Resource discovery and resource state monitoring are also seen to play an important role in computational task offloading in dynamic setups like that which exist in IoT deployments. This will require planning a more decentralized network architecture for IoT deployments working hand in hand with offloading strategies that will optimize resource usage within the decentralized network. Backward compatibility should be considered to allow for seamless integration into the existing vertical cloud offloading architecture for obvious reasons (the cloud being a very relevant computational powerhouse for heavy tasks). The absence of D2D connectivity at the end-user level contributes to the limitation of local computational tasks to data from sensors connected directly to the end-user device.
A Review of Computational Load-Balancing for Mobile Edge Computing
107
In many of the reviewed cases, minor decisions that an end-user node with enough computing power could have taken are still offloaded when the computation depends on data from other remote nodes. A move towards D2D computational offloading may require a new paradigm of algorithms or light-weight hybrids of heuristic algorithms since most conventional heuristic and machine learning algorithms reviewed tend to be compute-intensive with overheads that will reduce efficiency if executed at the enddevice level.
5 Conclusion This paper systematically reviews the adaption of Mobile Edge Computing for latency improvement in IoT applications with a specific focus on computational load balancing and offloading strategies. The most recent state of the art trends, benefits and limitations in using different algorithms classified under five broad categories have been summarized with references for readers and researchers who want to get into this area of study. Future research opportunities are discussed, and very good recommendations with strong justifications are made to jumpstart and drive the direction future research in this direction.
References 1. Madakam, S., Lake, V., Lake, V., Lake, V., et al.: Internet of things (IoT): a literature review. J. Comput. Commun. 3(05), 164 (2015) ´ 2. Penã-Lopez, I., et al.: ITU internet report 2005: the Internet of Things (2005) 3. Kumar, S., Tiwari, P., Zymbler, M.: Internet of things is a revolutionary approach for future technology enhancement: a review. J. Big data 6(1), 1–21 (2019) 4. Huyghue, B.D.: Cybersecurity, internet of things, and risk management for businesses. PhD thesis, Utica College (2021) 5. Kott, A., Linkov, I. (eds.): Cyber Resilience of Systems and Networks. RSD, Springer, Cham (2019). https://doi.org/10.1007/978-3-319-77492-3 6. Index, C.G.C., Index, C.: Forecast and methodology, 2016–2021; white paper; cisco systems. Inc.: San Jose, CA, USA (2017) 7. Hribar, J., DaSilva, L.: Utilising correlated information to improve the sustainability of internet of things devices. In: 2019 IEEE 5th World Forum on Internet of Things (WF-IoT), pp. 805– 808. IEEE (2019) 8. Arjun, N., Ashwin, S., Polachan, K., Prabhakar, T., Singh, C.: An end to end tactile cyber physical system design. In: 2018 4th International Workshop on Emerging Ideas and Trends in the Engineering of CyberPhysical Systems (EITEC), pp. 9–16. IEEE (2018) 9. Parvez, I., Rahmati, A., Guvenc, I., Sarwat, A.I., Dai, H.: A survey on low latency towards 5g: Ran, core network and caching solutions. IEEE Commun. Surv. Tutorials 20(4), 3098–3130 (2018) 10. Zanella, A., Bui, N., Castellani, A., Vangelista, L., Zorzi, M.: Internet of things for smart cities. IEEE Internet Things J. 1(1), 22–32 (2014) 11. Mebrek, A., Merghem-Boulahia, L., Esseghir, M.: Efficient green solution for a balanced energy consumption and delay in the IoT-fog-cloud computing. In: 2017 IEEE 16th International Symposium on Network Computing and Applications (NCA), pp. 1–4. IEEE (2017)
108
M. Wilson et al.
12. Rabayà, A., Schleicher, E., Graffi, K.: Fog computing with p2p: Enhancing fog computing bandwidth for IoT scenarios. In: 2019 International Conference on Internet of Things (iThings) and IEEE Green Computing and Communications (GreenCom) and IEEE Cyber, Physical and Social Computing (CPSCom) and IEEE Smart Data (SmartData), pp. 82–89 . IEEE (2019) 13. Hou, L., Zheng, K., Liu, Z., Xu, X., Wu, T.: Design and prototype implementation of a blockchain-enabled lora system with edge computing. IEEE Internet Things J. 8(4), 2419– 2430 (2020) 14. Pace, P., Aloi, G., Gravina, R., Caliciuri, G., Fortino, G., Liotta, A.: An edge-based architecture to support efficient applications for healthcare industry 4.0. IEEE Trans. Ind. Inf. 15(1), 481–489 (2018) 15. Zhang, T., Fang, X., Liu, Y., Nallanathan, A.: Content-centric mobile edge caching. IEEE. Access 8, 11722–11731 (2019) 16. Lee, Y., Kim, W., Moon, K., Lim, K.: A mobile edge computing device to support data collecting and processing from IoT. In: 2019 International Conference on Electronics, Information, and Communication (ICEIC), pp. 1–3. IEEE (2019) 17. Samie, F., Bauer, L., Henkel, J.: From cloud down to things: an overview of machine learning in internet of things. IEEE Internet Things J. 6(3), 4921–4934 (2019) 18. Mahmud, M.A., Bates, K., Wood, T., Abdelgawad, A., Yelamarthi, K.: A complete internet of things (IoT) platform for structural health monitoring (shm). In: 2018 IEEE 4th World Forum on Internet of Things (WF-IoT), pp. 275–279. IEEE (2018) 19. Abbas, N., Zhang, Y., Taherkordi, A., Skeie, T.: Mobile edge computing: a survey. IEEE Internet Things J. 5(1), 450–465 (2017) 20. Xu, X., Zhang, X., Gao, H., Xue, Y., Qi, L., Dou, W.: Become: blockchainenabled computation offloading for iot in mobile edge computing. IEEE Trans. Industr. Inf. 16(6), 4187–4195 (2019) 21. Pydi, H., Iyer, G.N.: Analytical review and study on load balancing in edge computing platform. In: 2020 Fourth International Conference on Computing Methodologies and Communication (ICCMC), pp. 180–187. IEEE (2020) 22. El-Sayed, H., et al.: Edge of things: The big picture on the integration of edge, IoT and the cloud in a distributed computing environment. IEEE Access 6, 1706–1717 (2017) 23. Talaat, F.M., Saraya, M.S., Saleh, A.I., Ali, H.A., Ali, S.H.: A load balancing and optimization strategy (LBOS) using reinforcement learning in fog computing environment. J. Ambient. Intell. Humaniz. Comput. 11(11), 4951–4966 (2020). https://doi.org/10.1007/s12652-02001768-8 24. Zhang, N., Guo, S., Dong, Y., Liu, D.: Joint task offloading and data caching in mobile edge computing networks, Comput. Netw. 182, 104476 2020. https://doi.org/10.1016/j.comnet. 2020.107446 25. Hoffman, K.L.: Combinatorial optimization: current successes and directions for the future. J. Comput. Appl. Math. 124(1–2), 341–360 (2000) 26. Ning, Z., Dong, P., Kong, X., Xia, F.: A cooperative partial computation offloading scheme for mobile edge computing enabled internet of things. IEEE Internet Things J. 6(3), 4804–4814 (2018) 27. Tang, H., Wu, H., Zhao, Y., Li, R.: Joint computation offloading and resource allocation under task-overflowed situations in mobile edge computing. IEEE Trans. Netw. Service Manag. 19, 1539–1553 (2021) 28. Shakarami, A., Ghobaei-Arani, M., Shahidinejad, A.: A survey on the computation offloading approaches in mobile edge computing: a machine learning-based perspective. Comput. Netw. 182, 107496 (2020). https://doi.org/10.1016/j.comnet.2020.107496
A Review of Computational Load-Balancing for Mobile Edge Computing
109
29. Maia, A.M., Ghamri-Doudane, Y., Vieira, D., de Castro, M.F.: An improved multi-objective genetic algorithm with heuristic initialization for service placement and load distribution in edge computing. Comput. Netw. 194, 108146 (2021). https://doi.org/10.1016/j.comnet.2021. 108146 30. Huang, J., Wang, M., Wu, Y., Chen, Y., Shen, X.: Distributed offloading in overlapping areas of mobile edge computing for internet of things. IEEE Internet of Things J. 9, 13837–13847 (2022) 31. Tu, Q., Li, H., Wang, X., Chen, C.: Ant colony optimization for the design of small-scale irrigation systems. Water Resour. Manage 29(7), 2323–2339 (2015) 32. Zhang, J., Kang, M., Li, X., Liu, G.-y.: Bio-inspired genetic algorithms with formalized crossover operators for robotic applications. Front. Neurorobotics 11, 56 (2017) 33. Willis, M.-J., Hiden, H.G., Marenbach, P., McKay, B., Montague, G.A.: Genetic programming: an introduction and survey of applications. In: Second International Conference on Genetic Algorithms in Engineering Systems: Innovations and Applications, pp. 314–319 . IET (1997) 34. Georgioudakis, M., Plevris, V.: A comparative study of differential evolution variants in constrained structural optimization. Front. Built Environ. 6, 102 (2020) 35. Ab Wahab, M.N., Nefti-Meziani, S., Atyabi, A.: A comprehensive review of swarm optimization algorithms. PLoS ONE 10(5), 0122827 (2015) 36. Misaghi, M., Yaghoobi, M.: Improved invasive weed optimization algorithm (iwo) based on chaos theory for optimal design of pid controller. J. Comput. Des. Eng. 6(3), 284–295 (2019) 37. Li, L.-l., Wang, J.-k.: Sar image ship detection based on ant colony optimization. In: 2012 5th International Congress on Image and Signal Processing, pp. 1100–1103. IEEE (2012) 38. Dale, S.: Heuristics and biases: the science of decision-making. Bus. Inf. Rev. 32(2), 93–99 (2015) 39. Ozlü, ï.A., Baimakhanov, O., Saukhimov, A., Ceylan, O.: A heuristic˙ methods-based power distribution system optimization toolbox. Algorithms 15(1), 14 (2021) 40. Müller, F.M., Bonilha, I.S.: Hyper-heuristic based on aco and local search for dynamic optimization problems. Algorithms 15(1), 9 (2021) 41. Dahrouj, H., et al.: An overview of machine learning-based techniques for solving optimization problems in communications and signal processing. IEEE Access 9, 74908–74938 (2021) 42. Chen, L., Zhou, S., Xu, J.: Computation peer offloading for energyconstrained mobile edge computing in small-cell networks. IEEE/ACM Trans. Netw. 26(4), 1619–1632 (2018) 43. Pertovt, E., Javornik, T., Mohorˇciˇc, M.: Game theory application for performance optimisation in wireless networks. Elektrotehniški Vestnik 78(5), 287–292 (2011) 44. Niu, X., et al.: Workload allocation mechanism for minimum service delay in edge computingbased power internet of things. IEEE Access 7, 83771–83784 (2019) 45. Chen, X., Li, X.: An energy-efficient task offloading decision in electric power IoT based on edge computing. In: 2021 International Conference on Electronic Information Engineering and Computer Science (EIECS), pp. 597–600. IEEE (2021) 46. Fan, Q., Ansari, N.: Application aware workload allocation for edge computing-based iot. IEEE Internet Things J. 5(3), 2146–2153 (2018) 47. Jia, M., Cao, J., Liang, W.: Optimal cloudlet placement and user to cloudlet allocation in wireless metropolitan area networks. IEEE Trans. Cloud Comput. 5(4), 725–737 (2015) 48. Yang, L., Cao, J., Liang, G., Han, X.: Cost aware service placement and load dispatching in mobile cloud systems. IEEE Trans. Comput. 65(5), 1440–1452 (2015) 49. Kim, Y., Song, C., Han, H., Jung, H., Kang, S.: Collaborative task scheduling for iot-assisted edge computing. IEEE Access 8, 216593–216606 (2020)
110
M. Wilson et al.
50. Chen, Y., Zhang, N., Zhang, Y., Chen, X., Wu, W., Shen, X.: Energy efficient dynamic offloading in mobile edge computing for internet of things. IEEE Trans. Cloud Comput. 9(3), 1050–1060 (2019) 51. Yue, S., et al.: Todg: Distributed task offloading with delay guarantees for edge computing. IEEE Trans. Parallel Distrib. Syst. 33(7), 1650–1665 (2021) 52. Dautov, R., Distefano, S.: Automating iot data-intensive application allocation in clustered edge computing. IEEE Trans. Knowl. Data Eng. 33(1), 55–69 (2019) 53. Qian, Y., et al.: A workflow-aided internet of things paradigm with intelligent edge computing. IEEE Netw. 34(6), 92–99 (2020) 54. Sakir, R.K.A., Ramli, M.R., Lee, J.-M., Kim, D.-S.: Uav-assisted real-time data processing using deep q-network for industrial internet of things. In: 2020 International Conference on Artificial Intelligence in Information and Communication (ICAIIC), pp. 208–211. IEEE (2020) 55. Wang, Z., Xue, G., Qian, S., Li, M.: Campedge: Distributed computation offloading strategy under large-scale ap-based edge computing system for IoT applications. IEEE Internet Things J. 8(8), 6733–6745 (2020) 56. Lei, L., Xu, H., Xiong, X., Zheng, K., Xiang, W., Wang, X.: Multiuser resource control with deep reinforcement learning in IoT edge computing. IEEE Internet Things J. 6(6), 10119– 10133 (2019) 57. Ale, L., Zhang, N., Fang, X., Chen, X., Wu, S., Li, L.: Delay-aware and energy-efficient computation offloading in mobile-edge computing using deep reinforcement learning. IEEE Trans. Cogn. Commun. Netw. 7(3), 881–892 (2021) 58. Do-Duy, T., Van Huynh, D., Dobre, O.A., Canberk, B., Duong, T.Q.: Digital twin-aided intelligent offloading with edge selection in mobile edge computing. IEEE Wireless Commun. Lett. 11, 806–810 (2022) 59. Cicconetti, C., Conti, M., Passarella, A.: Uncoordinated access to serverless computing in MEC systems for IoT. Comput. Netw. 172, 107184 (2020). https://doi.org/10.1016/j.comnet. 2020.107184 60. Shakarami, A., Shahidinejad, A., Ghobaei-Arani M.: An autonomous computation offloading strategy in mobile edge computing: a deep learning-based hybrid approach. J. Netw. Comput. Appl.178, 102974 (2021). https://doi.org/10.1016/j.jnca.2021.102974 61. Zhang, J., Guo, H., Liu, J., Zhang, Y.: Task offloading in vehicular edge computing networks: A load-balancing solution. IEEE Trans. Vehicular Technol. 69(2), 2092–2104 (201) 62. Zhang, P., Zhu, D., Luan, J.: An approximation algorithm for the generalized k-multicut problem. Discret. Appl. Math. 160(7–8), 1240–1247 (2012) 63. Kovalyov, Y.M.: A rounding technique to construct approximation algorithms for knapsack and partition-type problems (1996) 64. Nezami, Z., Zamanifar, K., Djemame, K., Pournaras, E.: Decentralized edge-to-cloud load balancing: Service placement for the internet of things. IEEE Access 9, 64983–65000 (2021) 65. Feng, M., Krunz, M., Zhang, W.: Task partitioning and user association for latency minimization in mobile edge computing networks. InIEEE INFOCOM 2021-IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS), pp. 1–6 (2021). IEEE 66. Zhang, W.-Z., et al.: Secure and optimized load balancing for multitier IoT and edge-cloud computing systems. IEEE Internet Things J. 8(10), 8119–8132 (2020)
Simultaneous Estimation Method of Mutually Correlated Geophysical Parameters and Its Application of Aerosol Parameter Estimation Kohei Arai1(B) and Xing Ming Lian2 1 Saga University, Saga 840-8502, Japan
[email protected] 2 NOAA/NESDIS, Former Student of Saga University, Greenbelt, MD, USA
Abstract. A method of simultaneous estimation of aerosol size distribution and refractive index from ground-based observation of the direct, diffuse, aureole radiance and polarization measurement data is proposed. In the method, a polarization computation based on the successive orders of scattering by separating the molecular and aerosol phase functions to the parallel and perpendicular polarization components in the first order scattering (single scattering) is introduced. Through a comparison between the proposed method and the Volume Spectral Analysis soft (Skyrad.pack) developed by Nakajima et al., it is found that the proposed method is superior to the Skyrad.pack method as aerosol optical depth is thin. Keywords: Aerosol Parameter · Refractive Index · Size Distribution · Successive Orders of Scattering · Phase Function · Polarization · Volume Spectral Analysis · Skyradpack · Optical Depth · Direct/Diffuse/Aureole Radiance
1 Introduction P. Romanov et al. proposed a method based on the linear inverse problem using direct solar irradiance, scattered irradiance, and peripheral irradiance as a method for simultaneously estimating the complex refractive index and particle size distribution of aerosols [1]. This solution was inadequate because the relationship between the unknown vector (complex refractive index and particle size distribution) and the objective function (sun direct irradiance, scattered irradiance, peripheral irradiance) was not linear [2]. On the other hand, Tanaka et al. proposed simultaneously estimated the complex refractive index and particle size distribution of aerosols due to the parallel and vertical polarization components of scattered irradiance, but the accuracy was insufficient due to reasons such as not considering the influence of polarization of atmospheric molecules [3]. According to Kohei Arai et al., the reason why the estimation accuracy of the imaginary part of the complex refractive index is low based on the results of the complex refractive index and particle size distribution of the aerosol and the sensitivity analysis of the direct solar irradiance, scattered irradiance and peripheral irradiance is the complex for the objective function [4]. It was pointed out that the sensitivity of the refractive index © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): SAI 2023, LNNS 711, pp. 111–128, 2023. https://doi.org/10.1007/978-3-031-37717-4_8
112
K. Arai and X. M. Lian
is much lower than the particle size distribution. Furthermore, Kohei Arai et al. have proposed an atmospheric upper radiance estimation method that considers the polarization of upward and downward radiances. The purpose of this paper is to improve the estimation accuracy of the complex refractive index and particle size distribution of aerosols based on the method of Kohei Arai et al. We propose a method for accurate simultaneous estimation of the complex refractive index and particle size distribution of aerosols by observing polarized irradiance in direct solar irradiance, scattered irradiance, and peripheral irradiance. The proposed method is based on the Successive Orders of Scattering: SOS [5], which separates the scattering phase functions of atmospheric molecules and aerosols in parallel and vertically, calculates the parallel and vertical polarization or linear polarization of single scattering, and direct irradiance and scattering. The complex refractive index and particle size distribution of the aerosol are estimated at the same time by calculating both irradiance and peripheral irradiance. Since depolarization occurs in multiple scattering, the effect of that amount on the estimation accuracy of each parameter was analyzed. In addition, the linear polarization degree was calculated, then simulation data was generated, and the sensitivity analysis of the unknown vector and objective function was performed. These simulation data were used and compared with the volume spectrum analysis code Skyrad.pack [6] developed by Nakajima et al. As a result, the optical depth of the aerosol in the atmosphere is thin (when the wavelength is 0.87 μm, the optical thickness of the aerosol is 0.12). When the aerosol particles are homogeneous spheres, the superiority of the estimation accuracy of the proposed method has been confirmed. The next section describes theoretical background together with related research work. Then a sensitivity analysis is discussed followed by proposed method. After that, experimental method and some results are described followed by conclusions and some remarks together with some discussions.
2 Theoretical Background 2.1 Successive Approximation Method Including Polarized Irradiance The successive approximation method is a method in which the atmosphere is represented by a multi-layered parallel plate, the scattering of multiple times in each layer is calculated independently, the sum is calculated, and the total radiance is calculated. The n-th scattered radiance (In) and radiation source function (Jn) are given by Eqs. (1) and (2), as described in Reference [4]. τ −τ (1) In (τ, ±μ, φ) = Jn (τ, ±μ, φ)exp ∓ d τ /μ μ Jn (τ, μ, φ) = /(4π )
τ
2π 1 0
−1
P(μ, φ, μ , φ )In−1 (τ, μ , φ )d τ d φ
(2)
Simultaneous Estimation Method of Mutually Correlated Geophysical
113
where, τ is the optical thickness of the atmosphere, μ, μ’ is the cosine of the emission and incident zenith angles (μ = cos (θ ), θ: emission zenith angle), and ϕ, ϕ’ is the emission and incident azimuth angles. Represent. ω represents a single scattering albedo. P is the scattering phase function. In the case of an atmosphere in which air molecules and aerosols are mixed, it can be expressed by Eq. (3). P=
Pmol τmol + Paer τaer aer /τ τ
(3)
where, τ mol and τ aer are the optical depth of air molecules and aerosols, ωaer is the single scattering albedo of aerosols, where ωaer = ω. Pmol Paer represents the phase function of molecules and aerosols. The first radiation source function can be expressed by Eq. (4). J1 (τ, μ, φ) = (
)π F0 P(μ, φ, μ , φ )exp(−τ/μ0 ) 4π
(4)
where, π 0 and μ0 are the cosine of the solar incident spectral intensity and the solar zenith angle, respectively. By substituting Eq. (4) into Eqs. (1) and (2), the scattered radiance up to the nth time can be calculated repeatedly. The total scattered radiance is calculated by Eq. (4). From Eqs. (1), (2) and (4), the scattered radiance of one-time scattering can be expressed by Eq. (5). τ −τ (5) d τ /μ I1 (τ, ±μ, φ) = J1 (τ, ±μ, φ)exp ∓ μ τ
When considering linearly polarized irradiance, it was found that the parallel polarization component and the vertical polarization component of the single scattering depend on the parallel component and the vertical component of the phase function because the solar incident irradiance π F 0 is not polarized. In addition, the phase function of Eq. (2) assumes that atmospheric scattering occurs only in aerosols and air molecules, omits elliptical polarization, and can separate the phase function into parallel and vertical components [7]. Hereinafter, regarding polarization, only one-time scattering will be considered. This simple scattering approximation method is a feature of this paper. At that time, the scattering position function P can be expressed by Eq. (6). Pl =
Pl, aer τaer aer Pl, mol τmol + , l = 1, 2 τ τ
(6)
where, Plmol and P1aer represent parallel (l = 2) and vertical (l = 1) components of the molecule and aerosol, respectively. These can be expressed as Eqs. (7) and (8). 3(1 + ) 3(1 − ) , P2,mol (θ ) = 8π (2 + ) cos2 θ/8π (2 + )
(7)
1 1 , P2,aer (θ ) = 2 r2 , i k i kr, m)n(r)dr kr, m)n(r)dr (θ, (θ, r1 1 r1 2
(8)
P1,mol (θ ) = P1,aer (θ ) =
k2
r2
114
K. Arai and X. M. Lian
where, Δ is a depolarization factor, and in the case of a standard mixed atmosphere, Δ = 0.035 [8]. r 1 and r 2 represent the minimum particle size and the maximum particle size of the aerosol particles, respectively. In this paper, it is assumed to be r 1 = 0.02μm, r 2 = 10μm. √ √ θ is the scattering angle and cos(θ ) = ±μμ + 1 − u2 1 − u2 cos(φ − φ) is the parallel and vertical components of the phase function. m is the complex index of refraction of the aerosol m = mr -imi and n(r) is the number density of the particles. Substituting Eqs. (6) into Eqs. (4) and (5), the parallel and vertical scattering components of the first scattering (single scattering) are Eqs. (9) and (10). τ −τ (9) Il,1 (τ, ±μ, φ) = Jl,1 (τ, ±μ, φ)exp ∓ d τ /μ μ τ
Jl,1 (τ, μ, φ) = /(4π )π F0 Pl (μ, φ, μ , φ )exp(−
τ ) μ0
(10)
where l = 1,2. Furthermore, the degree of linear polarization is obtained by Eq. (11). ρ = (Il,1 − I2,l )/(Il,1 + I2,l )
(11)
Scattering radiation brightness including multiple scattering without considering polarization can be obtained by Eqs. (1), (2), (4), (5), but since depolarization occurs due to multiple scattering, a linear polarization degree with good accuracy can be obtained. In order to make an estimation, calculation accuracy is required for a model that considers multiple scattering. Therefore, when estimating the linear polarization degree by the simple scattering approximation method, the optical thickness of the whole atmosphere was set to be small, and the optical thickness between the layers of each atmosphere of the parallel plate was set to be very small (about 0.0005). In this case, multiple scattering above the secondary scattering can be ignored [9]. 2.2 Direct Irradiance and Aureole Direct irradiance can be expressed by Eqs. (12) and (13) according to Beer-Lambert’s law. E = E0 exp(−βτ )
(12)
τaer = τ − τmol − τgas
(13)
where, E 0 is the solar radiation intensity outside the atmosphere. β is the air mass, β = 1 / cos (θ 0 ), (θ 0 : solar zenith angle, θ 0 ≤ 75°) [10]. τ aer can be expressed by Eq. (14) according to Mie scattering theory. r2 τaer = π r 2 n(r)Qext (m, r)dr (14) r1
Simultaneous Estimation Method of Mutually Correlated Geophysical
115
where, r is the particle size of the aerosol and Qext is the dissipation efficiency factor of Mie scattering. n (r) is the number density of the particles, and the dimension is L −3 . For the total number of particles in the air column of a unit area, the number of particles whose particle size is between r and r + dr is n (r) dr. Therefore, n (r) dr is the columnar number density. Auriol is generally a region with large scattering of about 15–20° from the solar disk. In this paper, aureole is calculated by the MS (Multiple-Single) approximation method. This method is a method for calculating aureole, assuming that the aureole scattered irradiance is mainly the contribution of the single scattering component of the molecule and the aerosol and the multiple scattering component of the molecule. That is, it becomes Eq. (15). Is = βE0 exp(−βτ )(τmol Pmol (θ ) + τaer aer Paer (θ ) + ms (θ, A)
(15)
where, A is the reflectance of the ground surface. ms (θ, A) is a contributing component of multiple scattering and reflection on the ground surface and can be calculated based on experimental data [11]. 2.3 Improved Arai-Ryo Model: IARM The Arai-Ryo model is a method for simultaneous estimation of the complex refractive index and particle size distribution of aerosols by the non-linear inverse problem method (simulated annealing) using direct solar irradiance, scattered irradiance and aureole. Each estimated value is sufficiently until the simulated annealing slow-cooling function converges so that the difference between the calculated simulation data and the observed data is minimized by the three objective functions of direct solar irradiance, scattered irradiance, and aureole. Repeat the input. As a result, the complex refractive index real part, imaginary part and Junge parameters of the aerosol with good accuracy are obtained. This method is described in detail in the references and will not be described here. In this paper, the three objective functions of the Arai-Ryo model, direct solar irradiance, scattered irradiance, and aureole, are calculated as they are, and the linear polarization degree obtained by the simple scattering approximation method for the polarization in the previous section is another purpose. It was a function. Therefore, the complex refractive index real part, imaginary part, and Junge parameters of the aerosol can be estimated at the same time by these four objective functions. In addition, in order to avoid random observation errors, we used observation data of several proximity scattering angles for one objective function. That is, it becomes Eq. (16). ⎤ ⎡ τa − τ ∗ /τ ∗ ⎡ ⎤ a a ε1 ⎥ ⎢
∗ (|Is (θi ) − Is (θi ) I ∗ (θi ) ⎥ ⎣ ⎦ ⎢ ∗ = ε4 (|ρ(θk ) − ρ (θk ) ∗ ⎢ s ⎥ = ε2 , ⎦ ⎣
ρ (θk ) ∗ ε 3 (|Idiff (θi ) − Idiff (θi ) I ∗ (θi ) diff (16) In Eq. (16), shows that it takes the average value of several scattering angles. In this paper, this model is called IARM.
116
K. Arai and X. M. Lian
3 Related Research Works As for aerosol parameter estimation method research works, phase function of relatively large aerosol particles containing bubbles by means of a Ray Tracing: RT is investigated [14]. On the other hand, aerosol parameter estimation with changing observation angle of ground-based polarization radiometer is conducted [15]. Meanwhile, characterization of aerosols in Saga city areas, Japan with direct and diffuse solar irradiance and aureole observations is conducted [16]. Monte Carlo: MC simulation of polarized atmospheric irradiance for determination of refractive index of aerosols is also conducted [17] together with ray tracing simulation for estimation of phase function of relatively large aerosol particles containing bubbles and experimental validity check [18]. On recent stratspheric aerosols observed by Lidar over Japan is investigated [19]. On the other hand, method for estimation of aerosol parameters based on ground based atmospheric polarization irradiance measurements is proposed [20] together with sensitivity analysis and error analysis of reflectance based vicarious calibration with estimated aerosol refractive index and size distribution derived from measured solar direct and diffuse irradiance as well as measured surface reflectance [21]. Sensitivity analysis for aerosol refractive index and size distribution estimation methods based on polarized atmospheric irradiance measurements is conducted [22]. Meanwhile, lidar detection of high concentrations of ozone and aerosols transported from Northwest Asia over Saga, Japan is attempted [23]. Also, aerosol data assimilation using data from Himawari-8, a next generation geostationary meteorological satellite is conducted [24]. Another sensitivity analysis of aerosol parameter estimation with measured solar direct and diffuse irradiance is conducted [25]. Method for aerosol parameter estimation error analysis- consideration of noises included in the measured solar direct and diffuse irradiance is proposed and discussed [26]. Then the aerosol parameter estimation method utilizing solar direct and diffuse irradiance measuring instrument without sun tracking mechanics is proposed and validated [27].
4 Sensitivity Analysis Some assumptions and atmospheric parameters need to be set to generate simulation data for direct solar irradiance, scattered irradiance, aureole and linear polarization. First, as the wavelength becomes shorter, the optical thickness of atmospheric molecules and aerosols increases, so the effect of multiple scattering on polarization is large, and the effect of polarization of atmospheric molecules is much greater than that of aerosols. Therefore, it is better to make the optical thickness of atmospheric molecules smaller than that of aerosols. Based on the above, the near-infrared wavelength range of 0.87 μm was selected [12] in this study. We also assume that the parallel plate atmosphere is homogeneous. For the vertical distribution of atmospheric particles, the mid-latitude winter model of MODTRAN 4.0 is adopted. In that case, the optical thicknesses of the molecule, aerosol, and ozone were determined to be τmol = 0.151, τaer = 0.1182, and τgas = 0.0004, respectively. Next,
Simultaneous Estimation Method of Mutually Correlated Geophysical
117
assuming that the particle size distribution of the aerosol is the Junge distribution, the number density is given by Eq. (17). dN C10α+1 0.02μm ≤ r ≤ 0.1μm = n(r) = (17) Cr −(α+1) 0.1μm ≤ r ≤ 10μm dr where, C is a constant, α is the Junge parameter, and N is the total number density of particles in the air column of a unit area [13]. The ground surface was the Lambert surface, and the reflectance was assumed to be A = 0.1. Figure 1 shows the calculated values of scattered irradiance and degree of polarization when the complex refractive index and Junge parameters are changed when the solar zenith angle is 65° and the azimuth is 0°. Figures (a), (b), and (c) in Fig. 1 show the change in scattered radiance with the change in the observation azimuth, with the observation zenith angle being the same as the solar zenith angle (65°). Figures (d), (e), and (f) in Fig. 1 are the main planes of the observation and show the changes in the linear polarization degree with the change in the observation zenith angle. From (d), (e), and (f) in Fig. 1, the maximum linear polarization degree is near the scattering angle 90° (θ = 25°, ϕ = 0°), and the minimum linear polarization degree is the scattering angle. It can be seen that it is near 0° (θ = 650, ϕ = 1800). In addition, from (a) of Fig. 1, it can be seen that in the forward scattering, the scattered irradiance calculated by the successive approximation method and the aureole scattered irradiance calculated by the aureole model are extremely different. This is because the error of the scattered irradiance calculated by the successive approximation method is large for a drastic change in the forward scattering (aureole) of the aerosol particles. To avoid this, the scattered irradiance of the forward scattering was calculated using the aureole model instead of the successive approximation method. For sensitivity analysis, the sensitivity is defined by Eq. (18). S = ∂(R − R0 )/∂(x − x0 )
(18)
where, x and x 0 represent the unknowns of the real part, the imaginary part and the Junge parameter of the complex refractive index of the aerosol. R and R0 represent objective functions of direct irradiance, scattered irradiance, aureole and degree of polarization. The definition of S is the gradient near the unknown x 0 . As this gradient increase, it means that the sensitivity of R to this unknown parameter x increases. Assuming that the objective function is linear near x 0 , Eq. (18) gives the sensitivity of the objective functions of scattered irradiance, aureole and linear polarization to the Junge parameter and complex index of refraction. The calculated sensitivities are shown in Fig. 2 and Fig. 3. Figure 2 shows the scattered irradiance calculated by the observation azimuth with respect to the real part, imaginary part and Junge parameter of the complex refractive index according to (a), (b) and (c) of Fig. 1. Figure 3 shows the sensitivity of the linear polarization calculated from the observed zenith angle according to (d), (e), and (f) of Fig. 1. As can be seen from Fig. 2 and Fig. 3, the sensitivity difference of aureole is large with respect to the unknown variables of Junge parameter and complex refractive index (S = 101 to 10–3 ), but the sensitivity of linear polarization. It was found that the difference was small (S = 100 to 10–2 ).
118
K. Arai and X. M. Lian
Fig. 1. Comparison of the Polarization Degree and the Scattering Radiance from the Different Parameters. (a), (b), (c) Represent the Dependence of Observed Scattering Radiation on the Azimuth Angle. Axis x Represents the Azimuth Angle(ƒ3 ) of Observation, and Axis y represents the Scattering Radiance. (d), (e), (f) Represent the Dependence of Polarization Degree on the Zenith Angle. Axis x Represents the Observation Angle(O), and Axis y Represents the Degree of the Polarization. (a), (d) are set as 2.5, 2.7, 3.0 for Junge Parameters, 1.50 and -0.01 for the Real and Imaginary Part of the Refractive Index. (b), (e) are set as 1.30, 1.40, 1.50 for the Real Part, 3.0 and -0.01 for Junge Parameter and the Imaginary Part. (c), (f) are set as -0.005, -0.025, -0.045 for the Imaginary Part, 3.0 and 1.50 for Junge Parameter and the Real Part. In (a) the Aureole and Scattering Radiances are Shown Simultaneously in the Forward Scattering Angles.
In the case of backscattering, the sensitivity of linear polarization is much higher than that of scattered irradiance. Especially, the scattering angle is 90. In the vicinity, it shows a peak with extremely high sensitivity of linear polarization. Furthermore, as a
Simultaneous Estimation Method of Mutually Correlated Geophysical
119
Fig. 2. The Azimuth Angle Dependence on the Sensitivity of the Scattering Radiance. The Junge Parameter and the Refractive Index are the Same as Fig. 1.
Fig. 3. The Scattering Angle Dependence on the Sensitivity of the Polarization Degree. The Junge Parameter and the Refractive Index are the Same as Fig. 1.
result of comparing the graphs at the bottom of Fig. 2 and Fig. 3 (sensitivity of scattered irradiance and linear polarization to the imaginary part), which have the lowest sensitivity to scattered irradiance and linear polarization, the scattered irradiance It was also found that the sensitivity of linear polarization is higher than that of sensitivity. Therefore, the scattering angle is 90. It can be expected that the estimation accuracy of unknowns will be improved by constraining the degree of linear polarization in the vicinity.
120
K. Arai and X. M. Lian
Skyrad.pack is a code for estimating the particle size distribution and complex refractive index of aerosols using direct irradiance, scattered irradiance and aureole. This code does not consider polarization. Since the proposed method considers polarized irradiance, improvement in estimation accuracy can be expected. To confirm this, IARM and Skyrad.pack (4.2 version) show two calculation examples. For the wavelength for generating the simulation data, 0.4, 0.5, 0.675, 0.870, 1.02um, which matches Skyrad.pack, was selected. Assuming water-soluble aerosols, the values proposed by Nobuo Takeuchi and J. Lenoble et al. are used for the real and imaginary parts of the complex refractive index of the aerosol for those wavelengths and are shown in Table 1. The optical thicknesses of air molecules, aerosols and ozone with respect to wavelength are calculated as described in the previous section and are shown in Table 1. Assuming a clear atmosphere with a Junge parameter of 3.5 for aerosols, simulation data of scattered irradiance, aureole, and linear polarization of all scattering angles were generated. Table 1. Some Parameters for Simulating Data Wavelength(μm) Real part Imaginary part
0.4
0.5
0.675
0.87
1.02
1.53
1.53
1.53
1.52
1.52
−0.005
−0.005
−0.007
−0.012
−0.017
Molecular OD
0.3501
0.1434
0.0423
0.0152
0.008
Aerosol OD
0.2286
0.1907
0.1451
0.1191
0.1087
Ozone OD
0.0
0.0085
0.0128
0.0
0.0
OD denotes Optical Depth
The calculation results are shown when there is no observation error and when the observation error of scattered irradiance and aureole exists in the range of 13 to 3%. A part of the simulation data (wavelength 0.870 μm) and the estimation result are shown in Fig. 4 and Table 2. The Junge parameters of Skyrad.pack are approximated to the particle size range of 0.02 to 10 μm from the volume spectrum of the aerosol estimated by Skyrad.pack. As shown in Table 2, the estimation accuracy of both models is almost the same for the Junge parameter regardless of the observation error of scattered irradiance, but the real and imaginary parts of the complex refractive index are estimated by IARM. It turned out to be more accurate than Skyrad.pack. In particular, it was found that the real part of the complex refractive index has an estimation accuracy of about 1% even if there is an observation error of scattered irradiance of 3%, which is almost the same as when there is no observation error of scattered irradiance. This is because the sensitivity of the scattered irradiance to the real part is not lower than that of the imaginary part and the Junge parameter, and the sensitivity of the linear polarization degree is the highest.] In addition, when there was an observation error, the estimation accuracy (22.50%) of the imaginary part was lowered by IARM, but it was found to be higher than the estimation result of Skyrad.pack. This is because the sensitivity of linear polarization
Simultaneous Estimation Method of Mutually Correlated Geophysical
121
Fig. 4. The Almucantar Radiance for a Water-Soluble Aerosol Model (m1,52–0.012i) and Junge Parameter 3.5 in λ = 0.87μm, θsolar = 65deg., τaer = 0.1191. The Solid and dot Lines Represent the Values of no Observed Error and Random Observed Error of −3 ~ 3%, Respectively Table 2. Comparison IARM with Skyrad.pack Skyrad.Pack
True
No error
± 3% error
Real part
1.5284
1.5674
Real part error
1.27%
3.12%
1.52 0.012
IARM No error
± 3% error
1.5247
1.536
0.73%
1.05%
0.0113
0.0093
5.83%
22.5%
3.4302
3.2677
Imaginary part
0.0105
0.0076
Imaginary part error
12.5%
36.66%
Junege parameter
3.4256
3.384
Junge parameter error
2.17%
3.31%
1.99%
3.78%
Time
2s
2s
26h
33h
3.5
is higher than the sensitivity of scattered irradiance to the imaginary part. Figure 5 shows the downward scattered radiance on the ground surface with an azimuth of 3 to 150° estimated by both models, and Fig. 6 shows the estimation error of those scattered radiances. As shown in Fig. 6, the accuracy of the scattered radiance estimated by both models is almost the same when the azimuth is between 20 and 70°. It is considered that this is because the accuracy of the Junge parameters estimated by both models does not change much. On the other hand, when the azimuth was 70° or more, the accuracy of the scattered radiance estimated by both models was gradually lower, but it was found that the accuracy of the estimated scattered radiance by IRAM was higher than that of Skyrad.pack. It is considered that this is because the sensitivity of the linear polarization is higher than the sensitivity of the scattered irradiance to the Junge parameter, the real part and the imaginary part as the azimuth angle becomes larger. It was also found that
122
K. Arai and X. M. Lian
the calculation time of IARM by the simulated annealing method is much longer than that of Skyrad.pack. As a matter of course, since depolarization occurs due to multiple scattering, it is considered that the estimation accuracy of each parameter may deteriorate. In order to investigate this, when the optical thickness of the aerosol is 0.12 (wavelength 0.87 μm), the optical thickness between the layers of the atmospheric layer is set to 0.0005, and the observer is responsible for the total polarization component (sum of polarization components). The scattered radiation brightness of a single scattering approximation on a plane and the scattered radiation brightness considering multiple scattering were calculated and shown in Fig. 7.
Fig. 5. Estimations of the Downward Diffuse Radiance at Surface from Skyrad.pack and IARM with Respect to the Azimuth Angles in Almucantar Observation
Fig. 6. Estimation Errors of the Downward Diffuse Radiance at Surface from Skyrad.pack and IRAM with Respect to the Azimuth Angles in Almucantar Observation
From Fig. 7, it was found that the difference between the calculation results of both scattered radiances with a scattering angle of around 90° was 3% or less. In addition, from the definition of linear polarization, it was found that the degree of elimination of linear polarization with a scattering angle of around 90° is much smaller than the degree of change in scattered radiance due to the effect of multiple scattering. Therefore, it was found that the error of the linear polarization degree caused by multiple scattering
Simultaneous Estimation Method of Mutually Correlated Geophysical
123
Fig. 7. Estimations of the Downward Diffuse Radiance at Surface from IARM for the Single Scattering and Multiple Scattering with Respect to the Zenith Angles in Principal Plane Observation
is smaller than 3%. The effect of the 3% linear polarization error on the estimation accuracy of each parameter will be examined below. If the observation error of the linear polarization degree becomes large, the estimation accuracy will be affected. Table 3 shows the particle size distribution and complex of the aerosol when the observation error is given only to the linear polarization degree by 1% and 3%, or when the observation error is given to both the linear polarization degree and the scattered irradiance by 1% and 3%. It is a change in the estimation accuracy of the refractive index. This aerosol model is a mixture of Oceanic (40%), Water-Soluble (50%) and Soot (10%). As shown in Table 3, when the observation error is given only to the linear polarization degree by 1% and 3%, the estimation accuracy of the Junge parameter, the real part and the imaginary part is about 1%, 2% and 10%, respectively. On the other hand, the estimation accuracy with 1% and 3% observation error for both linear polarization and scattered irradiance has a smaller Junge parameter and real part ( Z1− 2 where Z is a standard normal variable with mean 0 and standard deviation 1.
2.3 Modified Augmented Large Sample (MALS) CI The (1−α)100% modified augmented large sample CI for population process capability index, which is based on SM is provided below: LCL =
where CpM =
CpM CpM and UCL = √ √ exp(Z1− α2 B + C) exp −Z 1− α2 A + C
USL−LSL 6SM
(3)
and SM is defined in (see [17]).
2.4 Hummel, Banga and Hettmansperger CI (HBM) Following [22] hereafter HBH, the (1−α)100% CI for Cp:
Cp
Cp
≤ Cp ≤ γ−1 γ−1 α exp −Z α2 exp Z n n 2
(4)
where γ is the estimated kurtosis from a sample. 2.5 Modified Hummel, Banga and Hettmansperger CI (MHBM) The modified HBM confidence interval for population process capability index is defined as follows
M CP
M
CP
≤ Cp ≤ γ −1 exp Z α exp −Z α M n 2 2
M
where γM is the kurtosis estimator and C p =
USL−LSL . 6SM
γM −1 n
(5)
On Some Confidence Intervals
205
2.6 CI Based on Downton’s Estimator (DE) Downton [23] proposed the following estimator as an alternative for σ of a normal population, √ n 2 π
SD = i − 0.5(n + 1)X(i) (6) n(n − 1) i=1
Then, the (1−α)100% confidence interval based on SD is expressed as 2 2 χ α USL − LSL χ(1− α2 ,n−1) USL − LSL ( 2 ,n−1) and UCL = LCL = 6SD (n − 1) 6SD (n − 1)
(7)
and SD is defined in Eq. (6). 2.7 CI Based on Bonett (2006) (B) The following interval estimator for σ was proposed by Bonett [24]: LCL = sqrt exp ln cσˆ 2 − Zα/2 SE and UCL = sqrt exp ln cσ 2 + Zα/2 SE
(8) where Zα/2 is such that P(Z > Zα/2 ,) = α/2 and Z is a standard normal random variable. 1/2 , c = n/(n − Z ) and γ = n n (Y −μ)4 /( (Y − SE = c[{γ 4 (n−3)/n}/(n−1)] α/2 i 4 i=1 i
μ)2 )2 and σ 2 = exp γ 4 − n−3 n−1 . Then the (1 − α) 100 % population process capability index is obtained as
USL − LSL USL − LSL and UCL = , 6 ∗ UCL 6 ∗ LCL where UCL and LCL are defined in Eq. (8). LCL =
(9)
2.8 Modified CI Based on Bonett (MB) The modified (1−α)100% confidence interval for the population process capability index is defined as LCL =
USL − LSL USL − LSL and UCL = , 6 ∗ UCLMd 6 ∗ LCLMd
where UCLMD and LCLMD are defined in Eq. (11) below, LCLMd = sqrt exp ln cσˆ ∗2 − Z α2 SE∗ UCLMd
(10)
(11)
= sqrt exp ln cσˆ ∗2 + Z α2 SE∗
where Zα/2 is such that P(Z > Zα/2 ,) = α/2 and Z is a standard normal random variable ∗ = n n (Y − − Z ) and γ and SE∗ = c[{γ4∗ (n − 3)/n}/(n − 1)]1/2 , c = n/(n α/2 i=1 i 4 , where Md is sample median. Md )4 /( (Yi − Md )2 )2 and σ 2∗ = exp γ4∗ − n−3 n−1
206
B. M. G. Kibria and S. Banik
2.9 Steve Large Sample Normal Approximations CI (S) Steve [25] proposed the following 100(1 − α)% CI for σ: S2 S2 and UCL = LCL = γ−1 γ−1 α 1 + Z 1 − Z α2 n n 2
(12)
where γ = n ni=1 (xi − x)4 / ni=1 (xi − x)2 is the kurtosis estimator. Then a (1−α)100% CI for the population process capability is obtained as,
USL − LSL USL − LSL and 6 ∗ UCL 6 ∗ LCL
(13)
where UCL and LCL are defined in Eq. (12). 2.10 Modified Steve Large Sample Normal Approximations CI (MS) Modified Steve [25] (1−α)100% CI for σ: 2 2 SM SM LCLM = and UCLM = γM −1 γM −1 1 − Z α2 1 + Z α2 n n
(14)
where γM = n ni=1 (xi − Md)4 / ni=1 (xi − Md)2 is the modified kurtosis estimator, Md is the sample median and SM is defined in Eq. (18). Then, the modified Steve (1−α)100% CI for Cp is defined as
USL − LSL USL − LSL and 6 ∗ UCLM 6 ∗ LCLM
(15)
where UCLM and LCLM are defined in Eq. (14). 2.11 Proposed Non-parametric Bootstrap CI (NPB) Following [26], we order the standard deviations of all bootstrap samples as follows: ∗ ≤ S ∗ ≤ S ∗ .... ≤ S ∗ . CI for population standard deviation: S(1) (2) (3) (B) ∗ ∗ LCL = S[(α/2)B] and UCL = S[(1−(α/2))B]
(16)
Then the NPB (1−α)100% confidence interval for population process capability index is obtained as LCL =
USL − LSL USL − LSL and UCL = 6 ∗ UCL 6 ∗ LCL
where UCL and LCL are defined in Eq. (16).
(17)
On Some Confidence Intervals
207
2.12 Proposed Parametric Chi-Square Bootstrap CI (PChi) Following [26], the CI for population σ is defined as ∗2 ∗2 LCL = S (n − 1)/χα/2,(n−1) and UCL = S (n − 1)/χ1−(α/2),(n−1)
(18)
Then the (1−α)100% CI for population process capability index is outlined as LCL =
USL − LSL USL − LSL and UCL = 6 ∗ UCL 6 ∗ LCL
(19)
where UCL and LCL are defined in Eq. (18). 2.13 Construction of CI for Cp from the CI of σ In this subsection, we will show how we can obtain the CI for Cp from CI of σ. Suppose the (1−α)100% confidence interval for σ is defined as, LCL ≤ σ ≤ UCL. ⇒ σ = USL−LSE , then the (1−α)100% CI for Cp is obtained Since, Cp = USL−LSE 6σ 6Cp as LCL ≤
USL − LSE ≤ UCL 6Cp
After simplification, the above interval becomes, LCL =
USL − LSL USL − LSL and UCL = 6 ∗ UCL 6 ∗ LCL
(20)
which is the (1−α)100% CI for Cp.
3 Simulation Study 3.1 Simulation Technique Since a theoretical comparison among the interval estimators is not feasible, a simulation study will be executed in this section. To see the performance of the interval estimators for Cp, following [19], a Monte Carlo simulation study will be conducted in this section. Data will be generated from both symmetric (say normal, t) and skewed distributions (say Gamma, Beta etc). Different parametric conditions and sample size will be used in this simulation. More on simulation study for different kind of confidence intervals can be found in [27–29] among others. 3.2 Results and Discussions Based on the simulation results in Sect. 3.1, we will compare the performance of the estimators in this section. The high coverage probability and shorter width criterion will be used to compare the performance of the estimators. The widely used and the popular confidence coefficient 0.95 will be used for the simulation. However, one can consider 90% or 99% confidence interval.
208
B. M. G. Kibria and S. Banik
4 Some Concluding Remarks This paper reviews and proposes thirteen different confidence interval estimators for estimating the population process capability index. We will compare their performances under the same simulation condition but different parametric conditions. The symmetric, right and left skewed distributions will be used to generate data. Both average width and coverage probability will be considered as a performance criterion. We hope some of the proposed interval estimators will perform better than both ALS and MALS [19] under some simulation conditions. Also it will be a very good source of a reference paper about the confidence interval estimators.
References 1. Maiti, S., Saha, M.: Bayesian estimation of generalized process. J. Probab. Stat. 2012, 1–15 (2012). Article ID 819730 2. Kane, V.E.: Process capability indices. J. Qual. Technol. 18, 41–52 (1986) 3. Zhang, J.: Conditional confidence intervals of process capability indices following rejection of preliminary tests. Ph.D. thesis, The University of Texas at Arlington, USA (2010) 4. Kotz, S., Lovelace, C.R.: Process Capability Indices in Theory and Practice. Arnold, London (1988) 5. Smithson, M.: Correct confidence intervals for various regression effect sizes and parameters: the importance of non-central distributions in computing intervals. Educ. Psychol. Measur. 61, 605–632 (2001) 6. Thompson (2002), Thompson, B.: What future quantitative social science research could look like: confidence intervals for effect sizes. Educational Researcher, 31, 25–32 (2002) 7. Steiger, J.H.: Beyond the F test: effect size confidence intervals and tests of close fit in the analysis of variance and contrast analysis. Psychol. Methods 9, 164–182 (2004) 8. Chen, K.S., Pearn, W.L.: An application of non-normal process capability indices. Qual. Reliab. Eng. Int. 13, 335–360 (1997) 9. Bittanti, S., Lovera, M., Moiraghi, L.: Application of non-normal process capability indices to semiconductor quality control. IEEE Trans. Semicond. Manuf. 11, 296–303 (1998) 10. Wu, H.H., James, J.S., Phillip, A.F., Messimer, S.L.: A weighted variance capability index for general non-normal processes. Qual. Reliab. Eng. Int. 15, 397–402 (1999) 11. Ding, J.: A model of estimating process capability index from the first four moments of non-normal data. Qual. Reliab. Eng. Int. 20, 787–805 (2004) 12. Kotz, S., Johnson, N.L.: Process capability indices-a review, 1992–2000. J. Qual. Technol. 34(1), 1–19 (2002) 13. Kocherlakota, S., Kocherlakota, K.: Confidence intervals for the process capability ratio based on robust estimators. Commun. Stat.-Theory Methods 23(1), 257–276 (1994) 14. Peng, C.: Parametric lower confidence limits of quantile-based process capability indices. Qual. Technol. Quant. Manage. 7, 199–214 (2010) 15. Abu-Shawiesh, M.O.A., Srinivasan, M.R., Sindhumol, M.R., Kibria, B.M.G.: Performance of a robust confidence interval for the process capability index Cp based on the modified trimmed standard deviation under non-normal distributions. WSEAS Trans. Math. 19, 571–580 (2020) 16. Abu-Shawiesh, M.O.A., Banik, S., Golam Kibria, B.M., Akyüz, H.E.: A comparison of some modified confidence intervals based on robust scale estimators for process capability index. Prod. Eng. Res. Devel. 14(2), 217–229 (2019). https://doi.org/10.1007/s11740-019-00939-7
On Some Confidence Intervals
209
17. Kibria, B.M.G., Chen, W.: Comparison on some modified confidence intervals for estimating the process capability index Cp: simulation and application. Int. J. Stat. Sci. 21(2), 145–166 (2021) 18. Somkhuean, R., Wongkhao, A.: Confidence intervals for the common process capability index cp of normal distributions. J. Stat. Appl. Probabil. 11(1), 175–187 (2022) 19. Kibria, B.M.G and Banik, S. Estimation of population process capability index with confidence. Proceedings on Engineering Sciences, 5, 15–30 (2023) 20. Balamurali, S., Kalyanasundaram, M.: Bootstrap lower confidence limits for the process capability indices Cp, Cpk and Cpm. Int. J. Qual. Reliab. Manag. (2002). https://doi.org/10. 1108/02656710210442875 21. Burch, B.: Estimating kurtosis and confidence intervals for the variance under nonnormality. J. Stat. Comput. Simul. 84, 2710–2720 (2014) 22. Hummel, R., Banga, S., Hettmansperger, T.: Better confidence intervals for the variance in a random sample, Technical report. Department of Statistics, The Pennsylvania State University (2005) 23. Downton, F.: Linear estimates with polynomial coefficients. Biometrika 53, 129–141 (1966) 24. Bonett, D.G.: Approximation confidence interval for standard deviation of nonnormal distributions. Comput. Stat. Data Anal. 50, 775–782 (2006) 25. Steve, A.: Mathematical Statistics. Prentice Hall College Division (1990) 26. Banik, S., Albatineh, A.N., Abu-Shawiesh, M., Kibria, B.M.G.: Estimating the population standard deviation with confidence interval: a simulation study under skewed and symmetric conditions. Biometric Biostatist. 5, 1–9 (2014) 27. Baklizi, A., Kibria, B.M.G.: One and two sample confidence intervals for estimating the mean of skewed populations: an empirical comparative study. J. Appl. Stat. 36, 601–609 (2009) 28. Albatinehy, A.N., Wilcoxy, M.L., Bashar Zogheibyy, B.M., Kibria, G.: Confidence interval estimation for the population coefficient of variation using ranked set sampling: a simulation study. J. Appl. Stat. 41(4), 733–751 (2014) 29. Abu-Shawiesh, M.O.A., Akyuz, H.E., Migdadi, H.S.A., Kibria, B.M.G.: A comparison of some modified confidence intervals based on robust scale estimators for process capability index. Prod. Eng. Res. Devel. 14, 217–229 (2020) 30. Abu-Shawiesh, M.O.A., Banik, S., Kibria, B.M.G.: Confidence intervals based on absolute deviation for population mean of a positively skewed distribution. Int. J. Comput. Theoret. Stat. 5(01), 1–13 (2018)
Learning from Few Examples with Nonlinear Feature Maps Ivan Y. Tyukin1(B) , Oliver Sutton1 , and Alexander N. Gorban1,2 1
King’s College London, London, Strand WC2R 2LS, UK {ivan.tyukin,oliver.sutton,alexander.gorban}@kcl.ac.uk 2 University of Leicester, Leicester LE1 7RH, UK [email protected]
Abstract. In this work we consider the problem of data classification in post-classical settings where the number of training examples consists of mere few data points. We explore the phenomenon and reveal key relationships between dimensionality of AI model’s feature space, nondegeneracy of data distributions, and the model’s generalisation capabilities. The main thrust of our present analysis is on the influence of nonlinear feature transformations mapping original data into higher- and possibly infinite-dimensional spaces on the resulting model’s generalisation capabilities. Subject to appropriate assumptions, we establish new relationships between properties of nonlinear feature transformation maps and the probabilities to learn successfully from few presentations. Keywords: Few-Shot Learning · Kernel Learning Low-Sample High-Dimensional Data
· Learning from
Notation – R denotes the field of real numbers, R≥0 = {x ∈ R| x ≥ 0}, and Rn stands for the n-dimensional linear real vector space; – N denotes the set of natural numbers; n – bold symbols x = (x1 , . . . , xn ) will denote elements of R ; – (x, y) = k xk yk is the inner product of x and y, and x = (x, x) is the standard Euclidean norm in Rn ; – Bn denotes the unit ball in Rn centered at the origin: Bn = {x ∈ Rn | x ≤ 1}; – Bn (r, y) stands for the ball in Rn of radius r > 0 centered at y: Bn (r, y) = {x ∈ Rn | x − y ≤ r}; – Vn is the n-dimensional Lebesgue measure, and Vn (Bn ) is the volume of unit n-ball; I.Y. Tyukin—The work was supported by the UKRI Turing AI Fellowship EP/V025295/2 and the UKRI Trustworthy Autonomous Systems in Verifiability node EP/V026801/2. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): SAI 2023, LNNS 711, pp. 210–225, 2023. https://doi.org/10.1007/978-3-031-37717-4_15
Learning from Few Examples with Nonlinear Feature Maps
1
211
Introduction
Recent years have seen significant progress in the application of Artificial Intelligence (AI) and Machine Learning tools to a host of practically relevant tasks. Most importantly, we are witnessing major successes in the application of advanced large-scale models featuring millions of trainable parameters [7] to problems for which the volumes of available prior knowledge for training do not conform to the requirements of classical Vapnik-Chervonenkis theory [14] or other similar combinatorial bounds. A well-known example of the task in which this striking phenomenon can be observed is the MNIST digits dataset which, being reasonably small in size, can be learned remarkably well by modern large-scale deep neural networks. This property is fascinating in its own right, especially in view of [16,17] reporting evidence that large-scale deep neural networks with identical architecture and training routines can both successfully generalise beyond training data and at the same time overfit or memorise random noise. However, what is particularly striking is that some times an appropriately trained model is capable of exhibiting an extreme behaviour - learning from merely few presentations. To date, many different successful few-shot learning schemes have been reported in the literature. Matching [15] and prototypical [8] networks are examples of such learning machines. However, comprehensive theoretical justification of these schemes is yet to be seen. Recent work [2,9] suggested a new framework offering a pathway for understanding of few-shot learning. Instead of focusing on classical ideas rooted in empirical risk minimisation coupled with distributionagnostic bounds, it explores the interplay between the geometry of feature spaces and concentration of measure phenomena [6]. This enables an escape from the apparent paradox of generalisation discovered in [16,17]. Instead of posing the question of generalisation for all possible data distributions, one can ask a related but a different question: what properties of data distributions could be relevant or useful for few-shot learning? This refocusing might apparently be necessary in view of [1] showing that the spectrum of the data covariance matrix may hold the key to understanding benign overfitting. In this work we adopt the theoretical framework proposed in [2,9] and generalise it beyond the original setting whereby the problem of few-shot learning is analysed in models’ native feature spaces. Here we explore how the problem of few-shot learning changes if one allows a nonlinear transformation of these features. Our motivation to study this question is two-fold. First, many existing few-shot learning tools [8,15] already assume some sort of kernel-based transformation. Second, using kernels may enable mappings from original finite- or low-dimensional feature spaces into infinite- or essentially highdimensional spaces. The potential advantage of these transformations are illustrated in Fig. 1. As these figures suggest, mapping vectors from their original spaces into their corresponding feature spaces induced by various kernels has a significant impact on data geometry in the mapped spaces. In particular, on the probability of the sample’s quasi-orthogonality and linear separability.
212
I. Y. Tyukin et al.
Fig. 1. Empirical Estimates of how Easy it is to Separate Points using Various Nonlinear Kernels. Top: Separating Points Pairwise using Kernel Orthogonality. Bottom: Separating a Single Point from a set of 20,000 Other Points using a Linear Separating Surface in the Kernel Feature Space. In Both Cases, the Points are Sampled from a of the Point x ∈ Rn Uniform Distribution in [−1, 1]n . Here, φn (x) Denotes the Image −1 under the Kernel’s Associated Feature Mapping, and μ = |Y | y∈Y φ(y)
Learning from Few Examples with Nonlinear Feature Maps
213
As we show here, the latter properties may offer new perspectives and capabilities affecting probabilities of success of such schemes. These results are stated formally in Theorem 2 which is the main theoretical contribution of our work. The paper is organised as follows. In Sect. 2 we introduce some relevant notation and formulate the problem of few-shot learning, in which nonlinear feature transformations mapping input data into new feature spaces become important parameters of the problem. Section 3 presents our main results including appropriate assumptions on the data distributions enabling the few-shot learning rules analysed in this work. These few-shot learning rules are very similar to those proposed and empirically studied in [8]. In this respect, Sect. 3 presents theoretical underpinnings for such rules. Section 4 concludes the paper.
2
Preliminaries and Problem Formulation
In what follows we consider the problem of few-shot learning in the framework of a standard classification task. In this framework, we assume the existence of two sets of labels L and Lnew L ∩ Lnew = ∅, and two finite data sets, X = {(x, ) | x ∈ Rn , ∈ L}, |X | = N, and Y = {(x, ) | x ∈ Rn , ∈ Lnew }, |Y| = k in which the pairs (x, ) ∈ X are i.i.d. samples from some distribution PX , and the pairs (x, ) ∈ Y are i.i.d. samples from some other distribution PY . Elements ∈ L ∪ Lnew in the definitions of X and Y are the labels associated with the data vectors x. In addition to the distributions PX and PY , it is convenient to consider the marginal distributions PX and PY : PX (x, ), PX (x) = ∈L
PY (x) =
PY (x, ).
∈Lnew
We assume that there is a function F F : Rn → L
(1)
assigning an element from L to a vector from Rn . The function F models expertise of the system in relation to it’s capabilities to predict labels in the pairs (x, ) drawn from Y on the basis of the information that is contained in x.
214
I. Y. Tyukin et al.
In this respect, the set X represents existing knowledge about the environment. This set may be arbitrarily large or even infinite, but the learner has no access to the elements from the set X . The set Y represents new knowledge which is available to the learner. This new knowledge, however, is assumed to be scarce in the sense that k N , k n. In addition to the data vectors x ∈ Rn we consider a parameterised family of feature maps φn : φn : Rn → H (2) mapping elements of Rn into a Hilbert space H, which may be either finiteor infinite-dimensional. The map φn can represent transformations of the input data into the corresponding latent spaces in deep neural networks; it can also model other relevant data transformations emerging e.g. through the application of kernel tricks etc. For every x ∈ Rn , the map φn , in turn, induces a kernel map κn (x, ·): κn (x, ·) : Rn → R, κn (x, ·) = (φn (x), φn (·)). Remark 1. Examples of functions φn include the identity map φn (x) = x and feature maps of polynomial, κn (x, y) = ((x, y) + 1)m , m = 1, 2, . . . , Gaussian 2 κn (x, y) = exp(− x−y 2σ 2 ), σ ∈ R>0 and Laplacian κn (x, y) = exp(−αx − y), α ∈ R>0 kernels. The task is to learn a rule enabling the learner to discriminate between samples drawn from PX and PY by accessing only the values of xi and using available training data Y, possibly some additional generic knowledge about X , and the map φn . More formally, the task is stated as follows (cf [9]): Problem 1 (Few-shot learning). Consider a classifier F defined by (1), trained on a sample X drawn from some distribution PX . Let Y be a new sample that is drawn from another distribution PY and whose cardinality |Y| n. Let pe , pn ∈ (0, 1] be given positive numbers determining the quality of learning. Find an algorithm A(Y) producing a new classification map Fnew : X → L ∪ Lnew such that for x drawn from PY , and
P Fnew (x) ∈ Lnew ≥ pn
(3)
P Fnew (x) = F (x) ≥ pe
(4)
for x drawn from the distribution PX . Remark 2. Note that the set Lnew in Problem 1 is not necessarily a singleton. It may, in principle, contain more than one element. This allows questions to be posed regarding learning to discriminate between more than a single class. The other point that is articulated in the statement of Problem 1 is the requirement that |Y| n defining the context of what “few” is referring to in the definition of few-shot learning problems.
Learning from Few Examples with Nonlinear Feature Maps
215
In the next section we describe sufficient conditions for the existence of algorithms A presenting a solution of the class of few-shot learning problems, as formulated in Problem 1.
3
Main Results
We begin with the introduction of several useful characterisations of the maps φn in (2) which will enable us to formulate appropriate requirements on the distributions PX and PY . Consider Vφn (c, r, n) = 1dx. (5) φn (x)−c≤r
Symbol n in the left-hand side of the above notation indicates that x are taken from Rn . Assumption 1. There exists a function αφn : H × H × N → R≥0 such that for any c1 , c2 ∈ H, r1 ≤ r2 ∈ R>0 the following holds true αφn (c 1 ,c 2 ,n) r1 Vφn (c1 , r1 , n) ≤C (6) Vφn (c2 , r2 , n) r2 whenever Vφn (c2 , r2 , n) = 0 and where the constant C > 0 may be dependent on c1 , c2 . Remark 3. Note that the class of functions satisfying Assumption 1 is not empty. It holds, for example, for φn (x) = x with C = 1 and αφn (c1 , c2 , n) = n. In principle for some combinations of c1 , c2 the constant C may be infinite, although C is guaranteed to be finite for c1 = c2 by the monotonic nature of Vφn whenever Vφn is finite. For the purposes of this work we shall assume that Vφn is finite. It may also be feasible to assume some regularity of Vφn in the sense of existence of a function f : R × N → R limiting the growth of Vφn (c, r, n) with respect to c ∈ H: Vφn (c, r, n) ≤ f (r, n). In what follows we will require that the constant C exists and is finite for c1 , c2 in a vicinity of some characteristic points in H determining concentration properties of data distributions (namely points cX and cY in Assumptions 2, 3 below). We formalise this by supposing that C ∗ (c, r) = max C(ξ, c), ξ: c−ξ≤r
is finite for certain combinations of c and r. If the dependency of C on c1 , c2 is clear from the context then we will omit such explicit specifications in relevant expressions. For the functions αφn satisfying (6) we introduce βφn (c, r, n) =
min
ξ: c−ξ≤r
αφn (c, ξ, n).
(7)
We are now ready to proceed with specifying the requirements on PX and PY .
216
I. Y. Tyukin et al.
Assumption 2. For the distribution PX , there is a corresponding probability density function pX , positive numbers AX > 0, rX > 0, and cX ∈ H, such that pX is supported on the set SX = {x ∈ Rn | φn (x) − cX ≤ rX }, Vn (SX ) > 0, and satisfies the following growth bound: pX (x) ≤
AX . Vφn (cX , rX , n)
Assumption 3. For the distribution PY , there is a corresponding probability density function pY , positive numbers AY > 0, rY > 0, and cY ∈ H, such that pY is supported on the set SY = {x ∈ Rn | φn (x) − cY ≤ rY }, Vn (SY ) > 0, and satisfies the following growth bound: pY (x) ≤
AY . Vφn (cY , rY , n)
Observe that the functions Vφn , βφn in Assumptions 2, 3 are determined exclusively by the feature maps φn , whereas their arguments cX , rX and cY , rY capture relevant properties of PX , PY . The rest of this Section is organised as follows. Our main result, Theorem 2, justifying solutions of the few-shot learning problem (Problem 1) with the help of some auxiliary functions k 1 κn (xi , x) − θ, θ > 0, k i=1
where xi , i = 1, . . . , k are a part of the training sample, is stated and proved in Sect. 3.3. The proof of this theorem, however, is based on two other results. The first result is the generalised lemma on the typicality of quasi-orthogonality in high dimension (cf [3–5]) which we present in Sect. 3.1. The second result, which we call the law of high dimension, is presented in Sect. 3.2. Readers who may wish first to explore details of conditions and guarantees presented in our main theorem (Theorem 2) can skip the next two Sections and proceed to Sect. 3.3. 3.1
Quasi-Orthogonality in Hilbert Spaces
Lemma 1 (Quasi-orthogonality). Let Z = {x1 , x2 , . . . , xk } be a set of k i.i.d. random vectors drawn from a distribution satisfying Assumption 3, let δ, ε ∈ (0, 1), and let φn satisfy Assumption 1. Consider the event A1 : A1 : |(φn (xi ) − cY , φn (xj ) − cY )| ≤ δrY , ∀ i = j
(8)
Learning from Few Examples with Nonlinear Feature Maps
217
and the event A2 : A2 : φn (xi ) − cY ≥ (1 − ε)rY ∀ i.
(9)
βφn (c Y ,rY δ,n) P (A1 ) ≥ 1 − k(k − 1)CAY (1 − δ 2 )1/2 ,
(10)
Then
and
P (A1 ∧ A2 ) ≥ 1 − CAY k [1 − ε]βφn (c Y ,0,n) + βφn (c Y ,rY δ,n)
2 1/2 (k − 1) (1 − δ ) .
(11)
Proof of Lemma 1. Denote φ˜i = φn (xi ) − cY and consider the event E1 (φ˜1 , φ˜2 ) : |(φ˜1 /φ˜1 , φ˜2 )| > δ. The probability that event E1 (φ˜1 , φ˜2 ) occurs is equal to P (E1 (φ˜1 , φ˜2 )|φ˜1 )p(φ˜1 )dφ1 . The conditional probability P (E1 (φ˜1 , φ˜2 )|φ˜1 ) is equal to the probability that the vector φ˜2 ends up in the union of the following sets
˜1 φ , ξ − cY > δ C+ (φ˜1 , cY ) = ξ ∈ H φ˜1
˜1 φ C− (φ˜1 , cY ) = ξ ∈ H , ξ − cY < −δ . φ˜1 Given that x1 , . . . , xk are drawn independently from the same distribution, this probability can be bounded from above as ˜ ˜ ˜ P (E1 (φ1 , φ2 )|φ1 ) = PY (x)dx ˜1 ,c Y ) C + (φ PY (x)dx + ˜1 ,c Y ) C − (φ
AY ≤ Vφn (cY , rY , n) Observe that
1dx + ˜1 ,c Y ) C + (φ
˜1 ,c Y ) C − (φ
1dx .
˜1 ,c Y ) C + (φ
1dx < Vφn (c+ , rY (1 − δ 2 )1/2 , n)
218
I. Y. Tyukin et al.
and
˜1 ,c Y ) C − (φ
1dx < Vφn (c− , rY (1 − δ 2 )1/2 , n)
for some c+ , c− ∈ H satisfying c+ − cY ≤ rY δ, c− − cY ≤ rY δ. Therefore, according to Assumption 1 (eq. (6))
αφn (c Y ,c + ,n) (1 − δ 2 )1/2 P (E1 (φ˜1 , φ˜2 )|φ˜1 ) ≤ CAY αφn (c Y ,c − ,n)
+ (1 − δ 2 )1/2 . Taking (7) into account, the above estimate results in P (E1 (φ˜1 , φ˜2 )|φ˜1 ) ≤
βφn (c Y ,rY δ,n) . 2 CAY (1 − δ 2 )1/2
(12)
Hence, the probability that the event E1 (φ˜1 , φ˜2 ) occurs admits the following upper bound: P (E1 (φ˜1 , φ˜2 )|φ˜1 )p(φ˜1 )dφ˜1 ≤
βφn (c Y ,rY δ,n) 2 CAY (1 − δ 2 )1/2 p(φ˜1 )dφ˜1
βφn (c Y ,rY δ,n) = 2 CAY (1 − δ 2 )1/2 . Now consider events Em (φ˜1 , . . . , φ˜m ) : φ˜ φ˜ 1 m−1 ˜ ˜ ,φ ,φ > δ ∨ ··· ∨ >δ φ˜m−1 m φ˜1 m for m = 2, . . . , k. According to the union bound, P (Em (φ˜1 , . . . , φ˜m )|φ˜1 , . . . , φ˜m−1 ) ≤ m−1 φ˜ ˜ i ˜ ,φ P > δ φi φ˜i m i=1
Applying the same argument as has been used in the derivation of (12), we can conclude that the right-hand side of the above inequality does not exceed the value of
βφn (c Y ,rY δ,n) 2(m − 1)CAY (1 − δ 2 )1/2 .
Learning from Few Examples with Nonlinear Feature Maps
Hence
P (Em (φ˜1 , . . . , φ˜m )) ≤
βφn (c Y ,rY δ,n) 2(m − 1)CAY (1 − δ 2 )1/2
219
(13)
for every m = 1, . . . , k. Now consider events Bm (φ˜m ) : φ˜m < (1 − ε)rY m = 1, . . . , k. The probability P (Bm (φ˜m )|φ˜i , i = m) is: PY (x)dx P (Bm (φ˜m )|φ˜i , i = m) = φn (x)−c Y ≤(1−ε)rY AY ≤ 1dx Vφn (cY , rY , n) φn (x)−c Y ≤(1−ε)rY
(14)
Vφ (cY , (1 − ε)rY , n) = AY n Vφn (cY , rY , n) ≤ CAY [(1 − ε)]
βφn (c Y ,0,n)
Recall that for any events Ω1 , . . . , Ωd the following holds true: P (Ω1 ∧ Ω2 ∧ · · · ∧ Ωd ) ≥ 1 −
d
P (not Ωi ).
(15)
i=1
Therefore, using (13) and (14), one can conclude that P ((not E1 ) ∧ · · · ∧ (not Ek )) ≥ 1 −
≥ 1 − k(k − 1)CAY (1 − δ 2 )1/2
d
P (Ei ) i=1 βφn (c Y ,rY δ,n)
and P ((not B1 ) ∧ · · · ∧ (not Bk )) ≥ 1 −
d
P (Bi )
i=1
≥ 1 − kCAY [(1 − ε)]
βφn (c Y ,0,n)
(16)
(17)
.
Finally, observe that φ˜m is always bounded from above by rY . Therefore any φ˜1 , . . . , φ˜k satisfying conditions φ˜ φ˜ 1 m−1 , φ˜ , φ˜ ≤ δ ∧ ··· ∧ ≤δ φ˜m−1 m φ˜1 m for m = 1, . . . , k must necessarily satisfy
˜ ˜ φ1 , φm ≤ δrY ∧ · · · ∧ φ˜m−1 , φ˜m ≤ δrY .
220
I. Y. Tyukin et al.
Hence, the event [not E1 ∧· · ·∧not Ek−1 ] is contained in the event A1 defined by (8) and P (A1 ) ≥ P (not E1 ∧ · · · ∧ not Ek−1 ). and P (A1 ∧ A2 ) = P (A1 ∧ not B1 ∧ · · · not Bk ) ≥ P (not E1 ∧ · · · ∧ not Ek−1 ∧ not B1 ∧ · · · ∧ not Bk · · · ) ≥1−
k−1 i=1
P (Ei ) −
k
P (Bi ).
i=1
This together with (17), (16) concludes the proof. 3.2
The Law of High Dimension in Hilbert Spaces
Theorem 1 (The law of high dimension). Consider a set Z = {x1 , x2 , . . . , xk } of k i.i.d. random vectors drawn from a distribution satisfying Assumption 3, and let the function φn satisfy Assumption 1. Introduce the empirical mean of the sample in the feature space H: k 1 φn (xi ). φ¯n = k i=1
Finally, define U (k, δ) =k −1 (rY2 + (k − 1)δrY ), L(k, δ, ε) =k −1 ((1 − ε)2 rY2 − (k − 1)δrY ), where δ, ε are some real numbers from (0, 1). Then the following holds for any δ, ε ∈ (0, 1): P φ¯n − cY 2 ≤ U (k, δ) ≥ βφn (c Y ,rY δ,n)
. 1 − CAY k(k − 1) (1 − δ 2 )1/2
(18)
Moreover, P L(k, δ, ε) ≤ φ¯n − cY 2 ≤ U (k, δ) ≥ 1 − CAY k[(1 − ε)]βφn (c Y ,0,n) βφn (c Y ,rY δ,n)
− CAY k(k − 1) (1 − δ 2 )1/2 .
(19)
Learning from Few Examples with Nonlinear Feature Maps
221
Proof of Theorem 1. The proof follows from the Quasi-orthogonality Lemma (Lemma 1). Consider φ¯n − cY 2 = (φ¯n − cY , φ¯n − cY ) k k 1 1 = φn (xi ) − cY , φn (xi ) − cY k i=1 k i=1 k 1 φn (xi ) − cY 2 k 2 i=1 1 + 2 (φn (xi ) − cY , φn (xj ) − cY ). k
=
i=j
Lemma 1 (statement (10)), states that the probability of that the below holds true 1 k−1 rY δ |(φn (xi ) − cY , φn (xj ) − cY )| ≤ k2 k i=j
is at least
βφn (c Y ,rY δ,n) 1 − k(k − 1)CAY (1 − δ 2 )1/2 .
Noticing that φn (xi ) − cY ≤ rY for all i = 1, . . . , k assures that statement (18) holds. Combining the union bound, (18), and invoking statement (11) of Lemma 1, results in bound (18). 3.3
Few-Shot Learning via Averages of Nonlinear Feature Maps
Theorem 2 (Few-shot learning via averages). Let F be a classifier defined by (1) and trained on a sample X drawn from some distribution PX and whose marginal distribution PX satisfies Assumption 2 with cX = 0. Let Z = {x1 ,. . . , xk }, i = 1, . . . , k be an i.i.d. sample drawn from a distribution PY satisfying Assumption 3, and whose corresponding class labels are from the set Lnew . Finally, suppose that the function φn satisfies Assumption 1. Consider ⎛ ⎞1/2 k k 1 ⎝ κn (xi , xj )⎠ D(Z) = k i=1 j=1 and let δ ∈ (0, 1) be a solution of Δ = D(Z) −
rY2 k−1 + rY δ k k
1/2 > 0.
Then the map Fnew (x) =
k new , k1 i=1 κn (xi , x) − θD(Z) ≥ 0 F (x), otherwise
(20)
222
I. Y. Tyukin et al.
with new ∈ Lnew , parameterised by θ ∈ [max{Δ − rY , 0}, Δ] is a solution of Problem 1 with pn = (1 − C ∗ (cY , Δ − θ)AY ×
βφn (c Y ,Δ−θ,n) 2 2 1/2 rY − (Δ − θ) ×
(21)
(1 − C ∗ (cY , rY δ)AY k(k − 1)×
βφn (c Y ,rY δ,n) 2 1/2 rY (1 − δ ) ,
∗
pe = 1 − C (0, θ)AX
θ2 1− 2 rX
1/2 βφn (0,θ,n) .
(22)
Proof of Theorem 2. The proof of the theorem relies on the law of high dimension property captured in Theorem 1. According to this property, the probability that the parameter cY ∈ H determining concentration properties of the unknown distribution PY is at most U (k, δ) =
rY2 k−1 + rY δ k k
1/2
away in the space H from the empirical mean φ¯n =
k
φn (xi )
i=1
is at least βφn (c Y ,δrY ,n)
1 − C ∗ (cY , δrY )AY k(k − 1) (1 − δ 2 )1/2 . Now, suppose that cY − φ¯n ≤ U (k, δ) holds true. Pick 0 < θ < Δ and consider two sets: ¯ φn S1 = ξ ∈ H | ,ξ − θ = 0 φ¯n and
¯ ¯ φn φn S2 = ξ ∈ H | ,ξ − , cY = 0 φ¯n φ¯n
(23)
Learning from Few Examples with Nonlinear Feature Maps
223
The sets S1 and S2 define hyperplanes in H which are parallel to each other with the set S2 containing the point cY (that is S2 passes through the vector cY ). We observe that ¯ φn min , ξ = φ¯n − U (k, δ) = Δ, ¯n ¯n ≤U (k,δ) φ ξ: ξ−φ since φ¯n = D(Z), and can therefore conclude that the set S1 is at least D(Z) − U (k, δ) − θ = Δ − θ away from the set S2 . Note that all points x ∈ Rn for which φ¯n , φn (x) − φ¯n θ = φ¯n , φn (x) − D(Z)θ =
k 1 κn (xi , x) − D(Z)θ > 0 k i=1
(24)
will be assigned label new from Lnew by the classifier Fnew . Let u be the orthogonal projection of cY onto the set S1 . Then the probability that (24) occurs for x drawn from PY is 1− pY (x)dx, C(u ,u −c Y )
where
C(u, d) = x ∈ Rn
u − cY u − cY , φn (x) − cY − d > 0 .
Noticing that u − cY ≥ Δ − θ since it is just the separation distance between S1 and S2 , this probability is at least 1− pY (x)dx. C(u ,Δ−θ)
Taking Assumptions 1, 3, the latter integral can be bounded from below as 1 − C ∗ (cY , Δ − θ)AY
rY2 − (Δ − θ)2
1/2 βφn (c Y ,Δ−θ,n)
.
This together with (23) assures that (21) holds. Let x be drawn from PX . The probability that Fnew (x) = F (x) is pX (x)dx. ¯ φn ¯n ,φn (x) φ
Introducing v =
θ ¯ ¯n φn , φ
−θ>0
this probability may be estimated by
2 −θ 2 )1/2 v −φn (x)≤(rX
pX (x)dx.
224
I. Y. Tyukin et al.
2 −θ 2 )1/2 v −φn (x)≤(rX
1 ≤ AX Vφn (0, rX , n)
pX (x)dx
2 −θ 2 )1/2 v −φn (x)≤(rX
1dx
2 Vφn (v, (rX − θ2 )1/2 , n)) Vφn (0, rX , n) 1/2 αφn (v ,0,n) θ2 ≤ CAX 1− 2 rX 1/2 βφn (0,θ,n) θ2 ∗ ≤ C (0, θ)AX , 1− 2 rX
= AX
and the bound (22) follows.
Remark 4. The argument in the proof of Theorem 2 is largely based on validity of Assumption 1. This assumption is sufficient to produce estimates of the probability that two independent random samples from PY are quasi-orthogonal. At the same time, one could possibly obtain sharper or more feasible estimates if Assumption 1 in this part of the argument is replaced or supplemented by the one relating the ratios of volume integrals over spherical caps to the volume integrals (5) over relevant balls in (6).
4
Conclusion
This paper presents a very general treatment of the challenge of few-shot learning and its solutions based on sample averages. In this respect, the results generalise previous works in the area [9,10] and offer a new framework for future extensions, modifications, and analysis of relevant post-hoc AI error correction algorithms [11–13]. The main thrust of the framework is to explicitly include the influence of non-linear feature transformations into the problem, assumptions, and solutions. This enables to determine key desired properties of these nonlinear transformations, captured by Assumption 1, as well as the properties of data, specified by Assumptions 2, 3. Our main results show how these properties may influence success of few-shot learning schemes based on sample averages. These assumptions relate dimension of the original latent feature spaces with properties of nonlinear feature maps that are sufficient efficient learning. Potentially, these assumptions could also serve as explicit high-level specifications for the task of shaping or learning these nonlinear transformations from data. Detailed analysis of these properties and their practical feasibility are beyond the scope of this theoretical study. As our numerical examples show (see Fig. 1), exploration of the impact of nonlinear feature maps and their corresponding kernels on quasi-orthogonality, volume compression, and separability is a non-trivial and creative intellectual challenge which will be the focus of our future work.
Learning from Few Examples with Nonlinear Feature Maps
225
References 1. Bartlett, P.L., Long, P.M., Lugosi, G., Tsigler, A.: Benign overfitting in linear regression. Proc. National Acad. Sci. 117(48), 30063–30070 (2020) 2. Gorban, A.N., Grechuk, B., Mirkes, E.M., Stasenko, S.V., Tyukin, I.Y.: Highdimensional separability for one-and few-shot learning. Entropy 23(8), 1090 (2021) 3. Gorban, A.N., Tyukin, I.Y., Prokhorov, D.V., Sofeikov, K.I.: Approximation with random bases: pro et contra. Inf. Sci. 364–365, 129–145 (2016) 4. Kainen, P.C., K˚ urkov´ a, V.: Quasiorthogonal dimension. In: Kosheleva, O., Shary, S.P., Xiang, G., Zapatrin, R. (eds.) Beyond Traditional Probabilistic Data Processing Techniques: Interval, Fuzzy etc. Methods and Their Applications. SCI, vol. 835, pp. 615–629. Springer, Cham (2020). https://doi.org/10.1007/978-3-03031041-7 35 5. Kainen, P.C., Kurkova, V.: Quasiorthogonal dimension of Euclidian spaces. Appl. Math. Lett. 6(3), 7–10 (1993) 6. Ledoux, M.: The concentration of measure phenomenon. Am. Math. Soc. Number 89 (2001) 7. Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.-C.: Mobilenetv2: inverted residuals and linear bottlenecks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4510–4520 (2018) 8. Snell, J., Swersky, K., Zemel, R.: Prototypical networks for few-shot learning. In: Advances in Neural Information Processing Systems, pp. 4077–4087 (2017) 9. Tyukin, I.Y., Gorban, A.N., Alkhudaydi, M.H., Zhou, Q.: Demystification of fewshot and one-shot learning. In: 2021 International Joint Conference on Neural Networks (IJCNN), pp. 1–7. IEEE (2021) 10. Tyukin, I.Y., Gorban, A.N., Grechuk, B., Green, S.: Kernel stochastic separation theorems and separability characterizations of kernel classifiers. In: 2019 International Joint Conference on Neural Networks (IJCNN), pp. 1–6. IEEE (2019) 11. Tyukin, I.Y., Gorban, A.N., Green, S., Prokhorov, D.: Fast construction of correcting ensembles for legacy artificial intelligence systems: algorithms and a case study. Inf. Sci. 485, 230–247 (2019) 12. Tyukin, I.Y., Gorban, A.N., McEwan, A.A., Meshkinfamfard, S., Tang, L.: Blessing of dimensionality at the edge and geometry of few-shot learning. Inf. Sci. 564, 124– 143 (2021) 13. Tyukin, I.Y., Gorban, A.N., Sofeykov, K.I., Romanenko, I.: Knowledge transfer between artificial intelligence systems. Front. Neurorobotics 12, 49 (2018) 14. Vapnik, V.N.: An overview of statistical learning theory. IEEE Trans. Neural Netw. 10(5), 988–999 (1999) 15. Vinyals, O., Blundell, C., Lillicrap, T., Kavukcuoglu, K., Wierstra, D.: Matching networks for one shot learning. In: Advances in Neural Information Processing Systems, pp. 3630–3638 (2016) 16. Zhang, C., Bengio, S., Hardt, M., Recht, B., Vinyals, O.: Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530 (2016) 17. Zhang, C., Bengio, S., Hardt, M., Recht, B., Vinyals, O.: Understanding deep learning (still) requires rethinking generalization. Commun. ACM 64(3), 107–115 (2021)
Retention of Computing Students in a London-Based University During the Covid-19 Pandemic Using Learned Optimism as a Lens: A Statistical Analysis in R Alexandros Chrysikos1(B) , Indrajitrakuraj Ravi1 , Dimitrios Stasinopoulos1 , Robert Rigby1 , and Stephen Catterall2 1 London Metropolitan University, 166-220 Holloway Road, London N7 8DB, London, UK
[email protected] 2 University of Wolverhampton, Wulfruna Street, Wolverhampton WV1 1LY, Wolverhampton,
UK
Abstract. The aim of this research project is to investigate the low retention rate among the foundation and first year undergraduate students from the School of Computing and Digital Media in a London based university. Specifically, the research is conducted during the Covid-19 pandemic using learned optimism as a lens. The research will aid the university to improve retention rate as the overall dropout has been increasing in the last few years. The current study employed an exploratory investigation approach by using statistical modelling analysis in R to predict behavioural patterns. The quantitative data analysis conducted aims to support the efforts of the School of Computing and Digital Media of a London based university to re-evaluate its retention strategies in foundation and first year computing students. The main outcome of the analysis is that students with a foreign qualification are optimistic, while students with other or not known qualification are mildly pessimistic. In addition, students with a BTECH, Higher Education diploma or A level qualification are generally more pessimistic especially if they are also black ethnicity, or if are not black ethnicity, then aged under 34 and British. Keywords: Learned Optimism · Student Retention · Computing · R Programming · Quantitative Research · Data Analysis
1 Introduction The UK higher education providers have a good reputation around the world due to the high standards of quality provided and the extensive choice of subjects [38]. The Times Higher Education publishes the World University Rankings every year assessing an institution’s performance based on four areas such as Teaching, Research, Knowledge Transfer, and International Outlook. The ranking published by the group is trusted by students, teachers, industry experts and governments as the ranking includes more than 1500 universities across 93 countries in the world. There are four universities from the © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): SAI 2023, LNNS 711, pp. 226–249, 2023. https://doi.org/10.1007/978-3-031-37717-4_16
Retention of Computing Students in a London-Based University
227
United Kingdom that are featured in the top 20 universities in the world, with two universities based in London. The University of Oxford has been ranked as the number one university worldwide undisputedly for the last five years and University of Cambridge is ranked at six. London based universities such as Imperial College and UCL are ranked at 11 and 16. In addition, these universities are ranked in the top 20 universities in the world offering computer science-based courses [39]. The student initial entry rate into higher education has been increasing every year. In 2018/19 the initial entry rate rose to 52% among 17–30-year-old students compared to 42% in 2006/07 [14]. At the same time the student dropout rate has also been increasing each year. Specifically, London-based universities have a higher dropout rate compared to the national average. The gap between London and national average has been widening every year along with the increase in dropout rates. This statement is based on the research reports published by Social Market Foundation in July 2017 [32] and Mayor’s Skill Strategy in June 2018 [23]. Computing courses have registered the highest percentage of dropouts compared to all other streams of higher education from 2014 to 2018 entries. Even though the computer science-based courses register the highest dropout rates, the percentage of dropout has seen a slight decline in the last few years. Based on the data for student dropout following the year of entry for the full-time undergraduate courses published by HESA, 7 out of the 10 universities in the UK that have the highest percentage of student dropout rates are based in London. The London based university studied has been ranked at 11th for the highest percentage of student dropouts for the 2018–2019 year of entry. The percentage of dropout has been increasing every year and 23% of the total 2365 entrants have discontinued from the university during the 2018–2019 year of entry [7]. The current research aims to identify factors causing the decrease in retention rates in foundation and first year computing courses. Specifically, a survey was conducted among the foundation and first year students to collect relevant data for the study. The data was then analysed using R to identify the optimism levels in students against different levels. 1.1 A Profile of the School of Computing and Digital Media The London based university studied is one of the public research universities formed in the year 2002 as a merger between two other universities. The study to discover the optimism levels of students was conducted between 18th of April to 18th of May 2021. 64% of the students who study in the university belong to the Black, Asian or Minority Ethnic (BAME) background [10]. The School of Computing and Digital Media registered a dropout rate of 22% for the year 2018–2019 in the first year of undergraduate computing courses. The survey was conducted among the foundation and first year students from the School of Computing and Digital Media. The survey was conducted during the period of pandemic where the students did not experience on-campus learning and all the learning was through an online mode of teaching.
228
A. Chrysikos et al.
1.2 Aim of the Study The current research will help the university to reduce the dropout rate as the overall dropout has been increasing in the last few years. The current study employed an exploratory investigation approach by using statistical modelling analysis in R to predict behavioural patterns. The quantitative data analysis conducted aims to support the efforts of the School of Computing and Digital Media of a London based university to re-evaluate its retention strategies for foundation and first year computing students. 1.3 Objectives The motive of this research is to identify the optimism levels of the students in the foundation and first year undergraduate courses in the School of Computing and Digital Media using Learned Optimism as a lens. The objectives set for this research to achieve the motive of the research are as follows: • Literature review on the areas that contribute towards this research in analysing the optimism levels among the students. • Conduct a survey to calculate the optimism levels among the foundation and first year undergraduate students from the School of Computing and Digital Media of the university studied. • Analyse the information about admission rates, dropout rates for the all first-year undergraduate courses in the UK and a London based university. • Analyse the information about admission rates, dropout rates for the first-year undergraduate courses based on computing studies in the UK and a London based university. • Identify and discuss the factors that lead to low retention rates. • Perform statistical analysis using R from the data collected through the survey and identify the optimism levels among the students based on their qualifications, age, gender, ethnicity, and level of study. • Interpret the results based on the statistical analysis and put forward recommendations based on the results to improve the optimism levels among the students. The current paper is structured as follows: Literature Review, Research Methodology, Data Collection, Data Analysis, Implications, Limitations, Recommendations for Further Research, and Conclusion.
2 Literature Review In this section, the Learned Optimism, student retention rates for the first-year undergraduate courses in the United Kingdom, foundation and first year courses from the School of Computing and Digital Media of the university studied and the factors behind the dropout rates for the computing courses in the UK are discussed.
Retention of Computing Students in a London-Based University
229
2.1 Learned Optimism Two studies in 2012 and 2013 identified several factors that contribute towards student retention. The factors are transitioning, learning and teaching, supporting students, participation and belonging, utilisation of data and information communication technology, and strategic change. More recently, the emphasis has shifted towards how student participation and how a sense of belonging in higher education affects student retention and how technology, both in terms of supporting students and in analysing and linking data to retention strategies, can be more effectively harnessed, and applied [1]. The survey is aimed at discovering the student experience during or after the first or foundation years focused on computing courses of the London based university studied. The survey aims to identify the areas where the university can improve the student experience and engagement. The survey is adapted from Seligman’s ‘Learned Optimism’ to identify the behavioural pattern among the students and analyse the relationship with the retention rates. Seligman focused a great deal of his work on pessimism. However, his motive shifted the other way and he started focusing on the optimistic side of individuals [31]. He is considered the founding father of Positive Psychology and introduced the concept of ‘Learned Optimism’. Learned Optimism uses a technique called the ‘ABCDE’ model. Albert Ellis a psychologist and researcher developed a model to understand the reaction of an individual to an adversity (Guide 3, 2021). The model named as the ‘ABC’ model is based on: • Adversity (A) – The situation that calls for a response. • Belief (B) – How we interpret the event. • Consequence (C) – The reaction that is triggered by the belief. The ABC model is one of the most used during the Cognitive-behavioural therapy (CBT) for treating mental health issues [30]. The ABC model was developed into the ‘ABCDE’ model by Seligman which was introduced in his book “Authentic Happiness” in 2002. The ‘ABCDE’ model is a technique used in Learned Optimism. Seligman suggests that the ‘ABCDE’ approach can be used by identifying a pessimistic thought and then view as if that thought was suggested by someone who would want to make our life miserable on purpose [36]. The ‘ABCDE’ model has two extra factors, specifically: • Dispute (D) – The effort we expend to argue or dispute the belief. • Energy (E) – Realizing the change in energy after self-disputing the belief successfully. To overcome a negative thought, a person can either distract their thoughts to something else that would make them avoid the negative thought or dispute the belief that is making them feel pessimistic. According to Seligman, to dispute a belief one can deploy four possible methods. These are: 1. Evidence 2. Alternatives 3. Implications
230
A. Chrysikos et al.
4. Usefulness A person can use evidence to factually argue against the disbelief and justify that the pessimistic thought serves no purpose and does not affect them. An alternative analysis is to identify the causes for the belief and then act on the causes that are relevant. Such analysis would allow the person to think in a broad optimistic perspective. The implication is that it allows the person to continue the activity with a positive attitude that a negative event or belief would prevent them from doing. Usefulness allows the person to realise how useful it is to continue with the negative belief. An example is that the person must question themselves how useful it is to think about the problem and if thinking about it can do any good. This allows the person to validate the belief whether it has any meaning to it or not. A person can also use distraction rather than disputing thoughts to effectively continue an activity [35]. According to Seligman, a person can increase their happiness levels by disputing their negative beliefs. The final part of the ‘ABCDE’ model is Energization. The successful dispute or distraction that causes the negative belief to either fade away or become invalidated creates a positive behavioural change in a person [35]. The five stages of the model can be implemented and that would in turn help to reduce negative thoughts or overthinking and increase optimism. According to Boykin, this technique can be suggested to students to handle challenging tasks more optimistically in their academic and everyday life. The same technique can also be used by any professional to handle challenges in their day-to-day life [9]. 2.2 Student Retention in Higher Education In 1992, the polytechnics and the central institutions in the UK were given university status through the Further and Higher Education Act 1992. These universities are referred to as the ‘Post-1992 universities’. This led to an increase in the student population and wider student diversity. Moreover, the increasing emphasis on higher education as a free-market business has highlighted the need to ensure that students are both recruited and retained [4]. The Department for Education (DfE) in the United Kingdom has published a statistical report called the Higher Education Initial Participation (HEIP) report annually since 2014. The HEIP report contains the entry rates into higher education for students aged between 17–30 years old. Based on the latest report published in November of 2020, the student initial entry rates have increased to 52% for the 2018/19 year while the initial entry rate was 42% for the 2006/07 year. The data shows that there is a 10% increase in the initial entry rate in higher education in the last 12 years in the United Kingdom. However, the rate of students being retained is reducing, this is proved by the increasing student dropout rates. According to the reports published by the HESA, the student dropout rate has been increasing along with the increase in student admissions in the UK. The total number of full-time undergraduate entrants for the academic year of entry 2018/2019 was 426,485 compared to 403,125 in the academic year of 2014/2015. Together with the increase in the full-time entrants, the dropout rate has also increased from 31,155 students in 2014/2015 to 35,255 in 2018/2019. This can be interpreted as the student retention rate having decreased from 92.3% to 91.7% after the first year of
Retention of Computing Students in a London-Based University
231
study. The data was recorded on a criterion of students who dropped out of full-time undergraduate course after 50 days of commencement of the course. London has been witnessing a higher student dropout rate compared to the other regions of the United Kingdom. According to the report ‘Skills for Londoners’, a skills and adult education strategy report published by the Greater London Authority on behalf of Sadiq Khan, Mayor of London, on June 2018, the universities based in London have a higher dropout rate at 10% compared to the UK average of 8% [23]. The percentage of drop out from London based universities is increasing when compared to the outcomes of the research conducted by the Social Market Foundation. The dropout rate was at 7.73% for London based universities compared to the England average at 6.30% for the academic year of entry in 2016/17 [33]. The universities have a benchmark that estimates a percentage of students expected to discontinue their study before the end of the course based on the demographics of the student population. Out of 33 universities in London, 19 universities have a high dropout rate higher than the benchmark rate in 2016/17 [26]. There have been several studies to identify the reason for the increasing dropout rate from the universities in the UK. According to a news article published by The Times, the educational critics hold the universities responsible for the increase in dropout because of relaxing the admission eligibility to accommodate more students including those who fail to meet the required skills for the course. Other critics suggest that the students required greater assistance to adapt to the transition from school to university. The Office for Students has pointed out that the unconditional offers issued by the universities are the driving force behind this increase of dropout rates in the UK [6]. Universities with lower student satisfaction scores in the National Student Survey have higher drop-out rates on average [27]. A news article published in the Guardian states that the universities feel that the students starting their course during the pandemic might not be able to cope up with the demands of the course. They feel that the students would be in a stressful phase worried about paying for rent and food during the pandemic [15]. This section discussed about the student retention in higher education throughout the UK. The entry and dropout rates for the universities from the official government agencies indicated that there is an increase in both the student entry and dropout rates. London based universities have a higher dropout rate compared to the universities from other regions. In the following section, student retention at the London based university studied is explained. Student Retention at the School of Computing and Digital Media. In contrast with increasing entry rates in universities in recent years, the London based University of the current study has seen a decline in entry rates. The university registered 3125 students for the full-time undergraduate courses in the year of entry 2014/15. In the year 2018/19, the university registered 2365 students, which is 24% less than in 2014/15. However, the student dropout rate has been increasing despite the fact the entry rate for the university has decreased. In the year 2014/15 the university recorded 17% of the students who attended 50 days of the course but dropped out of higher education after the first year of study for undergraduate courses. The university witnessed a 5% increase in the dropout rate at 23% for the year of 2018/19. The rate of dropout has been surpassing the benchmark set for the university every year [7]. As seen in Table 1, the percentage of dropouts
232
A. Chrysikos et al.
has been increasing every year. The following section explains the core reason of why dropout rates in computing-based courses are the highest compared to other courses. Table 1. Non Continuation Following year of Entry [7] Year of entry
Number of entrants
Percentage of Dropouts
Benchmark
2014/15
3125
17.6
12.6
2015/16
2895
18
13.4
2016/17
3005
20.4
13.9
2017/18
2325
19.1
13.8
2018/19
2365
22.6
13.5
2.3 Student Retention in Computing Students The full-time undergraduate courses that have computer science as the main subject have recorded the greatest number of discontinuation rates amongst all courses in the UK. The data published on the HESA has been categorized into two age groups, young and mature. According to the HESA any student above the age of 21 is considered a mature student for the undergraduate courses [16]. As seen in Table 2, among young students, the dropout rate for the courses that have computing as the main subject has reduced compared to the previous years but remains highest compared to the other courses. Furthermore, among mature students, the dropout rate for the computing related courses was the highest in the year 2014/15 at 18%. Since then, the rate has come down to 17% and courses that have business and administrative studies registered the highest dropout rates [16]. Table 2. Student Dropout Rates in Computing Students from Universities in UK [7] Year
Young
Mature
2014/15
11
18.2
2015/16
10.6
17.5
2016/17
9.9
17.7
2017/18
9.8
17.5
2018/19
9.2
16.5
Student Retention of the School of Computing and Digital Media. The dropout rates for the first-year students in computing was analysed based on the results published by the school office from the year 2016/17 to 2019/20. The dropout rate has been determined
Retention of Computing Students in a London-Based University
233
based on the number of students who were registered for the course and the number of students who withdrew from the course. In the academic year of 2016 and 2017 as shown in Fig. 1 and Fig. 2, 52 students enrolled for the first year BSc (Hons) Computer Science course with 31 in the September intake and 21 in the January intake. 12 students withdrew from the course with 6 each from both the intakes at 19% and 29% making it a collective of 23% dropout for the academic year.
Fig. 1. BSc (Hons) Computer Science 2016/17 Autumn Semester Results
Fig. 2. BSc (Hons) Computer Science 2016/17 Spring Semester Results
In the academic year of 2017 and 2018 as shown in Fig. 3 and Fig. 4, 32 students enrolled for the first year BSc (Hons) Computer Science course with 22 in the September intake and 12 in the January intake. From the September intake 3 students (13%) withdrew from the course and no student (0%) withdrew from the January intake making it 9% dropout for the year. In the academic year of 2018 and 2019 as shown in Fig. 5 and Fig. 6, 46 students enrolled for the first year BSc (Hons) Computer Science course with 29 in the September intake and 17 in the January intake. 8 students (28%) from the September intake withdrew from the course and 2 students (12%) withdrew from the January intake making it 22% dropout for the year. The School of Computing and Digital Media witnessed their highest dropout rate in the 2016 to 2017 academic year at 23% but the following year had a contrasting dropout
234
A. Chrysikos et al.
Fig. 3. BSc (Hons) Computer Science 2017/18 Autumn Semester Results
Fig. 4. BSc (Hons) Computer Science 2017/18 Spring Semester Results
Fig. 5. BSc (Hons) Computer Science 2018/19 Autumn Semester Results
Retention of Computing Students in a London-Based University
235
Fig. 6. BSc (Hons) Computer Science 2018/19 Spring Semester Results
rate of 9%. However, the dropout rate for the academic year 2018 to 2019 increased to 22%. There are several factors contributing to this dropout rate that are explained in the following section.
2.4 Reasons for Dropping Out One of the major challenges is caused by the prior experience and perception of the subject based on academic experience from schools and colleges or personal experience. This incongruity is perhaps the greatest challenge facing both computing as an academic subject and the IT industry. According to research conducted by University of Roehampton, the percentage of students who took up GCSE Computer Science witnessed a slight increase from 12.1% to 12.4% between 2017 and 2018 among all GCSE students [22]. This might be due to the increase in the number of GCSE Computer Science providers in the country. However, 8% of the schools that offered GCSE Computer Science in 2017 ceased to offer in 2018. The teaching hours for computing and ICT in the secondary schools in England has dropped by 36% during the six-year period from 2012 and 2017. The two years known as year 10 and 11 which includes GCSEs and other exams are termed Key Stage 4 (KS4). The research found that there was a 47% decrease in the teaching hours focused on computing and ICT per week. For students who did not take up GCSE in Computer Science, they are unlikely to get any sort of computing education in the schools beyond the age of 14. The GCSE IT was removed from the curriculum in 2018 and reduced the total number of hours spent on teaching computing even more. Kemp, Berry and Wong feel that the dominance of computer science over computing and ICT would limit access to computing education for young people [22]. The national computing curriculum for England, which was revised in 2014, had faced criticism from many as the curriculum is much involved with coding. The vice president of programmes for Saleforce.org Charlotte Finn suggests that there must be a change in the curriculum as it emphasizes coding and not the other computing skills, and if it prolongs, then there are risks of many students being left behind [28].
236
A. Chrysikos et al.
Paul Clarke, Chief Technology Officer (CTO) at Ocado suggests that school curriculum should include technology concepts that are more relevant in the current market. Analysing and interpreting data to develop models, automation, artificial intelligence, and machine learning would make the children more digital literate. He feels that true digital literacy is more than teaching the students to code. This can be achieved only by updating the curriculum [29]. Finally, for many computing students at the pre-entry and transition to higher education stage there is a significant obstacle created by not possessing the requisite academic skills or experience and qualifications in important subjects such as mathematics and English or not having the resources and time while at university to acquire them quickly enough. While certainly not restricted to students who have entered higher education on a vocational route this is an issue which is more likely to prove challenging for them. Lack of mathematical knowledge and skills; lack of familiarity with extended writing tasks; oral presentational skills; research skills can prove overwhelming and can inform decisions by students to disengage and withdraw from higher education [34]. Based on the literature review, it is suggested that retention of students can be increased by improving the optimism levels of a student. An optimistic student would be able to handle a tough situation or challenge faced during the time of study in a positive manner and not become disillusioned. Students intend to engage and participate more in their studies when they feel a sense of belonging with the university. This might help develop a positive relationship between the university and the students. The increase in the number of universities post 1992, as well as their admissions, has also resulted in a rise in dropout rates especially in the courses that involve computing. This may be due to the changes in the curriculum in GCSE levels shifting the learning of computing more into coding away from other areas of computing. The entry requirements for gaining admission to computing based undergraduate courses has made it easier for students to enrol on the computing courses. However, the ability to face the challenges of the course in terms of academic knowledge is affected because of the learning quality and methods in the lower academic levels such as GCSE or A levels. The dropout rate has been high among the students from black ethnic groups compared to other ethnicities. This is one area that would require more extensive and in-depth research on the improvement of retention rates among the students from black ethnic groups. This is further addressed in Sect. 8. The university studied has recorded a decreasing number of entrants into the undergraduate courses between 2014 and 2018, however the dropout rate has been increasing despite the decreasing admission rates. This research identifies the issues causing low retention rates through the optimism levels of the School of Computing and Digital Media students. The methods involved in this research and the structure of the survey are discussed in the following section.
3 Research Methodology The data collected for this research is used to identify the optimism levels in the students and analyse it against factors such as age, gender, ethnicity, qualification, and nationality. A survey is conducted for the foundation and first year students in the School
Retention of Computing Students in a London-Based University
237
of Computing and Digital Media. The data collected is statistically analysed using R programming language. An exploratory data investigation is performed to summarize characteristics in the dataset. The Regression tree model is also used to build a statistical model for the data in this research. The data collection method employed in this research is a questionnaire in the form of a survey [2, 3]. The data collected through the survey was then explored to discover and summarize the characteristics of the data [8]. The data collection method is explained in detail in Sect. 4. The following sections explain the process followed in detail. Exploratory Investigation. The exploratory investigation or exploratory analysis is conducted to summarise the characteristics in a dataset. It involves data visualization and graphical analysis to explain the data. The exploratory investigation can be used to explore the data and formulate the hypotheses leading to new data collection and research. According to Tukey, the exploratory investigation fulfils the objective of suggesting hypotheses for the causes of observed phenomena, assessing assumptions based on which statistical inferences, identifying appropriate statistical tools and methodology and providing a basis for further data collection [5]. Few of the many graphical techniques that are used in exploratory Investigation are box plot, histogram, scatter plot, run chart, frequency plot. In this research the frequency plot is used to visualize the data. A frequency distribution plot is created in this research to identify the frequency of variables against optimism scores. The outcomes of the frequency distribution plot are used in selecting the appropriate modelling technique for the data. Regression Tree Model. Regression Techniques consist of a single output or target variable which is numerical and one or more input or explanatory variables. The Regression Analysis can predict the value of continuous output variables against independent, continuous, and categorical input variables based on their latent relationship from data. The implementation of tree-based models is an exploratory method to discover structure in the dataset and it is an alternative to linear and additive models for regression problems and to linear logistic and additive logistic models for classification problems. The tree-based models originated as an alternative to classical statistical methods like linear regression that are highly unstable when there is a correlation between the variables [25]. Regression Tree Models are developed as a two-step procedure. A recursive binary partitioning is implemented to produce a tree structure and then the insignificant leaves are pruned. The process has the possibility of assigning multivariate functions to terminal leaves for better generalisation. This allows a novel methodology of node partitioning, especially in a single optimisation model simultaneously performing the two tasks of identifying the breakpoint of binary split and assignment of multivariate functions to either leaf. This leads to an efficient regression tree model [42]. The regression decision tree has low misclassification rate and deviance compared to other models. The output presented from a decision tree is spontaneous, assists in decision making and is easy to interpret. The path of the tree leads to identifying interesting subsets of modules along with their characteristics [18–21]. The quantitative approach involving survey questionnaires as the data collection method, exploratory data investigation and regression tree analysis model is used in this
238
A. Chrysikos et al.
research to identify student retention in foundation and first year computing students using learned optimism as a lens.
4 Data Collection The Data Collection is the process of collecting data that is required for the research to perform the statistical analysis. The data collection was undertaken using a questionnaire through a survey among the foundation and first year students from the School of Computing and Digital Media from the 18th of April 2021 to 18th of May 2021. The questionnaire is adapted from ‘The Learned Optimism’ by Seligman [35]. The questionnaire contains 30 questions with each having different scenarios asking the person what they would tell themselves. The recipients are asked to choose between two options. Every option is marked in either one of the alphabets between A to L. The recipient would need to mark their answer with an X next to the provided options. The answers corresponding to the question number and the alphabet are marked on the scoring sheet and then the total number of answers for the alphabets are calculated based on the sum of the answers. The results in the test scoring sheet are then interpreted based on the Optimism Survey Interpretation Guide as shown in Fig. 7. The Interpretation Guide helps to classify whether a student is an optimist or pessimist, or average based on the total optimism score. The Pessimism score is calculated by the sum of the points in Columns I, D, and F. The score can describe the person’s attitude reacting to when a bad thing occurs. If the scores are between. • 0–6, then the person is optimistic when bad things happen • 7, then the person is average • 8–14, the person is pessimistic when bad things happen The Optimism score is calculated by the sum of the points in Columns H, E, and B. The score can describe the person’s attitude reacting to when a good thing occurs. If the scores are between. • 10–15, then the person is optimistic when good things happen • 8–9, then the person is average • 0–7, the person is pessimistic when good things happen To identify the total optimism score, the pessimism score is subtracted from the optimism score. If the scores are, • and above then the person is an optimist • 2–3, then the person is average • 1 or below, then the person is a pessimist The primary objective of the survey is to identify how the optimism levels are among the students in the School of Computing and Digital Media. There are several factors that contribute to a student being optimistic or pessimistic or average. The survey collects information about the student’s age, ethnicity, age, gender, level of study, mode of attendance, and disability. The information about the student’s qualification and work
Retention of Computing Students in a London-Based University
239
Fig. 7. Optimism Survey Interpretation Guide [35]
experience was mapped from the school’s enrolment database. The purpose of this stage is to investigate how the optimism score depends on the explanatory variables. The variables that are used in the analysis for this research are • Qualification - (Foreign qualifications, BTECH, Higher Education Diploma, A Level, GCSE, Other, N/A) • Work Experience – (Yes or No) • Disabled – (Yes or No) • Ethnicity - (White, Black, Asian, Other, Not Known) • Age – (In years) • Gender – (Female, Male) • British – (Yes or No) • Mode of Attendance – (Full time or Part time) • Level of Study – (Foundation or First year) From the 201 students who took part in the survey, an initial review classified them based on their optimism levels as Optimist or Average or Pessimist corresponding to gender, nationality and course levels as shown in Table 3. All personal information such as name, student ID email address was anonymised due to data confidentiality. A statistical analysis was then performed to identify how total optimism scores are based on the explanatory variables mentioned previously using R Programming Language. This is explained in detail in the following section.
240
A. Chrysikos et al. Table 3. Optimism Survey Data Total Male Female British Non-British First Year Foundation Year Undergraduate
Optimist
38
31
7
10
28
15
23
Average
54
36
18
22
32
22
32
76
33
33
76
49
60
Pessimist 109
5 Data Analysis Data Analysis is defined as the systematic process of applying statistical, logical techniques such as cleaning, transforming, and modelling to describe, illustrate and evaluate data. According to Shamoo and Resnik there are various analytic procedures to develop inductive inferences from data and distinguish the phenomenon of interest from the statistical fluctuations present in the data [37]. The statistical analysis is carried out using the R programming language on R studio. The data collected from the survey is analysed statistically and a model is developed to identify the relationship between the explanatory variables and the target variable. The data is loaded in the R studio to follow the objectives set in the research. The analysis begins with an exploratory investigation to identify the levels of impact the variables have on the target and then a model is developed for the data using a regression decision tree model. 5.1 Exploratory Investigation In an initial exploratory investigation, the frequencies of each category of the variables were obtained, as well as the corresponding proportions for each category of each of the variables. Specifically, the foreign qualification was the most common (27% of students), 18% of students had work experience, 10% were disabled, 57% were white ethnicity and 20% black ethnicity, 71% were male, only 32% were British, 2% were part-time students, and 57% were at foundation year level of study while 43% were at first year of undergraduate degree level of study. A frequency plot of the optimism scores was obtained below (see Fig. 8). The following were observed from the exploratory investigation • • • •
The minimum optimism score was –8 and the maximum was +9. The distribution of the optimism score was approximately normal. The optimism score was then plotted against each of the explanatory variables. This showed that the main variable affecting the optimism score was the qualification. Students with either foreign qualifications or GCSEs had the highest median optimism scores.
The frequency plot obtained by plotting the explanatory variables against the optimism score reveals that the most optimistic score was 9 and the least optimistic score was –8. From the plot, the highest frequency was observed for the students with an optimism score of 2 followed by a score of 1. Based on these results, the statistical model is developed using a regression decision tree model.
Retention of Computing Students in a London-Based University
241
Fig. 8. Frequency Plot
5.2 Model for the Optimism Score The optimism score was then modelled using the explanatory variables in a regression decision tree. The figure below (see Fig. 9) displays the resulting fitted tree model. The decision tree is built by successive binary splits of the data. Nodes of the tree are in blue, with the 7 nodes at the bottom level of the tree called leaves. At the top of the tree the node contains all the 201 cases (i.e., 100% of the data) and has a mean (or average) optimism score of 0.87. Note that the average optimism score of all the 201 students falls under the category of pessimistic character, indicating on average students are more pessimistic than optimistic based on the score interpretation guide. At the second level from the top of the tree, the students are then divided into two groups (two branches of the tree leading to two nodes) based on the variable QUALIFICATION, as shown on the second level from the top of the tree. On the left are students with QUALIFICATION = BTECH, HE_DIP OR A_LEV, comprising 26% of the students
Fig. 9. Regression Decision Tree
242
A. Chrysikos et al.
with an average optimism score of -0.19, while on the right are the other students with QUALIFICATION = FOREIGN, GCSE, OTHER or NK (not known), comprising 74% of students with an average optimism score of 1.2. Each of these two nodes are then split further. The bottom level of the tree has 7 leaves, and this partitions the students into seven groups with similar values of optimism. From left to right on the bottom level, the 7 groups are as follows: 1. Students with QUALIFICATION = BTECH, HE_DIP OR A_LEV and ETHNICITY = BLACK, comprising 5% of the students, with an average optimism score of −2.3. 2. Students with QUALIFICATION = BTECH, HE_DIP OR A_LEV and ETHNICITY = not BLACK, Age 0 is the learning rate.
(12)
Finding Eulerian Tours in Mazes
331
Equations (12), (5) and (11) show how an ordinary Q-learning interacts with our new state vector. 3.5
Simplified Wall-Following Algorithm
It can be argued that known algorithms such as the “wall-follower” algorithm [6] can be used instead of memory-augmented tabular Q-learning. This method, also known as the left-hand rule method, can solve any non-looped maze with a guarantee of not getting lost and will reach a different exit if there is one. However, this method does not solve the maze structures with loops. If we want to train a Q-table to have a “wall-follower” algorithm behaviour, we can modify the observation of the state to hold ot = (w(west), w(east), w(north), w( south), dentry ); The w inputs are the sensory data which will return 1 if there is a blocked cell in the specified direction and 0 otherwise. The dentry is the agent’s direction entered the current cell. In this representation, the policy is a function of the observation vector ot . The agent’s observation is limited to its surrounding walls and the direction in which the agent entered the current cell; the only way to solve the maze is to follow the wall, which is more exploitation than finding a solution by exploration. Therefore we can train a Q-table with shapes of (2, 2, 2, 2, 4, 4) to represent the state given to the “wall-follower” algorithm. We include this kind of agent in the experiments below for comparison against the memory-augmentation method.
4
Experiment Setup
Five distinct algorithms were intended to be compared to test on five different mazes; in each case, the augmented memory is added to the algorithm, and the results are compared. These algorithms are as follows: – Conventional Q-learning algorithm. – Proximal policy optimisation algorithm PPO. – A synchronous, deterministic variant of Asynchronous Advantage Actor Critic (A3C) A2C. – Deep Q Network (DQN) builds on Fitted Q-Iteration (FQI), using a replay buffer, a target network and gradient clipping. – Conventional Q-learning trained to behave like a “wall-following” algorithm. The PPO, A2C and DQN were taken from [16], and the specific hyperparameters were set to their default values. The inputs to PPO, A2C and DQN will benefit from our external memory, adding more details to their neural network input.
332
M. Pisheh Var et al.
For the conventional Q-learning algorithm, two inputs are acquired from the agent’s current position (y, x) in case of no memory augmentation. However, for the memory augmented method, the input is expanded to contain six values to additionally hold a vector c(yt ,xt ) shown in (7). For the reinforcement learning algorithms such as PPO, A2C and DQN, the network architecture of Critic and Actor, where relevant for each algorithm, consisted of a multi-layer fully-connected neural network. In case of no memory augmentation, the observation will hold the position of the agent ot = (yt , xt ), where it will be pre-processed into a one-hot encoded form. The input for each reinforcement learning algorithm included a one-hot encoded vector with the shape of the maze height added to the maze width. In case of memory augmentation, the state is expanded to hold a vector c(yt ,xt ) shown in (7) where contains four values in the range of 0 and 1 in addition to the agent’s current position, therefore, after one-hot encode pre-processing, the input will hold the sum of maze’s height and width added to eight extra values for the memory augmented part of the observation. Two hidden layer and 64 nodes are designed for each algorithm’s network architecture. The PPO and A2C algorithms used the tanh activation function; the DQN algorithm included the ReLU activation function in its network architecture. The agent’s x and y coordinates are one-hot encoded. Also, each element of shown in (4) is one hot encoded. Hence there are height + width + 4 ∗ 2 the C inputs to the neural network. The last layer’s activation function for all reinforcement learning algorithms was an identity activation function. The learning rate was set to 0.01 and the discount factor was set to γ = 0.9 The epsilon-greedy policy [18] was used to apply randomness in choosing the actions with a 0.1 chance to occur. The epsilon-greedy policy allows exploration in training, which was removed in the validation phase of our experiments. Each algorithm was given 500 time steps to find the exit in one episode. At the start of each episode, the exit is moved to its next possible cell. Five different maze structure environments were created; these environments are shown in Fig. 2. The exit is indicated with a yellow circle and named “EXIT”. The starting cell is drawn with a red square; the empty cross-hatched space shows the maze wall where the agent cannot move into these areas. Each maze environment contains two or four possible exit cells (exits) depending on the maze’s structure. Each maze has a set of possible exits, and one exit is chosen for each episode. Hence when the agent explores the maze, there is exactly one exit it is looking for. At the start of each different episode, the exit is rotated through the set of
Finding Eulerian Tours in Mazes
333
possible exits for that maze, so each time the agent starts a new episode, the exit will have moved from the last time it explored. For each episode, the agent has 500 time steps to find the exit before the time runs out. After experimenting with all the possible exits, the total number of accumulated time steps to reach each exit is recorded. These environments shown in Fig. 2 hold a common feature where the agent must decide to take one of the divided paths where only one can lead to the exit. The agent starts on the red square cell and will have to output actions to move into the allowed cells. In case of a collision with a wall, the agent will remain in the same cell, and one step will count towards the total steps. The experiment was created to run on 10 different trials; in each trial, the agent was trained for 100,000 episodes, and the weights were updated after each action was taken by the agent during each episode. The agent must devise a movement strategy to find and visit all possible environmental exits. For instance, in the small 3-cell maze shown in Fig. 2a, the optimal number of steps to reach the exit cell on the right side of the maze is 1. After reaching the exit cell, the agent’s position is reset. Therefore, the total time steps to reach the exit on the left side of the maze through the exit we already visited is 3. To reach both exits, we get 4 as the optimal number of steps to reach both exits. In Fig. 2, five different test mazes are shown and can be named and described in Table 1. First, the table introduces each maze environment, followed by the number of open cells and optimal steps to reach all the exits planned for the maze. Table 1. List of Mazes Followed by the Number of Cells Capable of being Traversed Maze Name
Total Traversable Cells
Small Corridor Maze (Fig. 2a)
3
Long Corridor Maze (Fig. 2b)
18
T-shaped Maze (Fig. 2c)
9
Cross Maze (Fig. 2d)
9
Complex Looped Maze (Fig. 2e) 188
Off-policy and on-policy are the standard methods used in Q-learning; offpolicy was chosen for the Q-learning method because it changes Q-values independently from its previous policies [21]. Since the “wall-following” Q-learning algorithm does not need to be trained on each maze, we will only train it on the cross maze shown in Fig. 2d, and we will validate it on the rest of the mazes defined in Table 1.
334
M. Pisheh Var et al.
Fig. 2. Visual Representation of the Mazes Purposed for the Experiment
Finding Eulerian Tours in Mazes
5
335
Results
Learning algorithms such as tabular Q-learning, DQN, A2C, and PPO were tested on each map represented in Table 2; the agent performs well when it minimises the accumulated steps. The results are compared against the A∗ searching algorithm to reach all the exits in the maze. Figure 3 shows the path the tabular Q-learning method with memory took to reach the exit. The white cells indicate the unvisited cells, and the red cells correspond to the agent’s path; the cross-hatched areas indicate the blocked areas where the agent cannot enter. Each maze’s exits will rotate during the 100,000 training iteration; the policies are frozen after one episode. A fresh evaluation dedicated episode starts with the policies frozen and no epsilon-greedy. Table 2. Performance of each Algorithm at 100,000 Episodes with a Frozen Policy and No Epsilon-Greedy Exploration. The Errors Indicate the Standard Error over 10 Trials Maze Name
A∗ Accumulated Steps Algorithm method Average total steps reaching all exits With external memory Without external memory
Small Corridor
4
Tabular Q-learning PPO A2C DQN
4.0 ± 0.0 77.22 ± 34.61 361.33 ± 72.94 445.88 ± 55.11
501.0 600.8 800.4 47.2
± ± ± ±
0.0 124.47 81.48 10.83
Long Corridor maze
30
Tabular Q-learning 30.4 ± 0.4 PPO 215.0 ± 67.33 A2C 786.44 ± 84.44 DQN 1000.0 ± 0.0
902.6 1000.0 1000.0 948.4
± ± ± ±
64.93 0.0 0.0 35.07
T-shaped maze
16
Tabular Q-learning 16.0 ± 0.0 PPO 47.66 ± 8.51 A2C 230.22 ± 105.02 DQN 1000.0 ± 0.0
703.6 1000.0 1000.0 738.4
± ± ± ±
80.67 0.0 0.0 93.82
Cross maze
32
Tabular Q-learning 32 ± 0.0 PPO 652.22 ± 162.91 A2C 1452.77 ± 49.97 DQN 1557.44 ± 55.32
1701.2 1352.6 1452.2 887.0
± ± ± ±
81.32 149.4 156.6 75.32
Tabular Q-learning 948.0 ± 52.0 PPO 2000.0 ± 0.0 A2C 1950.55 ± 49.44 DQN 2000.0 ± 0.0
1951.7 2000.0 2000.0 1667.9
± ± ± ±
48.3 0.0 0.0 77.10
Complex Looped maze 232
The “wall-follower” results for each maze environment are shown in Table 3.
336
M. Pisheh Var et al.
Fig. 3. Paths Made by the Memory Tabular Q-Learning on the Specified Mazes
6
Discussion
In Table 2, we can observe each algorithm method’s average total steps to reach all possible exits in each maze defined in Table 1. The small corridor maze consists of 3 open cells, and the agent starts in the middle of the three open cells. To achieve an optimal accumulated step to reach both exits, the agent has to devise a movement strategy to reach one exit, perform a return movement to the centre, and move to the exit at the other end of the corridor. 4 is the optimal accumulated step to reach both exits on the small corridor maze. It can be observed that the tabular Q-learning without memory accumulated 501 steps; this indicates that the agent did not find the second exit and ran out of time. The accumulated total steps by the Q-learning with memory achieved better performance than other algorithm methods such as PPO, A2C and DQN, where the same memory architecture was used to help PPO, A2C and DQN algorithms. The memory architecture we designed significantly helped tabular Q-learning training for 100, 000 iterations. It can be seen in Table 2 that the tabular Qlearning with memory achieved 931.661 ± 0.678 average total steps to reach all exits in the complex looped maze, which shows that a maze as big as the complex looped maze requires more time to optimise. Comparatively, PPO and A2C algorithms with our augmented memory performed better than those without external memory. However, DQN’s performance suffered from our augmented memory method. The state representation
Finding Eulerian Tours in Mazes
337
Table 3. Performance of “wall-follower” Algorithm Learned by Q-Learning at 100,000 Episodes. The Errors Indicate the Standard Error over 10 Trials Maze Name Small Corridor Long Corridor maze
A∗ Accumulated Steps Algorithm method
Average total of steps to reach all exits
4
Wall-following Q-learning
4 ± 0.0
30
Wall-following Q-learning
221 ± 0.0
T-shaped maze
16
Wall-following Q-learning
42 ± 0.0
Cross maze
32
Wall-following Q-learning
32 ± 0.0
Complex Looped maze 232
Wall-following Q-learning 1731 ± 0.0
st = (yt , xt , c(yt ,xt ) ) added unparalleled access to different state representations due to C(yt , xt , dt ) update method shown in (5). The performance of PPO, A2C and DQN did not reach the optimal accumulated steps. It can be assumed that PPO, A2C and DQN needed further adjustments, especially in their network structure, because these algorithms have proven to be sensitive to hyper-parameters [14]. Table 3 shows that the tabular Q-learning agent learned to perform like a “wall-follower” algorithm and solved perfect mazes. However, it can be seen that the algorithm struggled with the complex looped maze as expected. Comparatively, our Q-learning with an external memory solution performed better and reached all exits. Moreover, suppose there are two potential exits for a maze. In that case, reaching the potentially closer exit is more efficient. Our augmented memory tabular Q-learning follows this rule, whereas the “wall-follower” behaviour does not prioritise reaching the potentially more immediate exit first. The state representation given to the “wall-following” algorithm reveals the solution to the Q-table, and it will solve any given perfect maze with no loops. However, the state representation given to tabular Q-learning with memory only reveals the agent’s location and its neighbouring cells history of visit. For the RL algorithms such as DQN, A2C and PPO, we attempted to hotone-encode the observation state into a large flattened maze representation. Unfortunately, we did not get improved results compared to the state vector, including the position of the agent and the memory we designed.
7
Conclusion
In deep learning, recurrent neural networks are essential when each piece of information, through time, is needed to solve a problem. For example, in our implemented environment, where mazes require the agent to turn back after reaching a dead-end, the agent must have this information looped back to its algorithm to be able to perform this movement strategy. Recurrent neural networks can be computationally expensive, and it is difficult to train them; other difficulties, such as gradient vanishing, can be faced while using recurrent neural networks. In addition, it may become tedious to adjust all the activation functions in processing long sequences.
338
M. Pisheh Var et al.
In this paper, we presented memory augmentation as a straightforward way to add memory to work with maze-solving algorithms so that it rivals recurrent nodes without the difficulties that come with training them. Moreover, we showed that this simple state representation significantly benefited the tabular Q-learning algorithm, where it could perform better than the “wall-follower” generic maze solver. It will be helpful to change deep reinforcement learning algorithms such as PPO, DQN and A2C network structure in the future and try out different update methods. It will be helpful to integrate the solution we implemented in this paper to solve other discrete space environments to understand the limitations and advantages of this memory-augmented method.
References 1. Euler tour technique, December 2020 2. Alexopoulos, C.: A note on state-space decomposition methods for analyzing stochastic flow networks. IEEE Trans. Reliab. 44(2), 354–357 (1995) 3. Cayton, L.: Fast nearest neighbor retrieval for Bregman divergences. In: Proceedings of the 25th International Conference on Machine Learning, pp. 112–119 (2008) 4. Chen, C., Ying, V., Laird, D.: Deep q-learning with recurrent neural networks. Stanford Cs229 Course Report 4, 3 (2016) 5. Dearden, R., Friedman, N., Russell, S.: Bayesian q-learning. In: Proceedings of the Fifteenth National/Tenth Conference on Artificial Intelligence/Innovative Applications of Artificial Intelligence, AAAI 1998/IAAI 1998, USA, 1998, pp. 761–768. American Association for Artificial Intelligence (1998) 6. Del Rosario, J.R.B., et al.: Modelling and characterization of a maze-solving mobile robot using wall follower algorithm. In: Applied Mechanics and Materials, vol. 446, pp. 1245–1249. Trans Tech Publ (2014) 7. Ilin, R., Kozma, R., Werbos, P.J.: Efficient learning in cellular simultaneous recurrent neural networks-the case of maze navigation problem. In: 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning, pp. 324–329. IEEE (2007) 8. Jaderberg, M., et al.: Reinforcement learning with unsupervised auxiliary tasks. CoRR, abs/1611.05397 (2016) 9. Lin, J.-H., Vitter, J.S.: A theory for memory-based learning. In: Proceedings of the Fifth Annual Workshop on Computational Learning Theory, pp. 103–115 (1992) 10. Meng, L., Gorbet, R., Kuli´c, D.: Memory-based deep reinforcement learning for POMDPS. In: 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 5619–5626. IEEE (2021) 11. Meng, L., Gorbet, R., Kuli´c, D.: Partial observability during DRL for robot control. arXiv preprin arXiv:2209.04999 (2022) 12. Mnih, V., et al.: Human-level control through deep reinforcement learning. Nature 518(7540), 529–533 (2015) 13. Nielsen, Frank: Hierarchical clustering. In: Introduction to HPC with MPI for Data Science. UTCS, pp. 195–211. Springer, Cham (2016). https://doi.org/10.1007/9783-319-21903-5 8 14. Olsson, M., Malm, S., Witt, K.: Evaluating the effects of hyperparameter optimization in Vizdoom (2022)
Finding Eulerian Tours in Mazes
339
15. Osmankovi´c, D., Konjicija, S.: Implementation of q-learning algorithm for solving maze problem. In: 2011 Proceedings of the 34th International Convention MIPRO, pp. 1619–1622. IEEE (2011) 16. Raffin, A., Hill, A., Gleave, A., Kanervisto, A., Ernestus, M., Dormann, N.: Stablebaselines3: reliable reinforcement learning implementations. J. Mach. Learn. Res. 22(268), 1–8 (2021) 17. Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning representations by backpropagating errors. Nature 323(6088), 533–536 (1986) 18. Russell, S.J., Norvig, P.: Artificial Intelligence: A Modern Approach. Pearson Education Limited, Malaysia (2016) 19. Teh, H.Y., Kempa-Liehr, A.W., Wang, K.I.-K.: Sensor data quality: a systematic review. J. Big Data 7(1), 1–49 (2020) 20. Tijsma, A.D., Drugan, M.M., Wiering, M.A.: Comparing exploration strategies for q-learning in random stochastic mazes. In: 2016 IEEE Symposium Series on Computational Intelligence (SSCI), pp. 1–8 (2016) 21. Christopher JCH Watkins and Peter Dayan: Q-learning. Mach. Learn. 8(3–4), 279– 292 (1992) 22. Wierstra, D., Foerster, A., Peters, J., Schmidhuber, J.: Solving deep memory POMDPs with recurrent policy gradients. In: de S´ a, J.M., Alexandre, L.A., Duch, W., Mandic, D. (eds.) ICANN 2007. LNCS, vol. 4668, pp. 697–706. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-74690-4 71 23. Wu, Z., Wang, X., Gonzalez, J.E., Goldstein, T., Davis, L.S.: Ace: adapting to changing environments for semantic segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2121–2130 (2019)
Deep Learning Based Shadow Removal: Target to Current Methodology Flaws Shi-Jinn Horng1,2(B) and Cheng-En Zhuang1 1 Department of Computer Science and Information Engineering, National Taiwan University of Science and Technology, Taipei 10607, Taiwan R.O.C. [email protected] 2 Department of Computer Science and Information Engineering, Asia University, Taichung 41354, Taiwan R.O.C.
Abstract. Shadow removal, which converts images with shadow to images without shadow is currently an immature technology. When removing shadows, we need to consider various factors including illumination, environment etc. The traditional image processing method is complex and cumbersome, but the result is not ideal. Recently, with the rapid development of deep learning, various papers have proposed different methods to deal with this task, and the results have been greatly improved, but there are still some problems to be solved. At present, there are three problems in the shadow removal paper based on deep learning. First, the problem is color inconsistency. The shadow regions are difficult to restore the correct color and has a significant color difference from non-shadow regions after removal. Second, the shadow boundaries are clearly left in the image. Last, it is difficult to collect datasets in this field, resulting in a lack of training sets, making the model unable to adapt to various scenarios. We propose solutions to the above three problems respectively. The final model can effectively improve both the root mean square error (RMAE) and the structural similarity index (SSIM) of the shadow and non-shadow regions. Keywords: Shadow Removal · Deep Learning · Illumination Model · Computer Vision
1 Introduction In real life, shadow is created when light is blocked by objects. Shadow may cause poor vision and affect life, such as: the content of books is blocked, the photos of life are covered by shadow, the road is covered by shadow and the driver ignores obstacles on the road, etc. In recent years, with the rapid development of the field of deep learning, the application of computer vision has become more and more popular, and the objects used in face recognition, semantic segmentation, object detection and other technologies may be hidden by shadow, resulting in poor results. The purpose of shadow removal is converting a shadow image into a shadow-free image. This task is a difficult and challenging project. In the shadow removal task, the illumination variation (the illumination © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): SAI 2023, LNNS 711, pp. 340–351, 2023. https://doi.org/10.1007/978-3-031-37717-4_23
Deep Learning Based Shadow Removal
341
intensity is different at different times of a day, not only the shadow regions are different, but also the pixels in the non-shadow regions), scene limitations (some scenes cannot get a shadow-free image, such as the bottom of a tree, next to a building) or other environmental factors will affect the result. In recent years, deep learning has been devoted to the problem of shadow removal, such as inconsistent colors and obvious boundaries. Due to the difficulty in collecting datasets, the training set is increased by using self-produced data to make the model more perfect. In this paper, we based on the model of SP + M Net [1], the network make shadow regions restore a more similar color, and then shadow boundary restored network is added to calculate the shadow boundaries more accurately and reduce the boundary blur caused by the noise of the shadow boundaries. Finally, we use the training method proposed by SynShadow [2] to solve the shortcoming of insufficient dataset.
2 Related Work Shadow Removal: The traditional methods in the past, Xiao et al. [3] calculated the color difference between shadow and non-shadow regions, and converted the color of shadow regions to the color of non-shadow regions. Guo et al. [4] establish the illumination model through the influence of direct light and ambient light on the objects, and capture the shadow regions by the color and texture changes of the picture, and perform lit action on the shadow regions. Recently, deep learning has developed rapidly [13]. Qu et al. [5] is the first paper that applies deep learning in the field of shadow removal. The shadow image is calculated by neural network to obtain a shadow-free image. The problem of poor performance of penumbra was dealt with and effectively improved. After the publication of Generative Adversarial Network (GAN) [6], it has attracted much attention in the field of deep learning. Whether it is image processing or NLP, this concept has been used to achieve SOTA results, and various novel applications have also been developed. Mask-ShadowGAN [7] improves CycleGAN [8] and applies it to shadow removal to solve the limitation of shadow removal datasets. Datasets are very important factors for deep learning applications. In shadow removal datasets, shadow and shadow-free image need to be obtained, but certain scenes such as: the bottom of trees, next to the large buildings are difficult to obtain shadow-free image, use GAN [6] to generate shadow images by itself, increase the amount of data to train the model. SP + M Net [1] combined deep learning and illumination model [10] to estimate the difference between the shadow regions of the shadow image and the corresponding regions in the shadow-free image, and then lit the shadow regions, so that the shadow regions are lit up with it corresponds to the same regions in the shadow-free image. Shadow Synthesis: In the field of shadow removal, it is a difficult task to obtain a large-scale, diverse and high-quality dataset. In real life, a shadow image needs a shadowfree image as the ground truth. If you want to obtain shadow image and shadow-free image that cannot add or remove occlusion by yourself, you must wait for the sunlight to move, which cause shadow image and ground truth to be different. Not only in the shadow regions, the intensity of illumination in the non-shadow regions are also different, this condition affects the training data. Generally, the model hopes to remove shadows only for shadow regions while maintaining the integrity of other pixels of the shadow image.
342
S.-J. Horng and C.-E. Zhuang
In addition, it is difficult to obtain ground truth in some scenes, such as: the bottom of tree, next to the building. Due to the above problems, the acquisition of datasets is also a long time, many papers have proposed method of self-produced datasets, Mask-ShadowGAN [7] collected the USR dataset, including unpaired data, using the shadow-free image in USR with the shadow mask of the ISTD [9] to generate the shadow image, DHAN [11] also used the shadow-free image of the USR [7] and shadow mask of ISTD [9] to train a network that can generate shadow mask, and then combined the generated shadow mask and shadow-free image to obtain the shadow image.
3 Method SP + M Net [1] used physical illumination model and image decomposition to remove shadow. The idea is to light up the shadow in shadow image. The illumination model is established through the physical formula [10], and the parameters calculated by the model are used to linearly transform the image, so as to convert the shadow to lit the shadow image. At this time, the shadow regions are restored to the original brightness, while other non-shadow regions are overexposed. Using the relit image with the shadow matte estimated by the M-Net model to get the relit shadow regions, and then cover it on the shadow image, and the combined result is a shadow-free image. We follow the theory and process and improve it to solve the shortcomings of color consistency and obvious boundaries. Our revised framework and architecture are shown in the Fig. 1. In Sect. 3.2 and Sect. 3.3, the reasons and details of the modification will be introduced separately. 3.1 Model and Image Decomposition In this section, we introduce the illumination model, the goal is to find parameters that can convert the shadowed pixels in the shadow image (shadowed pixels) to the corresponding pixels in the shadow-free image (unshadowed pixels) through linear transformation. We define a shadow image and a shadow-free image: Shadow−free
Ix
(λ) = Ldx (λ)Rx (λ) + Lax (λ)Rx (λ)
(1)
Shadow−free
where Ix (λ) represents the light reflected at x when the wavelength is λ, L is illumination, L represents reflectivity, and the light received at x has not only direct light Ld , but also ambient light La . When region x is occluded by an object, a shadow is generated, which is defined as follow: IxShadow (λ) = ax (λ)Lax (λ)Rx (λ)
(2)
the direct light cannot illuminate, only the ambient light, the occlusion may also block part of the ambient light, so an ax (λ) attenuation factor is proposed to represent the intensity of the ambient light at x. Through Eq. (1) and (2), the shadow-free image can be converted into: Shadow−free
Ix
(λ) = Ldx (λ)Rx (λ)ax (λ)−1 + IxShadow
(3)
Deep Learning Based Shadow Removal
343
through this linear relationship, the attenuation factor is set as ω, the direct light Ldx (λ)Rx (λ) is set as b, and when shadowed pixels are converted to unshadow pixels, RGB channels need to be converted, which is defined as: Shadow−free
Ix
(k) = bk + ωk IxShadow
(4)
where Ix (k) is RGB channels in image, bk is direct light in RGB channels, ωk is attenuation factor in RGB channels. With Eq. (4) and (ω, b) calculated by SP-Nest, we lit shadow image as: Iirelit = ω · IiShadow + b
(5)
where i is each pixel in image, this represent lit all pixels in image.
Fig. 1. Shadow Removal Framework based on SP + M Net [1]
After introducing the illumination model, it is necessary to calculate the shadow regions. SP + M Net [1] proposed the concept of shadow matte. Because shadow mask is a binary image, it only contains the values of 0 and 1. After the shadow image is lit, it will only replace the part with the mask value of 1, which is equivalent to only lit the umbra regions. In most cases, the shadow is not only the umbra regions, in the case of no direct light, there is a penumbra. If we directly use the value in mask to lit, the penumbra regions cannot be lit, resulting in the shortcomings of obvious shadow boundaries. Shadow matte is equivalent to a soft mask, which can improve shortcomings of mask. Using the illumination model and shadow matte, shadow-free image can be obtained by the following formula: I shadow−free = I shadow · (1 − α) + I relit · α
(6)
where I relit is relit image, and α is shadow matte estimated from model. We want to maintain shadow regions in relit image and non-shadow regions in shadow image. 3.2 Color Inconsistency We hope to restore the shadow regions to unshadowed pixels in the shadow-free image. If the shadow regions cannot be processed correctly, the restored shadow regions may be
344
S.-J. Horng and C.-E. Zhuang
significantly different from the unshadowed pixels in the shadow-free image. As shown in Fig. 2, the color of the umbrella is obviously different from other blocks after the shadow is removed.
Fig. 2. Color Inconsistency
In solving the problem of color inconsistency, we think that the ResNeXt [12] of SP-Net [1] cannot very accurately estimate the parameter values (ω, b) to lit the image, resulting in the shadow regions cannot be convert to correct color. As shown in Fig. 3, we modify backbone from ResNeXt [12] to ResNeSt [15], ResNeSt [15] was added an attention based on ResNeXt [12], because the model input is shadow image and shadow mask obtained by using the shadow detection model, attention can strengthen the features of the shadow regions covered by the shadow mask, make the estimated parameters (ω, b) more accurate, and solve the problem of color inconsistency.
Fig. 3. SP-NeSt
Deep Learning Based Shadow Removal
345
3.3 Obvious Boundaries After solving the color inconsistency problem, we need to solve the obvious boundaries problem. As shown in Fig. 4, the boundary of the umbrella is obviously after the shadow is removed. The backbone is U-Net 14 in M-Net [1] which calculate shadow matte.
Fig. 4. Obvious Boundaries
U-Net 14 used down-sampling to obtain features of different scales. However, boundaries or other shallow features may be lost in the down-sampled image, which will lead to noise at the boundaries when restoring the original size, resulting in obvious boundaries problem. We add the Residual Refine Module (RRM) proposed by BASNet [18]. Modified architecture as shown in Fig. 5, the RRM architecture is similar to U-Net 14. BASNet [18] stated that the first half of U-Net [14] pays attention to deep features after down-sampling and pooling layers, shallow features such as boundaries will gradually
Fig. 5. M-Net
346
S.-J. Horng and C.-E. Zhuang
be lost, and finally the output results in blur boundaries or noise. The original M-Net [1] only used U-Net [14] to calculate shadow matte, which caused this problem, and the simple shallow network of RRM was used to convolute the U-Net [14] output to optimize the results of U-Net [14], further extracting finer features, complementing the information missing during U-Net [14] down-sampling, reducing the noise of the boundaries and effectively dealing with the problem of obvious shadow boundaries. 3.4 A Lack of Training Sets In the related research in Sect. 2, we mentioned the flaw of insufficient datasets in the field of shadow removal [16, 21]. The lack of datasets makes it unable to train model in a variety of scenes or various occlusions. If it is difficult for the model to fully learn a variety of shadow images, the effect of removing shadows will also be worse. SynShadow [2] proposed a method of synthetic shadow image, using shadow-free and shadow matte to synthesize shadow image, and then using these triplet images to train the model, and Mask-ShadowGAN [7], DHAN [11] use shadow mask to synthesize shadow image method is different with this. Inoue et al. [2] used Blender to generate shadow matte and make shadow-free image darken to generate darken image, then they synthesize by shadow-free image, darken image and shadow matte to obtain variety of shadow images. Then they count the dataset, set a reasonable sampling range for the parameter, and performed sampling randomly, expecting to generate realistic shadows. Due to the different sampling parameters, the intensity of illumination influence is also different, so shadow with different shadow depths can be obtained to increase variety.
4 Experiment 4.1 Dataset We used USR, ISTD + and shadow matte produced by Naoto et al. in training. First, we pre-trained on USR with shadow matte. Then, we fine-tune on ISTD + dataset. 1) USR Dataset (Unpaired Shadow Removal Dataset) The USR dataset is a dataset collected by Mask-ShadowGAN [7]. The shadow image and shadow-free image of the USR dataset [7] are not paired. The training set contains 1956 shadow-free image and 1770 shadow images, the testing set contains 489 shadow images. Although it doesn’t have ground-truth, it contains many different scenes which are appropriate for synthesizing shadow-free image. 2) ISTD + Dataset (Image Shadow Triplets Dataset) ISTD + is revised from ISTD [9]. Le et al. [1] proposed a problem that the illumination intensity changes due to different time, resulting in a slight difference in the pixels of shadow image and shadow-free image in the non-shadow regions in ISTD [9]. To deal with this problem, Le et al. [1] used the least squares method to calculate the relationship between each shadow image and shadow-free image in the non-shadow regions, and converted the shadow-free image according to this relationship to obtain correct ground-truth. This method reduced the RMSE of non-shadow regions in shadow and shadow-free so that training is not affected by this difference.
Deep Learning Based Shadow Removal
347
4.2 Evaluate Metric 1) RMSE (Root Mean Square Error, RMSE) In the field of shadow removal, the most commonly used metric for comparison is RMSE. Although the metric of the paper is mainly based on RMSE, in fact, the code calculates Mean Absolute Error (MAE), so this paper will continue to call it the metric is RMSE, but the calculated value is MAE. MAE =
n |y − x | i=1 i i n
(7)
which n is the number of images, yi is the shadow-free image and xi is the output. 2) SSIM (Structural Similarity Index, SSIM) SSIM [19] is metric for comparing the similarity of images. Compared with the value calculated by RMSE, it is mainly based on human vision and compares the similarity of two images. It is often used to evaluate super-resolution (SR). It consists of three information: Luminance, Construct and Structure. The formula is below: β γ α (8) SSIM (x, y) = l(x, y) · c(x, y) · s(x, y) l(x, y) =
2μx μy + C1 μx 2 +μy 2 + C1
(9)
c(x, y) =
2σx σy + C2 σx 2 +σy 2 + C2
(10)
σxy + C3 σx σy + C3
(11)
s(x, y) =
which x is the output, y is the shadow-free image, SSIM is obtained by multiplying the brightness l(x, y), the contrast c(x, y) and the structure s(x, y), α, β, γ are used to adjust the weights of the three relations, usually be set 1. L (x, y), c(x, y), s(x, y) are calculated by Eq. (9), μ and σ represent the mean and standard deviation respectively, and (C1 , C2 , C3 ) are used to maintain the stability of the formula and avoid the occurrence of the denominator being 0. We not only compare RMSE, but SSIM comparing output and ground-truth is also an important metric. 4.3 Results We compare the results of RMSE and SSIM [19] on the ISTD + [1] testing set, respectively. In the shadow removal paper, the comparison metrics are calculated for the shadow regions, the non-shadow regions, and all regions of the images. It is hoped that while removing the shadow, the integrity of the non-shadow regions is also preserved.
348
S.-J. Horng and C.-E. Zhuang
As shown in Table 1, compared with other methods, we obtained good results. In the shadow regions, RMSE is decreased to 6.8 but slightly higher than Auto-Exposure [17]. In the non-shadow regions, we are the same as SP + M Net [1] and have the best results. In all regions, we also got the best results. Our results show that on the basis of SP + M Net [1], we can improve the RMSE of the shadow regions while maintaining the RMSE of the non-shadow regions. Table 1. RMSE Comparison on ISTD + Methods
Shadow
Non-Shadow
All
Input Image
40.2
2.6
8.5
Guo et al.4
22.0
3.1
6.1
ST-CGAN9
13.4
7.7
8.7
DeshadowNet5
15.9
6.0
7.6
MaskShadowGAN7
12.4
4.0
5.3
DC-ShadowNet20
10.3
3.5
4.65
Auto-Exposure17
6.5
3.8
4.2
SP + M Net1
7.9
3.1
3.9
Ours
6.8
3.1
3.7
We also test SSIM on ISTD + [1], as shown in Table 2. Comparing SSIM, each region is higher than 0.95, which means that a certain degree of similarity can be maintained without destroying the image. We observed that the experimental results of Auto-Exposure [17] showed good results in the shadow regions of RMSE and SSIM, and recovered the shadow regions completely and accurately, but the SSIM in the nonshadow regions was slightly lower, and we found that the output image brightness of this method changes, resulting in lower performance in non-shadow regions. Table 2. SSIM Comparison on ISTD + Methods
Shadow
Non-Shadow
All
Input Image
0.926
0.984
0.894
DC-ShadowNet20
0.976
0.967
0.931
Auto-Exposure17
0.978
0.892
0.861
SP + M Net1
0.984
0.979
0.931
Ours
0.988
0.974
0.957
We show the results compared with other methods in recent years in Fig. 6. From left to right, the image is input, DC-ShadowNet [19], SP + M Net [1], Auto-Exposure
Deep Learning Based Shadow Removal
349
[17], our method and ground-truth. We observed the third image in second column, DC-ShadowNet [19] has the problem of lighting up the sky when removing shadows. Compared with the results of other papers, it also seems to be slightly improved. In the color inconsistency and obvious boundaries problems we mentioned above, as shown in Fig. 7. First row, we can see that the color of the umbrella is correctly converted to the same gray as the ground, not the yellow of the pavement markings. Second row, we can see that the umbrella boundaries has disappeared a lot more than the other results.
Fig. 6. Comparison of Shadow Removal Results on ISTD + Dataset
Fig. 7. Color Inconsistency and Obvious Boundaries
350
S.-J. Horng and C.-E. Zhuang
5 Conclusion We aim at the current problems faced in the field of shadow removal. The first point is to modify the model network for color inconsistency so that the color can be calculated correctly. The second point is to use RRM to reduce the loss of boundaries information for obvious edge problems. The limitation of datasets in the field of shadow removal is still an intractable problem. Although many methods use data enhancement or synthesis to increase the amount of data, the environmental factors that need to be considered in reality are complex, and the current training results are far from practical. There are still some gaps to use in life, and we expect to use different training strategies or collect more complex datasets to train the model in the future, so that the model can obtain better results. Acknowledgments. This research was partially supported in part by the Ministry of Science and Technology under contract numbers 111-2218-E-011 -011-MBK and 111-2221-E-011 -134 -, and also by the “Center for Cyber-physical System Innovation” from The Featured Areas Research Center Program within the framework of the Higher Education Sprout Project by the Ministry of Education (MOE) in Taiwan.
References 1. Le, H., Samaras, D.: Shadow removal via shadow image decomposition. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea (South), pp. 8577–8586 (2019). https://doi.org/10.1109/ICCV.2019.00867 2. Inoue, N., Yamasaki, T.: Learning from synthetic shadows for shadow detection and removal. IEEE Trans. Circuits Syst. Video Technol. 31(11), 4187–4197 (2021). https://doi.org/10.1109/ TCSVT.2020.3047977 3. Xiao, C., She, R., Xiao, D., Ma, K.-L.: Fast Shadow removal using adaptive multi-scale illumination transfer: shadow removal by multi-scale illumination transfer. Comput. Graph. Forum 32(8), 207–218 (2013). https://doi.org/10.1111/cgf.12198 4. Guo, R., Dai, Q., Hoiem, D.: Paired regions for shadow detection and removal. IEEE Trans. Pattern Anal. Mach. Intell. 35(12), 2956–2967 (2013) 5. Qu, L., Tian, J., He, S., Tang, Y., Lau, R.W.: DeshadowNet: a multi-context embedding deep network for shadow removal. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, pp. 2308–2316 (2017). https://doi.org/10.1109/CVPR. 2017.248 6. Goodfellow, I.J., et al.: Generative Adversarial Networks (2014). Accessed 04 June 2022. http://arxiv.org/abs/1406.2661 7. Hu, X., Jiang, Y., Fu, C.W., Heng, P.-A.: Mask-ShadowGAN: learning to remove shadows from unpaired data. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 2472–2481 (2019). https://doi.org/10.1109/ICCV.2019.00256 8. Zhu, J.-Y., Park, T., Isola, P., Efros, A.A.: Unpaired Image-to-Image Translation using CycleConsistent Adversarial Networks (2020). Accessed on 04 June 2022. http://arxiv.org/abs/ 1703.10593 9. Wang, J., Li, X., Hui, L., Yang, J.: Stacked conditional generative adversarial networks for jointly learning shadow detection and shadow removal (2017). Accessed on 06 June 2022. http://arxiv.org/abs/1712.02478
Deep Learning Based Shadow Removal
351
10. Shor, Y., Lischinski, D.: The shadow meets the mask: pyramid-based shadow removal. Comput. Graph. Forum 27(2), 577–586 (2008). https://doi.org/10.1111/j.1467-8659.2008. 01155.x 11. Cun, X., Pun, C.-M., Shi, C.: Towards ghost-free shadow removal via dual hierarchical aggregation network and shadow matting GAN (2019). Accessed on 04 June 2022. http://arxiv. org/abs/1911.08718 12. Xie, S., Girshick, R., Dollár, P., Tu, Z., He, K.: Aggregated residual transformations for deep neural networks (2017). Accessed on 04 June 2022. http://arxiv.org/abs/1611.05431 13. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition (2015). Accessed on 04 June 2022. http://arxiv.org/abs/1512.03385 14. Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation (2015). Accessed on 05 June 2022. http://arxiv.org/abs/1505.04597 15. Zhang, H., et al.: ResNeSt: split-attention networks (2020). Accessed on 05 June 2022. http:// arxiv.org/abs/2004.08955 16. Hu, X., Fu, C.-W., Zhu, L., Qin, J., Heng, P.-A.: Direction-aware spatial context features for shadow detection and removal. IEEE Trans. Pattern Anal. Mach. Intell. 42(11), 2795–2808 (2020). https://doi.org/10.1109/TPAMI.2019.2919616 17. Fu, L., et al.: Auto-exposure fusion for single-image shadow removal (2021). Accessed on 06 June 2022. http://arxiv.org/abs/2103.01255 18. Qin, X., Zhang, Z., Huang, C., Gao, C., Dehghan, M., Jagersand, M.: BASNet: boundaryaware salient object detection. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, pp. 7471–7481 (2019). https://doi.org/ 10.1109/CVPR.2019.00766 19. Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE Trans. on Image Process. 13(4), 600–612 (2004). https://doi.org/10.1109/TIP.2003.819861 20. Jin, Y., Sharma, A., Tan, R.T.: DC-ShadowNet: single-image hard and soft shadow removal using unsupervised domain-classifier guided network. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, pp. 5007–5016 (2021). https:// doi.org/10.1109/ICCV48922.2021.00498 21. Zhu, L., et al.: Bidirectional feature pyramid network with recurrent attention residual modules for shadow detection. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11210, pp. 122–137. Springer, Cham (2018). https://doi.org/10.1007/9783-030-01231-1_8
Machine Learning Techniques for Accurately Detecting the DNS Tunneling Mouhammd Alkasassbeh1(B) and Mohammad Almseidin2 1
2
Princess Sumaya University for Technology, Amman, Jordan [email protected] Department of Computer Science, Tafila Technical University, Amman, Jordan [email protected]
Abstract. For many networks to function properly, DNS is the main protocol. Security policies typically permit DNS traffic and therefore attackers use DNS traffic for malicious purposes. This research focuses on detecting the DNS traffic as a hidden tunnel using machine learning algorithms. Any type of IP traffic can be sent over the tunnel by encapsulating this traffic in uncommon DNS fields like NULL and TXT records. The study utilized machine learning algorithms to analyze dataset records and predict DNS tunneling traffic. The performance of machine learning classifiers varied in terms of recall, precision, and F-measure metrics. Among the classifiers examined, RandomForest, J48, and multi-layer perceptron (MLP) were found to be the most appropriate for classifying DNS tunneling. RandomForest, in particular, produced highly accurate results. These findings highlight the potential of machine learning algorithms in detecting and preventing malicious network activity, and provide valuable insights for future research in this area. Keywords: DNS Tunneling · Machine Learning and Multilayer Perceptron (MLP)
1
· RandomForest · J48
Introduction
The Domain Name System (DNS) is undoubtedly a core networking protocol without which accessing the Internet would be rather infeasible. Not only does DNS relieve us from memorizing ever-changing IPV4 addresses, but is also a life-saver when we all move to the daunting much longer IPV6 addresses. Name servers, known also as DNS servers, keep track of gigantic databases that map domain names to their respective IP addresses. Thus, the DNS protocol is heavily utilized in the Internet; whenever an application needs to access an online resource, the DNS protocol retrieves its corresponding IP address through a hierarchy of name servers, so that communication can be realized. From network security perspective, the DNS protocol is an inherently exploitable protocol [27–29]. DNS servers do not validate the identities of c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): SAI 2023, LNNS 711, pp. 352–364, 2023. https://doi.org/10.1007/978-3-031-37717-4_24
Machine Learning Techniques for Accurately Detecting the DNS Tunneling
353
whomever asks for their wealth of name-address mappings. Moreover, due to the nature of the DNS protocol, DNS servers blindly relay information from internal domains to the outside world. This rendered this protocol a popular candidate for attackers who wish to exfiltrate information from organizations, or transfer malicious software [1,29]. Another intrinsic DNS vulnerability is its unauthoritative caches; once a DNS cache is poisoned with a bad mapping, unwitting users would be directed to harmful destinations [17,22]. The aforementioned vulnerabilities stem from the DNS protocol fundamental design goals, namely simplicity and responsiveness. Intuitively, such qualities cannot be sacrificed. Thus, network Intrusion Detection Systems (IDS) become instrumental for identifying and reporting DNS attacks in a timely manner. Unlike signature-based IDSs that can only identify a set of pre-defined DNS attack patterns, behavioral IDSs are a more promising choice. This is because behavioral IDSs have the potential to understand normal DNS traffic, and thus, stand a better chance at identifying abnormalities whether seen or unseen before. Several approaches to design and implement behavioral IDSs exist among which is utilizing machine-learning techniques. The survey presented in [6] discussed several recent examples of such systems. Machine learning based IDSs are promising adaptive network security tools, especially with the absence of commercially successful behavioral IDSs. In this work, we introduce a behavioral IDS for identifying DNS tunneling attacks using machine learning. This IDS utilizes a wide set of features of both normal and tunneled DNS traffic, in order to identify unseen entities of tunneled DNS traffic. Section 2 of this paper sheds light on the DNS protocol and highlights the characteristics that render it vulnerable for tunneling attacks. Moreover, it discusses a set of recently proposed IDSs for detecting DNS tunneling. The proposed IDSs is illustrated in Sect. 3, while Sect. 4 depicts the experimental setup and results. Finally, Sect. 5 concludes this paper and highlights its future directions.
2
Background and Related Work
In this section, the DNS protocol vulnerabilities are highlighted alongside the mechanism of DNS tunneling, as one example of covert channel attacks. Moreover, state-of-the-art studies on detecting DNS tunneling using machine learning are presented. 2.1
DNS Tunneling
In a typical DNS query, the IP address of a given domain name needs to be retrieved; a process referred to as a lookup. To do so, a hierarchy of DNS (authoritative) servers forward the designated query among themselves until the required IP address is found and returned to the requester. The domain name owner gets to define the respective authoritative servers, and this opens the door for abusing the DNS protocol. While typical internal nodes in a network do not usually communicate directly with the Internet to perform lookups, their network’s DNS
354
M. Alkasassbeh and M. Almseidin
server(s) do. This implies that the DNS query would eventually reach one of the attacker’s authoritative DNS servers. Thus, if the attacker happens to have access to an internal node, then they would be capable of exploiting the network’s DNS infrastructure to their advantage [15]. Figure 1 illustrates a typical DNS tunnel together with the messages exchanged. The numbers designate the order in which these messages flow. A DNS tunnel is established when a fraudulent domain name is registered and is pointed to by a malicious authoritative server. Next, we describe what each number in Fig. 1 designates. (1) Designates an apparent DNS query, i.e. a typical DNS query made to the fraudulent domain from an internal client. This client was previously hijacked by the attacking party. Although it looks legit, the designated DNS query actually either pulls the fraudulent server to get the next command to execute, or embeds encoded and/or leaked information in the DNS query record. (2) The DNS query, which looks harmless, by-passes the DNS server of the network, which forwards the query as it does not know the corresponding IP address of the designated domain name. (3) The DNS query also by-passes the firewall of the network which would not assume the query is malicious as it looks normal. (4) The hierarchical DNS servers in the Internet forward the query among themselves as per the DNS protocol. (5) The DNS query is finally forwarded to its target, the fraudulent authoritative server. (6) The fraudulent server receives the query and processes it according to the contents conveyed in the query itself alongside the other parts of the DNS query record. (7) The malicious server replies with an apparent DNS response record, which may encode the next command to be executed by the hijacked node. (8–10) The response travels back to the hijacked node passing through the Internet DNS servers, the network’s firewall, and the network’s DNS server. 2.2
Recent Related Studies
Two main approaches to analyzing DNS traffic are reported in the literature; payload analysis and traffic analysis [15]. Payload analysis considers individual DNS request (or query)/response pairs each aside, whereas traffic analysis studies DNS traffic over time using statistical summaries that reflect the original traffic. The study depicted in [20] preformed payload analysis and examined Gaussian Bayesian, Support Vector Machines (SVM), Random Forests, and several variants of Decision Trees to detect DNS tunneling. Results showed that the Random Forest classifier outperformed the solution proposed in [21] in terms of precision, recall, accuracy, F1 score, training time and testing time. A multi-class SVM (MCSVM) was utilized in [11] in order to detect the type of protocol tunneled in DNS traffic. The MCSVM outperformed a Naive Bayesian classifier in terms of precision, recall, and f-measure. The protocols tunneled in DNS were HTTP and HTTPS, FTP, and POP3, whereas the features utilized included lengths of DNS query/response messages, the entropy values of the IP packets, and query names. Feed-forward Neural Networks (FNN) were used together with a dataset collected using different DNS tunneling tools in the study depicted in [14].
Machine Learning Techniques for Accurately Detecting the DNS Tunneling
355
Fig. 1. Messages Exchanged over a DNS Tunnel
The major features used comprised the entropy values of the DNS resource records, and the lengths of the resource record name and data. The experiments measured the classification accuracy, precision and recall of the normal DNS class, in addition to the ROC curves for each of DNS tunneling tools data. The study in [23] performed traffic analysis on a large-scale real-life dataset. A multi-resolution time-windowing approach was used to enable the detection both of tunneling, and low throughput data exfiltration over DNS. The features used included character entropy, unique query ratio, query volume, and query average length. The proposed solution comprised an Isolation Forest (IF) classifier which outperformed two state-of-the-art solutions at identifying all types of tunneled DNS traffic in a test dataset. In the study presented in [2], payload analysis was performed at the enterprise edge, using a dataset that comprised a week-long normal DNS traffic collected from top ranking domains. The features considered were characters count, entropy, and length of discrete labels in the query name. Isolation forests were also used to identify tunneled DNS traffic, while reducing false-alarm rates. The results depicted the classification accuracy of the classifier together with the false-alaram rate, in addition to the average time it took to analyze and classify a given DNS query. Reducing false-alarm rates was also the goal in the study depicted in [3], where traffic analysis was utilized to generate a 12-feature statistical dataset that considered the packet lengths of each query and its response. k-means clustering, followed by a simple decision tree to reduce false-alarms were utilized. Accuracy, AUC, and Youden index were measured to evaluate the classifier performance. The features used in the study presented in [18] comprised packet length, the entropy of the query name, and the query name length. In order to identifying tunneled HTTPS, FTP, and POP3 traffic, four classifiers were benchmarked;
356
M. Alkasassbeh and M. Almseidin
decision trees, support vector machines, k-nearest neighbours, and neural networks. The experiments measured the accuracy, precision, and recall of each classifier with respect to each type of tunneled protocols. The entropy of the DNS messages was used in the study depicted in [19]. Instead of examining packets payload, this study only utilized the clustering of packets meta data entropy values in order to identify tunneled HTTP and FTP traffic. To evaluate the proposed solution performance, accuracy, precision, recall, and mis-classification rates were measured. Numerous recent studies on DNS tunneling exist in the literature. In addition to the studies summarized here, the readers are also referred to the review depicted in [24] for further details on machine-learning solutions to address DNS tunneling. Moreover, in our previous work on intrusion detection systems (IDS) using artificial intelligence (AI) and machine learning, we developed and implemented various machine learning algorithms to analyze network traffic and identify potential security threats. We experimented with different approaches such as decision trees, random forests, and support vector machines to classify malicious and benign traffic. We also implemented feature engineering techniques to extract relevant features from the raw data and improve the accuracy of the model. Additionally, we designed and implemented an AI-based IDS system that used deep learning techniques to identify patterns in network traffic and detect anomalies that could indicate a security threat. Overall, our work on IDS using AI and machine learning resulted in significant improvements in the accuracy and effectiveness of the IDS system, as well as a reduction in false positives and false negatives. Here are some of our previouse work in the field [5,7–10,25].
3
Proposed Model
The proposed model carried out on our dataset the three most common machine learning classifiers that were reported on to measure performance and compare among these classifiers. The experiment was conducted using Waikato Environment for Knowledge Analysis (WEKA) 3.8 method. More details are given in the sections below. 3.1
The Dataset
For this study, we used the DNS dataset from [4] as the basis for our machine learning classification. The dataset consists of 2,505 records that were collected to detect DNS tunneling. These records were obtained from PCAP files captured from network devices in a real test-bed network. By using this dataset, we were able to evaluate the effectiveness of various machine learning algorithms in identifying DNS tunneling and their potential for detecting similar forms of malicious network activity in the future. The raw files feed into NetFlowMeter1 1
http://netflowmeter.ca/.
Machine Learning Techniques for Accurately Detecting the DNS Tunneling
357
to obtain 52 features. However, the NetFlowMeter is generic model that convert PCAP file into CSV files, in our case it has zeroes and null values, after deep analysis and studying the dataset, we found there will be just 12 features that matter in our case as it is shown in Table 2 which has a brief description for each variable. The class we are working on are just two, the normal and the abnormal (DNS tunneling data). Table 1 shows the number of records for each state of data traffic. Table 1. Data Trafic State Records Label Variable Name
Variable description
1
SrcPort
Source Port
2
DstPort
Destination Port
3
Protocol
Protocol
4
Flow Feduration
Duration of the flow in Microsecond
5
total FWwd Packet
Total packets in the forward direction
6
total Bwd packets
Total packets in the backward direction
7
total Length of Fwd Packet Total size of packet in forward direction
8
total Length of Bwd Packet Total size of packet in backward direction
9
Flow Packets/s
Number of flow bytes per second
10
Fwd Header Length
Total bytes used for headers in the forward direction
11
Bwd Header Length
Total bytes used for headers in the backward direction
12
class
Normal traffic or DNS tunneling traffic
Table 2. Variable Description No Traffic state Records
3.2
1
abnormal
59
2
Normal
2443
Machine Learning Classifiers
According to [26], machine learning classifiers are algorithms that have the ability to understand a dataset and make predictions. These classifiers use the records within the dataset to train and build a classification model that can accurately classify new objects based on the strength of the model itself. By leveraging the power of these classifiers, we were able to identify and classify instances of DNS tunneling in our dataset with a high degree of accuracy, paving the way for improved network security in the future. As mentioned before, a three classifiers are used. These classifiers are illustrated as following. The J48 classification algorithm is based on the C4.5 decision tree algorithm. It employs a training dataset consisting of labeled samples, each of which indicates the sample’s feature value. Figure 2 depicts how the algorithm constructs a
358
M. Alkasassbeh and M. Almseidin
decision tree by analyzing the training data, where each node represents a feature that partitions the sample set into subsets based on the information gain. Decision trees are recognized for their capacity to depict and analyze relationships and interactions between features in a plain and comprehensible manner. However, one of the limitations of decision trees is that they must be reconstructed whenever new samples are collected [12].
Fig. 2. Decision Tree
Random forest is a classification technique based on the decision tree algorithm that is appropriate for analyzing large datasets due to its capacity to manage multiple variables. Figure 3 depicts how, during the training phase, the algorithm constructs a collection of decision trees. Each tree operates on a set of attributes chosen at random. Combining the results of each tree through majority voting completes the classification procedure. Random forest is trained on multiple subsets of the training dataset, thereby avoiding the problem of overfitting that can occur with individual decision trees. Nevertheless, the process of reproducibility is hampered by the use of random selection in constructing the forest. [13]. The Multilayer Perceptron (MLP) is the most widely used type of artificial neural network. Like all neural networks, the MLP consists of interconnected components. It is composed of three different layers: an input layer, a hidden layer, and an output layer, each with a distinct function, as shown in Fig. 4. The input layer receives signals, the output layer generates decisions based on the input, and the hidden layer serves as the computational engine of the MLP. The MLP is commonly used for supervised learning problems, where it is trained on a set of input-output pairs and learns the correlations and dependencies among them [16].
Machine Learning Techniques for Accurately Detecting the DNS Tunneling
359
Fig. 3. Random Forest
3.3
Evaluation Metrics
In this paper, we have utilized the most commonly used measurement metrics, namely Precision, Recall, and F-Measure. Precision is defined as the ratio of the number of correctly predicted positive samples to the total number of predicted positive samples, while Recall is the ratio of the number of correctly predicted positive samples to the total number of actual positive samples. The F-Measure is a composite metric that takes into account both Precision and Recall, as shown in the following formulas: P recision =
TP TP + FP
TP TP + FN P recisionRecall F − measure = 2 P recision + Recall Recall =
(1)
(2) (3)
360
M. Alkasassbeh and M. Almseidin
Fig. 4. Multilayer Perceptron (MLP)
4
Experimental Resutls
The evaluation of the results and performance was carried out using Weka machine learning version 3.8, which ran on a 64-bit system with an R i7 processor and 8 GB of RAM, operating on the Windows 10 IntelCoreTM platform. In the experiments, we employed the 10-fold cross-validation technique to test the models, as it helps minimize the estimation variance. To use this technique, the training dataset was divided into ten subsets, and each subset was tested on the remaining nine subsets. Each test subset was used once in all ten repetitions. Table 3 show the performance of the three selected algorithms (J48, RF, and MLP) using Figure 5, 6 and 7 illustrate the performance of the classifiers that were used in terms of F-Measure, precision and the recall rates respectively. Table 3. Classifiers Results No Classifier
Correctly Classified Instances Precision Recall F-Measure
1
J48
98.04%
97.7%
98.0% 97.6%
2
Random Forest (RF)
98.20%
97.9%
98.2% 97.9%
3
multilayer perceptron (MLP) 98.00%
97.9%
98.0% 97.3%
From Fig. 5, it was found that the F-Measure values of all classifiers are efficient. However, The Random Forest classifier achieved a high performance in
Machine Learning Techniques for Accurately Detecting the DNS Tunneling
361
Fig. 5. F-Measure Results
Fig. 6. Precision Results
identifying DNS tunneling as well as the recall and the precision as it shown at Fig. 6 and 7.
362
M. Alkasassbeh and M. Almseidin
Fig. 7. Recall Results
5
Conclusion
This paper explored the use of machine learning algorithms to analyze dataset records and predict DNS tunneling traffic. We observed a variance in the performance of machine learning classifiers based on recall, precision, and F-measure metrics. Our findings suggest that the RandomForest, J48, and multilayer perceptron (MLP) classifiers were the most appropriate for classifying DNS tunneling. Among these, RandomForest produced highly accurate results. These findings highlight the potential of machine learning algorithms in detecting and preventing malicious network activity, and provide valuable insights for future research in this area.
References 1. Ahmed, J., Gharakheili, H.H., Raza, Q., Russell, C., Sivaraman, V.: Monitoring enterprise DNS queries for detecting data exfiltration from internal hosts. IEEE Trans. Network Serv. Manag. 27, 265–279 (2019) 2. Ahmed, J., Gharakheili, H.H., Raza, Q., Russell, C., Sivaraman, V.: Real-time detection of DNS exfiltration and tunneling from enterprise networks. In: 2019 IFIP/IEEE Symposium on Integrated Network and Service Management (IM), pp. 649–653, April 2019 3. Aiello, M., Mongelli, M., Muselli, M., Verda, D.: Unsupervised learning and rule extraction for domain name server tunneling detection. Internet Technol. Lett. 2(2), e85 (2019)
Machine Learning Techniques for Accurately Detecting the DNS Tunneling
363
4. Al-kasassbeh, M., Khairallah, T.: Winning tactics with DNS tunnelling. Netw. Secur. 2019(12), 12–19 (2019) 5. Alkasassbeh, M.: An empirical evaluation for the intrusion detection features based on machine learning and feature selection methods. arXiv preprint arXiv:1712.09623 (2017) 6. Alkasassbeh, M., Baddar, S.A.-H.: Intrusion detection systems: a state-of-the-art taxonomy and survey. Arabian J. Sci. Eng. 1–44 (2022) 7. Almseidin, M., Al-Sawwa,, J., Alkasassbeh, M.: Generating a benchmark cyber multi-step attacks dataset for intrusion detection. J. Intell. Fuzzy Syst. (Preprint), 1–15 8. Almseidin, M., Al-Sawwa, J., Alkasassbeh, M.: Anomaly-based intrusion detection system using fuzzy logic. In: 2021 International Conference on Information Technology (ICIT), pp. 290–295. IEEE (2021) 9. Almseidin, M., Alzubi, M., Alkasassbeh, M., Kovacs, S.: Applying intrusion detection algorithms on the KDD-99 dataset. Prod. Syst. Inf. Eng. 8, 51–67 (2019) 10. Almseidin, M., Alzubi, M., Kovacs, S., Alkasassbeh, M.: Evaluation of machine learning algorithms for intrusion detection system. In: 2017 IEEE 15th International Symposium on Intelligent Systems and Informatics (SISY), pp. 000277– 000282. IEEE (2017) 11. Almusawi, A., Amintoosi, H.: DNS tunneling detection method based on multilabel support vector machine. Secur. Commun. Networks 2018, 6137098:1–6137098:9 (2018) 12. Bhargava, N., Sharma, G., Bhargava, R., Mathuria, M.: Decision tree analysis on j48 algorithm for data mining. Proc. Int. J. Adv. Res. Comput. Sci. Software Eng. 3(6), 1114–1119 (2013) 13. Biau, G., Scornet, E.: A random forest guided tour. TEST 25(2), 197–227 (2016) 14. Bubnov, Y.: DNS tunneling detection using feedforward neural network. Eur. J. Eng. Res. Sci. 3(11), 16–19 (2018) 15. Farnham, G., Atlasis, A.: Detecting DNS tunneling. InfoSec Reading Room (2013) 16. Haykin, S.: Neural Networks and Learning Machines, 3/E. Pearson Education India (2010) 17. Hmood, H.S., Li, Z., Abdulwahid, H.K., Zhang, Y.: Adaptive caching approach to prevent DNS cache poisoning attack. Comput. J. 58(4), 973–985 (2015) 18. Homem, I., Papapetrou, P.: Harnessing predictive models for assisting network forensic investigations of DNS tunnels (2017) 19. Homem, I., Papapetrou, P., Dosis, S.: Entropy-based prediction of network protocols in the forensic analysis of DNS tunnels (2017) 20. Lin, H., Liu, G., Yan, Z.: Detection of application-layer tunnels with rules and machine learning. In: Wang, G., Feng, J., Bhuiyan, M.Z.A., Lu, R. (eds.) SpaCCS 2019. LNCS, vol. 11611, pp. 441–455. Springer, Cham (2019). https://doi.org/10. 1007/978-3-030-24907-6 33 21. Liu, J., Li, S., Zhang, Y., Xiao, J., Chang, P., Peng, C.: Detecting DNS tunnel through binary-classification based on behavior features. In: 2017 IEEE Trustcom/BigDataSE/ICESS, pp. 339–346, August 2017 22. Dissanayake, I.M.M.: DNS cache poisoning: a review on its technique and countermeasures. In: 2018 National Information Technology Conference (NITC), October, pp. 1–6 (2018) 23. Nadler, A., Aminov, A., Shabtai, A.: Detection of malicious and low throughput data exfiltration over the DNS protocol. Comput. Secur. 80, 36–53 (2019)
364
M. Alkasassbeh and M. Almseidin
24. Nuojua, V., David, G., H¨ am¨ al¨ ainen, T.: DNS tunneling detection techniques – classification, and theoretical comparison in case of a real APT campaign. In: Galinina, O., Andreev, S., Balandin, S., Koucheryavy, Y. (eds.) NEW2AN/ruSMART/NsCC -2017. LNCS, vol. 10531, pp. 280–291. Springer, Cham (2017). https://doi.org/10. 1007/978-3-319-67380-6 26 25. Obeidat, I., Hamadneh, N., Alkasassbeh, M., Almseidin, M., AlZubi, M.: Intensive pre-processing of KDD cup 99 for network intrusion classification using machine learning techniques (2019) 26. Sonawane, J.S., Patil, D.R.: Prediction of heart disease using multilayer perceptron neural network. In: International Conference on Information Communication and Embedded Systems (ICICES2014), pp. 1–6. IEEE (2014) 27. Torabi, S., Boukhtouta, A., Assi, C., Debbabi, M.: Detecting internet abuse by analyzing passive DNS traffic: a survey of implemented systems. IEEE Commun. Surv. Tutor. 20(4), 3389–3415 (2018) 28. Wright, C.V., Mache, J., Weiss, R.: Hands-on exercises about DNS attacks: details, setup and lessons learned. J. Comput. Sci. Coll. 32(1), 117–125 (2016) 29. Zhao, G., Xu, K., Xu, L., Wu, B.: Detecting apt malware infections based on malicious DNS and traffic analysis. IEEE Access 3, 1132–1142 (2015)
Gradient Descent-Based Optimization Algorithms for Batch-Normalized Convolutional Neural Networks A Comparative Performance Analysis using FEMTO, NASA, CWRU and MFPT Bearing Datasets Charles Usigbe and Xiao Perry(B) London South Bank University, 103 Borough Road, SE1 0AA London, UK [email protected], [email protected]
Abstract. In this paper, the author carries out a comparative evaluation of nine of the most generally utilised first-order stochastic gradient-based optimization strategies in a straightforward architectural configuration for a Batch-Normalized Convolutional Neural Network (BN-ConvNet). The algorithms that have been investigated are Stochastic Gradient Descent (SGD), Stochastic Gradient Descent with Momentum (Sgdm), Stochastic Gradient Descent with Momentum and Nesterov (Sgdm + n), Root Mean Square Propagation (RMSProp), Adaptive Moment Estimation (Adam), Adaptive Gradient (AdaGrad), Adaptive Delta (AdaDelta), Adaptive moment estimation Extension based on infinity norm (Adamax), and Nesterov-accelerated Adaptive Moment Estimation (Nadam). The author trained the BN-CNN model and evaluated the performances of the optimization algorithms by using four randomly selected experimental Bearing datasets three of which were from the Kaggle’s public dataset repository while one was from Vibration Institute’s Condition-Based Maintenance Fault Database. These datasets were used to determine the convergence speed, accuracy, and loss function. In comparison to the other optimization algorithms, the overall experimental results that were obtained indicated that Nadam attained the best performance across all four datasets, while Adadelta fared the worst. Keywords: Convolutional Neural Network · Gradient-Based Optimization Algorithm · Comparative Analysis · Optimizers · BN-CNN
1 Introduction Convolutional Neural Networks, often known as ConvNets, are a state-of-the-art technique and architecture for deep learning that, over the course of the last several years, have generated outstanding results in areas such as image recognition, Social media face recognition, detecting objects in self-driving cars, segmenting medical imaging scans, such as MRI images, as well as in diagnosing and detecting bearing faults. A deep learning researcher’s ultimate objective is to design a model that will produce superior and © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): SAI 2023, LNNS 711, pp. 365–397, 2023. https://doi.org/10.1007/978-3-031-37717-4_25
366
C. Usigbe and X. Perry
more rapid outcomes by optimizing hyper parameters to minimize the loss function. This goal varies depending on the domain problem that is being researched and the datasets that are being used. The production of accurate model classification or prediction relies heavily on the optimal tuning of the weights that are used in deep neural networks. Optimum tuning and altering the weights, means establishing the weights with the lowest loss function when gradient descent is being performed using back propagation [3]. In recent years, many optimization algorithms have been suggested, the trend mostly based on automated adaptive estimates that need minimum adjustment of the hyperparameter. This is done to prevent getting stuck in local minima; this, together with other difficulties of gradient descent like the curse of dimensionality, makes it impossible to find the optimal global minimum [3]. Considering non-convex problems are the most prevalent occurrence for neural networks, selecting an optimization method that seeks the global optimal solution in such networks is often difficult due to the process of estimating a large number of parameters in a high-dimensional search space. If the optimization strategy isn’t done correctly, the network can get stuck in the local minimum while it’s being trained, preventing it from making any progress. Because of its consistency and strong performance in a wide range of uses, the Adam optimizer is the clear favorite of several neural network researchers [3]; therefore, to gain a deeper understanding of optimizers’ behaviors, it is required to conduct a research that compares their efficiency across a variety of models and datasets. This paper’s contribution is an experimental evaluation of the efficiency of nine popular first-order stochastic gradient descent optimization techniques applied to a ConvNet model on four distinct bearing fault classification datasets. Utilizing loss function, convergence rate, and accuracy, as metrics, we can see how successfully, speedily, and steadily each optimizer solved the problem of determining the best and optimum minima during training. The remaining parts of the paper are structured in the following order: In Sect. 2, a backdrop of essential concepts and relevant research is presented, and then a brief discussion of the optimization approaches that will be investigated follows. In Sect. 3, both the findings of the experiments and a commentary of them are offered, while Sect. 4 contains my conclusion, and recommendation for future research.
2 Backgrounds and Literature Review 2.1 Overview of Neural Networks Optimization Research into the creation of novel optimization algorithms and improvements to existing ones has been an essential component of the evolution of deep learning systems over the course of the past several years. To a large extent, neural networks may be seen as an optimization problem with the goal of reaching a global optimum via a strong training trajectory and rapid convergence with the help of gradient descent techniques [7]. Finding optimal settings for model parameters to minimize prediction error while achieving a good fit with the data is the optimization problem in data fitting. The authors of [4] studied six distinct stochastic gradient descent-based optimization techniques, including SGD and SGD with momentum, Adam, Adagrad, AdaDelta, and RMSProp, and compared them based on convergence time, number fluctuations, and parameter update rate, as well as with respect to
Gradient Descent-Based Optimization Algorithms
367
iteration count and test function values. Compared to other optimizers, AdaDelta was found to have a faster rate of convergence, as evidenced by their experiments. In [3], the authors evaluated four distinct gradient descent-based variants, including stochastic gradient descent, gradient descent, stochastic average descent, and semi-stochastic gradient descent, for their versatility in performing logistic and Soft - max regression on numerical datasets, both synthetic and MNIST handwritten, respectively. In both of the experiments that the authors ran, the stochastic gradient descent performed significantly better than the standard gradient descent algorithm. The two hybrid models, however, were able to achieve improved accuracy in a practical length of time. The vast majority of researches have compared GD and SGD, or at most two of the other optimizers, using a narrow set of metrics and models [6]. Insightful guidance from others has illuminated strategies and tips for enhancing optimization algorithms and resolving typical issues with these optimizers. In this paper, we have conducted a practical experiment to find out the impact of these nine well-utilized optimizers, utilizing four traditional fault bearing classification datasets on a batch normalized Convolutional Neural Network. 2.2 Categories of Gradient Descent Algorithms In machine learning and artificial intelligence, iterative optimization through gradient descent reduces the cost function in order to produce models that are capable of making correct predictions, and it is able to accomplish this by updating weight parameters [8]. The cost function (or loss function) quantifies how much of a gap there is between the model’s projected and observed results. All of a neural network’s weights are initialized to random numbers near to zero but not zero, and the gradient, ∂c/∂ω, is calculated as the partial derivative of cost in relation to weight using the formula w = w-α∂c/∂ω, where α is the learning rate used to fine-tune the weights via gradient descent [7]. In general, there are three distinct types of gradient descent, which are as follows: • Batch Gradient Descent (BGD): Before updating the weights at each step of the gradient descent algorithm, the batch gradient method computes the gradient of the cost function. The gradient convergence is slow since we compute using the complete dataset [10]. • Stochastic Gradient Descent (SGD) is an approach that iteratively modifies weights for each row of a training dataset. Since the dataset is randomly generated and the weights are modified for each sample, the cost function update will be noisy, and jumps around as seen in Fig. 1. • Mini-batch gradient is a kind of stochastic gradient descent that uses several samples instead of a single training example. Mini-batch gradient descent is popular because it is quicker and more stable at the point of convergence. If the dataset permits it, the batch size can change [10]. 2.3 Types of Optimizers • Momentum: Momentum facilitates the acceleration of Gradient Descent on surfaces that slope more steeply in one direction than the other (GD). It also reduces the oscillation, as can be seen in Fig. 1(b), in comparison with SGD without Momentum
368
C. Usigbe and X. Perry
(depicted in Fig. 1(a)). The gradients from the current step and prior steps are used to adjust the weights; this expedites our convergence process.
Fig. 1. (a) SGD without Momentum (b) SGD with Momentum (c) Nesterov Momentum Update (source: http://cs231n.github.io/neural-networks-3/)
vt = γ vt−1 + η∇J (θ ; x, y)
(1)
θt+1 = θt − vt
(2)
vt−1 is the momentum value, γ is a friction coefficient taken to be 0.9, ∇J (θ ; x, y) is the gradient of the objective function at iteration t − 1, η is the learning rate, and θ are the parameter values [10]. • Nesterov Accelerated Gradient (NAG): In this case, we determine the gradient with regard to the subsequent step rather than the present step. Here, we adjust the weights based on the importance of the looked-ahead region’s gradient as shown in the equations below, and in Fig. 1(c). vt = γ vt−1 + η∇J (θ − γ vt−1 )
(3)
θt+1 = θt − vt
(4)
θ − γ vt−1 is the gradient of looked ahead
(5)
Gradient Descent-Based Optimization Algorithms
369
• Adagrad—Adaptive Gradient Algorithm The adaptive learning rate approach is called Adagrad. In Adagrad, we adapt the parameters’ learning rate. For infrequent parameters, we carry out greater updates, whereas for frequent parameters, we carry out smaller updates. Adagrad excels in situations when there is very little data available, such as in large-scale neural networks. With Adagrad, adjusting the learning rate is no longer a manual process. η · gt θt+1 = θt − √ Gt + ε
(6)
Gt is the sum of the squares of the past gradients w.r.t all parameters θ, while ε is random noise. Adadelta. Adadelta is an extension of Adagrad, which likewise seeks to lessen its aggressiveness by steadily slowing down the learning rate. To do this, we set a maximum width (w) on the window used to calculate the prior cumulative gradient. Therefore, the past average and the new gradient establish the running average at time t [10]. θt+1 = θt + θt θ = −
(7)
RMS[θ ]t−1 · gt RMS gt
(8)
• RMSProp The acronym RMSProp stands for “Root Mean Square Propagation.” By calculating a moving average of the squared gradient, RMSProp makes an attempt to solve the problem of Adagrad’s drastically falling learning rates. For this, the magnitude of the newest gradient descents is utilised, and then the gradient is normalized. The learning rate in RMSProp is automatically changed, and it uses different rates of learning for each parameter. θt+1 = θt −
η (1 − γ )g 2 t − 1 + γ gt + ε
· gt
(9)
where gt is the weighted average of squared gradients and γ is a decay term with values between 0 and 1. • Adam—Adaptive Moment Estimation An alternative approach that estimates the first and second moments of the gradients uses those estimations to determine the individual adaptive learning rate for each parameter. In addition to this, it reduces the rate of the diminishing learning rates of Adagrad. It is possible to think of Adam as a hybrid of two algorithms: Adagrad, which is effective on sparse gradients, and RMSProp, which is effective in online and non-stationary environments [10].
370
C. Usigbe and X. Perry
The hyper-parameters β1, β2 [0, 1] set the exponential decay rates of these average values, as shown below. mt = β1 mt−1 + 1(1 − β1 )gt
(10)
vt = β2 vt−1 + (1 − β2 )g 2 t
(11)
mt and vt are estimates of first and second moment respectively. After parameter update we will have the final equation below:
θt+1 = θt −
ηmt
vt + ε
(12)
• Nadam-Nesterov-Accelerated Adaptive Moment Estimation Nadam is a hybrid of NAG and Adam. Nadam is utilised for slopes that are noisy or have a significant degree of curvature. By averaging the exponential decay rates of the past and current gradient moving averages, the learning process may be accelerated. Adamax Optimization Algorithm Adamax is an optimization technique that builds on the original Adaptive Movement Estimation (Adam) algorithm [5]. To put it another way, it’s a variant of the Gradient Descent Optimization method. It is designed to learn time-variant processes, such as voice data with dynamically altered noise circumstances, as a result of its capability of altering the learning rate according on the properties of the data. Another advantage of this method is that it is faster than traditional methods. Although many well-known optimization methods are already included in deep learning packages, researchers are still working to improve generality using new optimizers as in [1, 7].
3 Experimental Setup and Test Results In this experiment, we tested a Batch–Normalized Convolutional Neural Network (BNCNN) model by subjecting it to a variety of bearing classification dataset experiment. On the BN-CNN model that was developed, the author verified the performance of nine (9) well-known stochastic gradient-based optimization algorithms that are utilised in the field of deep learning. These approaches are used to optimize models that are trained using Convolutional neural networks. The optimizers investigated include: SGD (with momentum and nesterov), RMSProp, Adam, Adamax, Adagrad, Adadelta, and Nadam. The datasets that were used were “FEMTO bearing dataset,” “NASA bearing dataset,” “CWRU” dataset, and “MFPT bearing dataset.” With the exception of the MFPT dataset, which was obtained from Vibration Institute’s Condition-Based Maintenance Fault Database, all of these datasets were accessible through Kaggle’s public dataset platform. In a broad sense, the author used BN-ConvNet model in order to apply all of the accessible optimization algorithms to each individual dataset. For the sake of consistency, the same setting was maintained for all of the hyperparameter throughout all of our investigations.
Gradient Descent-Based Optimization Algorithms
371
3.1 Convolutional Neural-Network Model When it comes to applications in fields like speech recognition, natural language processing, and image processing, CNN is by far the most used deep learning model. CNN is a special type of neural network that is typically employed for the processing of grid-like topological data, such as time series data gathered on a regular basis in a 1D grid format, and 2D grid image data comprised of pixel grids [2]. At the CNN layer, CNN makes use of a Convolutional mathematical procedure to produce the weighting function w(a), where ‘a’ represents the duration of quantitative analysis of data from a time series. Following previous researches such as [3, 9], the convolution operation may be described as follows in a generic sense when a function using weighted averages is applied at each time interval: s(t) = x(a)w(t − a)da = s(t) = (x ∗ w)(t) (13) Input is x, the filter or kernel is w, and the output is s, the feature map, for a time series t. In the case when x and w are described in terms of integer values of t, the discrete convolution process may be written as: ∞ x(a)w(t − a) (14) s(t) = (x ∗ w)(t) = a=−∞
The following is a description of how the cross-correlation convolution works when the input image has two dimensions and the kernel also has two dimensions: S(i, j) = (I ∗ K)(i.j) = I (i + m, j + n)K(m, n) (15) m
n
where K (m, n) is the filter coefficient, (i, j) are pixel values, S(i, j) is the integral image. The ConvNet architecture that was used in this investigation had filters that were 3 × 3 in size across all convolution layers, as well as 2 × 2 Maxpooling with a stride size of 2, and finally, a fully connected layer. The model was developed with the help of the keras deep learning package. During the training of the model, the regularization methods utilized in order to improve the accuracy of the results were dropout and data augmentation. Rectified Linear Unit (ReLU) nonlinearity activation function is additionally added to each Convolutional layer. Table 1 provides a brief overview of the ConvNet setup that was utilized for this research project and Fig. 2 shows the 1-D CNN architecture. 3.2 Datasets For the purpose of this study, we selected four (4) well-known Bearing faults classification datasets to test our model using a variety of different optimization strategies. Figure 3(a) depicts the vibration signal acquisition process while Fig. 3(b) represents Vibration-based bearing fault detection process flow using multiple techniques, and their respective performances. • Dataset 1- FEMTO DATASET Besancon, France’s FEMTO-ST Institute released the FEMTO dataset for public use. This dataset contains the results of the actual bearing accelerated life tests that were
372
C. Usigbe and X. Perry
Fig. 2. 1-D CNN Architecture (Source: Article: https://www.researchgate.net/publication)
Fig. 3. (a) Measurement of Vibrations (b) Bearing Defect Detection using Vibration Data: An Overarching Process Flow Depicting the use of Several Strategies and their Relative Performances (Source: [2])
conducted utilising an experimental platform known as PRONOSTIA. The primary goal of PRONOSTIA is to offer actual data relating to the accelerated deterioration of bearings that are done under operating circumstances that are either constant or changeable and are managed live. Bearings’ degradations may be conducted on the experimental platform, which is depicted in Fig. 4(a), in only a few hours. As a result, it is feasible to acquire a large number of tests within a week. The PRONOSTIA test bed is made up of three primary
Gradient Descent-Based Optimization Algorithms
373
components: a rotating component, a degradation generation component, and a measuring component. A sensor that measures the rotational speed, and another that measures the force, are used to describe the operating circumstances. Temperature and vibration signals are gathered using the PRONOSTIA platform to guarantee online bearing health monitoring. These data are gathered using vertical and lateral accelerometers. In addition, the vibration signals were sampled at 2.6 kHz and recorded every ten seconds; After that, we compare the vibration signals from the bearings that have deteriorated to the vibration signal from a bearing that has not deteriorated or is functioning normally. In conclusion, the sensor monitoring data may be processed further to extract meaningful features and perform ongoing assessments of the bearing’s health. There were a total of 17 runs with varying degrees of failure gathered across three separate operational situations. To create the FEMTO dataset, an accelerometer was used to record a continuous stream of vibrations from the PRONOSTIA server. This data is then fed to the first ConvNet layer of the Convolutional neural network to produce a feature map as the end outcome. To create the rectified feature map, a ReLU function is applied to the convolved map. For feature detection, various convolutions and ReLU layers are applied to the input signal. To isolate certain components of a signal, different pooling layers are employed in conjunction with a wide variety of filters. The final classified output may be generated by first flattening the pooled feature map, and then feeding it to the fully connected layer. A summary of the convNet configuration is shown in Table 1. • Dataset 2- CWRU DATASET The bearing fault dataset can be accessed at Case Western Reserve University] (https://case.edu/). The following nine statistical features are calculated in other to run the fault identification prediction: maximum, minimum, mean, standard deviation, RMS, skewness, kurtosis, crest factor and form factor. These statistical featured dataset is used as input to the convNet. Each feature is computed for time segments of 2048 points (0.04 s at the 48 kHz accelerometer sampling frequency). This data is then fed to the first ConvNet layer of the Convolutional neural network to produce a feature map as the end outcome. To create the rectified feature map, a ReLU function is applied to the convolved map. For feature detection, various convolutions and ReLU layers are applied to the input signal. To isolate certain components of a signal, different pooling layers are employed in conjunction with a wide variety of filters. The final categorized output is obtained by flattening the pooled feature map and feeding it to a fully connected layer. A summary of the convNet configuration is shown in Table 1, and the experimental setup is shown in Fig. 4(b). • Dataset 3- NASA DATASET There are three distinct bearing datasets included in the data packet (IMS-Rexnord Bearing Data.zip). Each dataset represents one iteration of a test that ultimately failed. The files that make up a dataset represent one second’s worth of vibration signals that were gathered at regular intervals. Each file has a sampling rate of 20 kHz and a total of 20,480 data points. An NI DAQ Card 6062E was used to record the vibration readings. This data is then fed to the first ConvNet layer of the Convolutional neural network to
374
C. Usigbe and X. Perry
Fig. 4. (a) Experimental Setup on PRONOSTIA Platform (b) CWRU Bearing Test Bed for Collecting Vibration Signals. (c) Experimental Platform of a NASA IMS Bearing Run-to-Failure Dataset. (Source: [2])
Gradient Descent-Based Optimization Algorithms
375
produce a feature map as the end outcome. To create the rectified feature map, a ReLU function is applied to the convolved map. For feature detection, various convolutions and ReLU layers are applied to the input signal. To isolate certain components of a signal, different pooling layers are employed in conjunction with a wide variety of filters. The final classified output may be generated by first flattening the pooled feature map, and then feeding it to the fully connected layer. A summary of the convNet configuration is shown in Table 1, and the experimental platform is shown in Fig. 4(c). • Dataset 4- MFPT DATASET Another dataset that may be utilized for REB defect identification and diagnosis is the MFPT dataset provided by the Society for Machinery Failure Prevention Technology. The Condition Based Maintenance (CBM) fault database’s primary objective is to offer a wide range of bearing and gear-specific datasets containing examples of both healthy and damaged components. To facilitate bearing analysis’s spread, a bearing fault dataset has been made available to researchers. Three actual failures, as well as data from a bearing test rig, nominal bearing data, an outer race fault at different loads, and an inner race fault at varying loads, are all included in the set. An intermediate shaft bearing from a wind turbine, an oil pump shaft bearing from a wind turbine, and a real-world planet bearing defect are three examples that have occurred in the real world. The MFPT makes use of a NICE bearing for the data that it makes accessible. There is a lack of transparency on the seeding processes that lead to faults. With a load of 1201 N on the bearing, three readings are provided for both the initial condition and one with an outer race fault. In addition, seven measurements for both the outer and inner race defects are provided over the 01334 N bearing load ranges. The bearing dataset from the Condition-Based Maintenance Defect Database includes nominal bearing data and data from a bearing test rig with a defect in the outer race, as well as data from a bearing test rig with a fault in the inner race under different loads. This vibration- based dataset is fed to the first ConvNet layer of the Convolutional neural network to produce a feature map as the end outcome. To create the rectified feature map, a ReLU function is applied to the convolved map. For feature detection, various convolutions and ReLU layers are applied to the input signal. To isolate certain components of a signal, different pooling layers are employed in conjunction with a wide variety of filters. The final classified output may be generated by first flattening the pooled feature map, and then feeding it to the fully connected layer. A summary of the convNet configuration is shown in Table 1. 3.3 Overall Results Discussion • Dataset 1 – FEMTO Bearing dataset: Table 2 and Figs. 5, 6, 7, 8, 9, 10, 11, 12 and 13 present the results for FEMTO Bearing dataset which shows Adam performing better than all other optimizers, with an accuracy of 99.6%, but closely trailed by Nadam with an accuracy of 99.48%, while Aldadelta has the poorest performance with an accuracy of 10.57%. In terms of convergence time, Sgdm, and Sgdm + n both have the fastest convergence time of 2 min and 7 s followed by Adam and Rmsprop with
376
C. Usigbe and X. Perry Table 1. Summary of Bn-Convnet Configurations
Dataset1-FEMTO Bearings
Dataset2-NASA Bearings
Dataset3-CWRU Bearings
Input time-series data Input time-series data Input statistical – – feature data-9 × 1 50 × 50 × 1 143 × 143 × 1
Dataset4-MFPT Bearings Input time-series data – 50 × 50 × 1
Conv3 × 3 × 8; stride Conv3 × 3 × 8; stride Conv3 × 3 × 8; stride Conv3 × 3 × 8; stride =1 =1 =1 =1 ReLU (nonlinearity function) Pooling layer: MaxPooling 2 × 2; stride = 2
Pooling layer: MaxPooling 2 × 2; stride = 2
Pooling layer: MaxPooling 2 × 2; stride = 2
Pooling layer: MaxPooling 2 × 2; stride = 2
Conv3 × 3 × 8; stride Conv3 × 3 × 8; stride Conv3 × 3 × 8; stride Conv3 × 3 × 8; stride =1 =1 =1 =1 ReLU (nonlinearity function) Pooling layer: MaxPooling 2 × 2; stride = 2
Pooling layer: MaxPooling 2 × 2; stride = 2
Pooling layer: MaxPooling 2 × 2; stride = 2
Pooling layer: MaxPooling 2 × 2; stride = 2
Conv3 × 3 × 32; stride = 1
Conv3 × 3 × 32; stride = 1
Conv3 × 3 × 32; stride = 1
Conv3 × 3 × 32; stride = 1
ReLU (nonlinearity function) Pooling layer: MaxPooling 2 × 2; stride = 2
Pooling layer: MaxPooling 2 × 2; stride = 2
Pooling layer: MaxPooling 2 × 2; stride = 2
Pooling layer: MaxPooling 2 × 2; stride = 2
Conv3 × 3 × 16; stride = 1
Conv3 × 3 × 16; stride = 1
Conv3 × 3 × 16; stride = 1
Conv3 × 3 × 16; stride = 1
Pooling layer: MaxPooling 2 × 2; stride = 2
Pooling layer: MaxPooling 2 × 2; stride = 2
FC-layer-144
FC layer-76
ReLU (nonlinearity function) Pooling layer: MaxPooling 2 × 2; stride = 2
Pooling layer: MaxPooling 2 × 2; stride = 2
Flattening (Input Layer of Neural Network) FC layer-76
FC layer-76
ReLU (nonlinearity function) Dropout = 0.3 (addressing overfitting issue) Categorical Cross entropy
Categorical Cross entropy
Categorical Cross entropy
Categorical Cross entropy
Softmax Layer (σ)
Softmax Layer (σ)
Softmax Layer (σ)
Softmax Layer (σ)
Gradient Descent-Based Optimization Algorithms
377
Table 2. Results for Femto Dataset Optimizers
Convergence time (hrs: Min: Sec)
Accuracy
Loss
Nadam
0:02:35
0.9948
0.0339
Sgdm
0:02:07
0.1910
1.2383
Adam
0:02:19
0.9960
0.0220
Adagrad
0:02:27
0.1880
2.1739
Adamax
0:02:40
0.7565
0.6520
Rmsprop
0:02:19
0.9688
0.1068
Sgd
0:02:15
0.4099
1.5038
Sgdm+n
0:02:07
0.5234
1.3231
Aldadelta
0:02:22
0.1057
2.3943
Table 3. Results for Cwru Dataset Optimizers
Convergence time (hrs: Min: Sec)
Accuracy
Loss
Nadam
0:00:10
0.9639
0.1216
Sgdm
0:00:11
0.9065
0.2700
Adam
0:00:13
0.9619
0.1030
Adagrad
0:00:11
0.7219
1.0147
Adamax
0:00:12
0.9477
0.1647
Rmsprop
0:00:11
0.9548
0.1273
Sgd
0:00:11
0.9065
0.2700
Sgdm+n
0:00:11
0.9174
0.2747
Aldadelta
0:00:12
0.2845
2.2303
Table 4. Results for Nasa Dataset Optimizers
Convergence time (hrs: Min: Sec)
Accuracy
Loss
Nadam
0:11:12
1.0000
2.3e-07
Sgdm
0:11:27
0.9733
0.4067
Adam
0:11:13
0.9800
7.5e-07
Adagrad
0:58:54
0.9700
0.0364
Adamax
0:11:25
0.9900
7.3e-06
Rmsprop
0:12:53
1.0000
5.4e-09 (continued)
378
C. Usigbe and X. Perry Table 4. (continued)
Optimizers
Convergence time (hrs: Min: Sec)
Accuracy
Loss
Sgd
0:11:32
0.9800
0.0022
Sgdm+n
0:11:22
0.9976
0.1739
Aldadelta
0:11:35
0.2888
1.3524
Table 5. Results for Mfpt Dataset Optimizers
Convergence time (hrs: Min: Sec)
Accuracy
Loss
Nadam
0:02:01
1.0000
4.9e-05
Sgdm
0:00:11
0.9065
0.2700
Adam
0:02:57
1.0000
3.2e-05
Adagrad
0:02:48
0.5455
0.9964
Adamax
0:02:51
1.0000
0.0004
Rmsprop
0:02:51
1.0000
2.1e-06
Sgd
0:02:40
0.5738
0.8303
Sgdm+n
0:02:46
0.8498
0.4124
Aldadelta
0:03:03
0.3424
1.3757
Fig. 5. Adadelta (FEMTO Dataset)
Gradient Descent-Based Optimization Algorithms
Fig. 6. Adagrad (FEMTO Dataset)
Fig. 7. Adam (FEMTO Dataset)
379
380
C. Usigbe and X. Perry
Fig. 8. Adamax (FEMTO Dataset)
Fig. 9. Nadam (FEMTO Dataset)
Gradient Descent-Based Optimization Algorithms
Fig. 10. Rmsprop (FEMTO Dataset)
Fig. 11. Sgd (FEMTO Dataset)
381
382
C. Usigbe and X. Perry
Fig. 12. Sgdm (FEMTO Dataset)
Fig. 13. Sgdm + n (FEMTO Dataset)
a convergence time of 2 min and 19 s. Adamax has the slowest convergence time of 2 min and 40 s. • Dataset 2 – CWRU Bearing dataset: Table 3 and Figs. 14, 15, 16, 17, 18, 19, 20, 21 and 22 present the results for CWRU Bearing dataset which shows Nadam performing
Gradient Descent-Based Optimization Algorithms
383
Fig. 14. Adadelta (CWRU Dataset)
Fig. 15. Adagrad (CWRU Dataset)
better than all other optimizers, with an accuracy of 96.39%, but closely trailed by Adam with an accuracy of 96.19%, while Aldadelta has the poorest performance with an accuracy of 28.45%. In terms of convergence time, Nadam have the fastest
384
C. Usigbe and X. Perry
Fig. 16. Adam (CWRU Dataset)
Fig. 17. Adamax (CWRU Dataset)
Gradient Descent-Based Optimization Algorithms
Fig. 18. Nadam (CWRU Dataset)
Fig. 19. Rmsprop (CWRU Dataset)
385
386
C. Usigbe and X. Perry
Fig. 20. Sgd (CWRU Dataset)
Fig. 21. Sgdm (CWRU Dataset)
convergence time of 10 s followed by Sgd, Adagrad, Sgdm + n and Rmsprop with a convergence time of 11 s. Adam has the slowest convergence time of 13 s. • Dataset 3 – NASA Bearing dataset: Table 4 and Figs. 23, 24, 25, 26, 27, 28, 29, 30 and 31 present the results for NASA Bearing dataset which shows both Nadam and
Gradient Descent-Based Optimization Algorithms
387
Fig. 22. Sgdm + n (CWRU Dataset)
Fig. 23. Aldadelta (NASA Dataset)
Rmsprop performing better than all other optimizers, with an accuracy of 100.00%, but closely trailed by Sgdm + n with an accuracy of 99.76%, while Aldadelta has the poorest performance with an accuracy of 28.88%. In terms of convergence time, Nadam has the fastest convergence time of 11mintes and 12s followed by Adam with
388
C. Usigbe and X. Perry
Fig. 24. Adagrad (NASA Dataset)
Fig. 25. Adam (NASA Dataset)
Gradient Descent-Based Optimization Algorithms
Fig. 26. Adamax (NASA Dataset)
Fig. 27. Nadam (NASA Dataset)
389
390
C. Usigbe and X. Perry
Fig. 28. Rmsprop (NASA Dataset)
Fig. 29. Sgd (NASA Dataset)
Gradient Descent-Based Optimization Algorithms
391
Fig. 30. Sgdm (NASA Dataset)
Fig. 31. Sgdm + n (NASA Dataset)
a convergence time of 11min and 13 s. Adagrad has the slowest convergence time of 8 min and 4 s. • Dataset 4 – MFPT Bearing dataset: Table 5 and Figs. 32, 33, 34, 35, 36, 37, 38, 39 and 40 present the results for MFPT Bearing dataset which shows Nadam, Adam,
392
C. Usigbe and X. Perry
Fig. 32. Aldadelta (MFPT Dataset)
Fig. 33. Adagrad (MFPT Dataset)
Adamax, and Rmsprop performing better than all other optimizers, with an accuracy of 100.00%, but closely trailed by Sgdm with an accuracy of 90.65%, while Aldadelta has the poorest performance with an accuracy of 34.24%. In terms of convergence
Gradient Descent-Based Optimization Algorithms
393
Fig. 34. Adam (MFPT Dataset)
Fig. 35. Adamax (MFPT Dataset)
time, Sgdm has the fastest convergence time of 11s followed by Nadam with a convergence time of 2 min and 1 s. Adamax has the slowest convergence time of 3 min and 3 s.
394
C. Usigbe and X. Perry
Fig. 36. Nadam (MFPT Dataset)
Fig. 37. Rmsprop (MFPT Dataset)
Gradient Descent-Based Optimization Algorithms
Fig. 38. Sgd (MFPT Dataset)
Fig. 39. Sgdm (MFPT Dataset)
395
396
C. Usigbe and X. Perry
Fig. 40. Sgdm + n (MFPT Dataset)
4 Conclusions and Recommendation for Future Research A comparison of the performance of nine different optimization techniques was carried out on four different fault bearing classification datasets employing a Batch-Normalized Convolutional Neural network Model (BN-CNN). The results of our research indicate that the performance of each optimizer differed depending on the dataset. This discovery provides more evidence that the data type and size both have an impact on the performance of the various optimizers in terms of accuracy, convergence time and loss functions. Our findings reveal that Nadam displayed a more superior and robust performance across all four datasets that were analyzed in comparison to other optimization strategies. These findings are based on the numerous tests that were carried out. This may be due to the fact that Nadam combines the advantages of the Nesterov Acceleration Gradient (NAG) method and the Adaptive estimation (Adam) algorithm, both of which appear to be appropriate for the four datasets and the model that was investigated in this work. All of our tests in this work were carried out using batch-normalized Convolutional Neural Network model and four different fault bearing classification datasets. I propose that future study use more than four datasets from diverse subject areas, and undertake tests to compare the impact of these optimizers across a range of ConvNet configurations and deep learning models to achieve better generalization.
Gradient Descent-Based Optimization Algorithms
397
References 1. Kingma, D.P., Ba, L.J.: Adam: a method for stochastic optimization. In: International Conference on Learning Representations (ICLR) (2001) 2. Neupane, D., Seok, J.: Bearing fault detection and diagnosis using case western reserve university dataset with deep learning approaches: a review. IEEE Access 8, 93155–93178 (2020) 3. Dogo1, E.M., Afolabi, O.J., Nwulu, N.I., Twala, B., Aigbavboa, C.O.: A comparative analysis of gradient descent-based optimization algorithms on convolutional neural networks. Department of Electrical and Electronics Engineering Science, University of Johannesburg (2019) 4. Hinton, G., Srivastava, N., Swersky, K.: Rmsprop: divide the gradient by a running average of its recent magnitude (2012) 5. Sutskever, I., Martens, J., Dahl, G., Hinton, G.: On the importance of initialization and momentum in deep learning, pp. 1139–1147. PMLR, Atlanta (2013) 6. Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12, 2121–2219 (2011) 7. Lv, K., Jiang, S., Li, J.: Learning gradient descent: better generalization and longer horizons. arxiv.org/abs/1703.03633 (2017) 8. Zeiler, M.D.: ADADELTA: an adaptive learning rate method (2012) 9. Hallen, R.: A Study of gradient-based algorithms (2017). Accessed http://lup.lub.lu.se/stu dent-papers/record/8904399, 29 10. Khandelwal, R.: Overview of different optimizers for neural networks (2020). Accessed https://medium.datadriveninvestor.com/overview-of-different-optimizers-for-neu ral-networks-e0ed119440c3
Landslide Prediction Using Multi-Layer Perceptron Model Geetanjali Mahamunkar(B) , Arvind Kiwelekar, and Laxman Netak Dr. Babasaheb Ambedkar Technological University, Lonere, India [email protected] Abstract. Landslides are one of the most dangerous natural hazards, which not only lead to deaths and economic losses but also affect the ecosystem. This study proposes a Multi-Layer Perceptron (MLP) model to identify landslides using data collected from Muzaffarabad, Pakistan. To predict a landslide, the dataset includes information from 1213 different sites from Muzaffarabad and 12 relevant parameters. To begin with, we trained various Machine Learning models and compared their results with MLP. In the end, applied models are assessed using a variety of performance indicators. The results demonstrate the accuracy of the model developed to be the highest compared to other Machine Learning models used in this paper. For other sensitive areas with similar environmental circumstances, our proposed approach can be utilised to anticipate landslides efficiently. Keywords: Multi-Layer Perceptron(MLP) · Deep Learning Geospatial Data Analysis · Machine Learning
1
·
Introduction
Landslides are a broad category of processes resulting from the outward or downhill movement of materials such as soil, rock, vegetation, etc. [1] Along with endangering property and human lives in the mountainous region, deforestation frequently damages transportation and communication systems. Severe rainfall and earthquakes of a very high magnitude are the main landslide event triggers. We have used the dataset1 of landslides of Muzzafarabad district of Pakistan to build a model for landslide prediction. Various quantitative and qualitative methods are used for the investigation of landslide susceptibility.[2] To anticipate the occurrence of such a natural hazard and to lessen its effects, image processing techniques are used along with satellite remote sensing. [3] Aspect, curvature, earthquake, elevation, flow, lithology, precipitation, profile, plan, and slope are common factors in the current study contributing to landslides. Normalised Difference Water Index (NDWI) and Normalised Difference Vegetation Index (NDVI) are also used to predict landslides. 1
https://www.kaggle.com/datasets/adizafar/landslide-prediction-formuzaffarabadpakistan.
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): SAI 2023, LNNS 711, pp. 398–407, 2023. https://doi.org/10.1007/978-3-031-37717-4_26
Landslide Prediction Using Multi-Layer Perceptron Model
399
Fig. 1. Diagrammatic Representation of Multi-Layer Perceptron
In this paper, the susceptibility to landslides is predicted using five advanced machine learning models, including Logistic Regression, Linear Discriminant Analysis, Gaussian Naive Bayes, Support Vector Machine, and Multi-Layer Perceptron. The Muzaffarabad district dataset was used to train these models for landslide susceptibility. We compared the results of each model based on various statistical parameters to determine which had the highest accuracy. The following sections are found in the remaining portions of the paper. The first section is the “Literature survey”, which discusses the theoretical background of the domain area. Next is “Methodology”, which explains the methods used in this study. Moreover, the proposed approach is followed by “Results and Conclusion”. In this section, we show the results obtained from applied models, and then conclusions are drawn.
2
Literature Survey
Identification of landslide-prone locations via landslide susceptibility mapping is crucial for preventing future landslides, casualties, and infrastructure damage. It has been observed that landslides occur mainly after heavy rainfalls or earthquakes. Certain anthropological activities, such as deforestation, mining,
400
G. Mahamunkar et al.
Fig. 2. Working Methodology
quarrying, etc., also result in landslides. Taking into account the topographic, geologic, geomorphologic, and pedologic data currently accessible, 15 governing or determining factors have been developed in [4]. We have considered 12 factors, namely, aspect, curvature, earthquake, elevation, flow, lithology, precipitation, profile, plan, and slope, along with NDVI and NDWI, for classifying the area type into landslide or non-landslide in this paper. The probabilistic analysis of the available data is done using machine learning methods such as Logistic Regression, Linear Discriminant Analysis, Gaussian Naive Bayes, Support Vector Machine, and Multi-Layer Perceptron. 2.1
Landslide Conditioning Factors
The different landslide conditioning factors used in this paper include aspect, curvature, earthquake, elevation, flow, lithology, precipitation, profile, plan, and slope. A parameter utilised in the creation of landslide susceptibility maps is aspect. A terrain’s direction and slope are simultaneously displayed on an aspect map. As a result, it plays a significant role in the analysis and creation of maps showing landslide risk. The movement of material down a slope in the form of fluid is known as a flow in landslides. Different flows include mud, debris, and
Landslide Prediction Using Multi-Layer Perceptron Model
401
rock flows (rock avalanches). The flow’s acceleration and deceleration are influenced by the profile curvature, which also impacts erosion and deposition. The convergence and divergence of the flow are affected by the plan curvature. Plan and profile curvature can be used to comprehend the flow through a surface by considering both. Hence curvature is also a factor to be considered for determining the occurrence of landslides. Because elevation affects environmental factors such as vegetation, temperature, precipitation, and humidity, it is considered among the factors affecting landslides [5]. Lithology studies the chemical, mineralogy, and physical characteristics of rocks. Lithology plays a vital role in the occurrence of landslides [6] hence it has been considered for predicting the susceptibility of landslides. Precipitation can trigger landslides by sweeping debris from the surface downstream and percolating into bedrock and soil, weakening slopes. The Normalized Difference Vegetation Index (NDVI) can be used as a criterion for classifying land cover and indicates vegetation density. The capacity of NDVI to detect changes in vegetation density would be essential in detecting areas at higher risk for landslides caused by human activities like deforestation [7]. The Near-Infrared (NIR) and Short-Wave Infrared (SWIR) channels are used to create the satellite-derived index known as the Normalized Difference Water Index (NDWI). Rainfall data might be related to the water index [8]. Rainfall is one of the initiating variables that contribute to landslides, too. If the rainfall statistics increase, the NDWI value will also increase, resulting in landslides. Hence we have considered NDWI for landslide susceptibility calculation in this paper. Landslide Dataset. The dataset used in this paper includes data from 1213 different sites from the Muzaffarabad district of Pakistan and 12 relevant parameters [9]. Landslide Susceptibility Zone mapping is generally done using the Fuzzy Gamma Operator model [10] into five different susceptibility classes, namely, very high (5), high (4), moderate (3), low (2), and very low (1) to rate each parameter as a deciding factor in predicting the landslide. The higher the value greater the probability of the parameter to contribute in a landslide. The 13th column in the dataset has the value 0 for the non-landslide area and 1 for the landslide. 2.2
Machine Learning for Landslide Susceptibility Prediction
Artificial intelligence (AI) and machine learning techniques called deep learning model how people acquire specific types of information. Deep Learning emulates the working of the human brain to solve statistics and predictive modelling. One of the main applications of deep learning is geospatial data analysis [11]. Data which has a location attached to it is called geospatial data. Geospatial data analysis involves the analysis of such data acquired through remote sensing and using other emerging technologies such as drones and unmanned aerial vehicles. Applications of geospatial data analysis using deep learning include mapping
402
G. Mahamunkar et al.
Fig. 3. Algorithm Comparison based on Mean
different land covers and identifying changes in their coverage over a specific period [12], land use and land cover classification [13] using analysis of satellite images and classification activities related to geocoordinate based data [14]. Machine learning models such as Random Forest and Support Vector Machines have been used to assess and map landslide susceptibility earlier[15]. This paper uses machine learning models such as Logistic Regression, Linear Discriminant Analysis, Gaussian Naive Bayes, and Support Vector Machine along with the Multi-Layer Perceptron model for predicting the occurrence of landslides in the area of Muzaffarabad district of Pakistan based on the factors described in the Sect. 2.1. We have compared the performances of these algorithms in terms of various statistical parameters. Logistic Regression. Logistic Regression is a statistical analysis method which uses previous observations from a data set to predict a binary outcome, such as yes or no. By examining the correlation between one or more already present independent variables, a logistic regression model forecasts a dependent data variable. In this paper, we have 12 landslide conditioning factors described in Sect. 2.1, which act as the independent variables and forecast the dependent data variable to be a landslide or non-landslide. Linear Discriminant Analysis. For supervised classification problems, a common dimensionality reduction technique is Linear Discriminant Analysis.
Landslide Prediction Using Multi-Layer Perceptron Model
403
Fig. 4. Algorithm Comparison based on Standard Deviation
It represents group differences by dividing groups into two or more classes. The features in a higher-dimension space are projected into a lower-dimension space using this technique. For instance, we need to divide our two classes effectively. Classes may contain a variety of features. If you categorise them using only one feature, there may be some overlap. As a result, we will constantly add features to ensure accurate classification. Thus as in this study, we have 12 features to classify the location into landslide or non-landslide we have used the Linear Discriminant Analysis technique. Gaussian Naive Bayes. Naive Bayes is the simplest and most powerful algorithm. To choose the output class, the Naive Bayes approach assumes that the predictors act equally and independently of one another. When the predictor values are continuous and are anticipated to follow a Gaussian distribution, the Gaussian Naive Bayes classifier is used. Hence we have used it in our study, considering the possible values of the independent variables. Support Vector Machine Support Vector Machine (SVM) includes a set of supervised learning methods for classification, regression and outlier detection. The Support Vector Classification (SVC) model used in this paper is the LinearSVC class from the scikit-learn package (sklearn), which is a faster implementation of Support Vector Classification for the case of a linear kernel capable of performing binary and multi-class classification on a dataset.
404
G. Mahamunkar et al. Table 1. Experimental Results
Algorithm
Train Accuracy Test Accuracy Precision Recall AUC
MLP Classifier
0.8420
0.8049
0.7626
F1-score
0.8628 0.8071 0.8097
Linear Discriminant Analysis 0.7229
0.7912
0.7647
0.8171 0.7922 0.7901
Logistic Regression
0.7885
0.7634
0.8114 0.7893 0.7867
0.7323
Support Vector Machine
0.7276
0.7885
0.7634
0.8114 0.7893 0.7867
Gaussian Naive Bayes
0.7288
0.7637
0.7149
0.8457 0.7668 0.7749
Multi-layer Perceptron. A fully connected class of feedforward artificial neural networks (ANN) is called a Multi-Layer Perceptron (MLP). A minimum of three layers of nodes make up an MLP: the input layer, the hidden layer, and the output layer as shown in Fig. 1. The MLP consists of nonlinearly activating nodes in three or more layers, including an input layer, an output layer, and one or more hidden layers. Since MLPs are fully connected, each node in one layer connects to every node in the subsequent layer with a specific weight. The perceptron learns by adjusting connection weights based on the amount of error in the output compared to the desired result after each piece of data is processed. Backpropagation, a generalisation of the linear perceptron’s least mean squares algorithm, is used in this instance of supervised learning. The difference between the output and the targeted output is calculated using the loss function. The weights are adjusted accordingly using the activation function to minimize this difference.
3
Methodology
We have implemented the algorithms described in Sect. 2.2 in Google Colab by importing basic packages and sklearn modules. Initially, the dataset described in Sect. 2.1 is loaded, and data scaling is performed using the Standardization method. Suppose the values of the features in the machine learning algorithms are more similar. In that case, there is a better chance that the algorithm will be trained well and quickly than a data set where the data points or feature values are highly dissimilar, which will require more time to understand the data and result in lower accuracy. Therefore, scaling is a technique to bring data points closer to each other if the data in any situation involves points far apart. Or, to put it another way, scaling is used to generalise data points so that the gap between them would be smaller. The data is split into train and test sets at 70:30, respectively. The machine learning models are then applied, and their performance is evaluated based on different statistical parameters such as precision, recall, Area Under Curve and F1 score. Lastly, a ROC curve is plotted based on the values derived. The receiver operating characteristic curve (ROC curve) is a graph that displays how well a classification model performs across all classification thresholds. AUC stands for “Area under the ROC Curve” and represents the degree or measure of separability. The mean and standard
Landslide Prediction Using Multi-Layer Perceptron Model
405
Fig. 5. Algorithm Comparison based on ROC
deviation for each algorithm is evaluated and compared. The input features are the 12 Landslide Conditioning Factors based on which the label is predicted. The accuracy assessment of each algorithm is calculated and compared. The methodology used in this paper is shown in Fig. 2.
4
Results and Conclusion
Table 1 summarizes the experimental results regarding the accuracy, precision, recall, and f1-score of implementing various machine learning models for landslide prediction. The percentage of correct predictions for the test data is known as accuracy. Precision is calculated as the number of true positives divided by the number of true positives plus the number of false positives. Whereas Recall is calculated as the number of true positives divided by the number of true positives plus the number of false negatives. By calculating the harmonic mean of a classifier’s precision and recall, the F1-score combines both into a single metric. Figure 3 and Fig. 4 show the comparison of the Mean and Standard Deviation, respectively, of the algorithms implemented for this paper. The ROC curve, which measures a classifier’s ability to differentiate between classes, is summarised as the area under the curve (AUC). The better a classifier can distinguish between positive and negative classifications, the greater its AUC score. The Area Under the Curve between two locations is calculated using a definite integral. To determine the area under the curve y = f (x) between x = a and x = b, integrate y = f (x) between the limits of a and b. Within certain limits, this can be calculated by integration. Comparing the utility of tests is done using the area under a ROC curve. A larger area denotes a more valuable test, and it assesses a test’s general usefulness. ROC is an acronym for Receiver Operating Characteristic. Figure 5 compare the ROC values of the algorithms used in this paper.
406
G. Mahamunkar et al.
From the results obtained in this paper, the Multi-Layer Perceptron model gives the highest test accuracy i.e. 80.49%, for the dataset used compared to the other Machine Learning models used in this paper. The test accuracy obtained for the Linear Discriminant Analysis, Logistic Regression, Support Vector Machine, and Gaussian Naive Bayes models are 79.12%, 78.85%, 78.85% and 76.37%, respectively. Table 1 shows that though the training accuracy for Logistic Regression(LR) and Support Vector Machine(SVM) is different, employing only binary features with a binary outcome predictor, an LR model without regularisation, and SVM with a linear kernel, LR and SVM are identical. Hence, the other parameters for LR and SVM, such as recall, precision, AUC and F1 score, are the same. Further, it can be used for landslide risk assessment for areas with similar environmental conditions and plan disaster management strategies accordingly. Also, we can conclude that deep learning models are suitable for geospatial data analysis in terms of accuracy and efficiency as they do not require much data engineering as is the case of Machine Learning models. For image datasets obtained through remote sensing techniques, Convolutional Neural Networks have been used [16]. Other deep learning models, such as Recurrent Neural Networks, have also been used for geospatial data analysis tasks such as Land Use changes [17], Traffic Prediction [18], etc. It is observed that the Multi-Layer Perceptron model is well suited for smaller datasets hence we have used it in this paper. Thus deep learning is a better alternative for carrying out geospatial data analysis.
References 1. Mohan, A., Singh, A., Kumar, B., Dwivedi, R.: Review on remote sensing methods for landslide detection using machine and deep learning. Trans. Emerg. Telecommun. Technol. 32, e3998 (2021). https://onlinelibrary.wiley.com/ doi/abs/10.1002/ett.3998 2. Rabby, Y. & Li, Y.: Landslide susceptibility mapping using integrated methods: a case study in the Chittagong hilly areas, Bangladesh. Geosciences 10 (2020). https://www.mdpi.com/2076-3263/10/12/483 3. Ghorbanzadeh, O., Shahabi, H., Crivellari, A., Homayouni, S., Blaschke, T., Ghamisi, P.: Landslide detection using deep learning and object-based image analysis. Landslides 19, 929–939 (2022) 4. Costanzo, D., Rotigliano, E., Irigaray, C., Jim´enez-Per´ alvarez, J., Chac´ on, J.: Factors selection in landslide susceptibility modelling on large scale following the GIS matrix method: application to the river Beiro basin (Spain). Nat. Hazards Earth Syst. Sci. 12, 327–340 (2012). https://nhess.copernicus.org/articles/12/327/2012/ 5. Dou, J., et al.: Assessment of advanced random forest and decision tree algorithms for modeling rainfall-induced landslide susceptibility in the IzuOshima Volcanic Island, Japan. Sci. Total Environ. 662, 332–346 (2019). https://www.sciencedirect.com/science/article/pii/S0048969719303055 6. Henriques, C., Zˆezere, J., Marques, F.: The role of the lithological setting on the landslide pattern and distribution. Eng. Geol. 189, 17–31 (2015)
Landslide Prediction Using Multi-Layer Perceptron Model
407
7. Dahigamuwa, T., Yu, Q., Gunaratne, M.: Feasibility study of land cover classification based on normalized difference vegetation index for landslide risk assessment. Geosciences 6, 45 (2016) 8. Ya’acob, N., Rashid, Z., Tajudin, N., Kassim, M.: Landslide possibilities using remote sensing and geographical information system (GIS). IOP Conf. Ser. Earth Environ. Sci. 540, 012084 (2020) 9. Sarwar, M., Fatima, S., Aman, M., Shoaib, A., Aslam, B., Zafar, A.: Landslide susceptibility mapping for Muzaffarabad region using machine learning algorithms (2022) 10. Sema, H., Guru, B., Veerappan, R.: Fuzzy gamma operator model for preparing landslide susceptibility zonation mapping in parts of Kohima Town, Nagaland, India. Modeling Earth Syst. Environ. 3, 499–514 (2017) 11. Kiwelekar, A.W., Mahamunkar, G.S., Netak, L.D., Nikam, V.B.: Deep learning techniques for geospatial data analysis. In: Tsihrintzis, G.A., Jain, L.C. (eds.) Machine Learning Paradigms. LAIS, vol. 18, pp. 63–81. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-49724-8 3 12. Mahamunkar, G.S., Kiwelekar, A.W., Netak, L.D.: Mapping and change detection of mangroves using remote sensing and google earth engine: a case study. In: Tuba, M., Akashe, S., Joshi, A. (eds.) ICT Systems and Sustainability. LNNS, vol. 321, pp. 187–195. Springer, Singapore (2022). https://doi.org/10.1007/978-981-165987-4 20 13. Mahamunkar, G.S., Netak, L.D.: Comparison of various deep CNN models for land use and land cover classification. In: Kim, J.-H., Singh, M., Khan, J., Tiwary, U.S., Sur, M., Singh, D. (eds.) IHCI 2021. LNCS, vol. 13184, pp. 499–510. Springer, Cham (2022). https://doi.org/10.1007/978-3-030-98404-5 46 14. Mahamunkar, G., Kiwelekar, A., Netak, L.: Deep learning model for black spot classification. Int. J. Perform. Eng. 18, 222 (2022) 15. Tsangaratos, P., Ilia, I.: Chapter 24 - Applying Machine Learning Algorithms in Landslide Susceptibility Assessments. Handbook Of Neural Computation, pp. 433–457 (2017). https://www.sciencedirect.com/science/article/pii/ B9780128113189000247 16. Jmour, N., Zayen, S., Abdelkrim, A.: Convolutional neural networks for image classification. In: 2018 International Conference On Advanced Systems And Electric Technologies (IC ASET), pp. 397–402 (2018) 17. Cao, C., Dragi´cevi´c, S., Li, S.: Short-term forecasting of land use change using recurrent neural network models. Sustainability 11, 5376 (2019) 18. Azad, A., Wang, X.: Land use change ontology and traffic prediction through recurrent neural networks: a case study in Calgary, Canada. ISPRS Int. J. GeoInform. 10 (2021). https://www.mdpi.com/2220-9964/10/6/358
Amenable Sparse Network Investigator Saeed Damadi(B) , Erfan nouri, and Hamed Pirsiavash Department of Computer Science and Electrical Engineering, University of Maryland, Baltimore County, Baltimore, MD 21250, USA {sdamadi1,erfan1,hpirsiavash}@umbc.edu
Abstract. We present “Amenable Sparse Network Investigator” (ASNI) algorithm that utilizes a novel pruning strategy based on a sigmoid function that induces sparsity level globally over the course of one single round of training. The ASNI algorithm fulfills both tasks that current state-of-the-art strategies can only do one of them. The ASNI algorithm has two subalgorithms: 1) ASNI-I, 2) ASNI-II. ASNI-I learns an accurate sparse off-the-shelf network only in one single round of training. ASNI-II learns a sparse network and an initialization that is quantized, compressed, and from which the sparse network is trainable. The learned initialization is quantized since only two numbers are learned for initialization of nonzero parameters in each layer L. Thus, quantization levels for the initialization of the entire network is 2L. Also, the learned initialization is compressed because it is a set consisting of 2L numbers. The special sparse network that can be trained from such a quantized and compressed initialization is called amenable. For example, in order to initialize more than 25 million parameters of an amenable ResNet-50, only 2 × 54 numbers are needed. To the best of our knowledge, there is no other algorithm that can learn a quantized and compressed initialization from which the network is still trainable and is able to solve both pruning tasks. Our numerical experiments show that there is a quantized and compressed initialization from which the learned sparse network can be trained and reach to an accuracy on a par with the dense version. This is one step ahead towards learning an ideal network that is sparse and quantized in a very few levels of quantization. We experimentally show that these 2L levels of quantization are concentration points of parameters in each layer of the learned sparse network by ASNI-I. In other words, we show experimentally that for each layer of a deep neural network (DNN) there are two distinct normal-like distributions whose means can be used for initialization of an amenable network. To corroborate the above, we have performed a series of experiments utilizing networks such as ResNets, VGG-style, small convolutional, and fully connected ones on ImageNet, CIFAR10, and MNIST datasets.
Keywords: Pruning
· Initialization · Nonconvex Sparse Optimization
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): SAI 2023, LNNS 711, pp. 408–427, 2023. https://doi.org/10.1007/978-3-031-37717-4_27
Amenable Sparse Network Investigator
409
Fig. 1. Pruning Tasks
1
Introduction
As indicated in Fig. 1, pruning of a DNN is done to fulfill two tasks: (1) obtaining an accurate sparse network, ready to use, whose test accuracy is on a par with the dense version of the network, and (2) finding trainable sparse structures that can be trained in isolation and reach test accuracy of the dense network. The goal of the first task is to provide an accurate off-the-shelf-network that can be used in small devices, such as smart phones. Although algorithms that solve the first task are very mature by now, solving the second pruning task is an ongoing research topic. Solving the latter task is of importance because over-parameterization [4] has been an obstacle for interpretability of parameters in the network. Therefore, finding a sparse network that can fit to a data from scratch would help to interpret parameters of a DNN. To find a trainable sparse network, [7] was the first work that proposes an algorithm that can solve the second task of pruning, i.e., the lottery ticket algorithm. Since then search for finding a sparse trainable network has attracted a lot of attention. However, the lottery ticket algorithm (LTA) requires multiple rounds of training and it fails to find the ticket for large networks, i.e., ResNet-50. Foresight pruning or pruning before training [26,41,43] tries to address the first issue. However, none of foresight pruning methods perform as good as methods that obtain sparse structures using gradual pruning [11]. The stabilized LTA, [9], shows that the second issue of LTA can be solved. By changing the initialization that uses values different than the original initialization, [9] shows a large sparse structure obtained from the stabilized LTA reaches test accuracy approximately. The stabilized LTA uses values of the learned parameters at k-th iteration of the first training round. This change is equivalent to using another sparse initialization for training of a sparse network. Also, it shows that there might be some other initialization from which a sparse network is trainable. Inspired by that, we learn another initialization that is quantized, compressed, and from that the sparse network is trainable. To this end, we present the “Amenable Sparse Network Investigator” (ASNI ) which performs as well as the state-of-the-art algorithms in solving the first task of pruning and solves the second task of pruning by learning an amenable sparse network that is trainable from a learned quantized and compressed initialization. To the best of our knowledge there is no other algorithm that can solve both two tacks together. Also, the ASNI algorithm is the first proposed algorithm that learns a quantized and compressed initialization from which a sparse network is trainable. In summary, the following are our contributions: – We present the ASNI algorithm (Algorithm 3) that solves both pruning tasks via two subalgorithms. In only one single round of training, ASNI-I
410
S. Damadi et al.
utilizes a novel simple pruning strategy based on a sigmoid function that induces sparsity globally across the network and learns an accurate sparse network. Figure 3 shows how one can select the parameters of the sigmoid function properly to avoid harmful consequences of pruning during the training process. – ASNI-II solves the second task of pruning. It takes the output of ASNI-I and learns an amenable sparse network with its corresponding quantized and compressed initialization. We experimentally show that the learned initialization is a set of averages; the average is taken over positive and negative learned parameters by ASNI-I for each layer, i.e., lines 12 and 14 in Algorithm 3. Figure 2 shows the parameter distribution of the output of ASNI-I for ResNet-50 where each layer has two distinct normal-like distributions whose means are used as the learned initialization. This pattern repeats for all networks that we study. – Finally, as given in Table 2, we show that the amenable sparse network equipped with the learned quantized and compressed initialization approximately achieves the test accuracy of the dense network. We compare ASNI-I against its counterparts [6,22,46] where the two last ones are the state-of-the-art methods. We show numerically that ASNI-I solves the first task of pruning with higher accuracy. Also, we show that the test accuracy of a sparse amenable network learned by ASNI-I and initialized by the quantized and compressed initialization learned by ASNI-II is higher than its counterparts. This is the case either with methods that solve the second task of pruning in one round [8,10] or the ones that use foresight pruning, i.e., [26,43]. Hence, the ASNI algorithm is capable of solving both tasks of pruning.
Fig. 2. Parameter Distribution of ResNet-50 Trained on ImageNet with ASNI-I. ResNet-50 has 4 Main Stages where Stages have 3, 4, 6, and 3 Bottlenecks, Respectively. The First Bottleneck of each Stage has 3 Convolution Layers and a Skip Connection. This Picture Shows Parameter Distribution of the First Bottleneck in the Forth Stage at Overall Sparsity Percentage of s = 80%. Two Short Bars Indicate c+ and c− of each Layer. These Bars are the Averages of Parameters and used for the Initialization of the Learned Amenable Sparse Network. They are Found by ASNI-II in Lines 13 and 15 in Alg. 3. To see All Stages of the Network, Check Appendix
Amenable Sparse Network Investigator
411
Fig. 3. This Figure Shows Different Global Sparsity Percentages Obtained from Sigmoid Function. The Sparsity Percentage is Applied Across the Network during E = 90 Training Epochs. The Value of γ Determines the Transition Slope from a Low Sparsity Percentage to a High One. For Large γ, e.g., 100, the Slope is Small and almost Constant during All Epochs. In this Case, the Initial Sparsity Percentage is High which Means a Lot of Pruning takes Place at the Beginning. For Small γ, e.g., 1, the Transition Slope is very Sharp which Means at some Epoch of Training a Lot of Pruning Happens. The Best One with β = 0.5 and γ = E/10 = 9 Allows the Network to Learn with the Full Capacity at the Early Epochs, i.e., up to e = 10. Then, the Network Sparsity Starts Increasing with a very Small Slope to make Sure Learning of the Sparse Structure is done Properly, i.e., e = 10 − 30. After Learning the Structure of the Network, Pruning Increases with a Constant Slope, i.e., e = 30 − 60. Then, Pruning Decreases very Rapidly to Allow the Network Heal from Pruning, i.e., e = 60 − 80. Finally, to Fine Tune the Survived Parameters Pruning is Almost Zero from e = 80−90
2
Related Work
In this section we will go over related work trying to solve the first and second tasks of pruning as indicated in Fig. 1. We focus on the second task since it is a newer task compare to the first one. 2.1
First Task of Pruning
Working on the first task of pruning started more than 30 years ago [16,17,25]. However, for deep neural networks, initially [15] showed that it is possible to achieve off-the shelf-sparse networks simply by magnitude-wise pruning. Since then, many works have been done using this concept, [5,13,14,27–33,42,44]. The current approach for solving this task is learning an accurate sparse network in one single round of training. [46] was the first work that tried to learn an accurate sparse network in this way. The current state-of-the-art [6,22] solve this task via one single round of training. The former starts from a dense network but the latter starts from a sparse network that has desired sparsity and finds a an accurate sparse network.
412
2.2
S. Damadi et al.
Second Task of Pruning
Solving the second task of pruning is to find a sparse trainable network. To solve it, one needs to find a mask. This mask may be found in multiple rounds of training or without training. We will go over both cases. Multiple Rounds of Training. To solve the second task [7] proposes a practical way of finding a sparse trainable network. The so-called lottery ticket algorithm learns a mask in multiple rounds of training and then shows that the associated sparse network is trainable from a specific sparse initialization. This initialization is determined using the learned mask and the original initialization. However, LTA is not able to find tickets for large networks like ResNet-50. The problem is that the sparse network can not be trained from the sparse initialization. Hence, the stabilized LTA, [9], uses parameters of the k-th step from the first training round as the initialization to reach test accuracy of the dense network approximately. Rewinding to an intermediate step k in order to obtain a sparse trainable network has also been observed by [45]. Necessity for rewinding to the k-th step utilized by [9,45] corroborates the observation in [1] that states “in the early stages of training, important connectivity patterns among parameters of layers are discovered”. Also, it has been observed that the learned connectivity patterns stay relatively fixed in the later steps of training. The LTA has an uncomplicated strategy to force more sparsity into the network. It applies a specific sparsity percentage at the end of each round of training. Although it is very easy to implement, it may not be an optimal strategy for reducing the number of parameters. The pruning strategy of the LTA can be improved as [39] uses a continuous strategy to remove parameters at each round of training. However, this is done at the expense of doubling the number of parameters. Foresight Pruning: Zero Round of Training. [7,9,39] all search for a sparse structure utilizing multiple rounds of training which is very costly. The best way could be finding a sparse structure even before training. This is called foresight pruning or pruning before training. To this end, [26] (SNIP) was the first work that posited the idea of foresight pruning which determines a mask before training. In this approach, a connection sensitivity score is calculated before training and parameters are removed based on the score vector. Others have tried to find different scoring vectors. For example, [43] introduces GraSP which utilizes Hessian-gradient product to find a score vector. As opposed to the previous proposed methods, [41] finds a sparse structure using different scoring in n rounds of pruning. Multiple- And Zero-Round Are Extreme. It is true that finding the sparse structure before training is the ideal approach and may have a negligible computational cost to find a mask, but the performance of the methods that apply pruning before training is not competitive with performance of networks whose
Amenable Sparse Network Investigator
413
initialization is obtained using pruning far later in training [10,11]. To avoid multi-round training computation and loss accuracy in foresight pruning [10] uses magnitude based pruning approach that learns a sparse structure in a single round. Similarly, [8] uses one round plus some k steps to find the new set of initialization. Considering these two last methods, our approach learns a sparse structure in one single round of training.
3
Finding a Sparse Trainable Network
In this section we first define the exact optimization problem that solves the first task of pruning. Then, we elaborate on how changing initialization results in solving the second task of pruning. 3.1
Problem Explanation
Finding a sparse network whose accuracy is on a par with a dense network amounts to solving a bi-level, constrained, stochastic, nonconvex, and nonsmooth sparse optimization problem as follows: ˆ 0 θ∗ = arg min ||θ|| θˆ
s.t. θˆ = arg min Ex∼D f y(x), h(x; θ)
(1)
θ
where h(x; θ) is the vector-valued neural network function whose input is a random vector x ∼ D labeled by random vector y(x) and its parameters are denoted by θ ∈ Rd . The scalar-valued function f is the cost or loss function, also known as the criterion, ||θ||0 is the number of nonzero elements of θ, or the 0 norm1 . In Problem (1): having two optimizations simultaneously makes it bi-level, D is the source of stochasticity and is unknown, composition of f and h make the objective function nonconvex; since 0 norm is not differentiable, one deals with a non-smooth problem. If one can solve Problem (1), the energy consumption reduces, hardware requirements are relaxed, and performing inference become faster. Unfortunately, even a deterministic and convex sparse optimization problem, e.g., least-squares problem, is a combinatorial and NP-hard problem [3,34]. Therefore, the best approach is to convert the current bi-level optimization problem into another optimization problem that is neither bi-level, stochastic, nor non-smooth. Solving the new optimization problem will ˆ However, the find an approximate solution to the original Problem (1), i.e., θ∗. unanswered question is how to convert Problem (1) to a solvable approximate problem. By trial and error, one can find an upper bound for the sparsity of the vector parameter, i.e., ||θ ∗ ||0 sˆ. Given the sparsity level sˆ, we get the 1
0 is not mathematically a norm because for any norm · and α ∈ R, αθ = |α|θ, while αθ0 = |α|θ0 if and only if |α| = 1.
414
S. Damadi et al.
following optimization problem: ˜ = arg min Ex∼D f y(x), h(x; θ m) (θ˜ ∗ m∗) θ ,m
s.t.
(2)
||m||0 ≤ sˆ
where m = {0, 1}d is a binary mask and denotes Hadamard product operator. Unlike Problem (1) where a sparsity level is automatically found, i.e., s∗, here sparsity level sˆ is given. Because of this fact, solutions to Problem (1) and ˜ may not be the same. Having an accurate estimate (2), that are θ∗ and θ˜ ∗m∗, for sˆ s∗ relaxes Problem (1) from being a bi-level optimization problem to a single level problem. However, all the other issues mentioned above still stay with Problem (2). Now from the perspective of optimization, the first question would be whether this problem is feasible or not. Empirically, [7] addressed this problem for the first time and conjectures that such a solution exists and named their conjecture the lottery ticket hypothesis. Mathematically speaking, the lottery ticket hypothesis conjectures that Problem (2) for sˆ s∗ is feasible and there exists a solution to that. Once we know the solution exists, the most straightforward approach to find ˆ which is a an approximate solution to Problem (2) is to first find a mask m ˜ To estimate m, ˆ there are two extreme approaches. One good estimate of m∗. ˆ using multiple rounds of training [7,9,39], and the other finds approach finds m it before training [26,41,43]. As opposed to the latter, the former reaches test accuracy of the dense network [10,11], but it is computationally expensive. Once ˆ ≈ m∗ ˜ is at hand, the non-smoothness and constraint an accurate mask, i.e., m of Problem (2) can be relaxed. Also, by assuming large sample size and using stochastic approximation of the expected value in the objective function [12], one can use the following unconstrained optimization problem as a relaxation for Problem (2): ˆ ˆ = arg min R(X; θ m) (θˆ ∗ m) (3) θ
ˆ is the approximation of the expected value as where R(X; θ m) M −1
M x(i) ), h(x x(i) ; θ m) ˆ , f y (x i=1
X is the data matrix as x (1) , . . . , x (M ) , x (i) for i = 1, . . . , M is a realization of x(i) ) is the realized target associated with x (i) . x ∼ D, and y (x 3.2
Initialization of the Stabilized LTA Vs LTA
ˆ every initialization from which Problem (3) can be solved Given a mask m, iteratively is an acceptable initialization. The LTA algorithm in Algorithm 1, proposes an initialization that is based on the original initialization. On the other hand, the stabilized LTA in Algorithm 1 uses the learned parameters of
Amenable Sparse Network Investigator
415
Algorithm 1. The LTA, [7] ˆ = 1, T , p%, r Require: θ 0 , m 1: for 1 to r do ˆ for T steps 2: Optimize R(X; θ 0 m) ˆ 3: Keep p-th percentile of |θ T | = 0 and update m 4: Rewind nonzero values of θ T to θ 0 and update θ 0 5: end for
Algorithm 2. The stabilized LTA, [9] ˆ = 1, T , p%, k, r Require: θ 0 , m 1: for j=1 to r do ˆ for T steps 2: Optimize R(X; θ 0 m) 3: if j = 1 then save θ˜0 ← θ k 4: end if ˆ 5: Keep p-th percentile of |θ T | = 0 and update m 6: Rewind nonzero values of θ T to θ˜0 and create θ 0 7: end for
the k-th step of training as the initialization. Our algorithm will learn another acceptable initialization using ASNI-II. To elaborate on the importance of the initialization notice that the LTA ˆ is assumed proposes Algorithm 1 that solves Problem (3) for r rounds when m to be given in each round. It starts from a dense random initialization and a ˆ = 1. Then, it trains the network for mask whose elements are all one, i.e., m T steps. At the end of training p% of nonzero parameters are zerod out and mask gets updated. Finally, nonzero parameters are rewound to their associated entries in the original initialization to update the initialization for the next round of training. The issue with Algorithm 1 is that it fails to find the winner ticket for large networks such as ResNet-50. This issue is solved by the stabilized LTA given in Algorithm 2. As one can see in Algorithm 1, LTA rewinds the learned parameters to the original initialization while the stabilized LTA rewinds the learned parameters to the k-th step of the first round of training. This little tweak stabilizes the LTA and makes it possible to find winning tickets for large networks. As we will see later, ASNI-II rewinds the nonzero parameters to the average of parameters learned by ASNI-I. This rewind is done so that sign of each learned initialization element is in accordance with learned parameters by ASNI-I.
4
Method
In this section we introduce the ASNI algorithm and explain how one can set its parameters.
416
S. Damadi et al.
Algorithm 3. The ASNI Require: Initial parameter vector θ 0 , training data Xtr , epochs E, optimizer’s parameters, e.g., initial learning rate η = η0 , cosine scale δ, mini-batch size b, initial mask ˆ = 1, sigmoid’s parameters α, β, γ. vector m 1: ASNI-I: 2: for e = 1 to E do 3: for k = 1 to T do ˆ 4: θ k ← θ k−1 − η ∇R(Xtr,b ; θ k−1 ) m 5: end for 6: p = α sigmoid((e − βE)/γ) 7: τg = p-th percentile of {|θ|} ˆ = 1|s|≥τg (|s|) 8: m ˆ 9: θ0 ← θT m 10: end for ˆ ← θ0 11: θ∗ 12: ASNI-II: 13: for l = 1 to L do [l] ˆ [l] > 0) 14: c¯+ = mean(θ∗ 0 [l] ˆ [l] ) 15: θ [l] + = c¯ 1 [l] (θ∗ +
16: 17:
[l]
0 θ [l] − [l] 0
>0 [l]
< 0)
[l] ˆ [l] ) = c¯− 1θˆ∗[l] 0. This global pruning threshold is obtained after each epoch of training. The global threshold is calculated by gathering magnitudes of all parameters except the bias and batch normalization
Amenable Sparse Network Investigator
417
parameters. We do not include bias and batch normalization parameters since the number of these parameters is negligible. The global threshold, i.e., τg , is found such that the magnitude of p% of all gathered parameters are less than τg , and (100 − p)% are above the global threshold. Next in line 72 , ASNI-I zeros out the entries of the mask vector for those parameters whose magnitudes are less ˆ gets updated. Once ASNI-I updates than τg . This is where the mask vector m the mask, retraining restarts for another epoch. As the last epoch finishes, the ˆ This learned vector of parameters would be the learned sparse vector, i.e., θ∗. sparse vector together with its mask is an approximate solution to Problem (2). [l] [l] ˆ is found, by following lines 11–16, centroids (¯ ¯− ) are calculated When θ∗ c+ , c ˆ learned to learn the quantized and compressed initialization. Also, the mask m by ASNI-I identifies the sparse amenable network. Note that initialization for batch normalization weights is one and initialization for all biases would be zero. 4.2
ASNI Parameters
ASNI algorithm follows a simple intuitive and easy strategy for determining the global sparsity percentage for each epoch. In line 5, ASNI-I determines sparsity percentage by utilizing a sigmoid function as p = α sigmoid((e − βE)/γ) for e = 1, . . . , E, where E is the total number of epochs, α controls the final sparsity, β governs how early and late pruning starts and stops, and γ controls how fast pruning should be done. Although the sigmoid function has three parameters, by determining two of them the last one is determined. Thus, we only need to search for two parameters. We will explain how to chose these parameters in the order of their importance. Hence, we start with the most important parameter which is β. How to Choose β? To apply the ASNI-I algorithm one needs to set β to 0.5 for all experiments. The value of β shifts the position of the inflection point of the sigmoid curve to the left or right. In Fig. 3, the inflection point is at 45th epoch. Therefore, up to that point, the sparsity percentage curve is increasing while after 45th epoch the sparsity percentage decreases. Setting β to 0.5 creates a symmetric curve about the inflection point of sigmoid function. Generally, every sparsity curve like the ones in Fig. 3 has three phases: 1) small pruning associated with learning the structure, 2) pruning and learning, 3) small pruning for healing from pruning. Any value other than 0.5 will be biased towards either phase 1 or 3. How to Choose γ? The value of γ determines the transition slope from a low sparsity percentage to a high one. The lower γ, the higher transition slope form a mild pruning strategy to an aggressive one. For a small γ, e.g., 1, the slope in Fig. 3 is very sharp and so many parameters would be pruned in a very few
2
1A (x) = {1 if x ∈ A,0 if x ∈ / A}.
418
S. Damadi et al.
Table 1. Datasets and Networks Combination Together with their Hyper Parameters Comb Dataset
Network
Params
E
B
LR/WD
Iter.
1
MNIST
FC
266,610
50
60 1.2e-3
1000
2
MNIST
Conv2
3,317,450
20
60 2e-4
1000
3
MNIST
Conv4
1,933,258
25
60 3e-4
1000
4
MNIST
Conv6
1,802,698
30
60 3e-4
1000
5
CIFAR-10 Conv2
4,301,642
20
60 2e-4
1000
6
CIFAR-10 Conv4
2,425,930
25
60 3e-4
1000
7
CIFAR-10 Conv6
2,262,602
30
60 3e-4
1000
8
CIFAR-10 VGG-11
9,231,114 160 128 0.05/5e-4
391
9
CIFAR-10 VGG-13
9,416,010 160 128 0.05/5e-4
391
10
CIFAR-10 VGG-16
14,728,266 160 128 0.05/5e-4
391
11
CIFAR-10 ResNet-18 11,181,642 160 128 0.08/5e-4
391
12
ImageNet ResNet-50 25,557,032
90 820 0.35/1e-4 6252
number of epochs. On the other hand, for a large γ, e.g., 100, the slope in Fig. 3 is small and the curve becomes a line starting from high sparsity percentages. This means we have a lot of pruning at the beginning. According to our experiments, γ = E/10 accompanied by β = 0.5 provides the best result for all networks we experimented. How to Choose α? As we explained, given two parameters the third one is determined. Therefore, by determining β and γ, the value of α is determined from a desired sparsity percentage.
5
Experiments Setup
This section explains the setup of our experiments for training. Table 1 summarizes experiments (dataset and network combinations) and their hyper parameters including number of trainable parameters, number of epochs for training (E), mini-batch size (B), the initial learning rate together with weight decay (LR/WD), and number of iterations to complete an epoch (Iter.). Also, we will elaborate on the learning rate policy and the optimizer choice in the following subsections.
Amenable Sparse Network Investigator
419
Fig. 4. Parameter Distribution Considering All Parameters in ResNet-50 Network Trained on ImageNet-1K. The Orange Plot Shows Distribution of the Dense Network Initialization. The Red Plot is the Distribution of the Learned Parameters when the Network is Dense. The Blue One is the Distribution of Parameters Learned by ASNI-I at s ≈ 80%. The Green one is the Distribution of Parameters for the Learned Sparse Amenable Network (s ≈ 80%) Initialized by Centroids Obtained from ASNI-I
5.1
Datasets
Our experiments involve image classification on well-known datasets including the MNIST [23], CIFAR-10 [21], and the ImageNet-1K [38]. For ImageNet-1K our training pipeline uses a standard data augmentation including random flips and crops. 5.2
Networks
Network architectures that we utilize include a 3-layer fully-connected network (LeNet-300-100) [24] named FC, convolutional neural networks (CNNs), named Conv-2, Conv-4, and Conv-6 (small CNNs with 2/4/6 convolutional layers, same as in [7]). We also use ResNet-18/50 [19]. Additionally, we utilize VGG-style networks [40] named as VGG-11/13/16 with batch normalization and an average pooling after convolution layers followed by a fully connected layer. As a result of the average pooling, parameter count of VGG-style networks decreases which mitigates the parameter inefficiency with original VGG networks. 5.3
Optimizer
We use stochastic gradient descent [37] with momentum [35] (SGD+M), or Adam [20] as our optimizers. the Adam optimizer is used for experiments involving small networks such as FC, Conv2/4/6. The SGD+M optimizer is used for the VGG-style and ResNet networks. We set the momentum coefficient to 0.9 for all experiments the SGD+M is used. 5.4
Learning Rate Policy
Three learning rate policy are used for different experiments: 1) constant, 2) cosine, 3) cosine with warm-up.
420
S. Damadi et al.
Constant Learning Rate. For those experiments that we use the Adam optimizer, the learning rate is constant throughout the training. Cosine Policy Learning Rate. For cases where the SGD+M is used, the learning rate follows a cosine policy. This cosine policy reduces the learning rate following a cosine function that has three parameters, i.e., η = η0 cos(πe/(1 + δ)E) where η0 is the initial learning rate, E is the total number of epochs, and nonzero δ controls the final learning rate. Parameter δ is set to 0.05 for VGGstyle and ResNet-18 networks and it is set to 0.04 for training ResNet-50. Only for training of ResNet-50 we use 10 epochs to warm up the value of learning rate. For the warm-up case learning rate increases linearly in 10 steps so that it reaches the highest value after the 10th epoch. Then, it starts decaying following the above cosine function. 5.5
Parameters Initialization
Network parameters are initialized according to the Kaiming Normal distribution [18]. 5.6
Deep Network Framework
For all experiments we use Pytorch [36] and its native automatic mixed precision (AMP) library to boost the speed of training. Networks are trained using NVIDIA TITAN X (Pascal) 12GB GPUs.
6
Results
In this section we will go over different aspects of our experiments like test accuracy, network- and layer-wise distributions of the sparse network learned by ASNI-I, and compare ASNI-I and learned initialization by ASNI-II with their counterparts. 6.1
The Overall Network Parameters Distribution
The ASNI-I algorithm uses a global threshold that considers the magnitude of every parameter across the network. Therefore, it makes sense to look at the distribution of all parameters at once and not in layer-wise fashion. Figure 4 shows the network distribution for ResNet-50 trained on ImageNet-1K. As Fig. 4 shows, initial parameters (orange distribution) have the largest variance. On the other hand, distribution of the learned parameters of a dense network (red distribution) has the smallest variance among all four distributions. Distribution of parameters learned by ASNI-I (blue one) includes two normal-like distributions where small values (in the absolute value sense) have been discarded. Ignoring small values is what ASNI-I forces but observing two normal-like distribution
Amenable Sparse Network Investigator
421
Table 2. This Table Shows Hyper Parameters for Pruning, Final Sparsity, and Top-1 Test Accuracy of Four Variants. Accuracy Percentages are the Average of 5 experiments. (T1-D): Dense Network with Zero Sparsity, (T1-A-I): Sparse Network Learned by ASNI-I, (T1-A-II) the Sparse Amenable Network Initialized by the Quantized and Compressed Initialization Learned by ASNI-II, (T1-S) the Sparse Amenable Network Initialized by the Original Initialization Comb α
γ
s%
1
98
5
96.87 8,335
96.88
96.72
96.93
96.75
2
99.2
2
98.18 86,363
98.09
98.12
98.14
98.03
3
98.5
2
97.94 39,828
98.33
98.46
98.52
98.27
4
98.5
3
97.15 51,420
98.25
98.54
98.53
98.36
5
98.5
2
96.71 141,364
76.10
74.61
76.4
75.26
6
95
2
94.47 134,185
84.34
84
83.76
83.24
7
94
3
92.72 164,624
86.69
86.1
85.72
85.55
8
97
16 96.14 356,230
92.40
91.51
91.18
89.94
9
98
16 97.14 269,221
94.23
92.98
92.29
91.67
10
99
16 98.15 272,626
93.90
93.45
92.13
91.94
11
97
16 96.14 451,216
90.11
88.93
88.52
87.95
12
81.04 10 80.12 5,080,737 76.08
75.23
75.17
70.91
12
91.05 10 90.08 2,535,257 76.08
74.28
73.97
68.41
Nonzeros T1-D% T1-A-I% T1-A-II% T1-S%
is surprising. The more surprising phenomenon is the parameter distribution of the amenable sparse network initialized by ASNI-II (green one). This distribution covers the distribution of parameters learned by ASNI-I and fills the gap between them. This means that the learned sparse amenable structure initialized by centroids is also able to learn the learned parameters by ASNI-I because they have the same parameter distribution. Compare to the layer distribution of parameters learned ASNI-I in Fig. 2, one can observe that these normal-like distributions also exist for each layer. That motivated us to use the averages of positive and negative values to initialize the amenable sparse network. 6.2
Test Accuracy
Table 2 shows hyper parameters for pruning and test accuracy corresponding to four different variants. These four variants are: 1) dense network, a sparse network learned by ASNI-I, the amenable sparse network initialized by the initialization from ASNI-II, the amenable sparse network initialized by the original initialization. There are two observations in Table 2. First, the accuracy of the dense network is not always the highest and sometimes the network learned by ASNI-I performs better even if it has far less parameters than the dense one. Another observation is that networks initialized by the original initialization cannot reach test accuracy of the dense network. They also fall short of the accuracy of the sparse amenable network which is initialized by the initialization learned by ASNI-II.
422
S. Damadi et al.
Fig. 5. Layer-Wise Sparsity after Training a ResNet-50 Network. The Network has 4 Main Stages where Stages have 3, 4, 6, and 3 Bottlenecks Respectively. The Stages are Color-Coded with Blue (2–11), Red (12–24), Green (25–43), and Pink (44–53). Black Bars are Sparsity Percentage of the First Convolution Layer (not in Stages) and the Fully Connected Layer at the End. The First Convolution Layer in each Stage Located in the First Bottleneck of that Stage is Hatched with Lines. Skip Connections of each Stage is Hatched with Stars. The Least Sparsity Occurs at the Last Layer (54) which is a Fully Connected one. Also, First Convolution Layers (2, 12, 25, 44) in the First Bottleneck of each Stage are the Ones with the Smallest Sparsity Compared to Other Convolutional Layers in Stages
6.3
Layer-Wise Sparsity Distribution
ASNI-I prunes parameters of the network to reach a predefined sparsity percentage for the entire network. After reaching this sparsity value the question is which layers have been pruned more and which layers are denser than others. As ASNI-I does not enforce any limitations on the sparsity of layers, each layer can be pruned differently. According to Fig. 5 layers are pruned non-uniformly. Among all layers the last layer (black one) is the most dense one with least pouring. The second most dense one is the very first convolutional layer. Another interesting observation is that, the first convolutional layer in the first bottleneck of each stage is the layer with the least pruning. To observe that notice stages are color-coded with blue (2–11), red (12–24), green (25–43), and pink (44–53) in Fig. 5. 6.4
ASNI Performance Vs Its Counterparts
The ASNI algorithm solves both tasks of pruning in Fig. 1 simultaneously. The sub-algorithm ASNI-I solves the first task and finds an off-the-shelf accurate sparse network in one round for ResNet-50 trained on ImageNet-1k. We compare ASNI-I with [6,22,46] in Table 3. These methods are the ones that try to solve the first task in one round of pruning. For the second task of the pruning, in Table 4, we compare the accuracy of foresight pruning methods in [26,43] and methods that utilize magnitude pruning like [8,10] for solving the first task in one single round. Table 3. First Task of Pruning at s ≈ 80% (comb. 12) Method
Nonzeros Top-1
Gradual Magnitude Pruning, [46] 5,120,000 74.68 Soft Threshold Reparameterization, [22] 5,120,000 74.87 5,120,000 74.55 Rigging the Lottery, [6] 5,080,737 75.23 ASNI-I
Amenable Sparse Network Investigator
423
Table 4. Second Task of Pruning at s ≈ 80% (comb. 12) Method
Nonzeros Top-1
Iterative magnitude pruning, [8] Magnitude pruning after training, [10] Single-shot network pruning, [26] Gradient Signal Preservation, [43] Amenable Sparse network initialized by ASNI-II
5,120,000 5,120,000 5,120,000 5,120,000 5,080,737
73.33 75.12 69.29 71.56 75.17
Fig. 6. Appendix: ResNet50-ImageNet-1KParameter: Distribution of each Layer
424
S. Damadi et al.
Fig. 7. Appendix: ResNet50-ImageNet-1KParameter: Distribution of each Layer
7
Conclusion and Discussion
We proposed the ASNI algorithm that solves two tasks of pruning simultaneously using two subalgorithms. To solve the first task ASNI-I learns an accurate sparse network in one round. This learned sparse network is amenable because it reaches the test accuracy of the dense network starting from quantized and compressed initialization. The ASNI algorithm owes its success to a simple pruning strategy which utilizes sigmoid function which manages the sparsity budget throughout the training process. By choosing parameters of sigmoid function
Amenable Sparse Network Investigator
425
properly, ASNI symmetrically reduces pruning at the beginning and end of the pruning. As our future work, we will work on the mini-batch Iterative Hard Thresholding algorithm [2] for results that are backed by stochastic optimization theory.
References 1. Achille, A., Rovere, M., Soatto, S.: Critical learning periods in deep networks. In: International Conference on Learning Representations (2018) 2. Damadi, S., Shen, J.: Convergence of the mini-batch SIHT algorithm. arXiv preprint arXiv:2209.14536 (2022) 3. Davis, G., Mallat, S.: Adaptive nonlinear approximations. PhD thesis, New York University, Graduate School of Arts and Science (1994) 4. Denil, M., Shakibi, B., Dinh, L., Ranzato, M., De Freitas, N.: Predicting parameters in deep learning. In: Advances in Neural Infomation Processing Systems, pp. 2148– 2156 (2013) 5. Dettmers, T., Zettlemoyer, L.: Sparse networks from scratch: faster training without losing performance. arXiv preprint arXiv:1907.04840 (2019) 6. Evci, U., Gale T., Menick, J., Samuel Castro, P., Elsen, E.: Rigging the lottery: making all tickets winners. In: International Conference on Machine Learning, pp. 2943–2952. PMLR (2020) 7. Frankle, J., Carbin, M.: The lottery ticket hypothesis: finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635 (2018) 8. Frankle, J., Dziugaite, G.K., Roy, D., Carbin, M.: Linear mode connectivity and the lottery ticket hypothesis. In: International Conference on Machine Learning, pp. 3259–3269. PMLR (2020) 9. Frankle, J., Dziugaite, G.K., Roy, D., Carbin, M.: Stabilizing the lottery ticket hypothesis. arXiv preprint arXiv:1903.01611 (2019) 10. Frankle, J., Dziugaite, G.K., Roy, D., Carbin, M.: Pruning neural networks at initialization: why are we missing the mark? arXiv preprint arXiv:2009.08576 (2020) 11. Gale, T., Elsen, E., Hooker, S.: The state of sparsity in deep neural networks. arXiv preprint arXiv:1902.09574 (2019) 12. Ghadimi, S., Lan, G.: Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM J. Optim. 23(4), 2341–2368 (2013) 13. Guo, Y., Yao, A., Chen, Y.: Dynamic network surgery for efficient DNNs. arXiv preprint arXiv:1608.04493 (2016) 14. Han, S., Mao, H., Dally, W.J.: Deep compression: compressing deep neural networks with pruning, trained quantization and Huffman coding. arXiv preprint arXiv:1510.00149 (2015) 15. Han, S., Pool, J., Tran, J., Dally, W.: Learning both weights and connections for efficient neural network. In: Advances in Neural Infomation Processing Systems, pp. 1135–1143 (2015) 16. Jos´e Hanson, S., Pratt, L.Y.: Comparing biases for minimal network construction with back-propagation. In: Advances in Neural Infomation Processing Systems, pp. 177–185, 1989 17. Hassibi, B., Stork, D.G.: Second order derivatives for network pruning: optimal brain surgeon. In: Advances in Neural Infomation Processing Systems, pp. 164– 171 (1993)
426
S. Damadi et al.
18. He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: surpassing humanlevel performance on imagenet classification. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1026–1034 (2015) 19. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) 20. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014) 21. Krizhevsky, A., Nair, V., Hinton, G.: Cifar-10 and cifar-100 datasets. https://www. cs.toronto.edu/kriz/cifar.html 6(1), 1 (2009) 22. Kusupati, A., et al.: Soft threshold weight reparameterization for learnable sparsity. arXiv preprint arXiv:2002.03231 (2020) 23. LeCun, Y.: The MNIST database of handwritten digits. http://yann.lecun.com/ exdb/mnist/ (1998) 24. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998) 25. LeCun, Y., Denker, J.S., Solla, S.A.: Optimal brain damage. In: Advances in Neural Information Processing Systems, pp. 598–605 (1990) 26. Lee, N., Ajanthan, T., Torr, P.H.S.: SNIP: single-shot network pruning based on connection sensitivity. arXiv preprint arXiv:1810.02340 (2018) 27. Li, H., Kadav, A., Durdanovic, I., Samet, H., Graf, H.P.: Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710 (2016) 28. Liu, Z., Sun, M., Zhou, T., Huang, G., Darrell, T.: Rethinking the value of network pruning. arXiv preprint arXiv:1810.05270 (2018) 29. Louizos, C., Welling, M., Kingma, D.P.: Learning sparse neural networks through l 0 regularization. arXiv preprint arXiv:1712.01312 (2017) 30. Luo, J.-H., Wu, J., Lin, W.: ThiNet: a filter level pruning method for deep neural network compression. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5058–5066 (2017) 31. Mocanu, D.C., Mocanu, E., Stone, P., Nguyen, P.H., Gibescu, M., Liotta, A.: Scalable training of artificial neural networks with adaptive sparse connectivity inspired by network science. Nat. Commun. 9(1), 1–12 (2018) 32. Molchanov, P., Tyree, S., Karras, T., Aila, T., Kautz, J.: Pruning convolutional neural networks for resource efficient inference. arXiv preprint arXiv:1611.06440 (2016) 33. Narang, S., Elsen, E., Diamos, G., Sengupta, S.: Exploring sparsity in recurrent neural networks. arXiv preprint arXiv:1704.05119 (2017) 34. Balas Kausik Natarajan: Sparse approximate solutions to linear systems. SIAM J. Comput. 24(2), 227–234 (1995) 35. Nesterov, Y.E.: A method for solving the convex programming problem with convergence rate o (1/k2). In: Dokl. akad. nauk Sssr, vol. 269, pp. 543–547 (1983) 36. Paszke, A., et al.: PyTorch: an imperative style, high-performance deep learning library. arXiv preprint arXiv:1912.01703 (2019) 37. Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat., 400–407 (1951) 38. Russakovsky, O., Deng, J., Hao, S., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 115(3), 211–252 (2015) 39. Savarese, P., Silva, H., Maire, M.: Winning the lottery with continuous sparsification. arXiv preprint arXiv:1912.04427 (2019)
Amenable Sparse Network Investigator
427
40. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014) 41. Tanaka, H., Kunin, D., Yamins, D.L.K., Ganguli, S.: Pruning neural networks without any data by iteratively conserving synaptic flow. arXiv preprint arXiv:2006.05467 (2020) 42. Tartaglione, E., Lepsøy, S., Fiandrotti, A., Francini, G.: Learning sparse neural networks via sensitivity-driven regularization. arXiv preprint arXiv:1810.11764 (2018) 43. Wang, C., Zhang, G., Grosse, R.: Picking winning tickets before training by preserving gradient flow. arXiv preprint arXiv:2002.07376 (2020) 44. Wen, W., Wu, C., Wang, Y., Chen, Y., Li, H.: Learning structured sparsity in deep neural networks. arXiv preprint arXiv:1608.03665 (2016) 45. You, H., et al.: Drawing early-bird tickets: towards more efficient training of deep networks. arXiv preprint arXiv:1909.11957 (2019) 46. Zhu, M., Gupta, S.: To prune, or not to prune: exploring the efficacy of pruning for model compression. arXiv preprint arXiv:1710.01878 (2017)
Comparison of Adversarial and Non-Adversarial LSTM Music Generative Models Moseli Mots’oehli1(B) , Anna Sergeevna Bosman2 , and Johan Pieter De Villiers3 1
Department of Information and Computer Science, University of Hawai’i at Manoa, Honolulu 96822, USA [email protected] 2 Department of Computer Science, University of Pretoria, Pretoria 0002, South Africa [email protected] 3 Electrical and Computer Engineering, University of Pretoria, Pretoria 0002, South Africa [email protected] Abstract. Algorithmic music composition is a way of composing musical pieces with minimal to no human intervention. While recurrent neural networks are traditionally applied to many sequence-to-sequence prediction tasks, including successful implementations of music composition, their standard supervised learning approach based on input-to-output mapping leads to a lack of note variety. These models can therefore be seen as potentially unsuitable for tasks such as music generation. Generative adversarial networks learn the generative distribution of data and lead to varied samples. This work implements and compares adversarial and non-adversarial training of recurrent neural network music composers on MIDI data. The resulting music samples are evaluated by human listeners, and their preferences are recorded. The evaluation indicates that adversarial training produces more aesthetically pleasing music. Keywords: Music Generation · MIDI · Generative Adversarial Networks · Long-Short Term Memory Neural Networks
1
Introduction
Like most art forms, music composition has long been a skill specific to human beings. Music composition has an intuitive side to it necessary to determine which pitches create harmorny together, what chords can be played after a certain note, or what note progressions are in violation of intrinsic musical theory. With the recent successes in neural network modeling of predictive natural behaviour and generative models, there have been good applications of modeling note progression probabilities for music generation. The two dominant approaches to neural music generation are adversarial training [14,27,41,50], c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): SAI 2023, LNNS 711, pp. 428–458, 2023. https://doi.org/10.1007/978-3-031-37717-4_28
Adversarial Music Generation
429
and sequence-to-sequence recurrent networks [10,45,46], each with its merits. Although Wave-form representations have been shown to be a viable way to generate audio not necessarily specific to music [31], it is symbolic representations that are favoured in literature for the task of music generation [10,11,26,29,50]. Owing to the existing lack of out-right comparisons between adversarial and nonadversarial training for music generation, the aim of this study is to compare music samples generated by two generative models, one trained in an adversarial setting, and the other in a non-adversarial setting, using musical instrument digital interface (MIDI) data. This work strives to demonstrate two points, namely: (1) That generative adversarial networks (GANs) with long-short-term memory neural network (LSTM) cells can be used to generate polyphonic music that is realistic, creative, and pleasing to listen to, and (2) that generative adversarial models with LSTM cells perform better than an identical non-adversarial LSTM-based generator. An LSTM-based neural network is trained in an adversarial setting to generate music in MIDI format and compared to an LSTM encoder-decoder network [7], which is not trained in an adversarial setting. Although adversarial training is much more complex in comparison to the encoder-decoder configuration for sequenceto-sequence models, GAN’s ability to model note progression by sampling a latent space leads to a more diverse generator. A Wasserstein generative adversarial network (WGAN) [4] is implemented instead of the maximum likelihood estimation (MLE) based GAN to ensure stable adversarial training. Although MIDInet [50], a GAN-based convolutional neural network (CNN) music generator, has been shown to produce better results compared to melodyRNN [45], which uses models that deploy recurrent neural network (RNN) cells, the two networks implement two different generator types, CNN and RNN, respectively, Thus the study provides no evidence to support the hypothesis that GAN training produces superior results in the music generation domain. The musical data used in this work is in MIDI format. The simplicity inherent in pre-processing MIDI data as compared to pre-processing raw audio made MIDI a more suitable choice of music representation for the training data. To ensure music quality is not negatively affected by data representation, a common 2D state-matrix representation [38] of note progression is adopted for both training configurations. Although multi-track notes are captured, for the purpose of comparing the network’s ability to model note progression, it is trivial to also learn multi-track probability distributions, and instead assume a MIDI type 0 file explained in Sect. 3 on decoding the resulting music and playback. The background of the architectures used in this study is also explored, beginning with fully connected feed-forward neural networks, activation functions, convolutions, recurrent networks, LSTM cells, and adversarial training. Standard music evaluation methods are used to come to a conclusion. The manuscript is organized as follows; In Sect. 2, we detail literature on artificial intelligence-based music generation, followed by a description of the dataset used in Sect. 3. In Sects. 4 and 5, we briefly discuss the music note encoding, decoding, and finally the models the data is trained on. As is standard in machine learning, Sects. 6 and 7 detail the evaluation metrics and experiments
430
M. Mots’oehli et al.
performed in assessing the performance of the trained models. We present results, and discussions in Sect. 8, before concluding and providing future research directions in Sect. 9.
2
Related Work
The goal of algorithmic music generation is to be able to develop systems that enable automation of the composition process, while still achieving results comparable to human-generated music. Although music as an art form has existed for millennia, the earliest publications on algorithmic composition are only as old as 1960 [52]. It was only towards the mid-1970s that significant interest and research were put into algorithmic music generation. There are different approaches to algorithmic music generation such as using mathematical models (stochastic processes) [49], grammar-based methods [26], learning algorithms [10,11,14,45], and evolutionary methods [2]. However, the most promising results have come from the learning algorithms in recent years, in particular deep learning neural networks. The focus of this study is on music composition using artificial neural network (ANN) learning algorithms [31,50]. A number of inventions in the deep learning domain contributed to the majority of the work performed in neural music generation. The LSTM network [19] is well suited to successful learning of sequential data such as audio, and has the capability to recall notes generated a number of time steps back by solving the vanishing gradient dilemma that other RNNs suffered from. GANs [17] are especially useful for generating realistic data while reducing over-fitting, and have been found to produce creative art [16,23,51]. The majority of the musicgenerating neural networks to date are trained on either jazz or classical music, and only the piano track is used or all other tracks are played back on the piano. SeqGAN [51] is a hybrid GAN between deep learning and reinforcement learning (RL), this model uses an RL generator agent to guide the generative learning. By using RL with the discriminator providing a reward function, seqGAN [51] is able to outperform standard MLE-based GANs in music generation. More recent work [22], shows again that SeqGAN outperforms other methods on a mucis generation bench-marking task based on the drum loops and FreeLoops datasets [3]. Like authors of [22], we implement a sequence based GAN, and consider subjective evaluation of the musical pieces produced. The study [11] proposed BachProp, an LSTM-based network for learning note progression independent of note representation. They propose a three-layered LSTM architecture to model notes, their timing, and duration by conditioning the two other attributes on the current note per time step. Although BachProp uses a normalized MIDI representation of all training songs for training, they neglect to indicate how the network is representation invariant, as it assumes MIDI input data. Like in BachProp, [29] introduced continuous-recurrent GAN (CRNNGAN) for the same task, and adopted a similar network structure with three stacked LSTM layers in the generator network to enable learning of highcomplexity notes, chords, and melody with an additional MIDI feature (note
Adversarial Music Generation
431
intensity) over BachProp. Unlike BachProp [11], CRNNGAN is trained in an adversarial setting. Due to their-three layered generator and continuous representation, CRNNGAN is limited to producing only up to three different tones per time step, hence producing music that is not rich in polyphony. Authors of [10] show that a gated recurrent unit neural network can be used to achieve results at least comparable to those of more sophisticated gated networks such as LSTM, keeping all training parameters equal. They train both models on piano-track MIDI data and evaluate generated samples for polyphony. However, comparison over multiple datasets produced inconclusive results. MelodyRNN [45] is a collection of RNN models (LookbackRNN and AttentionRNN) for polyphonic music generation trained on MIDI data. The LookBackRNN implements a look-back mechanism to help the network recall very long dependencies in generation, and the attentionRNN implements an attention mechanism [13] for increased note repetition to improve rhythm. WaveNet [31] introduced by the Google DeepMind team is a completely probabilistic network that uses the same architecture as PixelCNN [32] to generate raw audio waveforms from a dataset of mp3 files with multiple tagged genres. WaveNet was developed mainly for the task of text-to-speech synthesis and uses multiple layers of time-dilated convolutional networks instead of more traditional and suiting sequence models such as RNNs. The DeepMind team do this to avoid the long training time required for RNNs. However, WaveNet’s generator is not trained in an adversarial setting, and no quantitative results on music generation are reported by the authors. Residual-CNNs are successfully implemented in the GAN training framework in [6], where the task if to generate between track transitions similar to those made by a disco jockey (DJ). The authors of [6] also define a custom subjective evaluation methodology to conclude the musical pieces produced are competitive with those of human DJs. MidiNet [50] expands Wavenet this work using MIDI files instead, and train CNNs in an adversarial setting. Both networks learn note progression by sequentially conditioning future notes on the distribution of previously generated notes by using dilated convolutions. Although MidiNet outperforms standard RNNs, and produces music that is considered more varied and pleasing to listen to than both the LookbackRNN and AttentionRNN, it is unclear as to how it would compare to LSTM and gated recurrent unit (GRU) based networks which are superior to standard RNNs on very long sequence tasks. This work implements an encoder-decoder LSTM network in the same fashion as BachProp, and an LSTM-based GAN as in [11,29,51]. The authors in [15] also use the encoder-decoder architecture for evaluation of end-to-end polyphonic optical music recognition models. The majority of the work mentioned above uses MIDI training data, and only a few train from raw waveforms. The most common way of representing the MIDI features to be learned is using a 2D matrix [14] of binary entries where the pitch is on one axis, and the other axis represents time in ticks. In some cases, the real continuous values from the MIDI messages are entries in the 2D matrix [29,46]. Some work has been done in learning multi-track note progressions, although results show a lack of synchrony and cross-track dependency
432
M. Mots’oehli et al.
learning. The research [9] implements a sequence-to-sequence(Seq2Seq) model of four stacked LSTM layers to model multi-track note progression with each LSTM cell generating its own track’s output. Other notable neural network approaches for music generation include Variational Autoencoders [1,34], restricted Boltzmann machines [44], and deep belief neural networks [5] to learn note progression probabilities that boost rhythmic scores. They show how the resulting model can be used as a prior distribution for training an RNN for polyphonic music. In [39], the authors introduce “Music Frameworks”, a hierarchical music representation structure that allows for the treatment of melody, rhythm, and chord progression as separate but conditional tasks. While, unlike all the work listed thus far, Music FrameWorks implements two transformer-based networks [42] that have been shown to outperform LSTMs in the natural language processing (NLP) realm, their musical representation falls short when it comes to capturing polyphony and harmonic progression. As in most of the listed literature, Music Frameworks relies on subjective evaluation methods. We find that a gap exists in comparing music generation methods based on GANs versus none GAN training. It is also unclear in the literature how the same dataset of MIDI data would be treated in both settings. This work addresses these two points and shows that GANs deserve more attention in generative tasks of an artistic nature.
3
Database
Table 1. A Sample MIDI File with Note Action Messages Presented in Tabular form. The “Note” Column Represents Pitch. Note Type Communicates which Notes are on at each Point in Time, and on which of the 16 Channels the Message is Transmitted Channel Note Time Type
Velocity
0 0 0 0 0 0 9 9 1
80 80 80 80 80 64 80 64 80
39 58 46 61 70 70 39 39 46
0 0 0 0 0 240 0 480 0
note note note note note note note note note
on on on on on off on off on
Musical instrument digital interface (MIDI) is a set of standards that outline how to connect digital music instruments, so they can communicate using a messaging layout that is standardized. These messages are termed MIDI messages
Adversarial Music Generation
433
and encode instructions that can be decoded to produce sound by any digital instrument that conforms to the MIDI standard. There are several MIDI message types: “note on” messages, “note off” messages, meta messages, control change messages, program change messages, and tempo change messages. To assemble a fully functional MIDI file, all these messages are required, but for the purpose of modeling note progression and suitability to evaluate the model’s ability to generate audio, “note on” and “note off” messages on their own are sufficient, with other messages being set to default values. Each note event message contains several attributes that accompany it. Table 1 shows how note action messages are stored in a MIDI file. Since MIDI files are compact relative to the raw audio represented, there are numerous training dataset sources available online to choose from for the task of neural music generation. The Lahk MIDI dataset [33] is one of the most commonly used for this purpose. The dataset is a collection of 176,581 unique MIDI files of different genres and composers. Of the seven versions of the dataset, the “clean MIDI subset” version containing 17,257 songs with filenames indicating song titles and artists was used. This dataset is of size 224MB compressed and 770MB uncompressed. Due to memory constraints, only 289 distinct polyphonic songs from 10 composers were used for training, comprising only 10MB of disk space. Table 2 gives a statistical summary of the training data used for the different composers. Table 2. Training Dataset Description by Composer and Number of Songs. The Number of Notes Represent Both Note on and Off Messages to be Learned Composer
Number of Songs Number of Notes
Beethoven 29 117 Billy Joel 7 Borodin 3 Diana Ross 31 Elton John 5 Elvis Frank Sinatra 45 16 Liszt 15 Mendelssohn 21 Mozart Total
4
289
158978 947296 24260 39022 330044 29776 268180 53366 79516 81470 2011908
Methods
The note progression state-matrix representation algorithm used in this work is adopted from [38]. It takes a standard MIDI file as input and transforms the “note on” and “note off” messages into a matrix of binary entries. In this
434
M. Mots’oehli et al.
representation, only information pertaining to note messages is kept, that is, whether a note is on or off, the timing and duration of the note. The algorithm consists of two separate processes, one for encoding a MIDI file into the note state-matrix representation, and the other for decoding and transforming an existing note state-matrix into a valid MIDI file that can be transcribed by any MIDI-enabled device. The encoding and decoding processes are discussed in the following section. 4.1
Encoding
Starting with a MIDI file with textual and numerical data, the aim is to extract and transform into binary all note information to be used as training data for music generation. For each file, all messages relating to the same pitch are first lined up in ascending order of their delta times, with each present pitch having a list that starts at tick time 0. Adding up all the delta times on the pitch that plays last in the song produces the duration of the song. From the assembly of messages per pitch present in a song, three important attributes, namely pitch, time and note message type, are extracted and used to create the state-matrix representation of the song.
Fig. 1. Visualization of Sample MIDI Files in the Note State-Matrix Representation.
Extracting the pitch and note message type are fairly straightforward operations since they are explicitly contained in each of the ordered messages. The M ×2N matrix is constructed so that the M rows represent tick times. Of the 2N columns representing the pitch information for N possible pitch values, the first N columns uniquely identify each pitch value, and the next N indicates whether the pitch was played in the previous time step or not. N represents the number of possible pitches which is 128 by default. The entries into the matrix are binary to indicate the state of the note at each tick time on all pitches. Figure 1 shows the binary heat-maps of the state-matrix representations of notes in two songs. This representation allows for easy expression of multiple chords, hence allowing the models to capture as much information on polyphonic note progressions as possible. One complication with the state-matrix representation is that it creates an imbalance in the prediction space. This is because each time-step
Adversarial Music Generation
435
in the matrix contains a very small proportion of note-on signals as compared to note-off signals for the 128 available pitches. Table 3 shows the proportion of prediction instances representing positive note activations and negative activations for the training dataset used in this study. Table 3. Number of Prediction Instances/Note Messages in the Training Dataset Showing the Data Class Imbalance Inherent in the State-Matrix Representation Used in this Work Note Message Type # Messages Percentage Note-off Note-on
5924916 199884
96.63% 3.37%
Total
6124800
100%
Owing to the imbalance shown in Table 3, a model that simply predicts noteoff messages for all 128 pitch values at each time step will achieve an accuracy score of approximately 96%. BAcc, which is calculated as the weighted prediction accuracy between the number of prediction classes, is not prone to the same problem as prediction accuracy [8,18,43], and so it is used for both training and testing in this work. To create the training dataset, all training and test MIDI files are passed through this process to create a collection of note progression state matrices, which results in a three-dimensional matrix. This is ideal since the models explained in Sect. 5 expect a three-dimensional input. 4.2
Decoding
The decoding of a note state-matrix is the reverse process of the encoding process, and as such is highly dependent on the encoding. This process accepts a binary state-matrix of dimension defined in the encoding phase, as input, and transforms the matrix into a valid MIDI file. Without this process, it would be hard to evaluate the quality of audio samples generated by both the encoderdecoder and WGAN models discussed in Sect. 5. To be able to generate a valid transcribable MIDI song, each observation in the state-matrix is written as a note message in the MIDI file comprising the following attributes as a minimum requirement: channel, pitch, time, and velocity. The note message type is also a necessary attribute for valid MIDI transcription. For play-back purposes, a meta-tempo message is also sent to the file before all other note messages. However, the models compared in this work only model pitch progression states, and so the velocity, channel, and playback tempo are kept constant at 70, 1 (Acoustic Grand Piano), and 120 respectively to ensure the audio is loud enough and of standard pace. The Delta time of each message is calculated based on the number of ticks between the current message and the previous message of the same pitch, scaled on the constant file tempo. In the case that an observation in the state-matrix representation contains information about more than one pitch,
436
M. Mots’oehli et al.
multiple MIDI messages are generated from this one observation with the same delta time. The time attribute dt,s for pitch s is extracted only after all note progression states in the matrix are determined using formula 1 below: vst − vst−1 (1) τ where vst is the tick time of the current message containing instruction for state s, vst−1 is the tick time of the previous message containing information on the same state, and τ = 120 represents fixed standardizing tempo. Decoding pitch is more complex than all the other required attributes. In the note state-matrix representation of binary entries, the first N − 1 columns represent pitch activations, and the next N to 2N − 1 represent pitch retention of each observation. A value of 1 in the first N − 1 columns is recorded as a “note on” message, and a value of 0 denotes a “note off” message for the pitch encoded by the column. In the next N to 2N − 1 columns, a value of 1 instructs the MIDI transcriber to activate the corresponding pitch in column in the first N − 1 columns, while a value of 0 releases the pitch. Activations of a pitch of constant velocity for continuous time steps until release constitute an elongated pitch press to a human listener. When the state-matrix attributes have been extracted and written to a MIDI file, The decoding process assumes a MIDI file type 1, where all messages are written to one track. dt,s =
5
Models
The encoder-decoder and the adversarial generative models are presented in Sects. 5.1 and 5.2, respectively. Their architectures are explored together with some of the key structural training parameters such as activation functions, number of stacked LSTM layers, and optimizers used. Since the training configurations are different, the last part in each configuration explains how the preprocessed note state-matrix data is presented to the network for training. 5.1
Encoder-Decoder LSTM
The encoder-decoder configuration [7], is well suited to the modelling of sequential tasks such as text translation, textual question answering, text summarization and music generation. In this configuration, two networks joint end to end with only one objective function are trained so that the first network (encoder) is fed all the training sequences and encodes what it has learned in a low dimensional vector. The decoder then learns the mapping from this encoded “thought vector” to the desired sequence of outputs. The decoder does this by optimizing a loss function, and is trained using Back propagation through time (BPTT) [47]. Since the two networks are connected, gradients from the decoder are propagated all the way back to the encoder, so the encoder also improves its encoding process.
Adversarial Music Generation
437
It has been shown that stacking LSTM cells results in improved performance [12,29], and so three stacked LSTM cells in both the encoder and decoder networks were implemented. The reason for stacking LSTM layers is to capture multiple levels of abstraction that could be inherent in the temporal data. For music generation, the sequence of note information encodes not just note progressions per track, but also chord progressions and phrases that contribute to the rhythm and melody of the input and output. Capturing both the forward and backward flow of input during encoder training has shown to significantly improve the cell state’s memory retention and subsequently the quality of generated sequences [12,37] by the decoder. For this reason, one of the encoder layers reads the input sequence backwards, the other forward, and the third takes the aggregation function a(h1 , h2 ) result of the other two layers’s hidden states as input to finally produce the final hidden state VT = θ3 (a(h, h)), where θi (.) comprises all LSTM equations. Figure 2 shows bidirectional stacked encoder network. This bidirectional input approach was successfully implemented for music generation in [27,29].
Fig. 2. The Encoder Bidirectional LSTM Network. Inputs St+i Represent Note Pitch Information per Time-Series Observation (Tick) Pulled from the 2D Note Progression State-Matrix
The first two LSTM layers can be considered to be on the same hierarchical level, hence making up just a single bidirectional layer. The third LSTM is stacked over the bidirectional layer. The Aggregation function can either be a summation, multiplication, or concatenation operation over the hidden states of the forward and backward reading LSTM cells. Each LSTM cell internally consists of standard LSTM operations. The decoder comprises three stacked unidirectional LSTM cells with 256 neurons each, and sigmoid output activation. Unlike the encoder, the decoder is used in two modes, notably: training and inference. These two modes differ by the manner in which information flows from input to output until the termination of decoding. During training, the decoder’s hidden and cell states are initialized using the encoder’s final states.
438
M. Mots’oehli et al.
Since an LSTM cell expects three inputs, a zero vector st0 = 0 is presented ˆ1 as a primer to the decoder, and the first predicted note progression state st is generated by the network. During training, the real musical note progression states stt , stt+1 , ..., stt+n−1 are presented as input per time step to predict the ˆ t+2 , ..., st ˆ t+n . ˆ t+1 , st orderly sequence of note progression states st
Fig. 3. The Decoder Unidirectional LSTM Network. Inputs St+i Represent Note Pitch Information per Time-Series Observation (Tick) Pulled from the 2D Note Progression State-Matrix.
ˆ t+i are comAs in all other optimization-based networks, the predictions st pared to the actual note states stt+i to calculate the loss. The RMSprop [30] optimizer is used to minimize the overall encoder-decoder network cross entropy objective. RMSprop is the recommended choice for training RNNs according to Keras as it speeds up training, and has been successfully used in training RNNs on temporal data [24,28,35]. Post-training, a short sequence of notes is presented to the encoder which passes on a low-dimensional vector representation of its final hidden state to the decoder. Inference proceeds with a primer input, first note state prediction is passed as input into the decoder in the next time-step, and the process continues recursively until the desired number of time-steps of output is reached as depicted in Fig. 3. In all the LSTM cells, dropout(0.3) is used for regularization. Since each song is represented by a T ×2N matrix, where T represents the number of ticks/time-steps and N is the number of allowable pitches to model, all LSTM cells have layers containing 2N hidden units. The algorithm used for extracting note information from the MIDI file and creating the note state-matrix representation was discussed in Sect. 4.1. Encoder-decoder networks are sequence-to-sequence models, and so accept a sequence of priming notes in order to be able to produce an output sequence. In music generation, as
Adversarial Music Generation
439
in any other sequence-to-sequence task, the more data presented to the encoder network, the better the quality of information available to the decoder network. However, longer sequences can have negative effects on learning due to vanishing gradients. For encoder-decoder music generation, all songs are limited to N ticks; m of these, where m < N , are presented to the encoder and n, with m < n < N , to the decoder during training. Note that m + n = N . All training songs are split in this manner and presented to the network in batches of eight. During inference, the network expects a primer sequence of m notes and generates n note progression states probabilities, and a threshold value of 0.5 is used to turn notes on. The threshold is set to 0.5 because the values being predicted are probabilities of a note being played. Once there is a sequence of notes generated, it is presented to the MIDI note state-matrix decoding method in Sect. 4.2 to produce a MIDI file that can play on any MIDI-enabled device. In the following section, the WGAN implementation is discussed. 5.2
LSTM WGAN
Wasserstein distance or earth mover (EM) distance is a measure of the amount of work required to move one probability distribution to another. While traditional GAN seeks a density distribution Pθ that maximizes the likelihood of samples from the distribution Pr to be modeled, WGAN minimizes the Kullback Leibler (KL) divergence distance which is a reasonable approximation to EM distance [4]. EM distance can be approximated by the equation: W (Pr , Pθ ) = infγ∈(Pr ,Pθ ) E(x,y)∼γ [x − y]
(2)
where Pr and Pθ represent the unknown target distribution and the learned parametrized density distributions respectively. X, Y ∈ Rd . This objective has properties that ensure convergence in situations where other distance measures fail to converge. An example is the case in traditional GANs where the support of Pθ is drawn from a low dimensional latent space. If the overlap between the latent space and the real data generating distribution Pr is significant enough, most distance measures are invalid or deem the distance infinite. As a result of this, WGAN is much more stable compared to GAN, and needs less architectural hyper-parameter tuning of the generator and discriminator. As such, WGAN is an improvement over standard GAN training. The WGAN implementation in this work consists of an LSTM generator and discriminator. Since the goal is to compare adversarial training to encoder-decoder configured training, both the generator and discriminator network architectures are identical to the decoder and encoder respectively in terms of the number of LSTM layers, number of neurons in each layer, and activation functions except for the output activation.
440
M. Mots’oehli et al.
Other training hyper-parameters such as learning rate and stopping criteria are left to vary per training configuration and are discussed in Sect. 7. Below is a detailed description of the WGAN implementation used in this study. The generator G(z) in WGAN is similar to any other generative model that generates melodies from random noise. The specific implementation in this study has three stacked unidirectional LSTM layers with 256 neurons in each layer, an input layer, a fully connected layer before the output, and an output layer. The input layer accepts a matrix of latent variables of size 256 for each generation timestep from the standard normal distribution. The first LSTM layer takes this latent matrix as input per time-step and passes the information on to the higher layers, and the state output per time-step is passed on to the next time-step to generate a sequence of outputs. The outputs should transform into realistic note state-matrix probabilities over the course of training. The output of the generator network is forced to have the same dimensions as those of the statematrix representation of the real data. In traditional GAN training, the discriminator D(x) is an adversary to G(z) as they are trained in competition. The D(G(z)) in WGAN acts more in partnership with G(z) in that, D(G(z)) is trained to optimality much faster than G(z), and so D(G(z)) is able to provide important loss information to G(z) very early in training to speed up G(z)’s convergence. In this manner, the discriminator is comparable to the encoder-decoder network in that it provides the necessary information for the generator’s optimal training. The D(G(z)) implementation has the same number of LSTM layers, including one bidirectional layer as the encoder. The network expects input with dimensions consistent with the note state-matrix representation of the MIDI files and produces a scalar output to ensure the distance between real and fake sample outputs is as large as possible. This is unlike traditional GANs that produce a probability that the song is a sample from the real data generating distribution. For this purpose, a linear output activation function is used for D(G(z)). The scalar output helps in calculating the EM distance, which is used in D(G(z))’s and G(z)’s loss functions. The discriminator takes both real note samples and random noise as inputs together with their labels; 1 for real and -1 for fake samples. Random noise is passed through a partially trained generator at that point in time for each epoch, and the resulting samples represent generated music. To ensure D(G(z)) is always more informed than G(z) to be able to guide G(z)’s loss to optimality, D(G(z)) is trained five times more for every single epoch of G(z). Figure 4 shows the structural setup of the WGAN configuration with D(G(z)) and G(z) as explained above applied on image generation.
6
Evaluation Methods
The training and validation errors of the models were recorded and are reported in Sect. 8, however, they do not represent a complete measure for the performance of a generative algorithm. In generative models, the goal is to be able to learn the generative distribution underlying the samples used during training, while at the
Adversarial Music Generation
441
Fig. 4. WGAN Architecture for Image Generation. Source: [21]
same time not overfitting on these few examples. This means that it is possible to have a model that has average training and validation accuracy produce much more realistic and creative musical samples than one with a very high training accuracy, as music is an art form and is very subjective. The majority of existing literature in neural music generation relies on subjective evaluation methods, with human listener surveys being the most common approach [11,25,29,31]. For this study, a survey was conducted where listener impression scores from 10 individual volunteers were collected. The scores give a subjective opinion of the listeners’ impression of the samples generated by both the encoder-decoder and the WGAN generators. The 10 human evaluators of the samples were chosen at random from a group of friends and each was contacted through Whatsapp, messaging to be asked to participate in the study. This method of contact was especially suitable, as it allowed for quick and easy access to individuals that would likely agree to participate in the study. The fact that the social messaging application has functionality for sharing audio was also one influential factor for the choice of survey design. All the volunteers were aged between 20 and 40 with an uneven split of gender (6 male, 4 female). They were notified that they will receive eight distinct pairs of audio samples not longer than 2 min in length that are both generated by a learning algorithm. They were then asked to rate each of the samples by giving it a score in the range [0, 5], (where 0 is completely random noise and 5 is a good song), to express how much they enjoyed listening to the song and how rhythmic and melodic the music samples are. A numerical score was collected for each sample in the comparison pair. This enabled calculation as a percentage of the number of times samples from one model are preferred over those from the competing model. This together with the median represent a good measure of the centrality of the scores to ensure the analysis is not influenced by outliers. Each participant received a group of eight randomized pairings of encoder-decoder and WGAN samples to compare and score, resulting in a total of 80 ratings. Microsoft Excel was used to randomize the samples and to assign them to the volunteers. The importance of mentioning to the volunteers that both samples were algorithmically generated was to ensure they do not score the samples by comparing them to what they define as really good music generated by expert human artists. The volunteers were not informed as to which samples were generated by which model. Some of the volunteers, although not asked,
442
M. Mots’oehli et al.
were able to provide a textual explanation for their preference in the samples, which provided for a better comparison of the two generative models. Once the scores were collected, the mean opinion score (MOS) [20] test was conducted to get to the conclusion that answers the research questions. For each WGAN sample Siw and encoder-decoder sample Sied , MOS q(S) over all 10 volunteer ratings rk (S) is defined as: 1 rk (Si ) 10 10
q(Si ) =
(3)
k
where the rating rk (si ) is the rating given to sample Si by volunteer number k of 10 volunteers.The mean generator quality Q(Sg ) of a group of samples from the same generator g, is defined as the average MOS given by : 1 q(Sig ) 8 i 8
Q(Sg ) =
(4)
where Sig for i = 1, 2, 3, ..., 8 represents samples from generator g. Since there is subjectivity in the measure of quality, it is important to quantify how much variation there is in the recorded sample and model qualities using the standard deviation. This also expresses how much confidence is placed in the estimate of quality used, where a high standard deviation represents low confidence in the accuracy of the estimate, and a low deviation represents high confidence. The two standard deviation estimates for MOS and mean generator quality are expressed below respectively: 10 2 k (rk (Si ) − q(Si )) (5) σq(Si ) = 10 − 1 8 g 2 i (q(Si ) − Q(Sg )) σQ(Sg ) = (6) 10 − 1 with q(Si ) and Q(Sg ) given by Eqs. 3 and 4, respectively. Although one generator may have a higher MOS than the other, tests have to be performed to ensure the MOS estimate of quality is not negatively influenced by outliers, and that it indeed has a different and higher median opinion score. The Wilcoxon Signed-Rank t-test [48] was used to perform the test of equal medians in ranked pair data, discussed in the next section. 6.1
Wilcoxon Signed-Rank T-Test
The Wilcoxon signed-rank t-test is a statistical hypothesis test for ranked and paired data that assumes no predefined population distribution over which the
Adversarial Music Generation
443
data is sampled from. Wilcoxon signed-rank t-test is used for comparing related pair samples under the hypothesis that the median difference between the samples is zero. The test makes the following assumptions about the data: – The data observations are paired samples from the same population. – Each pair is chosen randomly and independently. – The observations are measured on an ordinal, not necessarily nominal scale. The assumptions above are met by the opinion score data collection process to a suitable extent in that: 1. The ranked samples come from the same population of ranking volunteers. 2. The pairs were chosen randomly, though independence in this case is subjective as it is important to ensure all samples from both models were evaluated by the same number of volunteers. This was achieved by random selection without replacement. 3. Although the scores indicate by how much one sample is better than the other, hence violating the ordinality assumption, a transformation is applied to the scores to ensure only the ordinal aspect of the scores is used. In this transformation, the score pairs are compared, and a new indicator feature is constructed that assumes the following values: positive (+) if the second sample has a higher score, zero if the scores are tied, and negative (−) otherwise. With this new sign transformed data, all the test assumptions are satisfied. The null and alternative hypothesis are given by: H0 : The median score difference between the paired samples is zero. H1 : The median score difference is not zero. Let ed and wg represent the encoder-decoder and WGAN models, respectively, and sgn represent the sign function. With data observations sgn(red,i − rwg,i ), all tied pairs are excluded from the initial sample of size N to a reduced test sample of size Nr . The original sample pairs are ordered in ascension according to the absolute differences red,i − rwg,i of the original captured scores with the smallest difference ranking first as 1 and all ties receiving the average rank of the positions they span. With these new pair ranks Ri , the Wilcoxon signed-rank t-test statistic is calculated as follows: W =
Nr (sgn(red,i − rwg,i ) · Ri )
(7)
i
Note that using the reduced sample size Nr is equivalent to using the original sample size N , since all tied pairs result in a zero sign, hence not contributing to W . For Nr ≥ 10, W is asymptotically normally distributed, thus the z-score can be calculated as follows: Z=
W − 0.5 σW
(8)
444
M. Mots’oehli et al.
with: Nr (Nr + 1)(2Nr + 1) (9) 6 The null hypothesis H0 : equal medians, is rejected in favour of H1 : unequal medians, if z > zcritical . Rejecting the null hypothesis would mean the two models have generated sample scores with statistically different medians, and so there is more confidence that the contribution of outliers in the mean model and opinion scores is trivial. Based on the results of the MOS, mean generator quality, and the Wilcoxon signed rank test, the generator with the higher mean generator quality is considered to produce music that is more aesthetically pleasing to listen to, given the null hypothesis of equal medians is rejected. The volatility estimates given by Eq. 6 are used as a measure of confidence in making the conclusion based on the MOS estimates above. The textual comments collected on some samples are analyzed to get more insight into how people perceived the music. However, due to the lack of correct collection of these comments, no computational sentiment analysis is performed data, but a human opinion sentiment analysis of the text is provided. σW =
7
Experiments
The implementation configurations of the MIDI state-matrix representation are briefly discussed, followed by a discussion on the training parameters for both generative architectures. Finally, we focus on the generation of music samples used in evaluating the encoder-decoder and WGAN models. All computation relating to the data and models was performed on a 7th generation core I7 intel processor, two Gigabytes Nvidia GeFORCE GPU personal computer with 16 Gigabytes of RAM. 7.1
MIDI Representation
Of the 128 possible pitch values, existing implementations use only 88 pitch values between 21 and 109 since all other pitches outside this range are inaudible to the human ear. This in effect reduces the state-matrix dimensionality, and overall model complexity. In this work, all 128 pitch values are used to avoid the added pre, and post-processing in using a reduced note representation. For the models trained below, changing from 88 pitch values to 128 pitch values had an increase in model parameters of only 16% and 22% for the WGAN and the encoder-decoder models, respectively 7.2
Model Hyper-Parameters
Below is the final list of models training hyper-parameters used in the study. A brief explanation of how they affected learning then follows per model.
Adversarial Music Generation
445
Table 4. Final Training Hyper-Parameters for the Encoder-Decoder LSTM Hyper-parameter
Final Value
Learning Rate Optimizer Dropout Probability Training Epochs Batch Size Hidden Layer Size Gradient Clipping Max Train test Split
0.001 RMSprop 0.3 300 32 256 2.0 80:20
Encoder-Decoder. The training hyper-parameters for the encoder-decoder LSTM neural network are shown in Table 4. The learning rate was set low to ensure smoother tracking down the loss function as higher learning rates converge quicker, but the quality of music generated was bad hinting at convergence to a local minimum. Dropout [40] as a regularization technique added more training stability and reduced over-fitting. Other parameters such as gradient clipping and batch size were determined using fivefold cross-validation with all other parameters fixed. Table 5 shows mean BAcc scores over the five-fold cross-validation (CV) for different learning rates and batch sizes. The model’s training mini-batch size and learning rate were from the 5-fold CV, and they are the values that achieved the highest BAcc score. Table 5. Mean 5-Fold CV Balanced Accuracy Scores for Different Batch Sizes and Learning Rates Batch Size Learning Rate 5 0.0005 0.001 0.005
8
10
71.2% 56.6% 51.1% 74.1% 69.6% 60.3% 73.6% 73.2% 72.8%
WGAN. Table 6 shows the final training hyper-parameters for both the generator and discriminator LSTM networks of the WGAN. Training the WGAN has more moving parts than the encoder-decoder model. The initial training parameters were adopted from [36], and then a grid search was used to fine-tune the hyper-parameters over 200 epochs. Figure 5 shows loss curves for both G(z) and D(z) for different points in the hyper-parameter grid search space. The combination of parameters that led to stable training was chosen as the final training parameters shown in Table 6.
446
M. Mots’oehli et al. Table 6. Training Hyper-Parameters for the WGAN Model Hyper-parameter
Final Value
G(z) Learning Rate G(z) Optimizer G(z) Epochs G(z) Batch size Z Latent Distribution D(x) Learning Rate D(x) : G(z) Epoch Ratio Gradient Clipping
0.00005 RMSprop 4000 32 Standard Normal 0.00004 5:1 0.01
Fig. 5. The Figure Shows G(z) and D(x)’ Training Losses for 12 Grid Search Points, where the Number of Critic Training Epochs (n critic) and Batch Size are Varied, for n critic = 2, 5, 10 and 20
At the start of training the generator takes Gaussian noise together with the discriminator’s critique of the generated samples. During this initial training, the discriminator’s layers are fixed and not trained to ensure the generator learning has started by the time the discriminator is trained to optimality. For each epoch of the generator, the discriminator is trained for five times more epochs, then it is used to critique the generator at its current competence level. Since the RMSprop is used, the learning rates for both the generator and discriminator decay with
Adversarial Music Generation
447
the number of epochs. This is to force the generator to value the learned signal more than temporary noise caused by sudden inconsistent gradient direction changes. Training of the WGAN model was faster than that of the encoderdecoder model and had fewer combined generator and discriminator trainable weights. This is because there is no transfer of LSTM cell and hidden states between the generator and discriminator unlike there is between the encoder and decoder networks. The WGAN networks are connected only by a scalar loss for information transfer. The generator and discriminator are not trained to optimize accuracy, since the aim is not to regenerate any song from the training set, and so the losses are rather similar to the MSE loss. The early stopping criteria for training was set to be when the EM distance between the current generator’s learned distribution and that of the real data stops reducing for any 10 consecutive epochs. 7.3
Music Generation
For the encoder-decoder model, music is generated one sample at a time at the end of training by priming the decoder network with a short series of notes and predicting the note progression probabilities for the entire song. The generated samples are on average 1.3 min long. It was observed that generating melodies that are much longer than 2 minutes resulted in repetitions of the same sequence of notes and sometimes silent spots or complete silence towards the end of the song. For WGAN, samples are generated during training. At the end of each generator training epoch, a sequence of ten priming latent vectors is passed to the generator, and it then generates ten audio samples based on its proficiency at that point in time. This process is repeated until training is complete, resulting in a number of music samples. If training progresses as expected, the samples reflect an increase in compositional skill from epoch one to the last epoch. Only randomly selected samples from the last generator training were collected and written to a MIDI file for evaluation.
8
Results
A model’s predictive accuracy is a quantitative measure of the number of prediction instances the model estimated correctly divided by the total number of prediction instances as a percentage. In the case of musical notes, accuracy is measured as the mean number of notes correctly classified as on or off per time step. The accuracy is measured during both training and validation of the model to ensure it is not over-fitting and is generalizing well. This is measured on the validation set. Table 7 shows the loss and accuracy scores for the encoder-decoder model together with the WGAN’s EM distance loss, which similarly to MSE loss, has only a lower bound of zero and no upper bound.
448
M. Mots’oehli et al.
Table 7. Top-1 Training and Test Accuracies for the Seq2Seq Model, as well as Loss Figures for the WGAN Metric
WGAN LSTM Encoder-Decoder LSTM
Training Accuracy Test Accuracy Train Entropy Loss Test Entropy Loss Generator Loss Discriminator Loss
1.006 0.9995
96.45% 96.3% 0.107 0.081 -
Although the prediction accuracy reported in Table 7 seems favourably high, it is a bad measure of the generative ability of the encoder-decoder model on the training data. This is because there are more negative than positive prediction instances in the dataset as described in Sect. 3, and shown in Table 3. To solve this problem, BAcc which results in a weighted score between the number of prediction classes was used. Since the models trained depend on random initialization of weights, a 5-fold CV was performed to get the average performance of the models. Table 8 shows the CV BAcc and loss that are a more realistic measure of the model’s prediction ability. Table 8. Training and Five-Fold CV Balanced Top-1 Accuracy Scores for the EncoderDecoder LSTM Neural Network CV Iteration Training BAcc Validation BAcc Training Loss Validation Loss 1
74.16%
73.84%
0.64%
0.78%
2
74.27%
74.66%
0.65%
0.30%
3
74.75%
74.78%
0.29%
0.16%
4
74.34%
74.24%
0.61%
0.67%
5
74.61%
74.09%
0.31%
0.46%
μ
74.42%
74.32%
0.5%
0.48%
σ
0.24%
0.39%
0.18%
0.25%
Results from Tables 7, 8, and 9 are not sufficient to arrive at a conclusion in comparing WGAN to the encoder-decoder model for the purpose of music generation, since they follow very different training methods with completely different objective functions. Also, music quality is more subjective as an art form than it is objective, and so there is a lack of good objective measures to base a conclusion on [11,29,35]. Accuracy reported for the encoder-decoder model represents the mean fraction of correct pitch predictions made by the decoder in the training song per prediction class and is depicted in the training curves shown in Fig. 7. The EM distance for WGAN is the final distance of
Adversarial Music Generation
449
Table 9. Training and Five-Fold CV EM Loss for the WGAN Model CV Iteration G(z) Loss D(z) Loss 1
1.000587
0.99953
2
1.000384
0.99962
3
1.000457
0.99959
4
1.000535
0.99956
5
1.000561
0.99952
μ
1.0005
0.9995
σ
0.00008
0.00017
the generator’s learned sampling distribution from that of the training data distribution.
Fig. 6. Five-Fold CV BAcc Curves for the Encoder-Decoder LSTM
The close training and CV mean BAcc curves in Fig. 6 also show the encoderdecoder LSMT is not over-fitting. An over-fitting model would regenerate the training music data samples when primed with a close enough sequence of notes and would lack diversity in the generated samples. Although the encoder-decoder does not seem to be over-fitting based on the CV accuracy, volunteers in the listening survey do hint at a lack of diversity and creativity in the encoderdecoder’s generated samples, discussed in Sect. 8.3. Figure 8 contains training
450
M. Mots’oehli et al.
Fig. 7. Mean Training and CV BAcc Curves
Fig. 8. Loss Curves for WGAN with the Best Grid Search Parameters: n critic=5 and Batch Size=32
EM distance loss curves for the WGAN networks. Note that the generator is much more unstable than the discriminator. The generator produces an estimate of the true distribution, and the discriminator’s loss estimates how far off the generator is through the EM distance. As stated earlier, due to the subjective nature of music, a subjective evaluation method was used for the generated samples and the results are presented in Sect. 8.1
Adversarial Music Generation
8.1
451
Mean Opinion Scores
Tables 10 and 11 present the results of the listener impression survey for the two generative models. Table 10. Listener Impression Scores for WGAN Generated Samples S1 to S8. q(si ) Represents the MOS for each Sample Volunteer S1
S2
S3
S4
S5
S6
S7
S8
1 2 3 4 5 6 7 8 9 10
2.5 4 3 4 3.5 3 2.8 2.5 1.5 2
4 3.6 3 4 4 1 4 4 4 4
2 4 3 2 4 2 5 3.5 2.5 2.5
3 3 4 3 4 4 3.5 3 3 4
2.5 3 3 3.5 3 2 1 3 4 3
4 2 3.5 4 2.5 3 5 4 4 3
4 3.5 4 3 3 3.5 3 4 3 3.5
3 2.5 3 2 4 2.5 3 3 4.5 2.5
q(si )
2.88 3.56 3.05 3.45 2.88 3.5
σq (si )
0.77 0.91 0.99 0.47 0.78 0.84 0.42 0.71
3.45 3
The results in Tables 10 and 11 show that volunteers generally scored WGAN samples higher than the encoder-decoder samples. This is reflected in the observation that only one of four WGAN samples received a MOS below three, yet all encoder-decoder samples got a MOS of three or lower. All samples from both models had score variations that are considerably low (all below one), also, both models have 95% sample ratings that are within two standard deviations of each other. This is a good indication that as much as the rating system is subjective and high variance is expected, people’s opinions about the samples do not differ so significantly that it was to seem they are each receiving a song of a different genre or were all exposed to entirely different interpretations of what good music sounds like. The overall results for both models across all samples are presented in Table 12, and are being used to deduce which of the two models is considered a more skilled composer of music than the other based on its generated samples. The results in Table 12 show that the WGAN samples have a higher mean opinion score than those of the encoder-decoder model based on volunteer listener ratings. However, since the arithmetic mean is a measure that can be greatly influenced by outliers, it is also important to consider the median opinion scores which are not influenced by outliers, and is a measure of the centrality of data observations. The Wilcoxon signed-rank test is used to assess whether the opinion scores collected for the WGAN and encoder-decoder models come from
452
M. Mots’oehli et al.
Table 11. Listener Impression Scores for the Encoder-Decoder LSTM Generated Samples S9 to S16 Volunteer S9
S10 S11 S12 S13 S14 S15 S16
1 2 3 4 5 6 7 8 9 10
3 2 2 4.5 2 4 2.5 3.5 3 3.5
3.3 3 4 3 2 2 3 3 2 3
q(si )
3
2.83 3
σq (si )
0.84 0.62 0.77 0.95 0.72 0.46 0.71 0.8
1 3.5 2.5 3 3 3 4 3 3.5 3.5
2 2 2 5 2 3 2 3.5 2 3
2 3 2 2 1 3.5 3 2 3 3
3 3 3.5 3 3 2.5 4 3.5 3 4
2.5 4 3 2.5 2 4 2 3 3 2
3.5 2 2 3 2 3.5 4 2 3 4
2.65 2.45 3.25 2.8
2.9
Table 12. Mean and Median Generator Quality Scores. Although the Sample Median Scores from the Two Models are Equal, this does not Imply they are Drawn from Populations with Equal Median Scores. It is the Wilcoxon Signed-Rank T-Test that Gives a Conclusive Answer on Equality of the Population Medians WGAN LSTM Encoder-Decoder LSTM Qqs σQqs Qqs + 2 × σQqs Qqs − 2 × σQqs Median Score
3.21 0.811 4.832 1.588 3
2.86 0.736 4.332 1.388 3
populations with equal medians. The Wilcoxon test results are discussed in the following section. 8.2
Wilcoxon Signed-Rank Test
The Wilcoxon signed-rank test for equal population medians as explained in Sect. 6.1 was performed. Using Eqs. 7, 8 and 9, the test statistic W , signedrank standard deviation, σW , and the z − score : z, were calculated and are presented In Table 13. The results are used in making the decision to reject, or to not reject the null hypothesis of equal population medians. It is standard to perform hypothesis tests to a certain level of confidence, usually α = 0.95. The critical z0.95 value is the inverse standard normal value under which 95% of all data falls since the standardized signed-rank test statistic z follows a standard normal distribution. Results in Table 13 show that z > z0,95 and according to
Adversarial Music Generation
453
the Wilcoxon signed-rank test, the null hypothesis of equal medians is rejected with 95% confidence in favour of the alternative hypothesis. The alternative hypothesis states the two generative models produced samples with significantly different opinion score medians, and thus their population distributions are not centered around the same score. This is also supported by plotting a histogram of opinion scores for both models in Fig. 9. Table 13. Wilcoxon Signed-Rank Test Results W
σW
|Z|
Zα=0.95
-1132 393.90 2.8751 1.6449
Figure 9 shows that WGAN sample scores are skewed more to the right as compared to those of the encoder-decoder network that form a more symmetric distribution. The evident skewed WGAN sample score distribution and a higher median opinion score suggest the higher WGAN MOS score is not falsely influenced by outliers, and this too supports that WGAN samples were found to be more pleasing to listen to than those generated by the encoder-decoder model. 8.3
Listener Comments
Although volunteers were not asked to provide any textual comments, some of them did provide written feedback together with the rating scores. Majority of the comments provided justification for the volunteer’s rating of the samples and their preference for one sample over the other. The generating model identity of the samples was not known to the volunteers during the survey, named simply samples 1 to 8 for the WGAN, and 9 to 16 for the encoder-decoder
Fig. 9. Opinion score Distribution for WGAN and Encoder-Decoder LSTM Generated Music Samples
454
M. Mots’oehli et al.
model. In total, 26 of the ratings were accompanied by a comment. Some of the most insightful comments are provided in appendix A, with the generating model’s name instead of the sample number revealed. The comments point to a general preference for WGAN music samples over those of the encoder-decoder model. WGAN sample comments can be summarized using the following keywords: melodic, rhythmic, nice, and creative. The encoder-decoder samples can be described: Slow, vague, and rhythmic.
9
Conclusion
This work described and compared two training configurations of generative models for music generation using MIDI data. After a detailed discussion of neural network architectures normally used for time series data and their topological components, state-of-the-art music composition networks were discussed. A short background on the MIDI file format and its representation in this work was provided. The work then detailed the WGAN and encoder-decoder models with LSTM components implemented and compared for the composition task, followed by the experimental setup and results. The main contributions of the study were to first show that an adversarial configuration is a viable approach to training neural networks for music composition and that it produces music samples of superior quality as compared to non-adversarial training. To be able to evaluate the generative skill of the two compared training configurations, a survey was conducted where 10 volunteers were asked to listen to four randomly selected samples from each generative model, and rate each sample on a scale from 1 to 5 based on how pleasing it was to listen to. 70% of the rating instances resulted in a preference for WGAN over the encoder-decoder samples. 30% of volunteer ratings pointed to a preference for encoder-decoder generated samples over WGAN samples. The ratings were then used to conduct a rating test to determine whether the two model’s sample ratings were from populations with different medians, and this test concluded they were. Assuming the sample ratings are representative enough of their populations, this implies adding more volunteers to the survey would infer that the adversarial neural network is a better composer than the encoder-decoder neural network. Based on the results of the survey and comments from volunteers on the generated samples, it can be concluded that adversarial training is a viable method for training generative models for music generation, and it does produce more diverse and pleasing to listen to music samples. Further experiments can be conducted on the WGAN implementation above by including more varied training music genres and allowing the generator to generate longer sequences. Since the generator is an LSTM network, methods normally used to improve performance and memory retention in sequence-to-sequence models on natural language processing (NLP) tasks, such as adding attention or pointer layer can be explored. It would be interesting as well to perform a thorough investigation of pure transformer networks in music generation, as well as how training methods such as diffusion can be applied to music. Another avenue of further research
Adversarial Music Generation
455
to be explored is the development and consideration of formal and objective evaluation methods for artistic tasks such as music generation and artistic painting. Being able to accurately and appropriately evaluate models for such subjective tasks will greatly improve the quality of generated samples.
A
Appendix :Listener Comments
– “I couldn’t get a feel of where the encoder-decoder song is going, the WGAN sample has a nice classical feel to it.” – “They both sound musical but the sound quality is bad.” – “I like the pace of the WGAN sample.” – “The encoder-decoder sample had too many silent spaces, the quiet spots emphasize the loud spots , which sounded like rambling.” – “The WGAN sample is a little chaotic but generally creates a good atmosphere, the encoder-decoder song has a good rhythm but no actual melody. – “I’d say the encoder-decoder song is better, has more rhythm, the WGAN sample sounds too fast and just noise to me.” – “The WGAN Sample sounds like me when I am under dissertation stress, terrible.” – “The WGAN sample sounds more creative, the encoder-decoder sample wins on rhythm although it takes so long to get there.” – “What’s important to me is they all sound musical.” – “I don’t listen to this genre so my view may be way off.” – “Most of the songs sound similar to me.” – “Wow, can’t believe the WGAN sample was generated by a machine, though it’s funny at the end LOL.” – “Can’t you get them to generate longer songs? perhaps with words? I like the encoder-decoder sample.”
References 1. Yamshchikov, I.P., Tikhonov, A.: Music generation with variational recurrent autoencoder supported by history. SN Appl. Sci. 2(12), 1–7 (2020). https://doi. org/10.1007/s42452-020-03715-w 2. Alfonseca, M., Cebrian, M., Puente, O.: A simple genetic algorithm for music generation by means of algorithmic information theory. In: Proceedings of the Institute of Electrical and Electronics Engineers Congress, Evolutionary Computation, pp. 25–28 (2007) 3. Ant´ onio, R., et al. The freesound loop dataset and annotation tool. In arXiv, (2020) 4. Arjovsky, M., Chintala, S., Bottou, L.: Wasserstein gan. In: Proceedings of the 34th International Conference on Machine Learning, vol. 70, pp. 214–223 (2017) 5. Bretan, M., Weinberg, G., Heck, L.: A unit selection methodology for music generation using deep neural networks. In: Proceedings of the International Conference on Computational Creativity (2017)
456
M. Mots’oehli et al.
6. Chen, B., Hsu, E., Liao, W., Ram´ırez, M., Mitsufuji, Y., Yang, Y.: Automatic DJ transitions with differentiable audio effects and generative adversarial networks. In: IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2022, Virtual and Singapore, 23–27 May 2022, pp. 466–470. IEEE (2022) 7. Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pp. 1724–1734 (2014) 8. Chongsheng, Z., et al.: An empirical study on the joint impact of feature selection and data resampling on imbalance classification. Appl. Intell. 53, 5449–5461 (2022) 9. Chu, H., Urtasun, R., Fidler, S.: Song from pi: a musically plausible network for pop music generation. In: Proceedings of the 5th International Conference on Learning Representations (2017) 10. Chung, J., Gulcehre, C., Cho, K., Bengio, Y.: Empirical evaluation of gated recurrent neural networks on sequence modeling. In: Proceedings of the 27th international conference on Neural Information Processing Systems, Deep Learning and Representation Learning (2014) 11. Colombo, F., Gerstner, W.: Bachprop: Learning to compose music in multiple styles. Arxiv, volume abs/1802.05162 (2018) 12. Cui, Z., Ke, R., Wang, Y.: Deep stacked bidirectional and unidirectional LSTM recurrent neural network for network-wide traffic speed prediction. In: 6th International Workshop on Urban Computing (2017) 13. Denil, M., Bazzani, L., Larochelle, H., de Freitas, N.: Learning where to attend with deep architectures for image tracking. IEEE Neural Comput. 24, 2151–2184 (2012) 14. Dong, H., Hsiao, W., Yang, L., Yang, Y.: Musegan: multi-track sequential generative adversarial networks for symbolic music generation and accompaniment. In: Proceedings of the the 30th international conference on Innovative Applications of Artificial Intelligence, pp. 34–41 (2018) 15. Edirisooriya, S., Dong, H.W., McAuley, J., Berg-Kirkpatrick, T.: An empirical evaluation of end-to-end polyphonic optical music recognition. In: International Society for Music Information Retrieval (2021) 16. Elgammal, A., Liu, B., Elhoseiny, M., Mazzone, M.: CAN: creative adversarial networks generating ”art” by learning styles and deviating from style norms. In: Proceedings of the International Conference on Computational Creativity (2017) 17. Goodfellow, I., et al.: Generative adversarial nets. In: Proceedings of the 27th International Conference on Neural Information Processing Systems, pp. 2672– 2680 (2014) 18. Henning, B., Soon, O., Enno, S., Joachim, B.M.: The balanced accuracy and its posterior distribution. In: 20th International Conference on Pattern Recognition, pp. 3121–3124 (2010) 19. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9, 1735–1780 (1997) 20. Huang, A., Wu, R.: Deep learning for music. Arxiv, volume abs/1606.04930 (2016) 21. Hui, J.: GAN-wasserstein GAN and WGAN-GP. Medium (2018) 22. Hung, T., Chen, B., Yeh, Y., Yang, Y.: A benchmarking initiative for audio-domain music generation using the freesound loop dataset. In: Proceedings of the 22nd International Society for Music Information Retrieval Conference, ISMIR 2021, Online, November 7–12, 2021, pp. 310–317 (2021) 23. Juefei-Xu, F., Boddeti, V.N., Savvides, M.: Gang of GANs: generative adversarial networks with maximum margin ranking. ArXiv, volume abs/1704.04865 (2017)
Adversarial Music Generation
457
24. Kim, J.: Using keras and theano for deep learning driven jazz generation (2017). https://deepjazz.io 25. Lackner, K.: Composing a melody with long-short term memory (LSTM) recurrent neural networks, Bachelor’s thesis, Technische Universit¨ at M¨ unchen, Munich, Germany (2016) 26. Lerdahl, F., Jackendoff, R.: On the Algorithmic Description of the Process of Composing Music. MIT Press (1983) 27. Liu, I. Randall, R.: Predicting missing music components with bidirectional long short-term memory neural networks. In: Proceedings of the 17th International Society on Music Information Retrieval Conference (2016) 28. Lopyrev, K.: Generating news headlines with recurrent neural networks. ArXiv, volume abs/1512.01712 (2015) 29. Mogren, O.: C-RNN-GAN: continuous recurrent neural networks with adversarial training. In: Proceedings of the 30th International Conference on Neural Information Processing Systems, Constructive Machine Learning Workshop (2016) 30. Mukkamala, M.C., Hein, M.: Variants of RMSProp and adagrad with logarithmic regret bounds. In: Proceedings of the 34th International Conference on Machine Learning, vol. 70, pp. 2545–2553 (2017) 31. Oord, A.,et al.: WaveNet: a generative model for raw audio. In: Proceedings of the 9th International Speech Communication Association, Speech Synthesis Workshop, pp. 125–125 (2016) 32. Oord, A., Kalchbrenner, N., Kavukcuoglu, K.: Pixel recurrent neural networks. In: Proceedings of the 33rd International Conference on Machine Learning, vol. 48, pp. 1747–1756 (2016) 33. Raffel, C.: Learning-based methods for comparing sequences, with applications to audio-to-midi alignment and matching (2018). https://colinraffel.com/projects/ lmd/ 34. Roberts, A., Engel, J., Eck, D.: Hierarchical variational autoencoders for music. NIPS Workshop on Machine Learning for Creativity and Design, vol. 3 (2017) 35. Roberts, A., Engel, J., Raffel, C., Hawthorne, C., Eck, D.: A hierarchical latent vector model for learning long-term structure in music. In: Proceedings of the 35th International Conference on Machine Learning, Machine Learning Research, vol. 80, pp. 4364–4373 (2018) 36. Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., Chen, X.: Improved techniques for training GANs. In: Proceedings of the 30th International Conference on Neural Information Processing Systems, pp. 2234–2242 (2016) 37. Schuster, M., Paliwal, K.: Bidirectional recurrent neural networks. IEEE Signal Process. 45, 2673–2681 (1997) 38. Shiebler, D.: Music RNN RBM. github.com/dshieble/Music RNN RBM (2017) 39. Shuqi, D., Zeyu, J., Celso, G., Roger, D.: Controllable deep melody generation via hierarchical music structure representation. arXiv preprint arXiv:2109.00663 (2021) 40. Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014) 41. Sutskever, I., Vinyals, O., Le, Q.: Sequence to sequence learning with neural networks. Adv. Neural Inf. Process. Syst. 4, (2014) 42. Vaswani, A., et al.: Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 6000–6010 (2017)
458
M. Mots’oehli et al.
43. Garc´ıa, V., Mollineda, R.A., S´ anchez, J.S.: Index of balanced accuracy: a performance measure for skewed class distributions. In: Araujo, H., Mendon¸ca, A.M., Pinho, A.J., Torres, M.I. (eds.) IbPRIA 2009. LNCS, vol. 5524, pp. 441–448. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-02172-5 57 44. Vincent, P., Lewandowski, N., Bengio, Y.: Modeling temporal dependencies in highdimensional sequences: application to polyphonic music generation and transcription. In: Proceedings of the 27th International Conference on Machine Learning (2012) 45. Waite, E.: Generating long-term structure in songs and stories. Lookback-RNNattention-RNN (2016) 46. Weel, J.: RoboMozart: generating music using LSTM networks trained per-tick on a midi collection with short music segments as input (2017) 47. Werbos, P.: Backpropagation through time: what it does and how to do it. Proc. IEEE 78, 1550–1560 (1990) 48. Wilcoxon, F.: Individual comparisons by ranking methods. Biometrics Bull. 1, 80– 83 (1945) 49. Xenakis, I.: Formalized Music: Thought and Mathematics in Music. Pendragon Press (1992) 50. Yang, L., Chuo, S., Yang, Y.: MidiNet: a convolutional generative adversarial network for symbolic-domain music generation. In: Proceedings of the 18th International Society on Music Information Retrieval Conference, pp. 1–12 (2017) 51. Yu, L., Zhang, W., Wang, J., Yu, Y.: Seqgan: Sequence generative adversarial nets with policy gradient. In: Proceedings of the 31st AAAI Conference on Artificial Intelligence, vol. 31 (2017) 52. Zaripov, R.: On the algorithmic description of the process of composing music. USSR Acad. Sci. 132, 1283–1286 (1960)
Deep Reinforcement Learning for Heat Pump Control Tobias Rohrer1(B) , Lilli Frison2 , Lukas Kaupenjohann2 , Katrin Scharf2 , and Elke Hergenr¨ other1(B) 1
2
University of Applied Sciences Darmstadt, Darmstadt, Germany [email protected], [email protected] Fraunhofer Institute for Solar Energy Systems, Freiburg im Breisgau, Germany Abstract. Heating in private households is a major contributor to the emissions generated today. Heat pumps are a promising alternative for heat generation and are a key technology in achieving our goals of the German energy transformation and to become less dependent on fossil fuels. Today, the majority of heat pumps in the field are controlled by a simple heating curve, which is a naive mapping of the current outdoor temperature to a control action. A more advanced control approach is model predictive control (MPC) which was applied in multiple research works to heat pump control. However, MPC is heavily dependent on the building model, which has several disadvantages. Motivated by this and by recent breakthroughs in the field, this work applies deep reinforcement learning (DRL) to heat pump control in a simulated environment. Through a comparison to MPC, it could be shown that it is possible to apply DRL in a model-free manner to achieve MPC-like performance. This work extends other works which have already applied DRL to building heating operation by performing an in-depth analysis of the learned control strategies and by giving a detailed comparison of the two stateof-the-art control methods. Keywords: Deep Reinforcement Learning · Heat Pump Control · Model Predictive Control · Demand Response
1
· Optimal
Introduction
Heating in private households accounted for 26% of total energy consumed in Germany in 2020, making it a major contributor to the emissions generated today [2]. Heat pumps are a sustainable alternative to traditional heating systems which rely on fossil fuels. Heat pumps exploit heat from natural energy sources such as ambient air or groundwater and bring it to a higher temperature level that can be used to supply heat to buildings. Therefore, heat pumps can be used for emission-free room heating if they are driven by electricity generated by renewable sources such as solar or wind. This makes heat pumps one of the key technologies for achieving the goals of the ongoing energy transition [24]. While the penetration of heat pumps in Germany is steadily increasing [3], most heat pumps in the field are controlled by a simple heating curve [20], which c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): SAI 2023, LNNS 711, pp. 459–471, 2023. https://doi.org/10.1007/978-3-031-37717-4_29
460
T. Rohrer et al.
is a static mapping from the current outdoor temperature to a control action. While this approach is easy to implement and maintain, it has potential for improvement as factors like a weather forecast or a time varying electricity price do not have any influence on the heating strategy [20]. As a result, model predictive control (MPC) has been applied to heat pump control in the past years. The basic idea of MPC is to make use of a simplified model to predict the effect of control actions. As promising as this sounds, it also introduces a strong dependency on the model which is used. Therefore, the performance of MPC for heat pump control highly depends on the accuracy of the model of the building to be heated [4,20,23]. Additionally, the computational effort required by MPC during execution is relatively high compared to other control methods, which results in additional hardware requibents on the controller level [5,20,23]. This means, the model must be relatively simple in order to achieve reasonable run times, but at the same time it must be accurate enough to make it useful for MPC [17,23,26]. As a result, often two models must be created in order to apply MPC to heat pump control. One simplified model which can be used by MPC to plan and one more accurate model used to close the control loop during validation [1,23]. The process of model creation is complex and considered as the most time-consuming task when applying MPC [23]. Overall, in recent years, the work on MPC used for heat pump control has increased. However, so far it has not gained acceptance in the field of heat pump control in practice [5,23]. Motivated by the shortcomings of MPC and its complex process of model creation, this work applies deep reinforcement learning to heat pump control in a model-free manner. This means, optimal control policies are approximated from experiences collected by interaction with a simulated building, without providing information about the simulation or its underlying model. We use MPC, which utilizes the simulations model as baseline for optimality. This work contributes to the field of research by: (1) formulating heat pump control as continuous control problem, (2) showing through our experiments in a simulated environment that optimal heat pump control strategies can be approximated through a modelfree approach by applying deep reinforcement learning, (3) showing that these strategies exploit the thermal storage of the building to shift heating loads to periods where the heat pump can be operated most efficiently and (4) providing a comparison of the two state-of-the-art control methods MPC and DRL.
2
Related Work
In the last few years, works that applied deep reinforcement learning to heat pump control have increased.The works from [14,16,21] investigated heat pump control in a demand response setting, where control is supposed to happen with respect to a time varying electricity price signal. They utilized fitted Q-iteration and thus discretized their control actions. With this approach, these works could all report a significant decrease of operating cost [14,16,21].
Deep Reinforcement Learning for Heat Pump Control
461
Heidari et al. [7] applied deep Q-learning [9] to control a simulated heat pump for hot drinking water usage. While there are also requirements for comfort in this problem statement, their focus can be considered different as their goal was to learn control strategies that are hygiene-aware. They use a binary action space to control the heat pump. By including occupants’ warm water usage behaviour, Heidari et al. could enable their agent to learn control strategies that consider occupants’ behaviour. They could report savings in energy usage by 24,5% compared to a rule-based controller which they used as a baseline. In [6], Ghane et al. applied deep reinforcement learning by using PPO [22] to control a simulated heat pump. Their problem statement differs as their goal is to find optimal control strategies for a central heat pump, which provides heat to a heating network consisting of multiple houses. Similar to the work at hand their goal is to minimize electricity usage while meeting the comfort requirements of the occupants. They compared their results against a heating curve based controller. They could report 16.03% of reduction in energy demand compared to their baseline. They used both discrete and continuous actions to validate their solution. In [10], Nagy et al. applied deep Q-learning [9] to a simulated air-source heat pump. By defining 6 discrete control actions, they managed to learn control strategies that stays in a comfort temperature range while reducing run cost based on an electricity price. They have used a dual price signal with two different fixed prices during the day and night. A baseline comparison using rule based and MPC-based control strategies was conducted. They reported savings of 5–10% compared to a rule based controller. Like in the work at hand, MPC was used as upper performance limit, as it used the simulation as a model for planning and had thus full knowledge of the target environment. The work from Nagy et al. [10] can be considered as the most similar to the work at hand. We extend other works in the field by implementing heat pump control as continuous control problem. Additionally, we could show by a detailed analysis of the learned control strategies that sophisticated strategies which exploit the thermal storage of the building could be learned.
3
Problem Formulation
The main task at hand is to approximate optimal control strategies for an airsource heat pump in simulation in a model-free manner. Thereby, a control strategy must minimize electrical energy usage and comfort deviations at the same time. Comfort deviations occur whenever the indoor room temperature exceeds the comfort temperature range, which is defined between 21 ◦ C and 25 ◦ C. Finding efficient control strategies presents a challenging task, as (1) not only the heat pump operation but also the outside temperature has an impact on the indoor temperature, (2) selected control actions have an effect only with a delay, since the heat transfer does not take place instantaneously, (3) minimization of comfort deviations and electrical usage are conflicting objectives and (4)
462
T. Rohrer et al.
the outdoor temperature effects the efficiency of the heat pump as it uses the outdoor air as heat source1 . 3.1
Simulation Framework
The core functionality of the simulation framework used in this work is illustrated by Fig. 1. It mimics a simplified building with a single room, which is heated by a floor heating system supplied with heat which is generated by an air source heat pump. The framework simulates the effect of the outdoor temperature Tout and heat pump supply temperature Tsup on the building indoor temperature Tin . The supply temperature defines the temperature of the water heated by the heat pump which enters the floor heating system. Additionally, the simulation calculates the return temperature Tret which describes the temperature of the water coming back from the floor heating system which is to be heated again by the heat pump. The heat transfer between outdoor and indoor temperature is parameterized by the coefficient for transmission and ventilation Hve,tr . Similarly, the heat transfer coefficient for radiation and convection Hrad,con is used to model the heat transfer between indoor temperature and the floor heating system. One simulation step is modelled as if 900 s would pass in real-time. At its heart, the simulation framework utilizes a 2R2C lumped capacitance model which can be expressed by the following differential equations: 1 dTin = · (Hrad,con · (Tret − Tin ) − Hve,tr · (Tin − Tout )) dt Cbldg
(1)
dTret 1 = · (Q˙ hp − Hrad,con · (Tret − Tin )) dt Cwater
(2)
Hereby, the term Cbldg defines the thermal capacity of the building envelope and Cwater defines the thermal capacity of the water in the underfloor heating system. Q˙ hp defines the thermal power generated by the heat pump during a time step and is the variable which needs to be controlled by the control strategy. The framework can be used to simulate a wide variety of buildings by setting different parameters like the transmission losses and heat capacities. The parameters and buildings used in this work to evaluate the proposed solution will be explained in more detail in Sect. 5.
4
Deep Reinforcement Learning for Heat Pump Control
We applied deep reinforcement learning to approximate optimal heat pump control strategies. Deep reinforcement learning targets the learning of strategies through interactions between an agent and an environment. Thereby, the agents goal is to choose actions which maximize the sum of rewards over time. The 1
The lower the difference between the heat source and the target temperature, the higher the efficiency of a heat pump.
Deep Reinforcement Learning for Heat Pump Control
463
Fig. 1. Overview of the Simulation Framework and the Interaction between Agent and Environment
reward serves as feedback on the quality of a certain action in a given state [25]. This interaction between agent and environment is illustrated in Fig. 1 and will be explained in more detail in Sects. 4.1 and 4.2. 4.1
Heat Pump Control as Markov Decision Process
The Markov decision process (MDP) serves as a mathematical formalism of sequential decision making problems. It is used to formally define environments in reinforcement learning [25]. The following describes the components that define the MDP used this work: Action Space. The action which can be chosen by the agent represents the thermal power Q˙ hp which is generated by the heat pump during the upcoming time step. The action space is continuous in the range Q˙ hp ∈ [0, 12kW ]. The action space is internally rescaled, so the actions taken by the agent lie in the symmetric interval [−1, 1]. t+1 t+2 t+n t t t , Tret , Tout , Tout , Tout , ..., Tout }, where State Space. A state is defined as {Tin t t t the Tin defines the indoor temperature, Tret the return temperature, and Tout outdoor air temperature of the current time step t. Additionally, the state contains a perfect forecast of the outdoor temperature of the next n time steps t+1 t+2 t+n , Tout , ..., Tout }. As shown later, the number of forecast steps included {Tout depends on the building at hand. It is important to note that the features contained in the state are available in different orders of magnitude. Therefore, standardization of the states was done by applying moving average standardization as implemented in [18].
464
T. Rohrer et al.
Reward Function. The reward encodes the goal of the research task at hand. Therefore, it balances the minimization of electricity usage and comfort deviations. The reward is calculated at every time step according to the following formula: rt = −1 ∗ (β ∗ electricity usedt + comf ort deviationt ).
(3)
The trade-off parameter β is used to balance the conflicting objectives of minimizing comfort deviations and electrical usage. 4.2
Deep Reinforcement Learning Agent
The agent is represented by a policy π which we implemented as fully connected neural network with two hidden layers containing 64 neurons each. In the following we refer to this network as policy network. During training, the policy network learns to approximate optimal heat pump control actions based on a state. Out of the many deep reinforcement learning algorithms that exist, we chose Proximal Policy Optimization (PPO) [22] to train the agent. We chose PPO as it (1) supports continuous action spaces, (2) was applied lately to problems which can be considered more complex [12,13], and (3) was reported as a default reinforcement learning algorithm at OpenAI, one of the pioneers in the field of deep reinforcement learning [11]. We utilized the deep reinforcement learning framework Stable Baselines3 [18] in order to implement and train the agent.
5
Experiment Setup Table 1. Summary of Simulated Buildings Old Floor Size in m
2
Heat Capacity Cbldg in W h/m2 /K
136
393
45
65.9
Transmission Losses Hve,tr in W/K 396 Energy-Efficiency Class
a
Efficient
F
281.7 A
Year of Construction 1984 2020 a According to the German buiding energy act https://www.bmwsb.bund.de/Webs/BMWSB/DE/ themen/bauen/energieeffizientes-bauen-sanieren/ gebaeudeenergiegesetz/gebaeudeenergiegesetz-artikel. html
We used the simulation framework described in Sect. 3.1 to imitate two different buildings. An old building, which can be considered inefficient in terms of its thermal properties and an efficient building, which can be considered energy efficient. See Table 1 for a more detailed description of the buildings. The heat
Deep Reinforcement Learning for Heat Pump Control
465
pump simulated corresponds to a Dimplex LA 6TU air source heat pump with maximum heating power set to 12 kW. We used weather data from the photovoltaic geographical information system of the European Commission3 to feed the simulation with outdoor air temperature data. We used the weather data between 2010 and 2015 for training and left the data from 2016 untouched for obtaining results. Weather data from April to September were excluded in both cases as heating in those months is usually not necessary. In the following, we are referring to the data from 2010 to 2015 as training data and to the data from 2016 as testing data. We trained one agent per building independently, which led to two independent agents. The training was conducted over 350 episodes, each containing 2880 interactions between agent and environment. Therefore, each of the two agents was trained over 1,000,000 time steps. One training episode corresponds to approximately one month, as one simulation step represents 900 s in real time. One month of weather data was chosen at random from the training data at the beginning of each episode. We repeated the training for both of the agents five times using different preset random seeds, as training might run significantly different for different seeds [8]. After training, we chose the agent which performed best out of the five runs and discarded the others. We setup MPC to serve as baseline for optimality by using the simulation framework itself to plan and execute the control actions. This would mostly not be applicable in reality, as usually, a simplified model is used to plan and a more complex model or the reality is used to close the control loop by executing the planned control actions [1,23]. However, in our scenario, providing MPC the same model for planning and execution gives us an optimal baseline for comparison.
6
Results
To obtain the results, the agents were executed using weather profiles of the six months from 2016 which were not used during training and therefore represent our testing data. 6.1
Qualitative Results
Figure 2 shows the control strategies picked up by the agents during training. It can be observed, that the learned control strategies differ between both buildings. Old Building. As shown in the top part of Fig. 2, the trained agent of the old building regulates the indoor temperature almost constantly at 21 ◦ C. Intuitively, this strategy is reasonable, as 21 ◦ C marks the lower comfort bound. Heating further would not increase comfort by our definition, but lead to an increase in electricity usage. Experiments have shown that the agent does not benefit 3
https://re.jrc.ec.europa.eu/.
466
T. Rohrer et al.
from weather forecasts that go far into the future. A prediction length of 8 time steps which corresponds to two hours performed best. Longer forecasts worsened performance due to an increase in the dimensionality of the observation space. Thus, we interpret the agent learned a strategy which can be considered relatively shortsighted. Which makes sense, as storing heat in case of the old building comes with high losses. The efficiency of this strategy is backed up as MPC, which is our baseline for optimality, takes a strategy which is almost identical (see top Fig. 2).
Fig. 2. Comparison of Strategies Taken by the Different Control Methods when Heating the Different. In Both Cases, the Same Weather Profile from March from the Testing Data was Used. Plots of Control During other Months of the Testing Dataset can be Found in [19].
Efficient Building. In contrast to the old building, the agent learned a strategy where the majority of heating is taking place while the outside temperature is relatively high (see bottom part of Fig. 2). This makes sense as the efficiency of the heat pump increases with the temperature of the low-temperature source. The agent learned to exploit the higher heat capacity and lower transmission losses of the efficient building (shown in Table 1) by storing heat and thus shift the heating load to periods, where the heat pump can be operated more efficiently. For this strategy, the agent learned to utilize the weather forecast. Experiments showed that a temperature forecast of 48 time steps which corresponds to 12 h performed best. When comparing the learned strategy to MPC, we can see that both strategic approaches are quite similar.
Deep Reinforcement Learning for Heat Pump Control
6.2
467
Quantitative Results
Table 2 quantifies the results of the strategies discussed above. It can be seen that DRL consumes only marginally (less than 1%) more energy for both buildings compared to MPC, which as explained above can be considered as baseline for optimality in our setup. The execution times indicate the time required by MPC and the DRL agent to perform heat pump control during the six test episodes. This includes 17,280 simulation steps and the same amount of control decisions to be taken. Here it can be seen that DRL is many times faster during execution4 . Table 2. Quantitative Baseline Comparison DRL Old
MPC
Electricity Mean in Wh 405.15 403.23 ◦ Comfort Deviation Mean in C 0 0.02 Comfort Deviation Max in ◦ C 0.18 0.15
Efficient Electricity Mean in Wh 137.92 137.65 ◦ Comfort Deviation Mean in C 0.02 0 Comfort Deviation Max in ◦ C 0.21 0.20 Execution Time in Seconds 38 1679 Note: Values were rounded to two decimal places. The mean values were calculated over the whole testing data and are to be considered per time step, which simulates 900 s in real time. The Max denotes the maximum comfort deviation which occurred any time during the control period of the whole testing data.
7
Extension to a Demand Response Scenario
The purpose of this experiment was to show if the proposed method can be extended to a demand response scenario, where control should happen with respect to a time varying electricity price signal. 7.1
Setup Demand Response Scenario
To the best of our knowledge, there is no publicly available data about time varying electricity prices for residential customers in Europe. Therefore, like in [15], we used day-ahead electricity prices from the European power exchange (EPEX) spot marked. These prices are used to balance supply and demand between electricity producers and distributors. It must be noted, that the used 4
During execution, the DRL agent approximates the optimal control action by forward propagating the neural network through fast matrix operations. MPC has to solve an optimization problem at every time step which comes with cost of compute.
468
T. Rohrer et al.
EPEX spot prices differ from residential customer electricity prices as they do not include taxes and costs for grid usage. The price data used in this experiment originate from the same time period as the weather data and were provided by the Fraunhofer Institute for Solar Energy Systems. Like the weather data, the price data was also split into training and testing data. In order to perform the demand response experiment, we changed the MDP definition as follows: Reward. The new objective of the agent is to minimize the operational cost of the heat pump while still maintaining comfort. The operational cost result from the amount of energy consumed and the electricity price at that time. This new objective was encoded in the reward as follows: rt = −1 ∗ (β ∗ electricity usedt ∗ pricet + comf ort deviationt )
(4)
Note that the new reward definition is analogue to the one from (3), but differs as the price is included. Analogous, a trade-off parameter β is used to balance comfort and costs. State. We included a perfect forecast of the electricity price in the state to give the agent the possibility to decide depending on the price. This lead to the folt+1 t+2 t t t lowing state definition: {Tin , Tret , Tout , pricet , Tout , pricet+1 , Tout , pricet+2 , ...}. A forecast length of 32 time steps, which corresponds to 8 h of the price and outside temperature has shown to perform best. We trained the agent analogously to the procedure described in Sect. 5. The experiment was conducted with the efficient building only, as the old building did not provide any heat storage capabilities to shift heating from high to low price periods. 7.2
Results Demand Response Scenario
Table 3. Results of the Demand Response Scenario DRL MPC DRL Baseline Cost Mean in Cent 0.32 0.31 0.47 0.22 0.02 Comfort Deviation Mean ◦ C 0.2 ◦ Comfort Deviation Max C 1.08 0.85 0.21 Execution Time in Seconds 38 1677 38 Notes: Values were rounded to two decimal places and the statistics are calculated over the whole testing data. The description of Table 2 provides more details on how the mean and Max were calculated.
Deep Reinforcement Learning for Heat Pump Control
469
Fig. 3. Heat Pump Control in a Demand Response Scenario by MPC and the Deep Reinforcement Learning Agent for the Efficient Building. For Both Control Methods, the Same Weather Profile from February from the Testing Data was Used. Plots of Control during other Months of the Testing Data can be Found in [19]
Figure 3 demonstrates the functionality of the trained agent for the efficient building in the demand response scenario. It can be seen that the agent learned to shift the heating to low price periods. Table 3 quantifies the results of this experiment. The cost and comfort deviation mean values denote the mean per time step and are calculated using all months contained in the testing data. Based on the costs, it can be seen that the proposed solution is almost as effective as MPC in the demand response context, which can be considered as the optimal solution. Additionally, the DRL method from Sect. 5, which controls independently of the price signal is listed in the Table and serves as an additional baseline. Although the DRL method described in this section causes more comfort deviations, operational costs could be reduced by almost a third compared to the DRL baseline. It must be noted, that the quantification of the cost saving potential depends on the price signal used and must be therefore interpreted with caution.
8
Conclusion and Future Work
Motivated by the shortcomings of the traditional heating curve and MPC, this work presents a model-free approach to heat pump control by applying deep reinforcement learning. The results indicate, that it is possible to learn optimal heat pump control policies just by learning from interactions with the simulated building in a model-free manner. However, our work and the related works listed in Sect. 2 were applied in simulation and not to the real world. Therefore, in
470
T. Rohrer et al.
future works we will focus on the transition of the presented concepts from simulation to reality.
References 1. Afram, A., Janabi-Sharifi, F.: Theory and applications of HVAC control systems a review of model predictive control (MPC). Build. Environ. 72, 343–355 (2014) 2. AG Energiebilanzen e.V. Anwendungsbilanzen zur Energiebilanz Deutschland. https://ag-energiebilanzen.de/wp-content/uploads/2020/10/ageb 20v v1.pdf. Accessed 2 Apr 2022 3. Bundesverband W¨ armepumpe e.V. https://www.waermepumpe.de/presse/zahlendaten/. Accessed 21 Apr 2022 ˇ 4. C´ıgler, J., Gyalistras, D., Siroky, J., Tiet, V.N., Ferkl, L.: Beyond theory: the challenge of implementing model predictive control in buildings. In: Proceedings of 11th Rehva World Congress, Clima, vol. 250 (2013) 5. Fischer, D., Madani, H.: On heat pumps in smart grids: a review. Renew. Sustain. Energy Rev. 70, 342–357 (2017) 6. Ghane, S., et al.: Supply temperature control of a heating network with reinforcement learning. In: 2021 IEEE International Smart Cities Conference (ISC2), pp. 1–7 (2021) 7. Heidari, A., Marechal, F., Khovalyg, D.: An adaptive control framework based on reinforcement learning to balance energy, comfort and hygiene in heat pump water heating systems. J. Phys. Conf. Ser. 2042(1), 012006 (2021) 8. Henderson, P., Islam, R., Bachman, P., Pineau, J., Precup, D., Meger, D.: Deep reinforcement learning that matters. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018) 9. Mnih, V., et al.: Human-level control through deep reinforcement learning. Nature 518(7540), 529–533 (2015) 10. Nagy, A., Kazmi, H., Cheaib, F., Driesen, J.: Deep reinforcement learning for optimal control of space heating. arXiv preprint arXiv:1805.03777 (2018) 11. OpenAI: Proximal policy optimization. https://openai.com/blog/openaibaselines-ppo/. Accessed 10 Sep 2022 12. OpenAI, et al.: Learning dexterous in-hand manipulation. CoRR (2018) 13. OpenAI et al.: Dota 2 with large scale deep reinforcement learning. arXiv preprint arXiv:1912.06680 (2019) 14. Patyn, C., Ruelens, F., Deconinck, G.: Comparing neural architectures for demand response through model-free reinforcement learning for heat pump control. In: 2018 IEEE International Energy Conference (ENERGYCON), pp. 1–6 (2018) 15. Paul, S.: Learning to control heat pumps using supervised and reinforcement learning methods based on neural networks. Master’s thesis, Albert-Ludwigs-University Freiburg (2019) 16. Peirelinck, T., Ruelens, F., Decnoninck, G.: Using reinforcement learning for optimizing heat pump control in a building model in modelica. In: 2018 IEEE International Energy Conference (ENERGYCON), pp. 1–6 (2018) ˇ aˇcekov´ 17. Privara, S., Cigler, J., V´ an ˇa, Z., Oldewurtel, F., Sagerschnig, C., Z´ a, E.: Building modeling as a crucial part for building predictive control. Energy Build. 56, 8–22 (2013) 18. Raffin, A., Hill, A., Gleave, A., Kanervisto, A., Ernestus, M., Dormann, N.: Stablebaselines3: reliable reinforcement learning implementations. J. Mach. Learn. Res. 22(268), 1–8 (2021)
Deep Reinforcement Learning for Heat Pump Control
471
19. Rohrer, T.: Deep reinforcement learning for heat pump control. Master’s thesis, Darmstadt University of Applied Sciences (2022) 20. Rolando, D., Hatef, M.: Smart control strategies for heat pump systems. KTH Royal Institute of Technology (2018). https://varmtochkallt.se/wp-content/ uploads/Projekt/EffsysExpand/P18 Project Report final reviewed.pdf. Accessed 14 Mar 2022 21. Ruelens, F., Claessens, B., Vandael, S., De Schutter, B., Babuska, R., Belmans, R.: Residential demand response applications using batch reinforcement learning. arXiv preprint arXiv:1504.02125 (2015) 22. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) 23. Serale, G., Fiorentini, M., Capozzoli, A., Bernardini, D., Bemporad, A.: Model predictive control (MPC) for enhancing building and HVAC system energy efficiency: problem formulation, applications and opportunities. Energies 11(3), 631 (2018) 24. Sterchele, P., et al.: Wege zu einem klimaneutralen Energiesystem. Die deutsche Energiewende im Kontext gesellschaftlicher Verhaltensweisen. Fraunhofer ISE (2021). Accessed 23 May 2022 25. Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction, 2nd edn. MIT Press (2018) 26. Wei, T., Wang, Y., Zhu, Q.: Deep reinforcement learning for building HVAC control. In: Proceedings of the 54th Annual Design Automation Conference 2017, pp. 1–6 (2017)
Efficient Training of Foosball Agents Using Multi-agent Competition Adriatik Gashi(B) , Elke Hergenr¨ other, and Gunter Grieser Darmstadt University of Applied Sciences, Sch¨ offerstraße 3, 64295 Darmstadt, Germany {adriatik.gashi,elke.hergenroether,gunter.grieser}@h-da.de, [email protected]
Abstract. In this work, an efficient training concept for training striker and goalkeeper Foosball agents using Deep Reinforcement Learning is designed and implemented. Based on a literature review, individual components such as the observation and action space are defined and suitable configurations for efficient learning are determined by experiments. Furthermore, different aspects for the modeling of multi-agent training are discussed on the basis of further literature research and a concept is derived from this for the use in the Foosball domain. Using only two consumer laptops, it could be shown that the use of multiple agents is also suitable in resource constraint domains. Despite an imbalance in the difficulty of the tasks of striker and goalkeeper agents, both agents were able to learn individual strategies while taking the respective opponent into account.
Keywords: Reinforcement Learning Agents
1
· Multi-agent Training · Foosball
Introduction
In recent years, deep reinforcement learning (DRL) has been successfully used in various areas like board- or real-time strategy games [6,30]. These successes were achieved using multiple agents training with each other. By modeling competitions, agents could be trained with each other to develop strategies against competitors. At the same time, the aforementioned works have a high computational cost in common. AlphaGo used only for the match against the world champion in Go a total of 1920 CPUs and 280 GPUs [30]. For OpenAI Five, the agents were trained with a training duration of ten months and a training volume of approximately two million processed images every two seconds [6]. This raises the question of the extent to which training agents with each other is possible with limited computational resources. For this purpose, the semi-automated Bosch Rexroth Foosball table from Fig. 1 is investigated as a problem domain within the scope of this work. A simulation of the Foosball table is used to train DRL agents to operate the striker- and goalkeeper rods using multi-agent competition. Two consumer c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): SAI 2023, LNNS 711, pp. 472–492, 2023. https://doi.org/10.1007/978-3-031-37717-4_30
Foosball Agents
473
Fig. 1. Semi-Automated Bosch Rexroth Foosball Table [9]
laptops are used for the training of the agents. Therefore, a training concept is developed that considers the limited computational resources available. To accomplish this, we first formalize the environment as a single-agent Markov Decision Process (MDP) from the perspective of the striker-agent in Sect. 4, selecting a suitable learning algorithm and evaluating different configurations by experiment. Afterwards, we extend the MDP to a Markov Game to contain the goalkeeper-agent in Sect. 5, adjusting the training process accordingly. Our results show that striker- and goalkeeper agents were able to develop offensive as well as defensive strategies in the multi-agent setting. While the task design showed an imbalance in the difficulty for the rods, both agents were able to develop behaviour to successfully shoot goals and control the ball with high precision. This suggests that using multi-agent competition to develop complex behaviour with relatively small computational resources is possible.
2
Related Work
First, related work with respect to the Foosball domain will be presented, followed by works that used simulation environments to train agents. In [34] an automated Foosball table was developed, whose control was implemented by means of a decision tree. In addition, a simulation implemented by the authors allows testing of different strategies by using two agents. [37] and [13] are other Foosball examples, where a Foosball rod was taught different action sequences by imitating human actions, and a method for tracking the ball using a camera was developed, respectively. In [9], DRL agents were used to teach a striker rod to score goals. The focus of their work was the “Sim-to-Real” domain, where the Foosball example was chosen to represent a complex manufacturinglike process, which was optimized using DRL. Using a simulation of the foosball table, the authors were able to train agents that successfully learned demanding Foosball control strategies using sparse reward signals. However, in contrast to this work, only one agent is trained and deployed in the respective environment. The authors used the same Foosball table as considered in this work. Furthermore, in [27] a Foosball goalkeeper was taught to defend the goal using DRL.
474
A. Gashi et al.
The authors used Deep Q-Learning [18] to train a goalkeeper agent on a variety of direct shots at the goal. However, the authors used a discrete action space that represents only lateral movements of the rod, disregarding rod rotations altogether. In [20] a robot hand could be taught to solve a Rubix cube. For this purpose, the authors used a simulation in which the agent was trained and then used this model to control a real robot hand. Thus it could be shown that by using a simulation, agents can be trained that achieve comparable results in reality. This was also shown in [21] by using visual object recognition to position objects by a robot hand as in a pictorial representation. In [30–32] agents learned grandmaster-level behaviours for board games like Chess, Shogi and Go using simulations. In [6], a complex training system was used to beat world champions in an online real-time strategy game. However, these works necessitated the use of large training systems with numerous computing units, and the training process for the agents took several weeks to several months.
3
Foosball Domain
The starting point of this work is the Foosball domain. Here, this work refers to the semi-automated Foosball table developed by Bosch Rexroth shown in Fig. 1. In order to be able to train agents for this Foosball table, a simulation with the CAD model of the Foosball table was recreated in Unity. With this simulation, a goalkeeper agent was already successfully taught to defend shots [27]. We extend the work from [27] to train striker and goalkeeper agents in this simulation. However, only two consumer laptops with the specifications mentioned in Table 1 are available for this purpose. This has to be taken into account when designing the training procedure, and therefore the efficiency of the training is of great importance when determining the individual training components. At the same time, the simulation is only an abstraction of the Foosball table, which is why the simulation-to-reality gap must also be taken into account. Table 1. Hardware Specification of the Available Computational Resources System model ThinkpadP1Gen2
Lenovo81NX
Processor
Intel Core i7-9750H @ 2.60 GHz Intel Core i7-9750H @ 2.60 GHz
RAM
32.0 GB
16.0 GB
GPU
NVIDIA Quadro T1000
NVIDIA GeForce GTX 1650 with Max-Q Design
4
Single Agent DRL
In order to use DRL agents to control the striker and goalkeeper rods, the task to be learned is formalized as an MDP, a suitable way to formally define an DRL problem [35, p. 10, 12]. Additionally, a learning algorithm must be determined.
Foosball Agents
475
For this purpose, this section first defines potential dimensions of observation and action space, as well as a reward function from the perspective of the striker rod. Then, a DRL algorithm for training the agents is selected, followed by a methodological determination of the observation and action space to be used for multi-agent training. 4.1
Definition of MDP
Observation Space. Observations depict state representations of an MDP. For this purpose, the frequency and the representation of an observation must be defined. Considering the observation frequency, it should be chosen as high as possible on the one hand, in order to be able to draw precise conclusions regarding the selected actions, but on the other hand realistic regarding the real problem domain, so that the simulation-to-reality gap remains small. The observation frequency chosen was 60 observations per second, as this is a frequently used frequency in established DRL benchmark environments [5] and is within realistic limits with respect to the human observation frequency of the eye [8,23]. Feature vectors were chosen to represent observations, as this significantly reduces dimensionality compared to using images and eliminates the complexity of image processing. Of particular importance is which observation channels are made available to the agent for the selection of actions, as MDP require the Markov-Property to be satisfied. Therefore information about both the rod and the ball must be provided. Tables 2 and 3 present the observation channels relevant to rod and ball, respectively, and the value ranges used to encode this information to satisfy the Markov-Property. Since no boundary values are known for the ball velocity, the range of values here were limited to [−10, 10] in a zero-centered manner, in order to avoid larger expressions that could negatively influence training stability. However, this minimal configuration may not represent the optimal combination of observation channels. One piece of information that is currently hidden from the agent is the number and position of players on a rod. By adding this information, the agent would not have to implicitly learn the number of players and their respective positions, but could accurately track them via the observation channels. Since the players are fixed to the rod and the distances between them thus remain the same even when they move, all that is needed here is a scalar value of the position of the players. Table 2. Observation Channels Regarding the Foosball Rod with Respective Encoding
Observation channel Encoding Lateral rod position
[−1, 1]
Angular rod position [−1, 1] Lateral rod velocity
[−1, 1]
Angular rod velocity [−1, 1]
476
A. Gashi et al.
Table 3. Observation Channels Regarding the Foosball Ball with Respective Encoding
Observation channel Encoding x-coordinate ball
[−1, 1]
z-coordinate ball
[−1, 1]
x-velocity ball
[−10, 10]
z-velocity ball
[−10, 10]
Furthermore, several works have investigated the influence of binary contact information on locomotion problems with the result that it can enable smaller performance gains [26]. From this, two potential observation channels can be derived: binary contact information and binary range information, where the binary contact information indicates a contact between player and ball and the binary range information indicates when the ball is in range of the striker rod. In the context of this work, the focus is on an efficient training of agents, which has to be considered when determining the observation channels to be used. For this purpose, the method described in [15] was used to determine an optimal observation space. The experimental setup and the results of this method are explained in more detail in Sect. 4.3, since an action space and a reward function must first be defined for this purpose. Action Space. The action space of a Foosball rod can be divided into two dimensions: the angular and lateral movement of the rod. Here, the encoding of both dimensions has to be determined, which can be continuous or discrete in nature. In [14], it was shown that by discretizing action spaces, training efficiency could be significantly increased. However, [33] showed that while discretization can be advantageous in complex tasks, in simpler tasks a continuous action space can be essential for achieving a high return. To determine the encoding for the action space, an experiment was held to compare discrete and continuous value ranges. This experiment will be explained in Sect. 4.4, because for this a reward function has to be determined first. Reward Structure. The goal of the Foosball striker is to score goals. At the same time, it must be avoided that the ball gets behind the striker. Thus, it is first noted that a positive reward is given for a goal and a negative reward is given if the ball gets behind the striker. If we assign a value of 1 for a positive reward and −1 for a negative reward, we can assume that scoring goals is just as important to the agent as defending the ball. However, since there are wide walls next to the goal, the probability of the ball bouncing back when using a random policy is much higher than randomly shooting goals. Since this can quickly create a negative incentive to refrain from shooting altogether, a negative reward of −0.1 is chosen instead of the negative reward of −1. With this ratio
Foosball Agents
477
1 of 10 it can be assumed that the incentive to shoot is still maintained despite initial bounces. At the same time, it can be assumed that optimizing a policy leads to the prevention of wall bounces. However, to make the training more efficient, the method potential-based reward shaping (PBRS) is used [19]. This allows a policy-invariant condensation of the reward function. For this the ball position on the field is used to define a potential function ψ striker :
ψ striker (St ) =
ψxstriker (XSt ) + ψzstriker (ZSt ) 2
(1)
where ψxstriker and ψzstriker describe the potential of the ball with respect to its x- and z-coordinate respectively. Given the encoding [−1, 1] of the x-coordinate of the ball, where the striker rod represents the value 1 and the goal the value -1, ψxstriker can be defined as the following linear function resulting in the value range of [0, 1]: (2) ψxstriker (XSt ) = −0.5XSt + 0.5 where XSt denotes the x-coordinate of the ball at state St and time step t. However regarding the z-coordinate, the width of the goal corresponds to the range [−0.26, 0.26], which should be equivalent to the highest potential value of 1. The potential of the ball’s z-coordinate decreases with values below or above the goal width. In order to represent that, the following non-linear potential function ψzstriker is used, which clips the potential values at 1 and results in the value range [0, 1]: max(1.36ZSt + 1.36, 1) if ZSt 0 where ZSt denotes the z-coordinate of the ball at state St and time step t. This results in the following shaping-reward function F (St , St+1 ) based on [19]: F (St , St+1 ) = γψ striker (St+1 ) − ψ striker (St )
(4)
where γ 1 − δ [11]. In a classical random walk, agents are only allowed to jump to directly adjacent grid points at each time step. The analysis for this case is therefore also more complex. Agents that initially start close to each other, are likely to collide repeatedly in near future rounds. On average, compared to the connected random walk, agents have more collisions with less agents. We call a random walk fast mixing if the location of agents only weakly correlates with its previous location. An excellent example of such a random walk is the connected random walk, where there is no correlation at all between the locations at different time steps. On the other hand, we call a random walk slow mixing if the location of agents correlates strongly between different rounds. The classical random walk is an example of a slow mixing random walk. The encounter-rate-based density estimation using a classical random walk is nearly as accurate as the fully connected random walk, as shown by Musco et 2 al. For the classical random walk, after O log(1/δ) · (log log(1/δ) + log(1/d)) d2 rounds, the rate d˜ approximates the true density d with probability encounter P d − d˜ ≤ > 1 − δ. This differs from the fully connected estimate only by 2
a multiplicative factor (log log(1/δ) + log(1/d)) .
Density Approximation Using Quantum-Inspired Random Walks
519
We extend this work by considering different types of random walks that allow each agent to move to grid points other than the directly adjacent ones in each time step. We consider different underlying distributions for random walks, including a uniform distribution and a distribution inspired by a quantum random walk. Note, Musco et al. only allow for the number of rounds to be smaller than the total grid size. We use simulations with more rounds than the total grid size to show that the encounter-rate-based density estimation converges faster for random walks with a wider spread in the probability distribution. We first give a formal problem description including an algorithm to run the encounter-based density estimation for different type of random walks in Sect. 2. In Sect. 3 we show our simulation results for different types of random walks and in Sect. 4 we give our conclusions. In Appendix A we formalize the concept of a quantum random walk in two-dimensions and show how we obtain the used quantum-inspired random walk distribution.
2
Problem Formulation
We consider the same situation as described by Musco et al., that is, we consider a square two-dimensional grid consisting √ of A nodes. Each node is described by its coordinates (x, y) with 0 ≤ x, y < A. To prevent ourselves from having to deal with possible complicating boundary issues we connect the opposite edges of the grid with each other, which results in a torus. As the initial situation we place N + 1 agents uniformly at random on this two-dimensional torus. The position of each agent is described by the (x, y) coordinates of the node on the grid. In this model, it is possible that multiple agents occupy the same node and hence have the same (x, y) coordinates, as illustrated in Fig. 1. Here Count(i, j) = k means that there are k + 1 agents located at the node with coordinates (x, y) = (i, j). From the perspective of a single agent, the density d of other agents on the grid is given by the total number of other agents divided by the number of possible nodes these agents can occupy, d = N A . The agent now tries to estimate this density by performing a random walk, and keeping track of the amount of collisions with other agents. Intuitively, it makes sense that the rate at which collisions occur is related to the density. On average, if you collide with other agents often, it is likely that the density is high. We model the behaviour of the agents by letting them take steps in successive rounds. In each round, all agents take a random step, according to some probability distribution Pstep . Each agent then counts the number of other agents their grid point of that round. After t rounds, each agent can estimate the density d by d˜ = ctt , with ct the total number of collisions observed in the t rounds. Note that the agents are modeled to be memoryless except for the total number of collisions. The full routine for each agent is captured in Algorithm 1.
520
R. S. Wezeman et al.
Fig. 1. Schematic Overview of the Grid with Different Agents Located in it, from [11]
Algorithm 1. Random-Walk-Based Density Estimation c := 0; for i=1. . . t do step := generateRandomStep() position := position + step c := c + count(position) end for return d˜ = ct
Sample a step from Pstep Update collision count
Algorithm 1 introduces two functions. The first, generateRandomStep(), samples a random step from a probability distribution Pstep ; The second, count (position), counts all other agents currently located at the specified position. This function can be used to update the collision count. We consider three different types of random walks, a uniform-, a classicaland a quantum-inspired random walk as primer for the probability distribution Pstep . Let a step that an agent can take from node (x1 , y1 ) to node (x2 , y2 ) be described by the x- and y-displacement. For each random walk, we let U denote the set of all possible steps an agent can take. We define the three considered random walks as follows 1. Uniform random walk: All possible steps in U are equally likely, i.e., U (X) = |U1 | for all steps X in U. We let U be the set of all steps that Pstep have a combined x- and y- displacement less than or equal to a given integer M . The corresponding random walk will be denoted by Uniform M. The
Density Approximation Using Quantum-Inspired Random Walks
521
possible steps U for a uniform random walk with M = 1 are given by U1 = {(1, 0), (−1, 0), (0, 1), (0, −1), (0, 0)}. The case when M is sufficiently large, such that U contains all the steps to all possible nodes, is sometimes referred to as the connected random walk, as in that case agents effectively traverse a fully connected graph. In the following figures we will refer to this case as Connected. 2. Classical random walk: Also for this case we consider U to be the set of all steps that have a combined x- and y- displacement less than or equal to a given integer M , but now the probability is not uniform, but given by M (X) = CM (X) Pstep
M 1 , 5
where CM (X) is the number of different routes from the starting point to X in M steps from U1 . The corresponding random walk is denoted by Classical M, which is the result of M consecutive single classical random walks. 3. Quantum(-inspired) random walk: Similar to the classical random walk, however, the probability distribution is now obtained from a two-dimensional quantum random walk with a maximum displacement M . We denote the resulting random walk by Quantum M Due to the inherent interference property, quantum random walks tend to have a wider spread, with the expected distance from the start after M random steps being linear in M , or Ω(M ) for short. This is different for the earlier discussed classical random walk with maximum displacement M , for which the spread is similar to the square root √ of M , or O( M ). We take a weighted sum of four different quantum random walks such that the final probability distribution becomes symmetric. The obtained distribution has a high probability to stay close at its current position while simultaneously also having a relative large probability to make a large step, see Fig. 2d. In Appendix A background information on quantum random walks is given together with the precise construction that we used to create the quantum-inspired probability distribution. In Fig. 2, the probability distributions are plotted for the different random walks. We remark that the probability distribution of the quantum-inspired random walk appears to create a chess board pattern. This corresponds to what one would expect if in each step, the agent has to move and is not allowed to remain at the current position. For an M -step quantum random walk, with M being even (odd), the probability to end up an odd (even) number of places away is zero. As a consequence, agents, that are initialized an odd number of steps away from each other, can never collide using our quantum-inspired random walks. This effectively halves the grid size. We deal with this issue by separating the agents in two groups. Each group is placed randomly on the grid, however in such a manner that agents within the same group are always an even number of steps separated from each other while agents from different groups are always an odd number of steps separated. This problem does not occur for the uniform-
522
R. S. Wezeman et al.
Fig. 2. Probability Distributions for Different Type of Random Walks and Different Values of Maximum Displacement M per Step (1 or 10)
and classical random walk because we have included the possibility for an agent to stand still in our definition of these walks, (0, 0) ∈ U. Encounter-based density estimation using different types of random walks can never outperform the connected random walk in terms of how fast the estimate converges to the true value d: In the connected random walk, in each round on average exactly d collisions occur, independent of the previous rounds. This argument does not hold for the other random walks. The number of collisions in each rounds is correlated with the positions of all agents before taking a step. Depending on whether agents can or can not reach each other within one time step, the average number of collisions for that time step will be higher or lower than d. As a consequence the average number of collisions d can only be reached over multiple time rounds, resulting in more variance and thus slower convergence of the density estimation. With this knowledge, we can compare different types of random walks with the connected random walk as a baseline. Our approach simulates agents performing Algorithm 1 for different types of random walks on different grid sizes. We compare the error of the density estimation of agents with respect to the number of steps taken by the agents.
Density Approximation Using Quantum-Inspired Random Walks
3
523
Results
As described in the previous section, we ran multiple simulations of Algorithm 1 for different grid sizes and different types of random walks. For each grid size and type of random walk, we will simulate multiple independent instances. In each instance, all N + 1 agents will independently produce a density estimate. For our simulations, we considered 1,000 instances for each grid size and type of random walk combination. Furthermore, we have fixed N = 26 in all simulations. We compare the density estimation of different random walks on the first 100,000 random walk rounds. In Fig. 3 the encounter-based density estimation is shown for each individual agent on a 10 × 10 grid for the first 20,000 steps. In this example, the true density of agents on the grid is given by 25 d= N A = 100 . As a performance metric we define the average absolute relative error after t rounds by the sum of the absolute errors averaged over the agents and the instances, relative to the true density d: 1000 26 1 1 d˜i,j (t) − d Err(t) = , 26 1000 d j=1
i=1
where d˜i,j (t) is the encounter-based density estimation for agent i in instance j after t rounds. Alternative metrics are also possible, for example by averaging the density estimation error of each individual agent instead of averaging over all agents in one instance. The disadvantage of that metric is however that it fluctuates more and thus takes more instances before randomness is suppressed and hence more simulation time is required. The global convergence behaviour in which we are interested, how different random walks with different step sizes on different grid sizes compare to each other, is not affected by the chosen metric. In Fig. 4 the average absolute relative error is shown for the uniform, classical and quantum-inspired random walk with different step sizes on a 80 × 80 grid. For each type of random walk, we observe that the encounter-based density estimation converges faster for larger step size. In Fig. 5 the connected and uniform random walk are compared for various grid sizes and number of steps. We see that the 5-step uniform random walk on a 40×40 grid, the blue line, performs almost as good as the connected random walk shown in grey. On a larger 80×80 grid, the 5-step uniform random walk does not perform nearly as well as before. Increasing the number of simultaneous steps taken by the random walk from 5 to 10, appears to result in similar performance as a connected random walk. This is as expected, as by increasing the grid and number of steps by the same factor, the same percentage of the total grid can still be covered in a single random step. As a consequence, a collision is possible with the same percentage of other agents. The probability of this is however smaller, as there are more locations. Similarly, the density is also smaller. The same pattern can also be observed for the two other types of random walks. The type of random walk affects the speed at which the encounter-based density estimation converges to d, as shown in Fig. 6. The uniform random walk
524
R. S. Wezeman et al.
Fig. 3. Encounter-Based Density Estimation for each of the 26 Agents on a 10×10 that are each Performing a Classical Random Walk. The True Density from the Perspective of each Agent is d = 0.25
Fig. 4. Average Absolute Relative Error for Uniform, Classical and Quantum-Inspired Random Walks for Different Step Sizes on a 80 × 80 Grid
Density Approximation Using Quantum-Inspired Random Walks
525
Fig. 5. Average Absolute Relative Error for the Uniform Random Walk and the Connected Random Walk, for Different Grid Sizes and Number of Steps
Fig. 6. Average Absolute Relative Error for Different Type of Random Walks for Different Grid Sizes
converges slightly faster to the true density than the quantum-inspired random walk followed by the classical random walk. A possible explanation for this is that distributions with a wider spread allow for more collisions between agents that are initially farther apart. These distributions also have a smaller chance of a quick re-collision, once a collision between two agents has occurred.
4
Conclusions
We extended the work by Musco et al. by considering different types of random walks and allowing for more steps to be taken. Their contribution is based on theoretical results that show that the encounter-based density estimation gives a good estimate of the true density. In this work we looked at a simulation based
526
R. S. Wezeman et al.
approach and showed for different types of random walks that the encounterbased density estimation converges faster for random walks with larger step size. Furthermore, we showed that the type of random walk affects the rate at which the encounter-based density estimation converges. The distributions that are more likely to result in a wider spread, the uniform- and quantum-inspired random walk, converge faster to the true density of agents on the grid than standard single-step random walks. A possible explanation is that agents located farther away are more likely to collide, while also preventing a high number of re-collisions once they have collided. An interesting follow-up question is what the effect is of non-uniform initial starting positions of the agents. Similarly, it is interesting to consider the effect of agents traversing a non-regular grid. Another possible direction could be to consider combined effect of different agents performing different types of (random) walks simultaneously. As further work it is interesting to increase simulation size and apply our work to a more realistic application, for example, simulating energy management in IoT surveilence systems [5]. Another direction our work could be applied in is the field of reinforcement learning, for example, to improve agent based recommendation systems [10].
Appendix A A.1
Quantum Random Walk Quantum Walk in One Dimension
A one-dimensional classical random walk is described by an agent that is allowed to move with integer steps on a one dimensional line. The position of the agent after a certain number of rounds can be described by an n ∈ Z, with respect to its starting position. Each round the agent moves with equal probability to the left on the number line, n − 1, or to the right, n + 1. This definitions slightly differs from the one in Sect. 2, as there agents are also allowed to stand still. In the quantum analogue, the position of a one-dimensional quantum agent is described by the quantum state |n with n ∈ Z. The most generic step that an agent in the state |n can take is given by |n → a |n − 1 + b |n + c |n + 1 ,
(1)
where the parameters a, b and c can be some complex valued parameters. To guarantee the preservation of total probability, quantum mechanics only allows for unitary operations acting on states. It can now easily be shown that there does not exist a unitary transformation matrix such that two of the parameters a, b and c are non-zero. This means that only trivial movements are allowed, always moving to the left, always staying at the same position or always moving to the right. This problem of only having trivial movement is overcome by using an extra register. This register is in literature often referred to as the coin state.
Density Approximation Using Quantum-Inspired Random Walks
527
Define the state space as all states |n ⊗ |c in Z × span{|0 , |1}. A quantum random walk is then defined by two successive unitary operations: 1. First apply what is known as the coin-flip operator C C (|n ⊗ |0) = a |n ⊗ |0 + b |n ⊗ |1 C (|n ⊗ |1) = c |n ⊗ |0 + d |n ⊗ |1 2. Followed by the shift operator S S (|n ⊗ |0) = |n − 1 ⊗ |0 S (|n ⊗ |1) = |n + 1 ⊗ |1 A single step of the quantum walk is given by an application of both operators. There is still freedom in choosing the exact coin operator C. The only restriction is that the corresponding matrix is unitary. A common choice for the coin operator is the Hadamard coin, given by 1 1 1 √ C = I2 ⊗ H = I2 ⊗ . 2 1 −1 In Fig. 7 the probability distribution of the first three steps for a classical random walk is compared to the probability distribution of a quantum random walk. The quantum walk starts from the initial state |0 ⊗ |1 and uses the Hadamard coin as coin operator. The classical random walk starts at the n = 0 position. After the first two steps of the random walk, the probability that the agent is found at a certain position is the same for both types. The random walks start to fundamentally differ from each other from the third step onward. The classical random walk keeps spreading symmetrically around the origin, splitting its probability equal to the left and the right. The quantum random walk appears to have a bias to the left and does not split its probability equally in the two directions. This effect is explained by quantum interference effects, where the amplitudes of specific states are cancelled. These differences become even more apparent when more random steps are taken. Both the coin operator and the initial superposition affect the direction of such a potential drift in the probability distribution. In Fig. 8 the probability distribution after a quantum random walk consisting of 100 steps is shown for the Hadamard coin with different initial superposition. These distributions show different behaviour when compared to the classical expected binomial distribution. The bias in the probability distribution is corrected for by considering a different (symmetric) coin or a different initial starting position. However, two differences between the classical and quantum random walk remain independent of the initial starting position or the used coin. – After t steps, √ the position n at which the agent is most likely found is given by |n| ≈ 12 2t for the quantum random walk and n ≈ 0 for the classical random walk.
528
R. S. Wezeman et al.
– After t steps, the expected distance of the agent away from initial position
√ the is given by Ω (t) for the quantum random walk and O t for the classical random walk [2]. Due to these properties, the quantum random walk spreads quadratically faster than its classical counterpart. Quantum random walks are used in numerous different quantum algorithms to obtain quadratic or even exponential speedups, see for example [6,7,14].
Fig. 7. Probability Distribution for First Steps of a Classical and a Quantum Random Walk
A.2
Quantum Walk in Two Dimensions
In this section we generalize the 1-dimensional quantum random walk to two dimensions based on the work of [3].
Density Approximation Using Quantum-Inspired Random Walks
529
Fig. 8. Two Different Quantum Random Walks for the same Hadamard Coin Operator √ but Starting from a Different Initial Superposition. The Initial Superposition 12 2 |0⊗ √ (|0 + |1) (Left) Versus the Initial Superposition 12 2 |0 ⊗ (|0 + i |1) (right).
An agent at the position (x, y) on the two dimensional grid can be represented by a state (|x ⊗ |y) ⊗ (|c1 ⊗ |c2 ) in Z2 × span{|0 , |1}2 , where the |ci are the coin-states. We define the two-dimensional quantum random walk to consist of the following two unitary operations 1. First apply what is known as the coin-flip operator C = C1 ⊗ C2 C ((|x ⊗ |y) ⊗ (|c1 ⊗ |c2 )) = (|x ⊗ |y) ⊗ (C1 |c1 ⊗ C2 |c2 ) 2. Followed by the shift operator S S ((|x ⊗ |y) ⊗ (|0 ⊗ |0)) = (|x + 1 ⊗ |y) ⊗ (|0 ⊗ |0) S ((|x ⊗ |y) ⊗ (|0 ⊗ |1)) = (|x − 1 ⊗ |y) ⊗ (|0 ⊗ |1) S ((|x ⊗ |y) ⊗ (|1 ⊗ |0)) = (|x ⊗ |y + 1) ⊗ (|1 ⊗ |0) S ((|x ⊗ |y) ⊗ (|1 ⊗ |1)) = (|x ⊗ |y − 1) ⊗ (|1 ⊗ |1) A single step of the two-dimensional quantum walk is given by the application of S and C. Note that there is now even more freedom in the choice of the coin operator C, which can be any tensor product of two 2-dimensional unitary transformation matrix, resulting in many different families of the two dimensional quantum random walk. In [3] the limiting sets and other properties are studied for three different types of coin operators. Like in the case of the 1-dimensional quantum random walk, also the 2-dimensional quantum random walk is likely to have a bias towards certain directions, based on the coin operators and the initial state of the coin operator. It is however non-trivial to set the initial position to obtain a symmetric distribution, given a coin operator. As our goal is to make comparisons to symmetric random walks, performed on a symmetric grid, we do desire a symmetric walk. To obtain this we averaged four different quantum random walks to obtain a symmetric random walk. We call the resulting distribution the quantum-inspired random walk distribution, and is constructed as follows:
530
R. S. Wezeman et al.
1. Choose a coin-flip operator, we chose: ⎛
⎞ 1 −1 −1 −1 1 ⎜−1 1 −1 −1⎟ ⎟. C = (I2 ⊗ I2 ) ⊗ ⎜ 2 ⎝−1 −1 1 −1⎠ −1 −1 −1 1 2. Perform the N -step two-dimensional quantum random walk for the four initial states (|0 ⊗ |0) ⊗ (|i ⊗ |j) with i, j ∈ {0, 1}. 3. Average the obtained probability distributions to obtain a symmetric distribution, see Fig. 9.
Fig. 9. The Four 10-Step Quantum Random Walks that are Combined with Equal Weights to Obtain the Quantum-Inspired Random Walk
References 1. Adams, E.S.: Boundary disputes in the territorial ant azteca trigona: effects of asymmetries in colony size. Anim. Behav. 39(2), 321–328 (1990) 2. Ambainis, A.: Quantum walks and their algorithmic applications. Int. J. Quantum Inf., 1 (2004) 3. Baryshnikov, Y., Brady, W., Bressler, A., Pemantle, R.: Two-dimensional quantum random walk (2010) 4. Bonabeau, E.: Agent-based modeling: methods and techniques for simulating human systems. Proc. National Acad. Sci. 99, 7280–7287 (2002) 5. Campanile, L., Gribaudo, M., Iacono, M., Mastroianni, M.: Hybrid simulation of energy management in IoT edge computing surveillance systems. In: Ballarini, P., Castel, H., Dimitriou, I., Iacono, M., Phung-Duc, T., Walraevens, J. (eds.) EPEW/ASMTA -2021. LNCS, vol. 13104, pp. 345–359. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-91825-5 21
Density Approximation Using Quantum-Inspired Random Walks
531
6. Chakraborty, S., Luh, K., Roland, J.: On analog quantum algorithms for the mixing of markov chains. ArXiv abs/1904.11895 (2019) 7. Childs, A., Cleve, R., Deotto, E., Farhi, E., Gutmann, S., Spielman, D.: Exponential algorithmic speedup by quantum walk. In: Conference Proceedings of the Annual ACM Symposium on Theory of Computing (2003) 8. Gordon, D.M.: Interaction patterns and task allocation in ant colonies, pp. 51–67. Birkh¨ auser Basel, Basel (1999) 9. Gordon, D.M., Paul, R.E., Thorpe, K.: What is the function of encounter patterns in ant colonies? Anim. Behav. 45(6), 1083–1100 (1993) 10. Mahadik, K., Wu, Q., Li, S., Sabne, A.: Fast distributed bandits for online recommendation systems. In: ICS ’20. Association for Computing Machinery (2020) 11. Musco, C., Su, H.-H., Lynch, N.A.: Ant-inspired density estimation via random walks. CoRR abs/1603.02981 (2016) 12. Pratt, S.C.: Quorum sensing by encounter rates in the ant Temnothorax albipennis. Behav. Ecol. 16(2), 488–496 (2005) 13. Schafer, R.J., Holmes, S., Gordon, D.M.: Forager activation and food availability in harvester ants. Anim. Behav. 71(4), 815–822 (2006) 14. Szegedy, M.: Quantum speed-up of markov chain based algorithms. In: 45th Annual IEEE Symposium on Foundations of Computer Science, pp. 32–41 (2004)
Self-organizing and Load-Balancing via Quantum Intelligence Game for Peer-to-Peer Collaborative Learning Agents and Flexible Organizational Structures Ying Zhao1(B) , Gabe Mata2 , and Charles Zhou3 1
3
Naval Postgraduate School, Monterey, CA 93943, USA [email protected] 2 USMC, Albany, GA 31704, USA Quantum Intelligence, Inc., Monterey, CA 93943, USA
Abstract. Military operations, particularly in the littorals, have been contested and challenging. It is imperative to develop tools and methods to help tactical units execute distributed operations. Distributed, collaborative and networked agents have been associated with peer-topeer models. The military operation applications require each tactical unit to prioritize requests, recommend content and services to balance the load of a whole peer-to-peer network without the total knowledge and communication with the whole network, therefore, effectively reduce the operation signatures of individual agents and avoid the detection by the adversaries. This objective can not be achieved using the traditional collaborative filters or distributed bandit algorithms. We show innovative collaborative learning agents (CLAs) applied to distributed operations. In a distributed operation, each unit or node is represented as a single CLA. A unit can be a knowledge supplier (e.g., capability or service), or knowledge consumer (e.g., a demand or request from its peer units or environment). When a unit receives a new request, it searches its peer network for the best match to fulfil the request, meanwhile the whole network constantly self-organizes and balances the word load of the nodes, lowers the signatures, and avoids detection. By employing CLAs, we first map the need of lowering operation signatures to a loadbalancing problem, then apply lexical link analysis (LLA) and principle of quantum entanglement and superposition into a framework of LLA quantum intelligence game (LLAQIG). LLAQIG optimizes the value of an agent itself and in the same time helps a peer network achieve the Nash equilibrium and attain the optimal total social welfare. We show a use case and data set for distributed transportation units to handle transportation movement requests. We demonstrate that the resulted newly formed peer groups using LLAQIG, which have the characteristics of smaller load mean range between peer groups, lower load standard deviation within groups, higher number of unique equipment utilized than other methods and random configurations. These discovered metrics indicate new peer groups are less likely to be detected with less movement of units within peer groups, therefore maintaining lower operation signatures. The military tactical units can potentially leverage the results c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): SAI 2023, LNNS 711, pp. 532–551, 2023. https://doi.org/10.1007/978-3-031-37717-4_33
Self-organizing and Load-Balancing
533
for flexible command and control (C2), organizational structures, and modernization. A peer-to-peer military distributed operation equipped with CLAs allow units, with traditional warfare capabilities of sensors, platforms, networks, weapons, and emerging technologies, optimize their value or load in an autonomous and self-organizing fashion with smaller footprint. Keywords: Collaborative Learning Agents · Distributed Computing · Peer-to-Peer Network · Self-Organizing · Load-Balancing · Lexical Link Analysis · LLA · Association Patterns · Data Mining · Lower Signatures · Small Footprints · Quantum Machine Learning · LLA Quantum Intelligence Game · LLAQIG
1
Introduction
Military applications have adopted distributed computing and peer-to-peer architecture early on and made them part of modern warfare strategy [1,2]. The resources have gradually migrated to distributed and mobile networks. However, contested environments are degraded environments caused by enemy actions, for example, laser and direct weapons, electronic warfare threats, and cyber attacks; or by failed systems or battle damages. In such environments, operational effectiveness reductions are caused by the environmental limitations, e.g., unavailable, jammed, or unreliable network connectivity and bandwidth. Peer-to-peer models are more suitable for the environments where resources have to assemble and reassemble on-the-fly without any fixed infrastructure. For example, to enhance total force readiness and project combat power across the wide range of operations and spectrum of conflict at any time, the U.S. Navy and Marine Corps (USMC) need hybrid fleet, mix of platforms, and plug-andplay capabilities to deliver effects across all domains [2,3]. Distributed Maritime Operations (DMO) for the Navy and Expeditionary Advanced Base Operations (EABO) for the USMC are essential concepts to maintain C2 superiority [4]. Another key capability to achieve distributed operations is to maintain lower signatures and avoid detection. For example, the DMO and EABO concepts consider not only offensive capabilities for winning battles, but also develop the abilities to counter-detect, confuse the enemy, and remain lower signatures and footprints in contested environments. In computer science, load-balancing is to distribute a set of tasks to a set of resources or units so the overall process is more efficient. In order to maintain a lower signature and smaller footprint to avoid being targeted and detected, it is important for a distributed operation to alternate the usage of the units thus to counteract using some nodes too often. Peer-to-peer systems can self-organize and often use algorithms and techniques inspired by naturally occurring biological phenomena. These bio-inspired solutions can be adaptive, resilient to component failure [5]. Peer-to-peer systems can self-organize to achieve load-balancing as we describe in this paper. The major contributions of this paper are summarized as follows:
534
Y. Zhao et al.
1. Achieving distributed infrastructure for flexibility using collaborative learning Agents (CLAs), which consist of distributed, networked, and peer-to-peer agent architecture and analytics. We show distributed operations empowered by CLAs that allow flexible combinations of capabilities. 2. Using a USMC’s transportation scenario and data set as an example, we show how to distribute the tactical units’ operations that maximizes the benefits and lowers the signatures of being detected and targeted by adversaries. We first map the need of lowering operation signatures to a load-balancing problem in a peer-to-peer network, and the apply CLAs and lexical link analysis (LLA) to solve the problem. We also apply the principle of quantum entanglement and superposition in quantum computing and quantum game, into the LLA quantum intelligence game (LLAQIG) framework. The LLAQIG reaches the Nash equilibrium for each individual agent or unit to maximize its value, meanwhile, achieves the total social welfare for the whole peer-to-peer system in an unsupervised and self-organizing fashion. The rest of this paper is organized as follows. Section 2 reviews related literature and the state-of-the-art distributed peer-to-peer algorithms. Section 3 describes the main methods used in our system. Section 4 describes use case and data sets. Section 5 presents use case results. Section 6 discusses the results and relations with other methods, while Sect. 7 concludes the paper.
2
Literature Review
In a traditional distributed model, a federated networking and machine learning (ML) enable multiple agents to build a global model without sharing data [6]. A federated operation needs some centralized management of resource/data, however peers usually cannot exchange data directly even under the same federation (e.g., Google or Apple’s mobile phone data models are federated). Another type of distributed model is a peer-to-peer approach, where consensus model [8] or a gossip model [7] builds a global model without central servers. For example, a consensus approach builds a ML that minimizes the prediction error over the union of all data sets. In a gossip model, each agent has a personalized objective. The goal is to collaboratively improve the personalized objective by leveraging information from the peers. It interweaves learning and propagation by optimizing a trade-off between the smoothness of the model parameters over the network of peers and the models’ prediction accuracy on the local data sets [7]. A consensus model needs the coordination of all the agents to agree on a common business value as in PageRank, opinion formation, smart power grids, control of UAVs, load balancing, and blockchain [8]. The research is also related to various applications using the classical collaborative filtering and content-based filtering methods to recommend items to users (e.g., recommend movies to users). The recommendation engines learn static recommendation models given training data based on supervised machine learning (ML). The problem for supervised ML approaches is that the models
Self-organizing and Load-Balancing
535
are trained offline, while real systems require dynamically changed models to address user-item preferences and content in real-time. The so called “bandit learners” and reinforcement learning, i.e., a bandit or agent recommends an action using the reward or feedback from the environment as the guidance for learning, e.g., adjusting the internal models’ parameters. The collaborative filtering bandit algorithms learn the preferences of similar users to similar items. Content-based filtering uses similarities in products, services, or content features, as well as information accumulated about the users to make recommendations. Clusters of users and items are based on knowlege graphs of users and items. They are not suitable to perform highly dynamic recommendations when users, items, and content are very fluid [9]. If the numbers of possible recommendable actions are large, e.g., recommending an item to a user from a large number of item collections, distributed, peer-to-peer, and parallel computation are required for speedy, dynamic recommendations. These algorithms are also referenced as contextual bandit algorithms where content can change rapidly and the algorithms learn and update in real-time the latent mappings or preferences between users and items based on their interactions as the reward signals, e.g., a user clicks on an item, a recommendation agent or “bandit” receives a reward. After such learning, the agent tends to select actions that provide higher rewards, thus reinforcing better user-item mappings. The latent mappings, which can be clusters of users or items, typically need look at all the data at once, i.e., all peers in a peer-to-peer network need to share data among distributed workers, therefore make them intrinsically difficult to scale up in parallel. A novel distributed bandit-based algorithm lazily creates clusters in a distributed manner, and dramatically reduces the network data sharing requirement, achieving high scalability [10]. When all the peers are solving the same linear bandit problem, i.e., finding a global set of linear coefficients for a chosen action as a reward function, discovering the clusters of peers allows achieving the optimal asymptotic best performance when the linear coefficients vary across the clusters [11].
3
Methods
The differences of our system, compared to the reviewed systems, can be summarized as follows: Firstly, our system as described in this section contains selforganizing and unsupervised ML algorithms which do not explicitly use the feedback of the environment to guide a learning process. The information sharing among the peers in a network is based on fusing the statistics of the content’s similarity and dissimilarity, while content include information consumers (e.g., users) information and capability providers (e.g., items). Therefore, the system automatically scales up because correlated content are embedded in the patterns of a single knowledge graph of all the types of objects. Secondly, our system updates clusters of content (e.g., topics and themes, not clusters of peers) periodically and in an online fashion. Thirdly, each agent in our system recursively
536
Y. Zhao et al.
computes clusters of content in its own network, therefore, the computation is in parallel and context-dependent naturally, not restricted to the linear bandit optimization [11]. Finally, our application requires each agent to prioritize and recommend content to balance the load of a whole peer-to-peer network without the total knowledge and communication with the whole network, therefore, effectively reduce the operation signatures of an individual agent and avoid the detection by the adversaries. This objective can not be achieved using the traditional distributed bandit algorithms. 3.1
Collaborative Learning Agent (CLA)
A CLA is the basic structure of our system. A single agent represents a single system capable of ingesting data, indexing, cataloging information, and performing knowledge and pattern discovery, machine learning from data, and separating patterns and anomalies from data. A single CLA can represent an operational unit which possesses certain capabilities. Multiple CLAs work collaboratively in a peer-to-peer network. Each agent has a peer list. In a more detail, a CLA first indexes the data (structured and unstructured data sources) locally using unsupervised ML and data mining algorithms, data-mines and discovers knowledge patterns, and then fuses the local models with the models of its peers. Therefore, the models and indexes that are available for the whole peer network. The collaboration of a network of CLAs is achieved through a peer list defined within each agent, through which each agent passes shared information only to its peers. A CLA network and collaboration mechanism is fault-tolerant and self-organizing. CLAs have been used in organization collaboration applications [12] and building swarm intelligence systems for health monitoring of systems of systems such as naval ships and internet of things (IoTs) [13]. Figure 1 shows a schematic network of CLAs used in distributed operations such as DMOs or EABOs. Each unit or node is represented as a single CLA. A unit can be a capability supplier, consumer, or broker. The patterns of content, data, links in each unit are indexed and learned from historical data. When using CLAs, each unit first builds a content and peer network to index, datamine, and fuse data locally from its peer network, then to classify behavior into patterned, emerging, and anomalous themes. When a unit receives a new request of capabilities, it searches its peer network and finds the best match to fulfil the request, meanwhile balances the work load to lower the signatures and avoids detection. The sorting and ranking mechanism of the units in a peer network is the key to lower the signatures. CLAs are related to peer-to-peer systems, yet, have the following major differences: – CLAs’ overlay structures are unstructured, agents join the network through the peer lists of existing agents. A CLA’s peer list can self-organize by a close proximity or the deliberation of the LLAQIG algorithm. – CLAs’ index and search are performed for each peer network in each agent without a dedicated or dominant agent. Therefore, CLAs’ learning is not
Self-organizing and Load-Balancing
537
federated learning. For CLAs, each agent keeps a list of knowledge bases, however, the list is not global like distributed hash tables (DHTs) [14]. Each agent’s knowledge base is only relevant to its own peer network. Each agent performs unsupervised learning to discover patterns and anomalies. – CLAs self-organize to achieve the workload balance, according to the principle of quantum entanglement and superposition of the value interaction between patterned, emerging, and anomalous themes. A CLA j includes an analytic engine with two algorithms, i.e., an AI/ML (Mine) and fusion (Fuse) algorithm, which can be customized externally: – A Mine algorithm integrates the local knowledge base b(t, j) and global knowledge base B(t − 1, j), which is only global to its peer network, into a new knowledge base B(t, j). – A Fuse algorithm assesses the total value of Agent j by separating the total knowledge base into the categories of patterns and anomalies, and predict a total value V (t, j, c) for each piece c of the content in Agent j. The whole process is illustrated in Algorithm 1, where p(j) represents the peer list of Agent j. The total value V (t, j, c) is used in the global sorting and
Fig. 1. Distributed Operations and CLAs: Each Unit or Node is Represented as a Single CLA. A Unit can be a Capability Supplier, Consumer, or Broker. The Patterns of Content, Data, Links in each Unit are Indexed and Learned from Historical Data. When using CLAs, each Unit First Builds a Content and Peer Network to Index, DataMine, and Fuse Data Locally and from its Peer Network, then to identify Behavior Patterns and Groups of Units and Capabilities. When a Unit Receives a New Request of Capabilities, it Searches its Peer Network and Finds the Best Match to Fulfil the Request, Meanwhile Balances the Work Load to Lower the Signatures and Avoids Detection.
538
Y. Zhao et al.
ranking of relevant content c in Agent j when it publishes the content to its peer network. Algorithm 1. Mine and Fuse in a Single CLA j while t ≥ 0 do for each i in Agent j’s peer list p(j) do B(t, j) ⇐ M ine(B(t − 1, i), b(t, j)) end for for each content c(j) in Agent j do V (t, j, c(j)) ⇐ F use(B(t, j), c(j)) end for end while
Algorithm 1 can run continuously over time and parallel for each Agent j. In order to perform a Fuse algorithm, the association list B(t, j) is computed recursively with the amount of the computation distributed among multiple agents from a Mine algorithm as shown in Fig. 2 and Algorithm 2. 3.2
Mine Algorithm
A CLA outputs a knowledge base B(t, j) which contains two components as shown in Fig. 2: The first component is an association list which contains pairwise correlations or associations between two word features. Word features are universal vocabularies and basic elements used for all the agents to form concepts in their knowledge bases. The second component is a context/concept list, which contains lists of context and concept pairs. Contexts can also be concepts of timestamps, geo-locations, or universal file locators that are common for all the agents. In the CLA’s Mine step, if Agent j’s local model b(t, j) shares word feature pairs with the knowledge bases to which its peers pass, i.e., B(t − 1, p(j)), a
Fig. 2. A CLA’s Mine and Fuse Algorithm Relations
Self-organizing and Load-Balancing
539
Mine algorithm simply modifies and updates the local association list to reflect the local data. Meanwhile, a so-called context learning is performed for each agent when the two agents do not share existing word feature lists, however, potentially share some contexts. The Mine algorithm is shown in Algorithm 2.
Algorithm 2. Mine Algorithm Recursion for CLA j for each i in Agent j’s peer list p(j) do for each k in the local new knowledge base do if b(t, j, k) in B(t − 1, j) then B(t, j, k) ⇐ B(t − 1, i, k) ∪ b(t, j, k)) else for a shared context ct of Agent j and Agent i do B(t, j, ct) ⇐ B(t − 1, i, ct) ∪ b(t, j, ct)) end for end if end for end for
Both Mine and Fuse algorithms can be customized externally. LLA is used as both a Mine or Fuse algorithm. In an LLA, a complex system can be expressed in a list of attributes or features with specific vocabularies or lexicon terms to describe its characteristics. LLA is a data-driven text analysis. For example, word pairs or bi-grams as lexical terms can be extracted and learned from a document repository. LLA automatically discovers word features, clusters of word features, and displays them as word feature networks. LLA is related to but significantly different from so called bag-of-words (BOW) methods such as Latent Dirichlet Allocation (LDA) [15]. Bi-gram allows LLA to be extended to numerical, categorical, or time series data. For example, for structured data such as attributes from databases, LLA discretizes and then categorizes attributes and their values to word-like features. LLA computes the counterfactual proportion difference in Eq. (1) as the strength of two associated items for Agent j: cflk (j) = [p(wl |wk ) − p(wl |not wk )] × P N
(1)
where p(wl |wk ) is the probability of the word feature wl occurs in the same context where the word feature wk occurs, also depending on Agent j and timestamp t, omitted in Eq. (1) for brevity. P N is the pooled sample size computed in Eq. (2): 1 1 + ) (2) P N = p(wl |wk )p(wl |not wk )( Nwl |wk Nwl |not wk cflk is a z-score [16]. If cflk > 1.96, then the link between word feature wl and wk has a statistically significant and causal link p − value < 0.05. The output
540
Y. Zhao et al.
of the Mine algorithm is the statistically significant association matrix for all word feature pairs wlk , l = 1, ..., W ; k = 1, ..., W , where W is the total number of unique word features in Agent j’s peer network in Eq. (3): cf11 (t, j) ... (3) B(t, j) = ... cflk (t, j) ... All the entries cflk (t, j) > 1.96 in Eq. (3). 3.3
Fuse Algorithm for Ranking Content and Agents
A fuse algorithm for Agent j is used to sort and rank the content or compute the value V (t, j, c) for each content c based on its knowledge base B(t, j) at time t, and then aggregate the scores to rank Agent j. A fuse algorithm needs to look at the whole knowledge base B(t, j), which includes an association list of word feature pairs and a context/concept list from Algorithm 2. Considering ranking a node using traditional network theory, the importance of a node in a network can be ranked using established hyperlinks, citation networks, social networks, or other collective intelligence marked by humans. However, few or no hyperlinks are available for private or proprietary data. Furthermore, high-value information can be different from applications. Current methods mainly score patterned information, which are useful for certain types of applications, e.g., marketing. Anomalous information is important for some other types of applications, e.g., intelligence analysis, novelty search, and loadbalancing. When a peer as a content provider in a peer-to-peer network crowd-sources from its peers as the content consumers, the interaction is modeled as a strategic cooperation game of two players. The content provider’s search for the best value of itself as a Nash equilibrium may not achieve a full Pareto efficiency or the so-called optimal social welfare for the whole peer network, referred as the Prisoner’s dilemma. It is necessary to apply quantum computing and game theory properties to escape the Prisoner’s dilemma, i.e., reach both the Nash equilibrium and optimal social welfare [17]. Fuse Algorithm Using Eigenvector. The algorithm examines the knowledge base association matrix B(t, j) as a whole to rank the nodes (i.e., word features of Agent j representing its capability and content) according to the global structure when the peers start to self-organize. Assume the initial load of the nodes is shown in Eq. (4): ⎡ 0⎤ u1 ⎢ u02 ⎥ ⎢ ⎥ ⎢ ⎥ (4) u0 = ⎢ ... ⎥ ⎢ ⎥ ⎣ u0i ⎦ u0N
Self-organizing and Load-Balancing
541
And u in Eq. (5) represents the load of the nodes before they reach the equilibrium: ⎡ ⎤ u1 ⎢ u2 ⎥ ⎢ ⎥ (5) u = ⎢ . ⎥ ⎣ .. ⎦ uN N is the number of units. The ranking is adjusting iteratively, which can be viewed as a self-organizing over the time before a new local content b(t + 1, j) becomes available. A fixed point is achieved in Eq. (6): Bu∗ = u∗ ,
(6)
1 B(t, j) is the association matrix computed from Algorithm where B = λmax 2 and Eq. (3). λmax is the maximum eigenvalue of B(t, j) and the maximum eigenvalue for B is 1. The B(t, j) is a primitive matrix related to the PerronFrobenius theorem [18]: If a matrix is non-negative (i.e., all its elements are non-negative real numbers) and its mth power is positive (i.e., all its elements are positive) for some natural number m and the same m works for all pairs of indices, then eigenvalue with the maximum magnitude of B(t, j) or B(t, j)’s spectrum radius is positive, i.e. λmax > 0. The self-organized ranking u∗ is the eigenvector corresponding to the maximum absolute eigenvalue of B [19,20], i.e., 1. The ranking change u0 propagates through the peer network, every node can reach every other node when t continues long enough or irreducible. The final new load for all the nodes can be expressed as in Eq. (7) when the network reaches the equilibrium: ⎡ ∗ ⎤ u1 2 ∗ ⎥ ⎢ N ⎢ u2 2 ⎥ ⎢ ⎥ (7) uf ixed = u0i ⎢ ... ⎥ ⎢ ⎥ i=0 ⎣ u∗i 2 ⎦ u∗N 2
Equation (7) is related to the current network node ranking mechanism where popular and highly connected nodes are dominant and ranked high. The nodes with higher-nodes have higher work load and therefore more are likely to be detected. Fuse Algorithm Using LLA. To avoid detection and balance the load, LLA can be used as a Fuse algorithm for ranking the content. The word features in LLA are clustered into groups or themes using the community detection algorithms[19]: – Patterned (P) themes: A patterned theme is more likely to be shared across multiple diversified domains, which are already in the public consensus and awareness and can be authoritative.
542
Y. Zhao et al.
– Emerging (E) themes: These themes tend to become patterned over time. – Anomalous (A) themes: These themes may not seem to belong to the data domain as compared to others. They can be interesting, unique, innovative to specific entities and may be high-value and need further investigation. Let the value of a content c computed from patterned (P) themes P (t, j, c) and from anomalous (A) themes A(t, j, c) of LLA for Agent j at time t, respectively. The total value V (t, j, c) for c is a function of P (t, j, c) and A(t, j, c) as shown in Eq. (8). V (t, j, c) = f (P (t, j, c), A(t, j, c)) (8) We apply the principle of quantum entanglement and superposition in quantum computing and quantum game to learn and select function f in Eq. (8). The framework allows each agent reach the Nash equilibrium to maximize its value, meanwhile, achieve the optimal total social welfare for its peer-to-peer system as a whole in an unsupervised fashion. The contribution is different from the current federated learning and distributed learning. Game-Theoretic Properties of LLA. In a traditional game theory, a player’s search for its own best reward is modeled as to reach a Nash equilibrium, however, may not achieve a full Pareto efficiency or optimal social welfare for all other players, referred as the Prisoner’s dilemma. When comparing one Nash equilibriumto another, the one that has a higher social welfare is a Pareto superior solution. A self-player agent or unit is more successful in terms of crowd-sourcing response of an information, e.g., publishing content, if its strategy results in a higher social welfare. A high-value content has to achieve Nash equilibrium, i.e., maximizes its own reward, meanwhile, is Pareto efficient or superior. It is necessary to apply quantum computing and quantum game properties to meet the two requirements simultaneously. We show a connection between LLA, quantum computing, and quantum game theory so both requirements are satisfied. A correlation is computed for a content provider’s content x with its opponent y’s content, i.e., the overlapping word features that x and y possess. The opponent of the content provider is the population of content consumers or the crowd-sourcing audience. Assume the total set of word features for the peer network of Agent j or x, categorized into patterned and anomalous themes are Py and Ay , respectively. The number of word features from x belong to Py and Ay are px and ax , respectively. When a self-player acts as a content provider to seek value from a cooperative opponent, the self-player has to be also Pareto efficient or superior in order to obtain the right response from the opponent, i.e., the content consumers or crowd-sourcing audience. Being Pareto efficient means the system can not make at least one player better off without making any other player worse off, i.e., achieving the so-called optimal social welfare.
Self-organizing and Load-Balancing
543
LLA Quantum Intelligence Game (LLAQIG). In contrast to a traditional quantum game, we apply geometric and probabilistic quantum superposition and entanglement to the pattern and anomaly themes of LLA to determine the quality, value, or impact of a content. The resulted quantum machine learning system is a quantum intelligence game or LLAQIG. A content provider (self-player) and its crowdsouring audience (opponent) always have a degree of entanglement since their knowledge bases are correlated and overlapped that constitute the content propagation and delivery. The degree of correlation is px , corresponding to the entanglement γ in the traditional quantum game. This is different from the current federated learning, distributed learning, or quantum machine learning, as formulated in terms of quantum game theory. For two pure strategies for C and D as in a quantum game, a quantum superposition is shown Eq. (9): |X = c0 |C + c1 |D
(9)
where the coefficients c1 , c2 are complex numbers. c0 |C = eiθ c1 |D = beiφ
(10)
The magnitude of the superposition as a total reward, i.e., total social welfare ρ, for the LLAQIG is ρ = X|X = 1 + b2 + 2b cos(θ − φ)
(11)
Fig. 3. Superpositioned Patterned (P) and Anomalous (A) Themes. The Highest-Value of a Self-Player is Achieved Both at the Nash Equilibrium and Maximal Total Social Welfare. Each CLA is a Self-Player x, its Peer Network is the Opponent y.
Among the four points (x1 , y1 ), (x2 , y2 ), (x3 , y3 ), and (x4 , y4 ) in the superpositioned circles in Fig. 3 and (x1 , y1 ) is the unique Nash equilibrium for all
544
Y. Zhao et al.
given cos θ, neither player can unilaterally improve its reward because the two players are correlated and entangled. The Nash equilibrium(x1 , y1 ) also reaches the maximal total social welfare 1 + b and θ = φ. This is the gain from the quantum effect. In summary, for a new content c from a content provider, and assume Py (t, j) and Ay (t, j) are the number of patterned and anomalous word features for Agent j and its peer network as content consumers y, the value of the content c from LLAQIG are computed in Algorithm 3.
Algorithm 3. LLAQIG Algorithm while t ≥ 0 do [Py (t, j), Ay (t, j)] ⇐ LLA[B(t, j)] for each new content c of Agent j do ] px (t,j,c) V (t, j, c) ⇐ [1 + ax (t,j,c) L L end for end while
px (t, j, c) and ax (t, j, c) are the number of patterned and anomalous word features for the content c respectively. c depends on j, omitted for brevity. L is a normalization factor, e.g., the length of a sentence or message of a content. px (t,j,c) represents the degree of entanglement content c with other content in L Agent j’s peer network. Intuitively, ranking and sorting the nodes using V (t, j, c) would give the content, i.e., representing a capability or service, with extra weighting on anomalous features, thus to be matched via search for the opportunity of a new demand request, therefore, balance the load of the naturally high-connected nodes, i.e., patterned features.
4
Use Case and Data Set
A USMC transportation capacity planning service is currently a centralized client-server system used by the USMC to provide near-term transportation planning, management, and execution capabilities. It enables assets visibility and in-transit visibility that contribute important information to the logistical picture. The data set contains about 37,449 transportation movement requests for about six months for 287 units and 15 equipment. The capabilities of a unit are associated with the equipment it operates. The data attributes are listed as follows: – – – – –
Unit ID Equipment TAMCN Cargo weight Passengers Fuel cargo
Self-organizing and Load-Balancing
– – – – –
545
Water cargo Fuel used Oil used Miles traveled Hours operated
In Fig. 5, units sorted by the real load, i.e., number of requests, indicate the top 26% (75 units) have 80% of all the work load. These units behave more important than others and may be detected and targeted by the adversaries. We use this case to show how to implement the distributed operation concept using LLAQIG to distribute centralized units and keep small footprint.
5
Results
We apply the LLA algorithm as both Mine and Fuse algorithms to the data set. The goal is to generate new peer groups for a peer-to-peer network to cover as many capabilities and patterns in the centralized network as possible in each new peer groups. The LLAQIG quantum effect is used to find the intrinsic value of a unit, i.e., a unit is changing balance and capable of carrying a requested load balance. When the units are all connected in a single group as in the centralized model of the original data set, the distribution of the load converges to an equilibrium, content propagates, and the units are sorted thus load-balanced following the Fuse algorithm chosen by a CLA, e.g., by the eigenvector algorithm in Sect. 3.3 or the LLAQIG algorithm in Algorithm 3. Figure 4 shows an example 1(A) as one of the LLA groups of units computed from the LLA as both Mine and Fuse algorithms for the original centralized data set. Each group consists of units with linked to similar capabilities and characteristics. For example, a unit linking to an equipment such as equipment tamcn d0009 is a capability. Linking to a word feature such as cargo weight mt 45643.2 (i.e., cargo weight more than 456643.2) is a characteristic. We compare the units sorted by various scoring methods including real load, degree, eigenvector, LLA groups, and LLAQIG groups. The cumulative real load from the sorted units are the curves shown in Fig. 5. A degree score is the degree centrality measure, i.e., how many connected nodes for a node, is computed from the LLA network and links computed based on Eq. (1). The eigenvector scores are computed based on Eq. (7). The LLAQIG scores are computed based on Eq. (11). The correlation between the degree scores and real load, eigenvector scores and real load is 0.60, 0.52, respectively.
6
Discussions
The LLA groups can be used for load-balancing in two scenarios:
546
Y. Zhao et al.
Using the LLA Peer Groups Directly In this scenario, units in each group are peers of each other, sharing data and local indexes. Units in each group can search and match each other’s content (e.g., capability and characteristics) for a new input of service request (e.g., a specific equipment request). For example, a capability request as shown in Fig. 4 is a service request using equipment tamcn d0013 enters via Unit 3210, the unit searches shared content in the peer network, and finds Unit 5058, Unit 5060 both linked to the requested capability, therefore Unit 3210 routes the capability request to Unit 5058 or Unit 5060 which has a lower load. As the consequence, the load of Unit 3210 is balanced to Unit 5058 or Unit 5060. LLA discovers 10 groups. After the 10 peer-to-peer networks are established, the load in each group is balancing among the units within the groups. 76% (219 out of 287) of the units would cover 80% of the total load, corresponding to the curve that the units are sorted by LLA in Fig. 5. Note that in randomly formed new peer groups, 80% of the units would cover 80% of the total load. It would be optimal to achieve this lowest signature and best counter-detection posture. However, we analyze below why this is not a good option. Using the LLAQIG Algorithm The LLAQIG algorithm can compute the intrinsic value of a node using Algorithm 3. The algorithm shows how a unit can always optimize its value or load in an autonomous and self-organizing fashion. Sorting by the LLAQIG scores
Fig. 4. LLA Group 1(A)
Self-organizing and Load-Balancing
547
Fig. 5. Unit Cumulative Percentage of Real Load based on Different Unit Scores
alone for all the units, 64% (184 out of 287) of units would cover 80% of the total load, corresponding to the curve that the units are sorted by LLQIG in Fig. 5. The LLAQIG quantum effect and mechanism implemented in each CLA agent enables the agent to change its load to the potential future value by re (t) = V (t, j, c) be linking thus re-organizing the peer-to-peer network. Let V c (t) ∈ RN , N the value (score) computed using the LLAQIG for all the units, V is the total number of units. For comparison, we also show new peers groups formed using degree scores, eigenvector scores, and random scores to rank the units. Total 10 groups are formed for each method. Units ranked closely are in the same group. Load-balancing happens when new peer groups form and current load shifts to the peers within the groups. Table 1 shows the current load mean range between groups, coefficient variation (CV) of current load between groups, number (#) of unique equipment (within the peer groups), and percentage (%) of units cover 80% of the total load. Figure 6 shows the current load mean for each group. Figure 7 shows the current load CV for each group. A CV for a new peer group is the ratio of the standard deviation to the mean of the current load. Among all the methods for new peer groups, LLAQIG and LLA both have smaller current load mean ranges than the random method and other methods, indicating the new peer groups are evenly behaved. However, LLAQIG covers more unique equipment per group than LLA. Comparing the LLAQIG and random methods, LLAQIG has smaller and lower CV range between groups, indicating smoother and less movement or effort, thus less cost for the goal
548
Y. Zhao et al.
Fig. 6. Load Mean for Each New Peer Group via Different Methods
Fig. 7. Load CV for Each New Peer Group via Different Methods
of load-balancing within the groups. The eigenvector and degree methods, as described before, are correlated with the current load, therefore, not good scores for forming new peer groups, because they both have larger current load mean ranges between groups. The meaning of the metrics are summarized as follows: – Larger load peer groups would attract detection. – Higher CV peer groups indicate higher level of dispersion around the mean and more load-balancing towards the mean. Lower CVs would make peer groups more stable. – Higher number of unique equipment show more capabilities are utilized. – Higher percentage of cover 80% of the total load makes peer groups look more random, therefore show less signatures. New peer groups can self-organize autonomously as long as the LLAQIG algorithm including both Mine and Fuse components that run continuously and in real-time in each peer CLA. Note that there are tactical reasons to base every resource individually, however, impractical. Some hybrid might be more
Self-organizing and Load-Balancing
549
practical. The CLAs can be federated where some CLAs may be unavoidably have higher load and footprint than others. Table 1. New Peer Groups and Characteristics Methods for Peer Load Mean Groups Range Between Peer Groups
# of Unique % of Units Equipment Cover 80% Total Load
Random
[86 179]
[1.06 1.83]
13
80%
Eigenvector
[7 362]
[0.57 1.33]
11
41%
Degree
[8 334]
[0.56 1.52]
8
43%
LLA
[80 263]
[0 2.09]
8
76%
[0.71 1.73]
11
63%
LLAQIG
7
Load CV Between Range Peer Groups
Conclusions
In order to win hybrid and complex in contested environments, the future DMO and EABO require flexible, resilient, and sustainable configuration and combination of capabilities and tactics across multiple domains and organizations including air, land, sea, space, and cyberspace. We show that a peer-to-peer and self-organizing CLAs to meet the requirement. We show that the CLA architecture is flexible and can provide justification to support the shift from highly-centralized operations to highly-distributed, highly-federated, and highly-peer-to-peer ones to be fault-tolerate and maintain small footprint. We demonstrate the LLAQIG based on the data and behavior patterns learned using CLAs, LLA, and the quantum intelligence game effect. The peer-to-peer and self-organizing DMO and EABO equipped with CLAs not only include traditional warfare capabilities of sensors, platforms, networks, and weapons, but also extend to other tactics that evolve with new technologies. The resulted newly formed peer groups using LLAQIG have the characteristics of smaller load mean range between groups, lower current load standard deviation within groups than the random configuration and other methods, therefore, maintaining the lower operation signatures. Acknowledgment. Authors would like to thank the Naval Postgraduate School (NPS)’s Naval Research Program (NRP) for supporting the research. The Office of Naval Research (ONR) and the SBIR contract N00014-07-M-0071 supported the partial research of Collaborative Learning Agents at Quantum Intelligence, Inc. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied of the U.S. Government.
550
Y. Zhao et al.
References 1. Priebe, M., Vick, A.J., Heim, J.L., Smith, M.L.: Distributed operations in a contested environment: implications for USAF Force Pres entation, Santa Monica, Calif.: RAND Corporation, RR-2959-AF (2019). As of 06 October 2021. http:// www.rand.org/pubs/research reports/RR2959.html 2. Lundquist, E. (2021). http://seapowermagazine.org/dmo-is-navys-operationalapproach-to-winning-the-high-end-fight-at-sea/http://www.afspc.af.mil/Portals/ 3/documents/Schreiver 3. Navy Warfare Development Command. CNO Visits Navy Warfare Development Command (2017). http://www.nwdc.usff.navy.mil/Press-Room/NewsStories/Article/2406186/cno-visits-navy-warfare-development-command/ 4. Colpo, D.: PMW 150 Portfolio (2016). http://www.ndia-sd.org/wpcontent/ uploads/2017/02/2.PMW 5. Babaoglu, O., et al.: Design patterns from biology for distributed computing. ACM Trans. Auton. Adapt. Syst. 1(1), 26–66 (2006) 6. Kairouzet, A.: Advances and open problems in federated learning. arXiv:1912.04977 (2019) 7. Vanhaesebrouck, P., Bellet, A., Tommasi, M.: Decentralized Collaborative Learning of Personalized Models over Networks (2017) 8. Savazzi, S, Nicoli, M., Rampa, V.: Federated learning with cooperating devices: a consensus approach for massive IoT networks. IEEE Internet Things J. 7(5), 4641–4654. arXiv:1912.13163 (2020). https://doi.org/10.1109/JIOT.2020.2964162 9. Li, S., Karatzoglou, A., Gentile, C.: Collaborative filtering bandits. In: The 39th SIGIR (SIGIR 2016) (2016). http://doi.org/10.48550/arXiv.1502.03473 10. Mahadik, K., Wu, Q., Li, S., Sabne, A.: Fast distributed bandits for online recommendation systems. In: ICS 2020: Proceedings of the 34th ACM International Conference on Supercomputing June 2020 Article No. 4, pp. 1–13 (2020). http:// doi.org/10.1145/3392717.3392748 11. Korda, N., Szorenyi, B., Li, S.: Distributed clustering of linear bandits in peer to peer Networks. In: the 33rd ICML 2016 (2016). http://arxiv.org/abs/1604.07706 12. Zhou, C., Zhao, Y., Kotak, C.: The Collaborative Learning agent (CLA) in Trident Warrior 08 exercise. In: Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, IC3K, Madeira, Portugal, pp. 323–328 (2009). https://doi.org/10.5220/0002332903230328 13. Zhao, Y., Zhou, C.C.: Collaborative Learning Agents (CLA) for swarm intelligence and applications to health monitoring of system of systems. In: Rodrigues, J.M.F., Cardoso, P.J.S., Monteiro, J., Lam, R., Krzhizhanovskaya, V.V., Lees, M.H., Dongarra, J.J., Sloot, P.M.A. (eds.) ICCS 2019. LNCS, vol. 11538, pp. 706– 718. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-22744-9 55 14. Buford, J., Yu, H.: Peer-to-peer networking and applications: synopsis and research directions. In: Shen, X., Yu, H., Buford, J., Akon, M. (eds.) Handbook of Peer-toPeer Networking, pp. 3–45. Springer, Boston (2010). https://doi.org/10.1007/9780-387-09751-0 1 15. Blei, D., Ng, A., Jordan, M.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003). http://jmlr.csail.mit.edu/papers/volume3/blei03a/blei03a.pdf 16. Penn State University (PSU). Online Statistics: Normal Approximation Method Formulas (2021). http://online.stat.psu.edu/stat200/lesson/9/9.1/9.1.2/9.1.2.1 17. Sun, Z.W.: The rule for evolution of cooperation in quantum games. Acta Physica Polonica A. 116(2), 135–140 (2009)
Self-organizing and Load-Balancing
551
18. http://www.math.umd.edu/∼mboyle/courses/475sp05/spec.pdf 19. Newman, M.E.J.: Finding community structure in networks using the eigenvectors of matrices. Phys. Rev. E 74, 036104 (2006) 20. Jackson, M., Zenou Y.: Games on networks (2014). http://web.stanford.edu/ ∼jacksonm/GamesNetworks.pdf
Crowd-Sourcing High-Value Information via Quantum Intelligence Game Charles C. Zhou1(B) and Ying Zhao2 1 2
Quantum Intelligence, Inc., Cupertino, CA 95014, USA [email protected] Naval Post Graduate School, Monterey, CA 93943, USA http://www.quantumii.com
Abstract. In traditional network theory, an important node contains high-value information. Current methods of ranking high-value nodes require established hyperlinks, citation networks, social networks, or other forms of crowd-sourced collective intelligence marked by humans. However, few or no hyperlinks are available for private, proprietary, or unstructured data. Furthermore, high-value information can be different among various applications. We apply the principle of quantum entanglement and superposition into a framework of lexical link analysis (LLA) quantum intelligence game (LLAQIG) to determine the value of a piece of information. LLAQIG optimizes the value of a content provider and in the same time helps its audience attain the optimal total social welfare. We show a use case of discovering impact of stock price based on daily business news for more than 7000 publicly traded companies for 249 trading days (i.e., a whole year). The LLAQIG model achieves a cumulative return of 60% while major indexes are all down at least 10% from July 5th, 2021 to June 29th, 2022. Our LLAQIG is an unsupervised algorithm based on the quantum intelligence game theory. We do not use the ground truth for training. The method can be applied to other value discovery applications using crowd-sourcing. Keywords: Lexical Link Analysis · LLA · Quantum Machine Learning · Quantum Superposition · Quantum Entanglement · Quantum Intelligence Game · LLAQIG · Business News
1
Introduction
In traditional network theory, an important node contains high-value information. Current methods of ranking high-value nodes require established hyperlinks, citation networks, social networks, or other forms of crowd-sourced collective intelligence marked by humans. However, few or no hyperlinks are available for private or proprietary data. Furthermore, high-value information can be different from applications. Current methods mainly score popular and patterned information, which are mostly useful for marketing applications. Anomalous c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): SAI 2023, LNNS 711, pp. 552–564, 2023. https://doi.org/10.1007/978-3-031-37717-4_34
Quantum Intelligence Game
553
information is important for intelligence analysis. When an information provider seeks crowd-sourcing from an audience’s response to a piece of information, it can be viewed as a strategic cooperation game of two players. The information provider’s search for a Nash equilibrium may not achieve a full Pareto efficiency or the so-called optimal social welfare, referred as the Prisoner’s dilemma. The contribution of this paper is that we show it is possible to apply quantum computing and game theory properties to reach both the Nash equilibrium and optimal social welfare requirements for a real world business intelligence application using a financial data set. The rest of this paper is organized as follows. Section 2 reviews related literature and the state-of-the-art related algorithms. Section 3 describes the main methods used in our system. Section 4 describes use case and data sets. Section 5 presents use case results. Section 6 discusses the results and relations with other methods, while Sect. 7 concludes the paper.
2
Literature Review
The research is related to recommendation algorithms for crowd-sourcing using the classical collaborative filtering and content-based filtering methods to recommend items to users (e.g., recommend movies to users) based on crowd-sourcing interaction data. The recommendation algorithms learn static recommendation models given training data based on supervised machine learning (ML). The problem for supervised ML approaches is that the models are trained offline, while real systems require dynamically changed models to address user-item preferences and content in real-time. The so called “bandit learners” and reinforcement learning, i.e., a bandit or agent recommends an action using the reward or feedback from the environment as the guidance for learning, e.g., adjusting the internal models’ parameters. The collaborative filtering bandit algorithms learn the preferences of similar users to similar items. Content-based filtering uses similarities in products, services, or content features, as well as information accumulated about the users to make recommendations. Clusters of users and items are based on graphs of users and items. They are not suitable to perform highly dynamic recommendations when users, items, and content are not predictable [1]. These algorithms are also referenced as contextual bandit algorithms where content can change rapidly and the algorithms learn and update in real-time the latent mappings or preferences between users and items based on their interactions as the reward signals. A novel distributed bandit-based algorithm lazily creates clusters in a distributed manner, and dramatically reduces the network data sharing requirement, achieving high scalability [2]. For when all the peers are solving the same linear bandit problem, i.e., finding a global set of linear coefficients for a chosen action as a reward function, discovering the clusters of peers allows achieving the optimal asymptotic best performance when the linear coefficients vary across the clusters [3].
554
C. C. Zhou and Y. Zhao
The consensus model [5] or gossip [4] builds a global model without dominant agents. For example, a consensus approach builds a ML that minimizes the prediction error over the union of all data sets from all the crowd-sourcing agents [5]. In a gossip model, each agent has a personalized objective [4]. The goal is to collaboratively improve the personalized objective by leveraging information from other agents. It interweaves learning and propagation by optimizing a trade-off between the smoothness of the model parameters over a network of agents and the models’ prediction accuracy on the local data sets [4]. A consensus model needs the coordination of all the agents to agree on a common business value as in PageRank, opinion formation, smart power grids, control of UAVs, load balancing, and blockchain. Another related area of the research is the traditional quantum computing and quantum game theory. A classic computer operates on bits, 1 or 0. A quantum computer operates on qubits. Thus, while a classic computer can only be in one of two states, a quantum computer with n qubits can be in an arbitrary superposition of 2n states simultaneously, allowing exponentially more possible combinations of states than a regular computer. Quantum mechanics concepts have informed not only quantum computing methods and physical quantum computers, but also information processing in computer science and biology. For instance, the concepts of quantum superposition and entanglement were applied to genetic and evolutionary algorithms [6]. Though quantum computing details are not the focus of this paper, it is noted that quantum computing, entanglement, and superposition properties are essential for the traditional quantum game and quantum intelligence game. In literature, quantum computing principles were explored jointly with game theory to escape the Prisoner’s dilemma. Several frameworks for quantum game have been proposed [7–9,14]. The following review is based on the framework detailed in references [10,14], in which the quantum Prisoner’s Dilemma is discussed. In a quantum game, each player has a qubit and can manipulate it independently. The quantum formulation proceeds by assigning the possible outcomes of the classical pure strategies C and D to two basis vectors in Eq. (1). 1 0 |C = , |D = (1) 0 1 The state of the game is described by a vector in the tensor product space which is spanned by the game basis |CC , |CD , |DC and |DD. At the beginning of the game, the qubits |C ⊗ |C go through an entangling gate ˆ ⊗ D) ˆ and γ is the measure of entanglement. U ˆA and U ˆB are the Jˆ = exp(i γ2 D quantum strategy moves available to the players with the two parameters of a quantum unitary 2 × 2 matrices in Eq. (2) and 0 ≤ θ ≤ π and 0 ≤ φ ≤ π2 . iφ θ θ ˆ (θ, φ) = e cosθ2 sin 2 θ U (2) − sin 2 e−iφ cos 2 ˆA ⊗ After the actions of both players and the gate, the final state |ψf = Jˆ+ (U ˆ ˆ UB )J |CC, which is a superposition. Measurement will make the final state
Quantum Intelligence Game
555
collapse to one of classical basis and the reward is calculated according to the corresponding entries in the matrix. The row player’s expected reward is given by (3) RA = rPCC + sPCD + tPDC + pPDD Two special cases are shown as follows: 1. For a separable game where γ = 0, there exists a pair of quantum strategies ˆ D) ˆ as in Eq. (4) (D, 0 1 ˆ ˆ D = U (π, 0) = (4) −1 0 which is a Nash equilibriumand yields reward (p, p). The quantum game behaves as a classical one. 2. For a maximally entangled quantum game where γ = π2 , a novel Nash equilibriumas in Eq. (5). ˆ=U ˆ (0, π ) = i 0 (5) Q 0 −i 2 exists, which yields reward (r, r) and has the property of being Pareto optimal. Therefore the Prisoner’s dilemma existing in the classical game is removed. ˆ Q) ˆ is equal to (C, ˆ C), ˆ so one can adopt Q ˆ as the cooperaThe reward for (Q, ˆ tor’s strategy instead of C so that the dilemmas in the classical game theory can be resolved in the quantum game theory. For the Prisoner’s dilemma game, the modified matrix in the non-maximally entangled quantum game can be obtained in Table 1. Table 1. Quantum Game Payoff Matrix for the Prisoner’s Dilemma Game ˆ Q
ˆ D
ˆ r t sin2 γ + s cos2 γ Q ˆ t cos2 γ + s sin2 γ p D
The Prisoner’s dilemma is resolved if r > sin2 γ + s cos2 γ and p < t sin2 γ + ˆ Q) ˆ exists, which s cos2 γ. In these conditions, a unique Nash equilibrium(Q, is Pareto optimal. The two inequalities call for the condition to escape the Prisoner’s dilemma : sin2 γ > max(
t−r p−s , ) t−s t−s
(6)
The classical chicken game is an anti-cooperation game, in which two chickens chose “Dare” to cross the street or “Chicken out” to not cross the street. It is mutually beneficial for the players to play different strategies. By using quantum computing properties to the classical chicken game, one can also achieve both Nash equilibriumand Pareto optimality as in the Prisoner’s dilemma game. In the same way, one can get the modified matrix of the classical chicken game in Table 2.
556
C. C. Zhou and Y. Zhao Table 2. Quantum Game Payoff Matrix for the classical chicken Game ˆ ˆ Q D ˆ t sin2 γ + p cos2 γ Q r 2 2 ˆ t cos γ + p sin γ s D
and the condition of escaping from the dilemma of CG: sin2 γ >
3
t−r t−p
(7)
Methods
The differences of our system, compared to the reviewed systems, can be summarized as follows: Firstly, our system as described in this section contains selforganizing and unsupervised ML algorithms which do not explicitly use the feedback of the environment to guide a learning process. The information sharing among the nodes in a crowd-sourcing network is based on fusing the statistics of the content similarity and dissimilarity. Therefore, the system automatically scales up because correlated content are embedded in the patterns of a single knowledge graph of all the types of objects. Secondly, our system updates clusters of content (e.g., topics and themes, not clusters of agents) periodically and in an online and real-time fashion. Thirdly, our application requires each agent to realize its value with innovation and uniqueness of itself, however, also reach a global consensus without the total knowledge and communication with other agents in the network. Lastly, in contrast to a traditional quantum game [14] reviewed in Sect. 2, we apply geometric and probabilistic quantum superposition and entanglement to the pattern and anomaly themes to determine the quality, value, or impact of a content. 3.1
Lexical Link Analysis (LLA)
Lexical link analysis (LLA) is an unsupervised learning algorithm. In an LLA, a complex system can be expressed in a list of attributes or features with specific vocabularies or lexicon terms to describe its characteristics. LLA is a datadriven text analysis. For example, word pairs or bi-grams as lexical terms can be extracted and learned from a document repository. LLA discovers word feature pairs, networks, and themes grouped into patterned, emerging, and anomalous ones. LLA automatically discovers word features, clusters of word features, and displays them as word feature networks. LLA is related to but significantly different from so called bag-of-words (BOW) methods such as Latent Dirichlet Allocation (LDA) [11]. Bi-gram allows LLA to be extended to numerical, categorical, or time series data. For example, for structured data such as attributes from databases, LLA first discretizes and then categorizes attributes and their values to word-like features.
Quantum Intelligence Game
557
LLA computes the counterfactual proportion difference in Eq. (8) as the strength of two associated items for Agent j (i.e., a content provider): cflk (j) = [prob(wl |wk ) − prob(wl |not wk )] × P N
(8)
where prob(wl |wk ) is the probability of the word feature wl occurs in the same context where the word feature wk occurs. wl and wk also depend on Agent j (i.e., a content provider) and timestamp t, omitted in Eq. (8) for brevity. P N is the pooled sample size computed in Eq. (9): 1 1 P N = prob(wl |wk )prob(wl |not wk )( + ) (9) Nwl |wk Nwl |not wk where N (wl |wk ) or N (wl |not wk ) is the number of the word feature wl occurs in the same context where the word feature wk occurs or does not occur, respectively. cflk is a z-score [12]. If cflk > 1.96, then the link between word feature wl and wk has a statistically significant and causal link p − value < 0.05. The word features in LLA are clustered into groups or themes using the community detection algorithms[13]: – Patterned (P) themes: A patterned theme is more likely to be shared across multiple diversified domains, which are already in the public consensus and awareness and can be authoritative. – Emerging (E) themes: These themes tend to become patterned over time. – Anomalous (A) themes: These themes may not seem belong to the data domain as compared to others. They can be interesting, unique, innovative to specific entities and may be high-value and need further investigation. Let the value of a content c computed from patterned (P) themes and from anomalous (A) themes at time t as P (t, c) and A(t, c), respectively. The total value V (t, c) for c is a function of P (t, c) and A(t, c) as shown in Eq. (10). V (t, c) = f (P (t, c), A(t, c))
3.2
(10)
LLA Quantum Intelligence Game (LLAQIG)
In a traditional game theory, a player’s search for its own best reward is modeled as to reach a Nash equilibrium, however, may not achieve a full Pareto efficiency or optimal social welfare for all other players, referred as the Prisoner’s dilemma. When comparing one Nash equilibrium to another, the one that has a higher social welfare is a Pareto superior solution. A self-player agent is more successful, thus has higher value in terms of crowd-sourcing response of an information, e.g., publishing content, if its strategy results in a higher social welfare. A high-value content has to achieve Nash equilibrium, i.e., maximize its own reward, meanwhile, to be Pareto efficient or superior. It is necessary to apply quantum computing and quantum game properties to meet the two requirements simultaneously. We show a connection between LLA, quantum computing, and quantum game theory to fulfill both requirements.
558
C. C. Zhou and Y. Zhao
Firstly, we use LLA to compute a correlation for a content provider’s content x with its opponent’s content y, i.e., the overlapping word features that x and y possess. The opponent of the content provider is the population of content consumers or the crowd-sourcing audience of the content provider. Assume the total set of word features of both are categorized into patterned and anomalous themes, denoted by Py and Ay , respectively. We omit the emerging ones for simplicity. The numbers of word features from the self-player’s (i.e., content provider’s) content x that belong to Py and Ay are px and ax , respectively. px and ax can be considered as correlations between the self-player’s content x and the opponent’s content. When a self-player acts as a content provider to seek value from a cooperative opponent, the self-player has to be also Pareto efficient or superior in order to obtain the right response from the opponent, i.e., the content consumers or crowd-sourcing audience. Being Pareto efficient means the system can not make at least one player better off without making any other player worse off, i.e., achieving the so-called optimal social welfare. In contrast to a traditional quantum game [14] reviewed in Sect. 2, we apply geometric and probabilistic quantum superposition and entanglement to the pattern and anomaly themes of LLA to determine the quality, value, or impact of a content. The resulted quantum machine learning system is a quantum intelligence game or LLAQIG. A content provider (self-player) and its crowd-souring audience (opponent) always have a degree of entanglement since their knowledge bases are correlated and overlapped that constitute the content propagation and delivery. The degree of correlation is pLx (L is a normalized factor, e.g., the length of the content x), corresponding to the entanglement γ in a traditional quantum game [14] reviewed in Sect. 2. For two pure strategies for C and D as in a quantum computing, a quantum superposition is shown Eq. (11): |X = c0 |C + c1 |D
(11)
where the coefficients c1 , c2 are complex numbers. c0 |C = eiθ c1 |D = beiφ
(12)
The magnitude of the superposition as a total reward, i.e., total social welfare ρ, for the LLAQIG is ρ2 = X|X = 1 + b2 + 2b cos(θ − φ)
(13)
Among the four points (x1 , y1 ), (x2 , y2 ), (x3 , y3 ), and (x4 , y4 ) in the superpositioned circles in Fig. 1 and (x1 , y1 ) is the unique Nash equilibrium for all given cos θ, neither player can unilaterally improve its reward because the two players are correlated and entangled. The Nash equilibrium(x1 , y1 ) also reaches the maximal total social welfare 1 + b and θ = φ. This is the gain from the quantum effect.
Quantum Intelligence Game
559
Fig. 1. Superpositioned Patterned (P) and Anomalous (A) Themes: The highest-value of a self-player is achieved both at the Nash equilibrium and maximal total social welfare. Each content provider is a self-player with a content x, its audience is the opponent with content y.
In summary, for a new content c from a content provider, and assume Py (t) and Ay (t) are the number of patterned and anomalous word features and a network of content consumers y, the value of the content c from LLAQIG are computed in Algorithm 1.
Algorithm 1. LLAQIG Algorithm while t ≥ 0 do [Py (t), Ay (t)] ⇐ LLA[B(t)] for each new content c of Agent j do ] px (t,j,c) V (t, j, c) ⇐ [1 + ax (t,j,c) L L end for end while
px (t, j, c) and ax (t, j, c) are the number of patterned and anomalous word features for the content c respectively. c depends on Agent j (e.g., a content provider), omitted for brevity in Algorithm 1. L is a normalization factor, e.g., represents the degree the length of a sentence or message of a content. px (t,j,c) L of entanglement content c with the content of its opponent. Ranking and sorting content using V (t, j, c) would give the value of a content based on the influence and impact to its audience.
560
4
C. C. Zhou and Y. Zhao
Use Case and Data Set
Publicly traded companies’ stock prices are impacted either positively (higher) or negatively (lower) by the business news they publish every day. Each company is a content provider and self-player in the LLAQIG model, and its news audience is the opponent. Daily business news content, collected from July 5th, 2021 to June 29th, 2022 of 249 trading days for more than 7000 publicly traded companies, was used to validate the LLAQIG model. The number of companies involved each day varies on the number of companies having news that day. For each day, the total news for all the companies was analyzed daily using the LLAQIG model, then the model is used to score each stock with the starting price >$5 and compared with various other methods to predict the next day return of the news day Rt + 1. The ground truth of Rt + 1 are extracted from actual returns in the next day of the news publication day. Keywords from business news are different among different industries and changing dynamically over time. Traditional supervised ML and predictive algorithms are difficult to apply successfully. Nevertheless, for comparison, we first apply our LLAQIG to the whole day’s news for many companies and discover patterned, emerging, and anomalous themes. Figure 2 and Fig. 3 show an example of patterned theme and anomalous theme, respectively. A news content, e.g., “GM yahoo finance autos reporter pras subramanian joins the live show to discuss chevy’s nft auction where the winner will receive a physical corvette,” scores 12 for patterned, 4 for anomalous, and 0.407 for LLAQIG according to Algorithm 1. We then compare the LLAQIG results with predictive algorithms of linear regression, random forest, suport vector machines, k-nearest neighbor, stochastic gradient descent, decision trees, and neural networks from the sciki-learn python package using the input variables including numbers of patterned or anomalous word features for each news content from LLAQIG, news sentiment, stock’s previous day’s prices, volumns, and returns with respect to the news publication day. As shown in Table 3, only LLAQIG, patterned, support vector machines, and Rt − 1 have positive average return for the total 249 days. In Fig. 4, the cumulative percentage returns are computed from the ground truth Rt + 1 of the next day. The scores are used in Table 3 are computed using various methods for each stock for each news day. The stock that is scored the highest by each method is selected as the top one stock to be observed for the next day return Rt + 1 for that method. These scores include pattern and anomaly scores from LLA, unknown scores (i.e., how many unknown word features in a content), previous day return Rt − 1, the news day return Rt0, and sentiment score for a news content.
Quantum Intelligence Game
561
Table 3. Average Daily Return for Selected Top One Stock of 249 Trading Days using Different Scoring Methods Methods
Average Daily Return (July 5th, 2021 to June 29th, 2022)
Linear Regression
−0.002
KNN
−0.003
Random Forest
−0.001
Stochastic Gradient Descent −0.005 Decision Trees Support Vector Machines
−0.003 0.002
Neural Networks
−0.003
News Day Return (Rt0)
−0.001
Previous Day Return (Rt-1) News Sentiment Patterned
0.003 −0.000 0.001
Anomaly
−0.000
Unknown
−0.002
LLAQIG
0.002
Russell Index
−0.001
Fig. 2. Example of a Patterned Theme from LLA
Fig. 3. Example of an Anomalous Theme from LLA
562
C. C. Zhou and Y. Zhao
Fig. 4. Cumulative Percentage of Daily Return (Y-Axis) and Trading Days from July 5th, 2021 to June 29th, 2022 (X-Axis)
5
Results
As shown in Fig. 4, the LLAQIG gives the best cumulative percentage of daily return, i.e., up 60% for the period (July 5th, 2021 to June 29, 2022). Although some scoring methods, e.g., the support vector machines and previous day return (Rt − 1) methods may have higher average daily returns as shown in Table 3, however, the cumulative percentage of daily returns of both methods are not as good as the LLAQIG model in Fig. 4. Further more, the LLAQIG model enables the cumulative daily return above 1 in Fig. 4. This indicates that the LLAQIG model can effectively prevent the risk and loss in a turbulent market environment while other methods can not. For example, the support vector machines model also has the days below 1 in Fig. 4. The indexes all have negative returns for the period, e.g., the Russell index is about 25% down.
6
Discussion
The problem for supervised ML approaches is that the models are trained offline, while real systems require dynamically changed prediction models in real-time. When history especially financial history does not necessarily repeat itself in the future, the financial data are not stationary and very noisy, therefore, especially difficult to predict. Models can be overfitted only to the training data, i.e., the
Quantum Intelligence Game
563
models learned from historical data, may not work for the future data. Our LLAQIG is an unsupervised ML algorithm based on the quantum intelligence game theory. We do not use the ground truth for training; only the unstructured business news was used in the training to discover three types of the themes. The ground truth was only used to construct the cumulative percentage returns and compared with supervised ML and other scoring methods.
7
Conclusions
We demonstrated a quantum machine learning system LLAQIG to discover highvalue information from crowd-sourcing using a business intelligence application of a financial data set. 1. Our LLAQIG is unsupervised. LLA can be set up as a game-theoretic framework where a self-player as an information provider against the opponent as the audience responding to the information. 2. We show quantum entanglement and superposition effect may be in work in real life when an information provider and its audience interact. 3. We validate our LLAQIG using a practical and open source financial data to find the best intrinsic values that reach the Nash equilibrium and maximize social welfare simultaneously.
Disclaimers The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied of the U.S. Government.
References 1. Li, S., Karatzoglou, A., Gentile, C.: Collaborative filtering bandits. The 39th SIGIR (SIGIR 2016). arXiv:1502.03473 (2016) 2. Mahadik, K., Wu, Q., Li, S., Sabne, A.: Fast distributed bandits for online recommendation systems. In: ICS ’20: Proceedings of the 34th ACM International Conference on Supercomputing June 2020 Article No.: 4, pp. 1–13 (2020). https:// doi.org/10.1145/3392717.3392748 3. Korda, N., Szorenyi, B., Li, S.: Distributed clustering of linear bandits in peer to peer Networks. In: the 33rd ICML 2016 (2016). https://arxiv.org/abs/1604.07706 4. Vanhaesebrouck, P., Bellet, A., Tommasi, M.: Decentralized collaborative learning of personalized models over networks (2017) 5. Savazzi, S, Nicoli, M., Rampa, V.: Federated learning with cooperating devices: a consensus approach for massive IoT networks. IEEE Internet of Things Journal 7(5), 4641–4654 (2020). arXiv:1912.13163. https://doi.org/10.1109/JIOT.2020. 2964162 6. Narayanan, A., Moore, M.: Quantum-inspired genetic algorithms. In: Proceedings IEEE International Conference Evolutionary Computation, pp. 61–66 (1996)
564
C. C. Zhou and Y. Zhao
7. Khan, F.S., Phoenix, S.J.D.: Nash equalibrium in quantum superpositions. Proc. SPIE 8057 (2011). https://doi.org/10.1117/12.882921 8. Eisert, J., Wilkins, M., Lewenstein, M.: Phys. Rev. Lett. 83, p. 3077 (1999) 9. Marinatto, T., Weber, A.: Phys. Lett. 272, 291 (2000) 10. Li, A., Yong, X.: Entanglement guarantees emergence of cooperation in quantum prisoner’s Dilemma games on networks. Nature Sci. Rep. 4. p. 6286 (2014). https:// doi.org/10.1038/srep06286 11. Blei, D., Ng, A., Jordan, M.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003). https://jmlr.csail.mit.edu/papers/volume3/blei03a/blei03a.pdf 12. Penn State University (PSU): Online Statistics: normal approximation method formulas (2021). https://online.stat.psu.edu/stat200/lesson/9/9.1/9.1.2/9.1.2.1 13. Newman, M.E.J.: Finding community structure in networks using the eigenvectors of matrices. Phys. Rev. E 74, 036104 (2006) 14. Sun, Z.W.: The rule for evolution of cooperation in quantum games. Acta Physica Polonica A. 116(2) (2009)
Teaming Humans with Virtual Assistants to Detect and Mitigate Vulnerabilities Fitzroy D. Nembhard(B) and Marco M. Carvalho L3Harris Institute for Assured Information, Florida Institute of Technology, Melbourne, FL 32901, USA {fnembhard,mcarvalho}@fit.edu https://www.floridatech.edu
Abstract. The use of virtual assistants has grown significantly in recent years. This growth can be attributed to the prevalence of mobile devices and advances in the Internet of Things (IoT). While virtual assistants are widely used for interpersonal and social purposes such as ordering items from restaurants, creating reminders, and communicating with peers, their use is limited in cybersecurity and other computational sciences. In this research, we develop a framework that teams humans with virtual assistants on mobile devices in an effort to encourage the use of vulnerability detectors to mitigate errors in software and their underlying networks and systems. Creating effective cyber defenses involves teaming humans with machines in a way that enables secure orchestration, real-time communication, and unity of action. We demonstrate that a seamless coordination between human and AI can help minimize the number of errors in software systems, which will ultimately reduce data breaches and other cyber-related challenges plaguing our world.
Keywords: Software Security Assistants · Devbots
1
· Cybersecurity · NLP · Virtual
Introduction
Humans play a significant role in the security of computer systems, networks, and software. Consequently, some actions taken by humans could result in data breaches. Studies show that the average cost of a data breach that results from human error is in the millions of dollars [6]. Some of these errors occur at the code level (i.e., during software development). While it is difficult to model human behavior and resolve human errors, we believe that proper training and the use of efficient security analytics tools can minimize the number of errors caused by humans. However, it is not an easy undertaking to get humans to use certain security tools. Some reasons for this include high false-positive rates of existing tools, the exorbitant amount of time required to investigate inactionable alerts, and usability issues [10,13,16]. Further, the sheer number of tools available on the c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): SAI 2023, LNNS 711, pp. 565–576, 2023. https://doi.org/10.1007/978-3-031-37717-4_35
566
F. D. Nembhard and M. M. Carvalho
market makes it difficult for developers to select an appropriate tool to scan their projects for security violations and flaws. To address some of these challenges, researchers have proposed techniques to refine alerts or to fuse results from multiple tools [4,12,15]. Others have proposed systems that attempt to help programmers fix errors that are found in their projects [9,14,22]. These auto-repair systems have not yet become widespread due to their ability to fix only a small subset of errors. There is also a need for agent-based systems that can work with programmers to produce more secure code instead of simply performing code scans and displaying bug reports. Nembhard et al. were the first to apply recommender systems to the secure coding problem [16]. Their system, named VulIntel (Vulnerability Intellisensor), monitors program code as the programmer codes, checks for unsafe practices, and then makes security recommendations by suggesting a set of candidates from which programmers can choose to fix vulnerabilities in their code [16]. While existing solutions have improved the security analytics landscape, there is still room for improvement. For example, assistive techniques built using Artificial Intelligence (AI), robust to human errors, and unsusceptible to manipulations can be very beneficial in this domain. It is to this end that we propose a solution that teams humans with an AI and NLP-based solution to encourage use of code analysis tools. The proposed solution utilizes virtual assistants (VAs) on mobile devices to work with humans as they develop software with the goal of inculcating security practices. A virtual assistant is an application that understands voice commands and works as a software agent to complete tasks on behalf of a user. Interpersonal virtual assistants are already being used seamlessly in entertainment, dining, and the management of IoT devices in homes across the globe. These systems learn over time and get to know users’ habits and preferences. Backed by AI, VAs understand natural language, recognize faces, identify objects, and are able to communicate with other smart devices and software. Adopting virtual assistants in the field of vulnerability detection will allow developers to communicate with virtual agents that will scan code for errors, summarize the results, and help to fix errors while the developers multi-task and remain productive. For example, it would be of interest for a developer getting ready for a code review to be able to ask a virtual assistant to scan his code for vulnerabilities and email him a report while he prepares for work. Since VAs are becoming more ubiquitous, we propose that a seamless integration with existing security analytics tools will meet developers where they are instead of introducing them to new systems with which they are not familiar. This integration has the potential to increase use of security analytics tools, reduce the number of errors introduced during development, and ultimately minimize the severity and number of data breaches that result from code-level vulnerabilities. In fact, virtual assistants are no longer seen as mere assistants, and their way of interacting with humans brings them closer to users as friendly companions [3].
Teaming Humans with VAs
567
Therefore, we posit that the close relationship between users and smart devices on an interpersonal level can be further extended to include security analytics. The rest of the paper is organized as follows: In Sect. 2, we discuss related work in the area of hybrid security analytics and virtual agents. We propose our human-machine approach to code analysis in Sect. 3. In Sect. 4, we evaluate our proposed approach using a case study, discuss the results of the case study in Sect. 5, and present our conclusion in Sect. 6.
2
Related Work
In this section, we discuss related work in the area of hybrid code analysis and virtual agents. In [20], the authors conducted an empirical study regarding a softwaredevelopment bot (DevBot), named Security Knowledge Framework (SKF) chatbot, designed to answer queries about software security. Their goal was to assess the needs and expectations of software developers regarding such a system. They explored factors that may hinder the elaboration of more sophisticated conversational security DevBots and identified features for improving the efficiency of state-of-the-art solutions. [21] presented the Repairnator bot, which is an autonomous agent that constantly monitors test failures, reproduces bugs, and runs program repair tools against each reproduced bug [21]. If a patch is found, the program repair bot reports it to the developers. The authors also presented a blueprint design of a program repair bot for continuous integration (CI) build failures and provided actionable recommendations to help authors of future program repair bots [21]. In 2020, Adamopoulou and Moussiades presented an overview of chatbot technology. They summarized the history of chatbot technologies from Eliza and Alice to SmarterChild, which have all focused on allowing users to communicate with computers using text [3]. MSRBot is a conversational DevBot that supports tasks related to the mining of software repositories [2]. For example, users may ask text-based questions regarding the names of developers who modified a repository or a summary of commits made on a particular date. In [5], the authors discussed the use of a voice user interface (VUI) to control laboratory devices and report device data. Their results produced benchmarks of established infrastructure and demonstrated a high potential for future applications of a VUI within laboratories [5]. As can be seen in the body of literature regarding chatbots and systems that attempt to repair code, no system currently exists that can hold verbal conversations with humans, communicate with code analysis tools and audibly report findings and take actions on behalf of the user. To the best of our knowledge, our work is among the first to implement a system that teams humans with voice-based virtual assistants in an effort to mitigate vulnerabilities in program code.
568
3
F. D. Nembhard and M. M. Carvalho
Proposed Approach
In this section, we describe our proposed approach for teaming humans with a virtual agent on smart devices to conversationally scan and fix defects in coding projects. The architecture of the proposed framework is captured in Fig. 1.
Fig. 1. Proposed Framework for Teaming Humans with Virtual Agents in Code Analytics
The system consists of three main components: a virtual assistant, a cloudbased framework backed by a webhook API, and a code analysis environment (highlighted with a dashed border in Fig. 1 ). We chose Google Assistant as the virtual assistant for this research because of its seamless integration with App Engine and Dialogflow, which together have become popular in the design and evaluation of virtual agents. Dialogflow is a natural language understanding (NLU) platform that allows users to design and integrate a conversational user interface into a mobile app, web application, device, bot, interactive voice response system, etc. [1]. NLU is used to extract context and meanings from natural language user inputs and respond appropriately according to user intention [3]. The code analysis environment consists of a local web application, an integrated development environment (IDE) plugin, and security analysis tools. The process of teaming humans with the virtual assistant works as follows: first, a user verbally launches a Google Assistant mobile app on a smart device such as a smart phone using a set of common phrases associated with code analysis. Backed by NLP, this app is designed to understand phrases that are associated with security analytics. To facilitate a smooth communication between
Teaming Humans with VAs
569
the user and the agent, training phrases were entered into Dialogflow using its intent management system. By using NLU, the app identifies user intent and extracts domain-specific entities. An intent represents a mapping between a user’s speech and the action(s) that should be taken by the virtual assistant. Figure 2 shows the current intents incorporated into our proposed system. As can be seen in the figure, the intents are categorized into 6 main sections: Entry Point, vulnerability Detection and Mitigation, Clone Detection, Cancel, Exit, and a Fallback Intent. The arrows in the diagram connect related contexts or contexts that are reachable from other contexts. We now summarize the purpose of each intent. 1. The Entry Point intent is to introduce the user to the application and provide a summary of domain-specific questions that the user may ask the virtual assistant. While this intent is triggered automatically when the app starts up, a user may also use it to provide support by asking a question such as: how do I use the system? 2. The goal of the Vulnerability Detection and Mitigation intent is to work with a human agent to conversationally scan a coding project for vulnerabilities and complete follow-up actions such as emailing the user a vulnerability report, auto-fixing errors found in the project, or completing other tasks based on the capabilities of the underlying security analytics tool. 3. The purpose of the Clone Detection Intent is to analyze a coding project and present code that is similar in structure. This intent can also display a parallel view of duplicated code. Code clones pose a challenge in program code because they could potentially lead to unexpected behavior or vulnerabilities [11] 4. The Cancel Intent is used to exit running tasks. 5. Exit Intent is used to exit the application. 6. The goal of theFallback Intent is to ask the user to clarify an utterance or to end the program gracefully if a phrase is misunderstood after several requests for clarification. After invocation, the Google Assistant app attempts to determine the intent of the user by communicating with the Google Conversation API. After contextualizing the user’s intent, the Google Assistant app then communicates with a web application running in the background on the user’s development environment. This communication is driven by port forwarding so that the actions app can be seamlessly integrated with the tools on the user’s computer such as IDEs and web browsers. As shown in Fig. 2, our system can be used to scan code that the user is currently developing in an IDE or code identified by tabs in a browser such as a Git repository. Two novel approaches are used to localize the user’s request to scan a project. To scan code referenced in a browser, special scripts are used to capture information regarding tabs that are currently open. In the current implementation of our system, AppleScript is used to report the URLs of all open browser tabs. This information is also added to a message queue (see Fig. 1) where the user can further specify the repository they would like to scan.
570
F. D. Nembhard and M. M. Carvalho
Fig. 2. Dialogflow Intents used by Proposed System
To capture and scan code that is actively being developed in an IDE, an IDE plugin is used to establish communication between the IDE and the web service. This plugin works as an intermediary tool that invokes a code analytics tool, fulfill the user’s request and enqueues the output to a messaging queue. The web service residing on the user’s machine monitors the queue and sends the messages to the Google app, which interprets them and relays the appropriate responses to the user.
Teaming Humans with VAs
4
571
Case Study
In this section, we use a case study to demonstrate a prototype of our proposed system. The goal of this study was to implement a fully-functioning Google Actions app, which can be executed using Google Assistant on mobile devices, as a demonstration of teaming humans with a virtual agent to interactively scan a project for vulnerabilities. Our main question of interest that we wanted to answer was the following: Can a virtual agent be integrated with software development environments to interactively scan code for vulnerabilities, take actions requested by the user, and report findings to the user? To carry out the study, we implemented a Google Assistant app using the Java programming language. Driving this app are the intents shown in Fig. 2. We then tested the app using the Google Actions Simulator before releasing it in alpha mode where it can be tested on a smart device. We tentatively named the application My CodeAnalyzer so that it can be launched on a device with Google Assistant using the following phrase: Ok, Google, Talk to My Code Analyzer. Figure 3 shows the resulting Google Actions App displayed on an iPhone after being launched using speech. To test the ability of the system to work with humans during software development, we created IDE plugins for IntelliJ Idea and Eclipse. Both plugins were packaged as Java ARchive (JAR) files and installed using their respective installers. To answer our question of interest, we selected OWASP WebGoat [17] as a target application for vulnerability analysis. WebGoat is a deliberately insecure application that allows developers to test vulnerabilities commonly found in Java-based applications that use common and popular open source components [17]. To establish a basis for the expected behavior of a virtual agent that would interact with a code analyzer, we first tested the project with two code analyzers: FindBugs (version 3.0.1) and PMD (version 6.51.0). FindBugs is an open source tool for static code analysis of Java programs, which scans byte code for bug patterns to find defects and/or suspicious code [7]. PMD is a static source code analyzer for finding common programming flaws in Java and Apex code [19]. Since PMD requires a ruleset in order to scan code for vulnerabilities, we configured it to use the error-prone ruleset provided by the maintainers of the PMD project [18]. Error-prone rules are rules to detect constructs that are either broken, extremely confusing or prone to runtime errors [18]. FindBugs was executed with its default settings. By using a plug-and-play approach, we incorporated PMD into our code analysis framework (see Fig. 1). The goal was to ascertain whether a virtual agent can be used to capture the output from a code analyzer, summarize the results and report findings to the user via speech.
572
F. D. Nembhard and M. M. Carvalho
Fig. 3. Resulting Mobile App Launched using Google Assistant
5
Results and Discussion
In this section, we compare the results obtained by using PMD and FindBugs independently to scan the WebGoat project versus incorporating PMD into our framework driven by a virtual assistant on a smart phone. Table 1 shows the results of scanning the WebGoat project using PMD and FindBugs in addition to the spoken message provided by the assistant working with the PMD analyzer. PMD uses the classification of low-high to rank bugs whereas FindBugs uses a rank of 1–20. These ranks are then grouped into the categories scariest (rank 1–4), scary (rank 5–9), troubling (rank 10–14), and of
Teaming Humans with VAs
573
concern (rank 15–20) [8]. As shown in the table, PMD was able to find 9 errors of high priority wheres FindBugs was able to find 1 scary error. Figure 4 shows a full transcription of a sample communication between a user and the virtual assistant while scanning the same WebGoat project. As can be seen in the transcription, the virtual assistant was able to identify GitHub projects referenced in the user’s browser. Additionally, it was able to recognize that the WebGoat project was opened in the Eclipse IDE. The user was then given the option to help contextualize their request by selecting the project they would like to scan using their voice. After the project was selected, the assistant invoked the PMD analyzer, captured the output and summarized the results, raising awareness to 9 errors that were of high priority and giving the user the option of receiving a more detailed report by email. Because the virtual assistant was able to capture the same errors that PMD was able to report independently, these results demonstrate that a virtual agent can be teamed with a human agent to verbally and interactively scan a project for vulnerabilities, complete tasks on behalf of the user (emailing a report in this context), and reporting findings in a user-friendly manner. Table 1. Results of using FindBugs and PMD vs PMD with a Virtual Assistant to Scan the WebGoat Project PMD High 9
Medium Medium High Low 515 11 702
FindBugs Scariest Scary 1 0
Troubling 4
Of Concern 291
PMD with Virtual Assistant 1,237 errors were found in the WebGoat project. (9 of High priority). Say “Give me more details”, if you would like me to summarize the errors found. You can also say “Email me a report”, if you prefer that option. Or, you can say “No, thank you” to end the conversation
5.1
Limitations
We now discuss some of the limitations of the current implementation of our proposed framework. One major limitation is that the system in its current form does not have a built-in code analyzer. At this prototypical stage, only the PMD analyzer has been configured to work with the virtual assistant. PMD is a rulebased code analyzer, which requires users to select the rulesets to be used when scanning a project. Consequently, the limitations of this code analyzer would be reflected in the results generated by the virtual assistant. However, based on the
574
F. D. Nembhard and M. M. Carvalho
Fig. 4. Sample Communication between a Person and the Google Assistant Agent
modularity of the framework, other code analyzers could be selected based on the choice of the user in a plug-and-play approach. This is due to the fact that most code analyzers can export findings in common formats such as XML, JSON, and HTML. The system can then be configured to mine these file formats and report findings. It is our plan to incorporate a recommender system into the framework that will be able to recommend a scanner based on the type of project the user selects for scanning. Due to the fact that some code analyzers require users to select configurations for them to work properly, the recommender system would have the ability to present a set of preset configurations from which the user can choose or the recommender system can recommend based on the nature of the system under test. This would help minimize the time required to configure existing scanners and thus dampen the reservations programmers commonly have with using code analyzers.
Teaming Humans with VAs
6
575
Conclusion
Overcoming human error in computing systems and software is still a major challenge. Because of this, there is a need to team humans with virtual agents to help secure computers and their underlying networks. In this research, we proposed, created and tested a framework that teams humans with virtual assistants on smart devices to allow developers to verbally interact with an agent that can help them create more secure software. Currently, some systems exist that allow developers to communicate with devbots or chatbots using written messages. However, to the best of our knowledge, no system currently exists that allows virtual assistants to interact verbally via smart devices with developers as they develop software. The use of virtual assistants in everyday activities, such as purchasing products and searching for places of interest, has been growing at an alarming rate. Because these agents are now viewed as friends, we demonstrate that they can be employed in the field of cybersecurity and software security where the use of security analytics tools is often frowned upon. Our proposed framework uses the natural language processing capabilities of Google Assistant and Dialogflow to interface with code analyzers to scan code for vulnerabilities as programmers multi-task. This integration has the potential to increase productivity while simultaneously hardening software projects making them more resilient to attacks. Future work will involve conducting usability studies to ascertain the usefulness of this technology in helping developers produce more secure software as well as investigating security and privacy issues related to the use of virtual assistants.
References 1. Dialogflow (2022). https://cloud.google.com/dialogflow/docs. Accessed 19 Aug 2022 2. Abdellatif, A., Badran, K., Shihab, E.: MSRBot: using bots to answer questions from software repositories. Empirical Softw. Eng. 25(3), 1834–1863 (2020). https:// doi.org/10.1007/s10664-019-09788-5 3. Adamopoulou, E., Moussiades, L.: An overview of chatbot technology. In: Maglogiannis, I., Iliadis, L., Pimenidis, E. (eds.) AIAI 2020. IAICT, vol. 584, pp. 373–383. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-49186-4 31 4. Alenezi, M., Javed, Y.: Developer companion: a framework to produce secure web applications. Int. J. Comput. Sci. Inf. Secur. 14(7), 12 (2016) 5. Austerjost, J., et al.: Introducing a virtual assistant to the lab: a voice user interface for the intuitive control of laboratory instruments. SLAS Technol. Translating Life Sci. Innov. 23(5), 476–482 (2018) 6. Dongre, S., Mishra, S., Romanowski, C., Buddhadev, M.: Quantifying the costs of data breaches. In: ICCIP 2019. IAICT, vol. 570, pp. 3–16. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-34647-8 1 7. FindBugs: FindBugs - Static Code Analysis of Java (2022). https://www. methodsandtools.com/tools/findbugs.php. Accessed 19 Aug 2022 8. FindBugs: FindBugs 2 (2022). https://findbugs.sourceforge.net/findbugs2.html. Accessed 19 Nov 2022
576
F. D. Nembhard and M. M. Carvalho
9. Gupta, R., Pal, S., Kanade, A., Shevade, S.: DeepFix: fixing common C language errors by deep learning. In: Thirty-First AAAI Conference on Artificial Intelligence (2017) 10. Johnson, B., Song, Y., Murphy-Hill, E., Bowdidge, R.: Why don’t software developers use static analysis tools to find bugs? In: Proceedings of the 2013 International Conference on Software Engineering, ICSE ’13, pages 672–681, Piscataway, NJ, USA. IEEE Press (2013) 11. Juergens, E., Deissenboeck, F., Hummel, B., Wagner, S.: Do code clones matter? In: 2009 IEEE 31st International Conference on Software Engineering, pp. 485–495 (2009) 12. Kong, D., Zheng, Q., Chen, C., Shuai, J., Zhu. M.: ISA: a source code static vulnerability detection system based on data fusion. In: Proceedings of the 2nd International Conference on Scalable Information Systems, InfoScale ’07, pages 55:1–55:7. ICST (Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering) (2007) 13. Kremenek, T., Ashcraft, K., Yang, J., Engler, D.: Correlation exploitation in error ranking. In: ACM SIGSOFT Software Engineering Notes, SIGSOFT ’04/FSE-12, pp. 83–93, New York, NY, USA. ACM (2004) 14. Marginean, A., et al.: SapFix: automated end-to-end repair at scale. In: 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP), pp. 269–278 (2019) 15. Nembhard, F.: A recommender system for improving program security through source code mining and knowledge extraction. PhD thesis, Florida Institute of Technology (2018) 16. Nembhard, F.D., Carvalho, M.M., Eskridge, T.C.: Towards the application of recommender systems to secure coding. EURASIP J. Inf. Secur. 2019(1), 9 (2019) 17. OWASP WebGoat. OWASP WebGoat (2022). Accessed 19 Aug 2022 18. PMD. PMD Error Prone Ruleset (2022). Accessed 19 Aug 2022 19. PMD Source Code Analyzer Project. PMD source code analyzer project (2022). Accessed 19 Aug 2022 20. Tony, C., Balasubramanian, M., D´ıaz Ferreyra, N.E., Scandariato, R.: Conversational DevBots for secure programming: an empirical study on SKF chatbot. In: Proceedings of the International Conference on Evaluation and Assessment in Software Engineering 2022, EASE ’22, pp. 276-281, New York, NY, USA (2022). Association for Computing Machinery 21. Urli, S., Yu, Z., Seinturier, L., Monperrus, M.: How to design a program repair bot? Insights from the repairnator project. In: 2018 IEEE/ACM 40th International Conference on Software Engineering: Software Engineering in Practice Track (ICSE-SEIP), pp. 95–104. IEEE (2018) 22. Weimer, W., Fry, Z.P., Forrest, S.: Leveraging program equivalence for adaptive program repair: models and first results. In: 2013 28th IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 356–366 (2013)
Spaces of Interpretations: Personal, Audience and Memory Spaces Yehuda Roth(B) Oranim Academic College, K. Tivon 36006, Israel yehuda [email protected]
Abstract. Quantum measurement scenario plays a significant role in quantum theory. This study used these measurement principles to define an interpretation process described below. In a quantum measurement process, an observer plays a crucial role: He is the one that selects the measuring device with which he describes reality. For example, in exploring a particle, the observer may choose to measure the momentum or the particle’s location. This selection generates the following scenario: Consider a measuring device not adjusted to measure the actual observables of the measured object (for example, a location measurement of a particle with a defined momentum). In that case, a quantum collapse occurs in which the system randomly collapses into one of the measurement states defined by the device. For that scenario, we can say that the measuring instrument does not detect nature observables but interprets them according to the selected measuring device. This study presents a procedure describing this process of “quantum-like interpretation”.
Keywords: Personal Space Interpretation
1
· Audience Space · Memory Space ·
Introduction
Interpretation plays a crucial role in many aspects of life. We can say that without being interpreted, observed data are meaningless [1]. This was the motivation for a previous paper in which a quantum-like formalism describing the process of interpretation was suggested [2] where the phrase “quantum-like approach” refers to applying quantum theory’s tools to a non-quantum system [3]. Previous papers introduced a quantum-like formalism to describe such process. The author in [4,5], quantum-like approach modeled the concepts of cognition, decision making, and rationality. The author in [2] introduced a quantum-like procedure for an interpreting process where the focus was on an individual interpretation. This process of individual interpretation was purely mathematical, but because we based it on quantum tools such as measurement theory [6–9] and various representations of states [2], we proposed the idea of implementing it as a quantum-based machine. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): SAI 2023, LNNS 711, pp. 577–585, 2023. https://doi.org/10.1007/978-3-031-37717-4_36
578
Y. Roth
This study proceeds1511 [2] in that it describes the space where the individuals who have already interpreted exchange information. In addition, we formulate a mathematical procedure describing the transferring of interpreted information, thereby making room for newly interpreted data. Demonstrating our approach for interpretation, we analyze ambiguous figures as presented in Fig. 1.
Methods This study is theoretical. We base our research on the mathematics implemented in quantum theory as described below: 1. Hilbert spaces - describe the processes of personal interpretation. We represent the Interpretation activities by implementing operators operating on states. 2. Fock space - describes individuals that exchange information. Creation and annihilation operators represent the alternating observations an individual observer performs. 3. The collapse process - In the existence of several interpretations, the collapse scenario determines that selected interpretation.
2
Preview: Activities Within a Personal Space
This part reviews the interpretation process performed by a single observer [2]. Here, we demonstrate the interpretation process with the following ambiguous figure, presented in Fig. 1 and 2. Is it possible to interpret this image’s indefinite collection of points composing image 1 as the letter B or as the digit 13?
Fig. 1. An Ambiguous Figure: Is it the Letter B or the Digits 13?
Fig. 2. Observing the Column in the Middle, the Image is Interpreted as a Letter while the Line 12, 13, 14, Associate the Image with the Digits Concept
Following [2], we review the four stages in the interpretation process of a single observer: Suppose we have an item to be interpreted. That is, from the mere object, it acquires meaning in the process of interpretation
Spaces of Interpretations
579
1. State Construction In state construction, the system transforms the items to beinterpreted into a state in a Hilbert space. This state is denoted as (Item) . Details describing a transformation from an object (such as a simple image) into a state, is not discussed in this study; however, a procedure for generating coherence was described in [14], where a nonlinear approach was implemented. In addition, in [2] we show how to represent an image in terms of states implementing a pixel representation. In our demonstration:
π
2. Classified representation The system defines the states (concepts) to be used to interpret the information received. These concepts are defined it by the states |Ii where i label the state in the corresponding Hilbert space. In our demonstration the classified concepts generates a space composed of two sub-spaces: letters and digits with the corresponding states, |li and |di , respectively. In our demonstration:
3. Representation The constructed state is represented in terms of classification states. Defining def R = |Ii Ii | (1) i
as a classification operator, we obtain:
R
π (Item) = αi |Ii,
(2)
i
with,
αi = Ii
π (item)
(3)
In our demonstration,we consider |Ij = |B and |Ik = |13 where ∀i = j, k, αi = 0 to obtain:
(4) 2
2
where in according to quantum formalism, |αj | and |αk | are the probabilities to be interpreted as a letter B or the digits 13, respectively. for
580
Y. Roth
4. Determination The state collapses into one of the classification states to complete the interpretation. We use the observable: ιi |Ii Ii | , (5) D = i
where similar to [15], ιi are eigenconcepts that serve as the measurement output. These are the interpretation results send to the observer. In our demonstration:
(6) where 13 and B are the eigenconcepts.
3
Spaces of Various Interpreting Observers-Audience Space of Measurements
The previous review described the interpretation process conducted by a single observer. In this section, we extend our discussion to an audience of observers who perform measurements, whereby the term audience refers to observers who observe the same event. When different individuals interpret the same event, each of them (according to their interpretation systems) can make a personal interpretation that may differ from those of other observers. That being said, note that a similar understanding of the same event is essential for effective communication between observers. In a previous article [2], “Quantum-like measurement” was presented as the fundamental phenomenon in the determination process for which the observed item collapses into one result out of some outcomes. The term “Quantum-like measurement” corresponds with a process containing all elements of quantum measurement, particularly the collapse phenomena, but without identifying it as a pure quantum process. Assuming that all observers obtain the data through quantum-like measurements, each observer is associated with a quantum-like state that indicates the number of measurements he performed. The space spanned by these states is referred to as the Space of measuring audience (to be called the audience space shortly). Consider N observers associated with the index o where o = 1, 2, 3, ...N . Each o-observer is associated with a variable τo that can have discrete values τ = 0, 1, 2 . . ., where τo represents the number of measurements performed by the o-observer. In an audience space, a state describing distinguishable observers is
Spaces of Interpretations
|Ψ =
N
|τo , τo = 0, 1, 2...
581
(7)
o=1
We consider |τo to be a state in a bosonic-like Fock space; i.e., the state is subject to the creation and annihilation operators b† and b such that b†o |τp =
√ τo + 1 |τo + 1 δo,p
bo |τp =
√ τo |τo − 1 δo,p
An o -observer performs a measurement that follows previous n -measurements, An o -observer disregards the last measurement.
(8)
The non-measuring state |0o is considered as a vacuum state for the o-observer. 3.1
Single-Measurement Interpretation of o-Observer
We start with τ = 0 (the vacuum state); that is, no measurement is yet performed. Consider the input state |π (item), which is shared by all interpreting observers. An o-observer that perform his first measurements explores the state (9) = |π (item) |0o
o
where the magnifying-glass icon represents an observing o-observer. We skipped the o-subscript from |π (item) because it is a state common to all observers and |0o represents an o-observer who performed no measurements. By performing a single measurement (denoted as [1] in the proceeding equation)) to interpret the image, the o-observer applies the operator product: Io [1] = Do Co b†o ,
(10)
where we recall that C and D are the operators that represent the classification and determination stages in the interpretation process, in accordance with Eq. 1 and 5. By implementing Io [1] on , we obtain o
Io [1]
o
= Do Co
π
(Item) |1o = Do
αi,o |Io |1o ,
(11)
i
By operating the observable (determination term) Do on the state
i
Ai,o |Ii,o ,
the o-observer interprets the state into as one of a specific I state (say Ic ) with a 2 probability of |Ac,0 | , to detect (interpret) it. Because the coefficients are indexed with o, the probability of detecting an Ic varies among different observers. Moreover, since we have an individual eigenconcept (of the o-observer), we conclude that different observers may conceive the same item differently (individually).
582
4
Y. Roth
Memory Space Within Personal Category
In Sect. 3 we defined τ as a parameter representing a single measurement from a hierarchy of measurements performed by the observer. In this section in which we describe how observed events are “burned” in memory, the same τ plays the role of a timestamp dating the observed event. After the determination stage ends, when the data to be interpreted collapses into a single state |Ic , the system must activate a reset process to make room for the next event to be interpreted. Correspondingly, a mechanism for transferring the present analyzed data from the current space into what we refer to as memory space is activated. This process contains two stages: A removal stage for which the interpreted data is transferred to the memory space while removed from the interpreting system. As τ counts the number of measurements conducted by the observer (see Eq. 7, according to our approach, the chronological order of τ defines dating. For example, τ = 3 dates the chronological event after the second measurement, i.e., τ = 2. The operator that transfers the data from what we refer to as the “consciousness-like-machine component” to the memory component is: def |μ {Ii } Ii | b† b (12) M {τ } = i
For example, in our demonstrating example, if in the output of the fourth as the digit 13, we obtain measurement the observer interprets the image M {τ } |13 |4 = 4 |μ{13}|4 .
(13)
In general, for an interpretation result |ρ = |φ |τ , we obtain: M {τ } |ρ = τ |μ{φ}|τ
(14)
Focusing on the τ state and the number operator b† b, it is seen that the timestamp of the event is the eigenstate τ . By associating the number operator with a measurement, we can say that τ is the answer to the question, “when did this event happen”? 4.1
Removal of Interpreted Information
After the interpreted item is determined and the event is recorded in the observer’s memory, the interpreted item must be removed from the main system. The removal process is represented by the following the operation:
b Ii o | Io [1]
o .
(15)
Considering the vacuum state |0o as the state before measurement and by assuming that after the determination stage, |Ii,o collapses into the state |Ii,c , we then obtain that:
b Ii o | Io [1]
o
After determination −−−−−−−−−−−→ b Ii,o
|Ic,o |1o = |0o
(16)
Spaces of Interpretations
583
Thus, implementing the operations of Eq. 15, provides a zero measurement state |0o , that is, no trace of the measurement remains.
5
Results
In this and previous studies, we defined a quantum-like process to describe interpretation processes as we emphasized that the purpose of this research is to enable the design of a machine based on quantum mechanisms that will have the ability to perform personal interpretation. Based on quantum principles, such a device requires describing the interpretation process with quantum tools. Indeed, we described scenarios using quantum tools as detailed below: 1. A state in a Hilbert space represents the data intended for interpretation. 2. The selection of interpretation alternatives is shown as the presentation of the measured state as a superposition of the possible interpretation alternatives. 3. The process of interpretation is determined through the quantum collapse. 4. A deleting operator erases unnecessary information, and a recorded operator can renew elsewhere. 5. Fock space describes communication between communicating individuals. Creation and destruction operators represent the number of measurements the observers perform to exchange information. To summarize, we presented the interpretation process as a quantum procedure with the potential of being implemented into a machine capable of interpretation if the appropriate technology is available.
6
Conclusion
When solving the Schr¨ odinger equation for the amplitude, ψ ( r) = r |ψ, one gets a predictable time evolution. It is the measurement process with the corresponding collapse scenario that introduces randomness [16,17]. Herein and in previous paper [2], we present an additional perspective on this randomness; we suggest treating it as part of a comprehensive process that can be implemented into an interpretation process. As already mentioned in [2], our interpretation description is mathematically pure, and therefore, does not necessarily represent a quantum system. However, since we base our framework upon the same mathematical tools and concepts as in quantum systems, we can consider a quantum implementation to construct a quantum-based machine capable of interpreting reality. As a measurement-based process, this interpretation description does not preclude the need to define an observer that reads the results of the interpretation [2]. Note that in the traditional literature, the observer’s role is limited only to measuring and reading the outputs; Now, we can focus and say that reading the measurement results means reading the interpretation. We can associate consciousness with an observer reading the eigenconcepts after the determination stage. However, consider a model where we replace the eigenconcepts with
584
Y. Roth
operators representing a follow-up activity in response to an observed event. We can refer to this response as a reflective activity. Acknowledgment. This work is dedicated to my beloved son Lior who passed away at the age of 34. Lior was an individual with special needs, who had unconventional and unique interpretations of his surroundings. He is the inspiration for this study.
References 1. Willig, C.: The SAGE Handbook of Qualitative Data Analysis, Interpretation and analysis. In: Flick, U., (edn.), pp. 136–145. Sage publishing (2014). ISBN 978-14462-0898-4 2. Roth, Y.: Quantum Approach for Planning an Interpreting Machine. Available at SSRN: https://ssrn.com/abstract=4258709 3. Facco, E., Fracas, F.: De Rerum (Incerta) natura: a tentative approach to the concept of quantum-like. Symmetry 14, 480 (2022). https://doi.org/10.3390/ sym14030480 4. Havena, E., Khrennikovb, A.: Statistical and subjective interpretations of probability in quantum-like models of cognition and decision making. J. Math. Psychol. 74, 82–91 (2016). https://hdl.handle.net/2381/37068 5. Khrennikov, A.: Quantum-like modeling: cognition, decision making, and rationality. Mind Soc. 19(2), 307–310 (2020). https://doi.org/10.1007/s11299-020-002406 6. Marinescu, D.C., Marinescu, G.M.: CHAPTER 2 - Measurements and Quantum Information. Classical and Quantum Information. Academic Press, pp. 133-220 (2012). ISBN 9780123838742, https://doi.org/10.1016/B978-0-12-383874-2.000023 7. Murdoch, A.: Objectivity in classical continuum physics: a rationale for discarding the ‘principle of invariance under superposed rigid body motions’ in favour of purely objective considerations. CMT 15, pp. 309–320 (2003). https://doi.org/10. 1007/s00161-003-0121-9 8. Ghirardi, G., Bassi, A.: Collapse Theories. In: Edward, N.Z. (ed.), The Stanford Encyclopedia of Philosophy (Summer 2020 Edition). https://plato.stanford.edu/ archives/sum2020/entries/qm-collapse/ 9. Bassi, A.: Philosophy of Quantum Mechanics: Dynamical Collapse Theories. Oxford Research Encyclopedia of Physics (2022). https://oxfordre.com/physics/ view/10.1093/acrefore/9780190871994.001.0001/acrefore-9780190871994-e-77 10. Kornmeier, J., Bach, M.: Ambiguous figures - what happens in the brain when perception changes but not the stimulus. Front. Hum. Neurosci. 6, 1–13 (2012). https://doi.org/10.3389/fnhum.2012.00051. ISSN 1662-5161 11. Eppler, M.J., Mengis, J., Bresciani, S.: Seven types of visual ambiguity: on the merits and risks of multiple interpretations of collaborative visualizations. In: 2008 12th International Conference Information Visualization, pp. 391–396 (2008). https:// doi.org/10.1109/IV.2008.47 12. Moreno-Sanchez, M., Antonio Aznar-Casanova, J., Torro-Alves, N.: How does the mind handle uncertainty in ambiguous figures? Psychol. Res. 6(1), 1–13 (2016). https://doi.org/10.17265/2159-5542/2016.01.001 13. Boring, E.G.: Sensation and perception in the history of experimental psychology. Appleton Century, New York (1942)
Spaces of Interpretations
585
14. Roth, Y.: J. Phys.: Conf. Ser. 574, 012085 (2015). https://doi.org/10.1088/17426596/574/1/012085 15. Roth, Y.: J. Phys. Commun. 3, 045002. https://doi.org/10.1088/2399-6528/ ab128d 16. Bassi, A., Ghirardi, G.: Phys. Rept. 379, 257–426 (2003). https://doi.org/10.1016/ S0370-1573%2803%2900103-0 17. D¨ urr, D., Goldstein, S., Zangh´ı, N.: Quantum mechanics, randomness, and deterministic reality. Phys. Lett. A 172(1–2), 6–12 (1992). ISSN 0375-9601, https://doi.org/10.1016/0375-9601(92)90181-K, https://www.sciencedirect.com/ science/article/pii/037596019290181K
An Application Based on the Concept of Gamification to Promote Cultural Tourism in the Municipality of San Diego in the Department of Cesar, Colombia Paola Patricia Ariza-Colpas1,3(B) , Marlon Alberto Piñeres-Melo2,3 , Roberto-Cesar Morales-Ortega1,6 , Andres Felipe Rodriguez-Bonilla3 , Shariq But-Aziz4 , Diego Armando Rodriguez-Parra3 , Ileana Rodriguez-Bonilla3 , and Leidys del Carmen Contreras Chinchilla5 1 Department of Computer Science and Electronics, Universidad de la Costa CUC, 080002
Barranquilla, Colombia {pariza1,rmorales1}@cuc.edu.co 2 Department of Systems Engineering, Universidad del Norte, 081001 Barranquilla, Colombia [email protected] 3 Blazing Soft Company, 081001 Barranquilla, Colombia {andres.rodriguez,diego.ramirez, directora.proyectos}@blazingsoft.com 4 Department of Computer Science, University of South Asia, Lahore, Pakistan 5 Faculty of Engineering and Technology, Universidad Popular del Cesar, 200004 Valledupar, Cesar, Colombia [email protected] 6 Certika Company, 081001 Barranquilla, Colombia [email protected]
Abstract. The Department of Cesar is a highly important tourist attraction in Colombia where most of its visitors go on vacation, recreation, and leisure. Despite being positioned as a high tourist attraction, it has been possible to identify that there is a lack of public-private articulation, as well as the participation of the academy in the development of projects with a greater regional impact, weaknesses in the working capital of the sector, little entrepreneurship, and innovation around the generation of new tourism products. That is why it is necessary to strengthen and expand the coverage of national or regional programs, projects, and initiatives on these routes, corridors, and infrastructure projects proposed, the development of information systems that allow assertive decision-making, as well as the pertinent regulation by the sustainability framework. One of the sectors that require support and strengthening is the tourism sector which requires support to strengthen and boost the economy. The purpose of this article is to show the development of an application based on gamification that allows the cultural strengthening of the region, which allowed appropriation processes to be generated in the community that is the object of intervention of the department and economic revitalization. Keywords: Tangible and Intangible Heritage · Cultural Heritage · Experiential Experiences · Augmented Reality · Software Application · Gamification © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): SAI 2023, LNNS 711, pp. 586–597, 2023. https://doi.org/10.1007/978-3-031-37717-4_37
An Application Based on the Concept of Gamification
587
1 Introduction In recent years, tourism has experienced significant growth in Colombia, making the country one of the main tourist attractions whose objective is to position it as an international destination, therefore, on December 31, 2020, the tourism law considered a useful tool to reactivate the tourism sector that was affected by the restrictions given during the pandemic. “This new Law, an initiative of Mincomercio, is going to allow tourism to be put on the new path of economic reactivation after a difficult year, and the most interesting thing about this new Tourism Law, Law 2068 of 2020, is that it is a combination of short-, medium- and long-term measures. It is relevant to indicate that this law promotes innovation, competitiveness, and business growth giving the benefits it will generate for both tourists and businessmen in this sector. The year 2020 was one of the most critical years for this sector due to the restrictions given during the pandemic, the number of tourists amounted to 791,673 of which 1,201 visited the department of Cesar corresponding to 15.17%. Social and cultural tourism has become the central theme of local and departmental governments and is even an axis that has been strongly implemented in the national agenda. According to data from the Banco de la República, tourism was the second largest generator of foreign currency in the country, surpassing products such as coffee, flowers, and bananas. Income from this activity has increased in the last 10 years by 68.2%. The country went from receiving US$3,440 million in 2010, to US$5,787 million in 2017. More than $5.7 billion have been invested in hotels in the last 8 years, and nearly 2 million formal jobs related to this work have appeared. The department of Cesar contributed 1.68% to the national GDP in 2020, in commerce, hotels, and repair sector composition in Colombia it was 16.3, and in the department of Cesar 11.9 according to DANE departmental accounts on June 25, 2021. Based on the previous statistics in the 2020–2023 development plan “We do it better”, it contemplates an objective “to strengthen the institutionality of the tourism, culture, art and heritage sector, with firm and innovative support for departmental actors, both from the tourism industry as creative, to build the tourism development strategy, under the mission of internal and external promotion, which consolidates tourism, which contributes to the social and economic growth, productivity and competitiveness of Cesar”. To achieve this, they establish as a goal of the tourism program to improve the regional tourist offer with new products, important quality levels with competitive standards to attract local, national, and international tourists to the department of Cesar. To achieve the proposed objectives and position the department as one of the best tourist attractions in the country not only for its natural but also its cultural wealth, it is necessary to design tourist and cultural products that allow the dissemination and tourist offer, this is where the technology plays a fundamental role given that it facilitates promotion not only at the regional level but also nationally and internationally, therefore it is required that the secretariats of development, environment, and tourism focus on establishing budgets and programs that allow the development of mobile applications and other technological tools that promote tourism in the department. This article shows the process of conception and development of a virtual learning object that serves to socialize the key components of cultural and historical tourism in the municipality of San Diego in the department of Cesar. First, conceptual information
588
P. P. Ariza-Colpas et al.
about the OVAS is shown, later some tools that are used for the development of the OVAs are shown, then the operation of the OVA for the municipality of San Diego is shown.
2 Conceptual Information An OVA is defined as a set of digital resources that can be used in various contexts, with an educational purpose, and is made up of at least three internal components: content, learning activities, and contextualization elements [1]. In addition, the learning object must have an external information structure (metadata), to facilitate its storage, identification, and retrieval. On the other hand, the OVAs are part of the purposes of ICT tools, which are producing changes in teaching and learning methodologies, in the way teachers and students relate to knowledge and the way they interact. The agents involved in the educational process [2]. In the same way, instead of being a definable object, it is a complex and multifaceted technological construction, a larger technological puzzle, because they converge with the pedagogical and curricular aspects that derive from the practices of educational technology, of information technology. And communication, among others. From the above, it is necessary to note that this investigation for the elaboration of the OVA to develop numerical skills is based on the ADDIE model, as well as on various learning theories: information theory, by generating a fragmentation process, thus leaving the information divided into small pieces [3], theory of connectionism (Schneider, 1987), where learning is generated from practice, depending on the number of connections learned in previously encountered situations. Similarly, it focuses on Bruner’s constructivist theory, for whom learning is an active process in which students build new ideas or concepts based on their current/past knowledge. In addition, the student selects and transforms information, builds hypotheses, and makes decisions based on a cognitive structure to provide meaning and organization to experiences and allow the individual to go beyond the given information [4]. The foregoing is by what Piaget proposed, in relation to what is new is always built from what has been acquired and transcends it [5], because when presenting the information in a virtual learning object, with games, and in a different way from the traditional one would be significant, and according to [6], the individual builds meanings, mental representations related to the contents, and learning is knowing, understanding the meaning, and this is possible to the extent that it occurs the anchoring of the new material as a product of motivation, needs and desires, since “learning is active” [4]. Similarly, considering Vygotsky [7], says that culture plays a major role in human development, in the case of this work, everything that encompasses the current digital culture. Hence, virtual learning objects could be an interactive strategy for learning and teaching. The SCORM (Shareable Content Object Reference Model) is defined as a set of specifications that propose a content aggregation model (Content Aggregation Model, CAM), a run-time environment (Run-Time Environment, RTE), and the sequencing and navigation (Sequencing and Navigation, SN), of the contents [8]. Likewise, they argue that SCORM is currently the standard that is having the greatest impact in the industry since it is the one that has been implemented in a greater number of systems as a reference model [9]. Thus, different authoring tools such as Exelearning allow you to build virtual
An Application Based on the Concept of Gamification
589
learning objects with output to a SCORM package, which is an XML file that contains all the information to start up in a format compatible with an LMS [10].
3 State of the Art On the web we can find various types of Learning Objects (LO): Video, Audio, Slides, Conceptual Map, and Simulation, among others. In many cases we can find OAs that bring together different types of formats in the same delivery, for example A flash with animated content and interactive exercises. In any case, some standards must be met when building an object for Education and we can find them in the SCORM documentation. This allows us to document metadata to identify the created object, define a structure and even allow communication with the LMS (Learning Management System Ex: Moodle). A Virtual Course implemented in an LMS is made up of one or several Learning Objects. In turn, there are various tools for the creation of virtual learning objects (known as authoring tools) and then a tour of the best known on the market will be made. Vcasmo [11], is software that is limited to generating videos that are made up of a slide and a presenter on one side. “JCLIC” is a pioneering authoring tool that allows you to create activities and games; the Objects generated from it can be viewed from the web (does not accept mobile devices); it also requires installation on the computer before being able to use it and does not allow adding theoretical content, only developing activity. Another tool is Claro [12], developed by the company DominKnow, it was awarded as the best tool in 2013, it allows one to obtain multilanguage results, add animations and actions on the elements that are incorporated and also has a bank of predefined contents but the inconvenience It lies in the fact that it has many options and requires a long learning curve for the teacher who wants to use it. On the other hand, we have Articulate [13], this tool is oriented to display avatars with texts. It allows the configuration of events and working with layers. This is precisely what makes it difficult for a standard user to obtain results in a short time and has the option to export to html5 to be viewed on mobile devices. GoAnimate [14], is another tool that offers avatars, animations, voice synchronization, and other options to develop various skills within the course, in a timeline but does not offer the development of activities. Adobe Captivate [15], another well-known tool but requires computer installation and time to learn how to use it. Lastly, we want to highlight “Zenler”, which proposes a power point plugin to build SCORM content from this office automation tool. As a result of the systematic review of the literature, different applications can be highlighted that have been developed to promote cultural development that promotes gamification processes and allows cultural appropriation in different contexts, which are described in Table 1:
590
P. P. Ariza-Colpas et al. Table 1. Applications of Virtual Learning Objects to the Tourism Sector
Reference Description [16]
An application was developed that allows appropriating tourism related to bird species in Bogotá Colombia. Through this tool bird tourism is encouraged, allowing the social appropriation of environmental knowledge related to birds
[17, 18]
Application based on virtual learning objects for teaching care methods at home. As a result of the implementation, it was possible to identify an improvement in the processes of patient care through the implementation of educational technology
[19]
Virtual reality-based application in higher education at Universitas Negeri Jakarta, Indonesia. A control group of students was taken where the use of new technologies in learning was implemented and the other had traditional learning processes. The result of the experimentation determined the impact of the use of new technologies on cognitive and learning processes in students who used the application in contrast to those who did not participate in the process
[20, 21]
A pedagogical strategy based on gamification processes was carried out to improve reading comprehension in the rural area of Pitalito Huila, Colombia, which stimulates the motivational process of learning and reading comprehension, interacting with new knowledge, and allowing the creation of knowledge networks
[22]
A virtual learning object that supports the process of cardiovascular and respiratory responses was developed, validated by a group of 24 nurses, and its usability was verified by public university students. The categories of the virtual learning objects were given according to the analysis of the three basic components: introduction, cardiovascular responses, and lung responses, obtaining an acceptance level of 85%
[23, 24]
A virtual learning object was developed with the purpose of being able to support the restructuring process of the life project of young users of psychoactive substances, through different strategies of appropriation and continuous improvement
[25]
Development of a virtual learning object for dental students to be able to identify the different components and materials that are used for the theoretical-practical implementation process. To evaluate the effectiveness of the application of the VLO, an analysis based on a pretest and post-test was generated where the respective students were characterized and it was possible to contrast the evolution in those who had had contact with technology for the apprehension of the concepts
[26, 27]
A virtual learning object was developed that allows improving the skills in learning English in secondary school boys and girls, allowing to denote the improvement in the use of different grammatical concepts for the development of both reading, comprehension and speaking skills in the English language
As a result of this review, it was possible to identify that the application that is later defined, in addition to having not only relevant information from the municipality, also allows generating support for the dynamic process of understanding the tangible and intangible heritage of the department of Cesar.
An Application Based on the Concept of Gamification
591
4 Methodology For the development of this application, the development of the following is contemplated: 4.1 Creation of a Learning Object Model Complying with the technical standards proposed by SCORM and offering an integration of different types of multimedia resources, in this phase the structure of the Learning Objects to be built with the web platform will be designed. It is desired to have a Learning Object that allows content to be presented in different formats and the constant development of different types of activities. The creation of the conceptual model and technical model will be documented. The latter with the support of UML. 4.2 Management and Design of Knowledge Transfer Strategies in Elearning and SCORM One of the main characteristics of the authoring tool to be built is the transfer of knowledge related to Elearning and the SCORM Standard. A teacher who does not have training in ICT in education will be able to learn about the subject through the platform while building their first Learning Object. In this phase, you want to select the base material and design the strategies to present the contents throughout the software. 4.3 Design of the Business Model In this phase, an analysis of the “SAAS Software as a Service” strategy will be carried out, designing its own model to offer access to the authoring tool through different offers. From free but limited access to resources, to unlimited access to universities and all their teachers. For this, a technological surveillance process will also be carried out that allows the identification of sustainability, marketing, and scalability strategies in its functionalities. 4.4 Identification and Analysis of Software Requirements A detailed document will be prepared for each of the functional characteristics that the platform must have to respond to the “Learning Object Model”, the “Knowledge Transfer” and the Proposed “business model”. The functional requirements must contemplate the constructor of Learning Objects and the virtual Community. Subsequently, the analysis of the documented requirements will be carried out, with the aim of validating them in different scenarios and applying technological prospecting. 4.5 Design and Development of Web Platform In this phase, the entire software engineering process is developed, which proposes the design of the databases, modeling the architecture of the application, creation of algorithms, coding, and testing of the system.
592
P. P. Ariza-Colpas et al.
4.6 Design and Development of Templates for the Construction of Learning Objects In this phase, the different template options that will be available on the platform will be designed, built, and categorized. Want to have templates of all kinds, where the user can easily locate the one, he needs. In addition, the API will be documented so that new templates can continue to be developed later. 4.7 Construction of a Multimedia Resources Bank In the first instance, we want to collect resources with a free license and then define the list of new resources (images, photos, and animations) to be developed. A certain number of resources in different formats and for different areas of knowledge should be considered. The acquisition of paid image banks is also contemplated. Finally, you want to organize all the categorized resources so that they can be easily located by teachers. 4.8 Testing and Analysis of Results In this last phase of deployment of the web platform, and the people belonging to the different municipalities will be able to have access to the virtual learning object.
5 San Diego´s Tourism OVA The municipality of San Diego belongs to the north-eastern area of the Department of Cesar and has a land area of 670 square kilometers, bordering to the north, east and south with the municipality of La Paz; and to the west with the municipality of Valledupar, with the Cesar River in the middle. It is located 180 m above sea level, with an average temperature of 27 °C in the municipal capital and in the foothills of the Serranía del Perijá it ranges between 15 °C and 20 °C. The municipal territory comprises two perfectly defined regions, a flat and low zone of high temperatures, located in the plain of the Cesar River in whose vicinity are flooded lands, very suitable for the development of tropical agriculture and livestock. The other region corresponds to the mountainous area of the foothills of the Serranía de Perijá, where the districts of Tocaima, Media Luna and El Rincón are located and due to its climatic nature, it has favorable lands for agriculture in medium climates. The municipality has a fluvial system that bathes it; constituted by a set of flows, tributaries most of the Cesar River. This water wealth, although it is not used rationally, has facilitated irrigation for technified agriculture and livestock on an almost permanent basis. Among the main flows are: El Chiriaimo, El Tocaimo, El Salado, El Perú, El Piecito, El Ceras and a branch of affluent Acequias of these flows; in addition to the Mocho River, which marks the limits with the territory of La Paz. The topographic framework where the settlement of San Diego is located, presents the same characteristics of all the municipal capitals of the department. The urban grid is ordered according to the Spanish grid system with blocks of 80 x 80 m on a regular side, with disparate sidewalks and platforms, which in each case have different levels.
An Application Based on the Concept of Gamification
593
Although in the old sector, that is, the center of the town, the grid is not so precise. The Spanish grid that has been the pattern of urbanization and ordering in San Diego does not obey a rational principle, but rather a tradition, which has made the urban conformation of a dispersed type, presenting large, underutilized areas of land, in the interior of the apples, which in many cases favors the aptitude of productive. For this reason, the urban area is large, compared to the density of its population. This project is developed within the framework of a project for the inclusion of new technologies and augmented reality for the 25 municipalities of the department of Cesar. For this, a screenshot is shown first, a general scheme of the map of the department of Cesar is shown, see Fig. 1.
Fig. 1. Initial Screenshot of the Municipalities of the Department of Cesar
After you can view the different municipalities of the departments, you can select the municipality of San Diego, which is in the north of the department of Cesar, see Fig. 2. Later, when clicking on the municipality that you want to view, you can see the initial welcome screen to the municipality whose tourist component you want to learn about, see Fig. 3. When entering, you can see the information related to the municipality where the cultural and historical information has been defined because of the survey of the information that you want to show to tourists, see Fig. 4. Additionally, there is audiovisual information that shows the traditions of the municipality that can be viewed within the OVA. For example, architectural places of interest, food and material and immaterial traditions, see Fig. 5. There are also different learning evaluation strategies, which contemplate the key aspects of the information provided by the OVA about the municipality and that allows verifying if the users have acquired the required and necessary knowledge for the development of the project, see Fig. 6.
594
P. P. Ariza-Colpas et al.
Fig. 2. Initial Screenshot of the Municipality of San Diego
Fig. 3. Initial Screenshot of the San Diego OVA
Fig. 4. Cultural and Historical Information of the Department of San Diego
An Application Based on the Concept of Gamification
595
Fig. 5. Examples of Multimedia Content within the OVA
Fig. 6. Activities in OVA
6 Results The OVA digital resource, when implemented, promotes learning and understanding of the culture of the municipality of San Diego in the department of Cesar, facilitating the strengthening of and appropriation of Caribbean culture. The OVA groups pedagogical and didactic factors that lead to the teaching-learning process with the support of ICT. All these elements allow autonomous learning, at the pace of each one and individualized. However, it is clarified that these digital resources do not work without the interest of the learner and the acquisition of certain basic technological skills. A positive impact is determined in the OVA implementation process for the strengthening of culture in the department of Cesar.
7 Conclusions With the development of this application, it was possible to show the different tourist and cultural sites of the municipality of San Diego and achieve the economic revitalization that is expected with the process of using the application, which achieved that different people belonging to the community could learn about many immaterial traditions of the municipality.
596
P. P. Ariza-Colpas et al.
The education of the 21st century requires that we use all these types of digital tools, to educate, present ideas, cooperatively synthesize processes. Within the platform implementation process, the following options for improvement in the learning processes are evident, which can be constituted as a future work: Consider the needs of the community in the planning of pedagogical activities, in order to consistently develop the didactic units and solve, improve conditions and quality of life. Make use of technological tools, integrate them into the daily tasks of school learning, innovate new methods that enrich the multimedia language to inform and express ideas, encourage the expression of new knowledge acquired, which improve critical thinking and therefore the development of various competencies, is the task of the educational community. Implement a strategy with the OVA, which is a resource under permanent construction to promote cultural identity in different scenarios.
References 1. Tovar, L.C., Bohórquez, J.A., Puello, P.: Propuesta metodológica para la construcción de objetos virtuales de aprendizaje basados en realidad aumentada. Formación universitaria 7(2), 11–20 (2014) 2. Cabrera, J.: Un Objeto Virtual de Aprendizaje (OVA) para el Movimiento Armónico Simple (M.A.S) y sus Aplicaciones. Entornos 28, 71–85 (2014). https://doi.org/10.25054/012479 05.526 3. Kurts, C., Kosaka, H., Carbone, F.R., Miller, J.F., Heath, W.R.: Class I– restricted crosspresentation of exogenous self-antigens leads to deletion of autoreactive CD8+ T cells. J. Exp. Med. 186(2), 239–245 (1997) 4. Bruner, J.: Actos de significado más allá de la revolución cognitiva. Alianza (2006) 5. Hernandez Suarez, C.A., Rojas Suárez, J.P., Albarracín, C.Z.: Objeto virtual de aprendizaje para desarrollar las habilidades numéricas: una experiencia con estudiantes de educación básica. Panorama 14(26), 111–133 (2020) 6. González Carreño, A.: La OVA como recurso didáctico para la enseñanza de las operaciones matemáticas básicas (2019) 7. Vygotsky, L.: La mente en la sociedad: El desarrollo de los procesos psicológicos superiores. Harvard University Press, Cambridge, MA (1978) 8. Hilera, J., Hoya, R.: Estándares de E-learning: Guía de consulta. Universidad de Alcalá, España (2010) 9. Mayorga, M., Alfonso, D., Escamilla, R.: Los contenidos educativos digitales como apoyo al proceso de aprendizaje en programas virtuales de la Universidad Antonio Nariño. En XIV Encuentro internacional Virtual Educa, Medellín, Colombia (2013) 10. Alvarez-Marin, A., Castillo-Vergara, M., Pizarro-Guerrero, J., Espinoza-Vera, E.: Realidad aumentada como apoyo a la formación de ingenieros industriales. Formación universitaria 10(2), 31–42 (2017) 11. Vcasmo Studio. Vcasmo App. Available on: https://www.vcasmo.com 12. DominitikKnow. Claro App. https://www.dominknow.com/elearning- authoring-tools 13. Articulate Inc. Articulate App. https://articulate.com 14. Vyon Inc. Goanimate App. https://think.vyond.com/get-started-today?utm_medium=cpc& utm_source=google&utm_campaign=Brand_Tier_2&utm_content=headline&utm_term= go%20animate&gclid=CjwKCAjwtKmaBhBMEiwAyINuwJjSRXHt4nJygt1mNwYuPzQb 4zXLJ jBXDmhOVA9chP10JB quo1rxoCTNAQAvD_BwE
An Application Based on the Concept of Gamification
597
15. Adobeinc.AdobeCaptivate. https://www.adobe.com/co/products/captivate.html 16. Forero, J.A.M., Rodríguez, A.C.P.: Community-based Avitourism as a tool for Environmental Appropriation. In: International Conference on Tourism Research (pp. 399–407). Academic Conferences International Limited (2021) 17. Atingo, P.: Use of Technology in Teaching Hospitality (2020) 18. Piñeres-Melo, M.A., Ariza-Colpas, P.P., Nieto-Bernal, W., Morales-Ortega, R.: SSwWS: structural model of information architecture. In: International Conference on Swarm Intelligence, pp. 400–410. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-263546_40 19. Kustandi, C., Fadhillah, D., Situmorang, R., Prawiladilaga, D., Hartati, S.: VR use in online learning for higher education in Indonesia (2020) 20. Astaiza, M.N.: Comprensión lectora inferencial a través de una estrategia pedagógica apoyada en un OVA. Revista Dialogus, pp. 31–43 (2023) 21. Ariza-Colpas, P.P., Piñeres-Melo, M.A., Nieto-Bernal, W., Morales-Ortega, R.: WSIA: web ontological search engine based on smart agents applied to scientific articles. In: International Conference on Swarm Intelligence, pp. 338–347. Springer, Cham (2019). https://doi.org/10. 1007/978-3-030-26354-6_34 22. Rueda Díaz, L.J., Mercado Miranda, D.A., Padilla García, C.I.: A reusable learning object for assessment of cardiovascular and respiratory responses. Investigación y Educación en Enfermería 40(2) (2022) 23. Pinzón Barragán, E.F., Rodríguez Pinilla, A.N.: Objeto virtual de aprendizaje (OVA), para formación logoterapéutica y proyecto de vida en adolescentes, con problemática de consumo de sustancias psicoactivas (2022) 24. Ariza-Colpas, P., Herrera-Tapias, B., Piñeres-Melo, M., Guerrero-Cuentas, H., ConsuegraBernal, M., De-la-Hoz Valdiris, E., Morales-Ortega, R.C.: Cyclon language first grade app: technological platform to support the construction of citizen and democratic culture of science, technology and innovation in children and youth groups. In: International Conference on Intelligent Human Computer Interaction, pp. 270–280. Springer, Cham (2020). https://doi. org/10.1007/978-3-030-44689-5_24 25. De Cesare, F., Araújo, G.S., Tubelo, R.A., Rodrigues, S.B., Leitune, V.C.B., Collares, F.M.: Desenvolvimento e avaliação do uso de um objeto virtual de aprendizagem com simulação virtual sobre alginato. Revista da ABENO 22(2), 1949 (2022) 26. Correa Martínez, R., Sarmiento Fontecha, K.M.: OVA un recurso digital para facilitar el aprendizaje del inglés en los estudiantes de grado sexto del colegio para hijos de empleados de la Contraloría General de la República (2022) 27. Ariza-Colpas, P., Leon-Jacobus, A., De-la-Hoz, S., Piñeres-Melo, M., Guerrero-Cuentas, H., Consuegra-Bernal, M., Collazos Morales, C.A.: Glyph reader app: multisensory stimulation through ICT to intervene literacy disorders in the classroom. In International Conference on Intelligent Human Computer Interaction, pp. 259–269. Springer, Cham (2020). https://doi. org/10.1007/978-3-030-44689-5_23
Platform Based on Augmented Reality to Support Cultural Tourism in the Department of Cesar, Colombia Paola Patricia Ariza-Colpas1,3 , Marlon Alberto Piñeres-Melo2,3 , Roberto-Cesar Morales-Ortega1,6 , Andres Felipe Rodriguez-Bonilla3 , Shariq But-Aziz4 , Leidys del Carmen Contreras Chinchilla5(B) , Maribel Romero Mestre5 , and Ronald Alexander Vacca Ascanio5 1 Department of Computer Science and Electronics, Universidad de la Costa CUC, 080002
Barranquilla, Colombia [email protected], [email protected] 2 Department of Systems Engineering, Universidad del Norte, 081001 Barranquilla, Colombia [email protected] 3 Blazing Soft Company, 081001 Barranquilla, Colombia [email protected] 4 Department of Computer Science, University of South Asia, Lahore 44000, Pakistan 5 Faculty of Engineering and Technology, Universidad Popular del Cesar, 200004 Valledupar, Cesar, Colombia {leidyscontreras,maribelromero, ronaldalexandervacca}@unicesar.edu.co 6 Certika Company, 081001 Barranquilla, Colombia [email protected]
Abstract. The tourism sector is one of the sectors that have been most affected by the Covid-19 pandemic, due to the reduction in its income by more than half in 2020 compared to the previous year, according to the UNWTO. In Colombia, the panorama does not differ, with fall in sales close to 70% compared to 2019 in the hotel sector and travel agencies. Even so, it is one of the sectors to which many regions point for post-pandemic economic recovery. So, in the context of the current Covid-19 contingency, it is necessary to redefine the tourist experience. Various international organizations and/or authors propose innovation and digital transformation as key strategies for the tourism sector’s resilience. Policies for economic reactivation at the national level bet on the promotion of tourism, strengthening itself through CTI activities and promoting the integration of the private sector and academia. This article shows the development of an application based on augmented reality that is the first software application that promotes cultural tourism in the department of Cesar. For the development of this application, agile development methodologies were used that allow direct interaction with the client, which allowed a significant impact on economic development in the department of Cesar. Keywords: Tangible and Intangible Heritage · Cultural Heritage · Economic Reactivation · Cultural Tourism · Experiential Experiences · History · Augmented Reality © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): SAI 2023, LNNS 711, pp. 598–612, 2023. https://doi.org/10.1007/978-3-031-37717-4_38
Platform Based on Augmented Reality to Support Cultural Tourism
599
1 Introduction The tourism sector is one of the sectors to which many regions point for post-pandemic economic recovery. In 2016, the UNWTO (World Tourism Organization) reported a share of goods and services associated with tourism in 7% of total world exports, reaching a percentage greater than 5% of total GDP in countries of Latin America such as Panama, Costa Rica, and Mexico, among others. Similarly, at the global level, the UNWTO has announced actions for the relaunch of tourism in the Americas, among which is the commitment to: “Innovation and digital transformation as essential factors for tourism to be more resilient and advance sustainably determined in the implementation of new development models” [2]. The increased use of technology will play a key role in the “new normal” of the tourism sector, in a context where technological solutions will multiply and diversify naturally thanks to the internet, Iota, 5G networks, and smartphones. These technologies are already influencing the way trips are planned and experienced, and how tourist destinations are promoted. Technology is already the protagonist in experiential tourism or immersive tourism, where the traveler becomes the main actor of the experience, using virtual reality (VR), augmented reality (AR), and 360° videos. In 2020, the tourism sector has been one of the most affected because of the pandemic effect, with international tourism revenues falling by 64% in real terms (local currencies, constant prices) according to the barometer of the UNWTO. In Colombia, losses in the tourism sector were close to six billion pesos in the hotel sector, and a drop in sales of 70.4% for travel agencies. In this regard, within the economic reactivation strategies proposed by CONPES 3999 [1], for the initial response to the Covid-19 contingency, there are four thematic lines, where the national and international tourism promotion campaigns stand out. The application detailed in this paper was developed for the department of Cesar, one of the 32 unified departments in the northeast of Colombia, see Fig. 1. The department is named after its river, which has the same name, which is born in the foothills of the Sierra Nevada de Santa Marta and runs through the department from north to south, where it forms the Zapatosa swamp. The magic of Valledupar and Cesar is mainly linked to its topography: it’s a bright light, and especially its deep blue sky, caused by the reflections of the sun on the snows of the Sierra Nevada, its unique acoustics and sound that are born as well as its geographical conformation which constitutes one of the departments with the greatest cultural and artistic wealth in all of Colombia. Even though the department of Cesar is clearly identified as one of the places with the greatest cultural and historical wealth in Colombia, it has also been possible to identify that they lack many technological tools that allow strengthening the economy of the department in terms of tourism. That is why due to this identified weakness, the following problematic research question was postulated: How can technologies based on Augmented and Virtual Reality strengthen the processes of the economic-tourist revitalization of the department of Cesar? This article is organized as follows. First, conceptual information based on augmented reality is shown, then a review of the state of the art of augmented reality applications to the tourism sector is shown, then the methodology of application development is shown, and finally the characteristics of software development.
600
P. P. Ariza-Colpas et al.
Fig. 1. Location of the Department of Cesar in Colombia
2 Conceptual Information 2.1 Augmented Reality Definition Augmented Reality (AR) is a technique that has been implemented very recently for approximately 25 years, however, it has had a dizzying growth which is motivated by the massive use of mobile devices. It is precisely in these devices where the development of new functionalities has been evidenced, which have allowed a new set of content to be generated incrementally for augmented reality. Different applications of AR can be seen, among which the most outstanding is the education sector or cultural and territorial heritage, it is precisely in these sectors where the greatest applicability has been generated, to strengthen tourism and show the cultural strengths of a particular place. One of the great advantages of AR is that being derived from computer science, it can be adapted to content and methods and permeate different applications from advertising to medicine. It is precisely this advantage that allows the contents to be presented fluidly and improves the user experience and the understanding of the different phenomena that are displayed. The concept of augmented reality is framed around techniques that allow reality and digital representation to interrelate. AR was initially defined as a combination of virtualized and real information displayed through a computer, considering the observable digital generation [3–10]. Augmented reality uses the real world as support and context for the information to be digitized. This way of representing information is achieved through the complement of both digital and virtual data, in which the real world joins together with digital data, managing to form a complex experience where it is possible to visualize the information that the user normally does not watch.
Platform Based on Augmented Reality to Support Cultural Tourism
601
2.2 Historical Applications of Augmented Reality The different applications that have shown the evolution of this technology throughout history are detailed below: • Born from fiction in 1901, in the work “The Master Key”, the author L. Frank Baurn, where the idea of having an electronic device capable of displaying information on data from reality is shown [11]. • Sensorama, devised by the filmmaker Morton Heilig, recreates in his film a device where the mixture of sounds, smells, and visualizations is achieved. • Materializing the ideas of fiction “The ultimate Display”, created by Ivan Surtherland where you could interact with a special environment different from reality, to emulate the laws of physics, in the ‘60s. • “Head-mounted display”, also created by Ivan Surtherland, allows a user to see an image considering the movements made by his head [12]. • VideoPlace, created in 1975 by Myron Krueger, was a device that allowed users to interact with objects that were generated by a computer, it was the first user-technology interaction [10, 13]. • “Wearable 1”, devised by Steve Mann, from the University of Toronto, consists of a visual system where text and images could be superimposed on reality, which was in this context the first application of augmented reality. This device was the precursor to Google Glass [14]. • Synoptic weather maps, this was a system developed in 1981 which allowed digital information to be incorporated into maps to show weather and climate forecasts, using TV cameras. • “Super Cockpit”, was a system installed in the helmet of aircraft pilots to display relevant information about the status of the flight through a display screen [13]. • In the 1990s, Boeing Tom Caudell and David Mizell introduced the concept of Augmented Reality, referring to a new technology that could display information by increasing the visual field, taking data from a space observed in reality [10–15]. • “Virtual Fixture”, developed by the researcher Louis Rosenberg and the United States Air Force Research Laboratory in 1992, was one of the first functional AR systems [16]. • “Karma” developed by Steven Feiner, Blair Macintyre and Doree Seligman, is considered the first complex AR application [4]. Specifically, it was only through the publication of the scientific article, “A taxonomy of mixed reality visual displays”, by the authors Paul Milgram and Fumio Kishino in 1994, that a clear conceptual framework was established about the differentiation of virtual and real environments within the framework of a concept that was later adopted as “virtuality continuum”, generating, in turn, mixed, virtual, or augmented environments [8]. Other authors such as Milgram, Takemura, Utsumi and Kishino in their publication “Augmented Reality: a class of displays on the reality-virtuality continuum”, defined and described the environments that can be generated by a computer considering real and virtualized information. The massive use of the term augmented reality AR was strengthened through the publication of the author Ronald Azuma in 1997, “A survey of augmented reality”, in which
602
P. P. Ariza-Colpas et al.
Azuma describes the characteristics that this technology should have. In 1998, Hirokazu Kato and Mark Billinghurst developed the ARToolKit framework, which allowed the developer community to create augmented reality content and applications, distributed through the GPL (General Public License), in which they collaborated with researchers at the Nara Institute of Science and Technology, Kyoto, Japan, and released by the University of Washington [13]. In 1998, the term Spatial Augmented Reality was included, by researchers Ramesh Raskar, Greg Welch, and Henry Fuchs, in which it is possible to complement different elements of the physical environment with images that are integrated into how objects are displayed. It is at the end of the 1990s when the usability of this type of technology begins to spread widely in the scientific community through the creation of world-class conferences where researchers from different latitudes converge to show their application experiences. Among the most popular are: • IWAR- IEEE International Workshop on Augmented Reality. • ISMR - International Symposium on Mixed Reality. • ISMAR - IEEE, ACM, and Eurographics International Symposium on Augmented Reality. During the 21st century, a set of video games based on augmented reality technology is given way, such as: • ARQuake 2 was an application developed by Bruce Thomas, which used GPS as a kind of digital compass. This could detect the location of players and elements, which allowed the user to obtain points and marks of interest based on a real environment and to participate in each of the challenges established in the game [17]. • The AR-PDA system, developed by Jürgen Freund for Personal Digital Assistant (PDA), which allowed the visualization of augmented reality. The application achieved a real image that was captured through digitized virtual information [18]. • Michael Kalkush, developed a system based on augmented reality to guide the guided movement, this application had virtual markers present in physical space. [19, 20]. • Mobile AR Authoring System, developed by Guven and Feiner [21], includes multimedia and hypermedia elements in a 3D system, which is portable. The increase in mobile devices, which have more storage and processing space, they have allowed a boom in applications based on augmented reality. • The first massive application was developed by Mathias Möhring, which is based on 3D marks for mobile telephony, the marks use graphics rendering technology for each piece of information [22]. • Wikitude and Metaio, in 2009 in their company developed massive applications in RA through the generation of content and platforms based on this technology; this application was the first for smart mobile devices [23].
3 State of the Art Currently, in the department of Cesar, there are no mobile or web applications supported by augmented and/or virtual reality that is freely accessible, specifically speaking of cultural tourism scenarios. The increase in the use of smartphones in everyday life and
Platform Based on Augmented Reality to Support Cultural Tourism
603
the easy access to information, thanks to the internet [24], have changed how users live and explore the tourist and cultural content of different locations, not only at the local level. But also, nationally, and internationally. It is here where the proposed project takes on greater relevance, by exploiting the lack of local implementations of free access to cultural tourism, hand in hand with the wide portfolio of places and services that the department of Cesar has and the wide coverage of connectivity via mobile internet. These factors are transformed into a favorable scenario to exploit the capabilities of mobile applications that allow experiences beyond what we can perceive tangibly, reaching intangible implementations that allow awakening new sensations, experiences, and emotions in each of the users of the applications to develop. Since the mid-2000s, the use of instrumental technologies in Cultural Heritage has been extended to immersive technologies, the latter understood as: “a collective term for augmented-, virtual-, and mixed-reality technologies, which provide sensory experiences through various combinations of real and digital content [a collective term for augmented, virtual, and mixed reality technologies].” [25]. The authors also classify the applications of immersive technology in cultural and archaeological heritage into five categories: education, exhibition enhancement, explorations, reconstruction, and virtual museums. In times of social distancing, several museums and platforms have adopted immersive technologies to attract visitors who are in different geographical locations, such as the Google expeditions, Google Arts and Culture, or the Uxart project in consortium with IBM [26]. In the field of tourism, various services have emerged that use immersive technologies such as augmented reality combined with gamification to improve the experience of visitors, such as the projects carried out by the Spanish startups Play&Go and Visyon 360. Likewise, the study by Bernard Conde [27] on the use of the Fuendetodos app, which seeks to expand and stimulate knowledge about Goya and his presence in the town of Fuendetodos, allows us to verify that AR not only promotes learning about cultural heritage, rather, it allows users to participate in the process. At the national level, we find several examples of the use of technology applied to the dissemination of heritage and history, such as the project of the Mayor’s Office of Pereira through the Municipal Institute of Culture and Promotion of Tourism with its project of heritage plaques for the digital age, whose objective was the dissemination and appropriation of heritage, as well as the improvement of the city’s tourist offer through QR codes placed along three historical routes of the center. More recent initiatives can be found in the city of Bogotá with the call launched by the Ministry of Culture, Recreation and Sports (SCRD) in 2020 to reactivate the economic chains of the tourism sector for the town of La Candelaria in the historic center, the application winner proposed virtual tours through immersive 360° videos [28]. Bilateral cooperation agencies such as the UNDP also include immersive technologies in their agendas, as is the case of the augmented development project of the UNDP acceleration lab, which seeks to accelerate digital connectivity for and with lagging territories [29]. At a regional level, the Mapuka museum [30] of the Universidad del Norte presents an attractive proposal, in which, through the implementation of a mobile application, the user can not only obtain additional information about the exhibited objects and the history of cultures Karib, but also interact and manipulate them virtually, as well as engaging games and creating audiovisual content. All these experiences show the uses
604
P. P. Ariza-Colpas et al. Table 1. Applications of Augmented Reality to Different Sectors
Reference Description [32]
Iskandar.my, is an application based on multimedia development that allows showing different sites of interest that allow users to measure satisfaction with the different services that are generated around the place
[33, 34]
An application based on virtual reality for “Lawang Sewu”, in Indonesian Java developed in Unity, Vuforia, and Maya, which allows an interactive scenario for each of the visitors to this attraction
[33]
The cARica application is an application based on a mobile development with augmented reality applied to Wonosobo, located in Central Java in Indonesia where the different tourist attractions are shown, allowing using Google Maps that different users can travel through the region and know its attractions
[35, 36]
The application “AR The Gods of Olympus”, allows to invigorate the teaching of Greek mythology through the identification and study of the gods of Olympus, allows in Greece to bring people closer and to the basic conceptualization of history
[37]
The Globe 360 application allows different actors to see a staging that energizes the centralization of events in the advancement of theater making use of 21st century technologies
[38, 39]
The AMOR project is focused on the processes of conservation and valorization of cultural heritage through the implementation of mixed reality, where tourists can access different tourist attractions and learn about culture, it is applied to the Ancient Baths of Caracalla in Rome, Italy
[40]
The AR-sculptures application allows you to make augmented reality applications to learn about the different components of different sculptures and their history that allows the user to contextualize themselves in the historical moment in which the work of art was built
[41, 42]
A support platform has been developed for the development of tourist activities in Halal that allows to stimulate different sectors and points of interest to stimulate the entry of residents in this region
of AR in cultural tourism, making available information and interactions that would not be easily accessible, enriching the user experience in the city setting. “Regarding the use of information technologies, it is very common for visitors to use accommodation search engines with offers to enjoy their stay. Some applications facilitate the trip for tourists, such as the case of Colombia Travel App, or Colombia Map, which offers tourists information on the different cities of the country and the most tourist sites. These applications do not focus on the user experience, nor do they show the information stored and verified, as the National Tourism Registry (rnt) does, which gives tourists full certainty that the service provided is legal and has a conduit … regularly to guarantee the quality of the service.” [31].
Platform Based on Augmented Reality to Support Cultural Tourism
605
Within the framework of the literature review, the following authors can be highlighted who have made different contributions with their developments to augmented reality, which are described below, see Table 1:
4 Methodology Currently in the department of Cesar, there are no mobile or web applications supported by augmented and/or virtual reality that is freely accessible, specifically speaking of cultural tourism scenarios. The increase in the use of smartphones in everyday life and the easy access to information, thanks to the internet [7], have changed how users live and explore the tourist and cultural content of different locations, not only at the local level. But also, nationally, and internationally. It is here where the proposed project takes on greater relevance, by exploiting the lack of local implementations of free access to cultural tourism, hand in hand with the wide portfolio of places and services that the department of Cesar has and the wide coverage of connectivity of mobile internet. These factors are transformed into the favorable scenario to exploit the capabilities of mobile applications that allow experiences beyond what we can perceive tangibly, reaching intangible implementations that allow awakening new sensations, experiences, and emotions in each of the users of the applications to develop. To fulfill the general objective of the project, “Improve the user experience of cultural tourism in the department of Cesar using augmented reality technologies” the following methodology was proposed to fulfill the specific objectives. In the first phase, the pertinent information collection techniques are established to carry out the process of characterizing the user profiles that carry out cultural tourism in the department of Cesar. After the identification process of these techniques, the instruments to be applied to the target population are designed. Once the necessary instruments have been defined for the characterization of the user profiles that carry out cultural tourism in the department of Cesar, the application is made to the selected sample space to collect the information for its subsequent analysis. After the information capture process of the selected sample space is finished, the results of the previously applied instruments are analyzed to determine the factors that affect a satisfactory user experience. In the conceptual design phase of the products to be developed, the Use Cases of the system are identified based on the characterization of the results of phase 1, user experience, software artifacts and audiovisual components. Subsequently, the deliverables of the design stage were made, such as: the System components, the class diagrams, the database structures, and the audiovisual components necessary for the development of the web application and the reality mobile application. In the coding phase of the project, the deliverables of phase 2 are materialized through programming codes, considering access to both the web application and the mobile application. All the components required for the correct functioning of the different systems and applications are developed. At the same time, and at the beginning of this phase, the selection of the development tools, DreamWorks, graphics engines, multimedia software and other software artifacts necessary for the correct development of the components obtained in phase 2 is made; all this based on the analysis of said deliverables.
606
P. P. Ariza-Colpas et al.
In the same way, different tests are carried out for each of the components, both software and audiovisual, developed in this phase and the integration between them, to make pertinent corrections in errors that may occur. Within this testing phase, a closed beta version will be generated and different actors in the value chain will be allowed to make use of the different products and validate their usability. In parallel, a digital dissemination strategy will be implemented for the different actors in the project’s value chain; all this to generate a campaign of expectation of the impact of the products in the scheme and the experience of cultural tourism in the department of Cesar. Then, the documentation of each of the resulting software products (augmented reality mobile application and web application) is made. In the same way, the installation, assembly, and configuration of the web application in the cloud and the augmented reality mobile application in the iOS (Pastore) and Android (Play Store) application stores to give free access to the target audience through an open beta version and proceed with the validation of the products developed to improve the experience of users who perform cultural tourism in the department of Cesar. During this stage, the development team will be in constant supervision of the operation of the applications (web and mobile), analysis of generated data, generation of tables and reports for the analysis of these by the administration and process management, all with to support the correct decision-making during the deployment that allows the objective of this critical stage of the project to be met, user acquisition and retention.
5 Application’s Feature
Fig. 2. Principal Screen of Fall in Love with Cesar Application
The “Enamorate del Cesar (fall in love with Cesar)” application is the first tourist application in the department of Cesar in Colombia, which was developed in conjunction with the Popular University of Cesar, the development companies Blazing Soft and Certika, with resources from the SGR (System Royalties General). This application has
Platform Based on Augmented Reality to Support Cultural Tourism
607
been developed in Unity and allows it to be executed in a multiplatform scenario so that different users can access the application, it is built through the implementation of different interactive environments that place the user spatially at 50 tourist places of the 25 municipalities of the department of Cesar. In Fig. 2, you can see the initial loading page of the application.
Fig. 3. Visualization of the Map of the Department of Cesar by Municipalities
When you enter the application you can see the map diagram of the department of Cesar in Colombia with its 25 municipalities: Valledupar, Aguachica, Agustín Codazzi, Bosconia, Chimichagua, El Copey, San Alberto, Curumaní, El Paso, La Paz, Pueblo Bello, La Jagua de Ibirico, Chiriguaná, Astrea, San Martín, Pelaya, Pailitas, Gamarra, Manaure Balcón del Cesar, Río de Oro, Tamalameque, Becerril, San Diego, La Gloria, González, as can be seen in Fig. 3. When located in a particular municipality, you can see the reference to both tourist and cultural sites that are included within the application that are articulated with the in-situ application of augmented reality. The user can then select the place of their choice to be able to carry out the cultural experience, see Fig. 4. Within the functionality of the application is the concept of “time capsules”, where you can see how the place was in the past and give a projectionist message of how it would be in the future, you can show 360 videos in which socialize the different cultural and historical components of the place that has been selected, see Fig. 5. To promote the social appropriation of knowledge, in addition to having the cultural and historical information of the site of interest, the platform also has components focused on community learning about the tourist site through virtual learning objects that allow different conceptual information to be carried out applied through gamification strategies to encourage local, national, and international tourists to learn about the place under intervention, see Fig. 6.
608
P. P. Ariza-Colpas et al.
Fig. 4. Location of Places of Interest within the Map of Municipalities of the Department of Cesar
Fig. 5. Location of the Site of Interest
Platform Based on Augmented Reality to Support Cultural Tourism
609
Fig. 6. Link to Videos, OVAS and Games Related to the Intervention Site
6 Discussions Based on the research question, the impact of the implementation of augmented reality in the tourism sector has been generated in this department of the country, highlighting the following advances: • Characterization of the tourist sites that most highlight the Cesarence culture in Colombia. • Definition of material and immaterial cultural values that promote economic revitalization. • Compilation of information from the different tourist sectors through descriptive and non-descriptive instruments to strengthen cultural identity. • Development of OVAS that allow strengthening the social appropriation of knowledge to future generations of Caribbean culture. • Development of an interactive application for the strengthening and use of augmented reality by the population of the department object of the intervention.
7 Conclusions Colombia is a country that stands out for the mega diversity of its heritage. According to MinCIT (2018) the tourism authorities highlight that some relevant attractions of the national tourism inventory need to improve their status, since deterioration in their quality and meaning is observed. It is important to refer to the contribution made by Michael Porter between 1993 and 1994, who indicated that tourism competitiveness in Colombia is beginning to be planned, that it had great potential and that it was necessary to work in clusters, that is, by organizations of agglomerated companies and with availability of infrastructure and resources. Based on this reference, it should be noted that the offer of tourist services requires investment, control and monitoring, the provision of natural resources, safe access routes, public services, transport services, physical and technological infrastructure play a fundamental role in it is time for tourists
610
P. P. Ariza-Colpas et al.
to identify a destination of interest that is why the municipal and departmental secretaries of development, environment and tourism must develop programs that promote social and cultural tourism in the regions. The software application made it possible to stimulate knowledge of places of interest in the department of Cesar and allow the post-pandemic economic revitalization of tourism, allowing the generation of higher income and benefiting the community in general.
References 1. UNWTO, UNWTO World Tourism Barometer and Statistical Annex, December 2020, UNWTO World Tour. Barom. 18(7), 1–36 (2020). https://doi.org/10.18111/wtobarometer eng.2020.18.1.7 2. Colombia + Competitiva, 4 departamentos beneficiados en la primera fase de Destinos + Competitivos + Sostenibles – Colombia más competitiva (2021). https://www.colombiam ascompetitiva.com/4-departamentosbeneficiados-en-la-primera-fase-de-destinos-competiti vos-sostenibles/. Accessed on 24 Jun 2021 3. Azuma, R.: A survey of augmented reality. Presence 6(4), 355–385 (1997) 4. Azuma, R., Baillot, Y., Behringer, R., Feiner, S., Julier, S., MacIntyre, B.: Recent advances in augmented reality. IEEE Comput. Graphics Appl. 21(6), 34–47 (2001). https://doi.org/10. 1109/38.963459 5. Billinghurst, M.: Augmented reality in education. Recuperado 22 de julio de 2014, a partir de (2002). http://www.it.civil.aau.dk/it/education/reports/ar_edu.pdf 6. García González, J.A.: El lenguaje visual y cartográfico en las enseñanzas humanísticas. Planos de Metro de Albacete. Cartografías utópicas. Ensayos: Revista de la Facultad de Educación de Albacete 28, 101–115 (2013) 7. González, C., Martín-Gutiérrez, J., Domínguez, M.: Improving spatial skills: an orienteering experience in real and virtual environments with first year engineering students. In: 2013 International Conference on Virtual and Augmented Reality in Education. Recuperado a partir de (2013). http://udv.ull.es/vare/data/vare2013_ID_73_FULL%20PAPER.pdf 8. Milgram, P., Kishino, F.: A taxonomy of mixed reality visual displays. IEICE Trans. Inf. Syst. 77(12), 1321–1329 (1994) 9. Milgram, P., Takemura, H., Utsumi, A., Kishino, F.: Augmented reality: a class of displays on the reality-virtuality continuum. En Photonics Industr. Appl. 282–292. International Society for Optics and Photonics (1995) 10. Ruiz Torres, D.: La realidad aumentada y su aplicación en el patrimonio cultural. Gijón: Trea (2013) 11. Johnson, J.: The Master Key: L. Frank Baum envisions augmented reality glasses in 1901.Recuperado 11 de julio de 2014, a partir de (2012). https://archive.today/4jTOk 12. Sutherland, I.E.: A head-mounted three dimensional display. In: Proceedings of the December 9–11, 1968, fall joint computer conference, part I, pp. 757– 764. ACM. Recuperado a partir de (1968). http://dl.acm.org/citation.cfm?id=1476686 13. Sherman, W.R., Craig, A.B.: Understanding Virtual Reality: Interface, Application, and Design (Edición: Revised, Update.). Amsterdam ; Boston: Morgan Kaufmann (2002) 14. Mann, S.: Eye Am a Camera: Surveillance and Sousveillance in the Glassage. Time. Recuperado a partir de (2012). http://techland.time.com/2012/11/02/eye-ama-camera-surveillanceand-sousveillance-in-the-glassage/ 15. Lee, K.: Augmented reality in education and training. TechTrends 56(2), 13–21 (2012)
Platform Based on Augmented Reality to Support Cultural Tourism
611
16. Rosenberg, L.B.: Virtual fixtures as tools to enhance operator performance in telepresence environments, vol. 2057, pp. 10–21 (1993). https://doi.org/10.1117/12.164901 17. Thomas, B., et al.: ARQuake: an outdoor/indoor augmented reality first person application. In: Presentado en 4th International Symposium on Wearable Computers, Atlanta, GA, pp. 139– 146. Recuperado a partir de (2000). http://www.tinmith.net/papers/thomas-iswc-2000.pdf 18. Freund, J., Geiger, C., Grafe, M., Kleinjohann, B.: The augmented reality personal digital assistant. In: Proceedings of the Second International Symposium on Mixed Reality, pp. 85–94 (2001) 19. Cepal, El turismo de Centroamérica y la República Dominicana ante las tecnologías digitales: retos y oportunidades para las mipymes. www.cepal.org/apps 20. Departamento Administrativo de Información Estadística. Disponible en: https://www.dane. gov.co/ 21. Guven, S., Feiner, S.: Authoring 3D hypermedia for wearable augmented and virtual reality. In :Seventh IEEE International Symposium on Wearable Computers, 2003 Proceedings, pp. 118– 126 (2003). http://doi.org/https://doi.org/10.1109/ISWC.2003.1241401 22. Möhring, M., Lessig, C., Bimber, O.: Video see-through Ar on consumer cellphones. In :Proceedings of the 3rd IEEE/ACM International Symposium on Mixed and Augmented Reality, pp. 252–253. IEEE Computer Society. Recuperado a partir de (2004). http://dl.acm. org/citation.cfm?id=1033722 23. Jamali, S.S., Shiratuddin, M.F., Wong, K.W.: A review of augmented reality (AR) and mobileaugmented reality (mAR) technology: learning in tertiary education. Int. J. Learn. High. Educ. 20(2), 37–54 (2014) 24. Imbert-Bouchard, D., Llonch Nayra, M., y Osàcar Eugeni, C.: Turismo cultural y APPs. un breve panorama de la situación actual. Her. Mus. Heritage Museography 13, 44–54 (2013) 25. Bekele, M.K., Pierdicca, R., Frontoni, E., Malinverni, E.S., Gain, J.: A survey of augmented, virtual, and mixed reality for cultural heritage. J. Comput. Cult. Herit. 11(2), 1–36 (2018). https://doi.org/10.1145/3145534 26. de los Á Orfila, M.: Un museo sin muros que escondió obras virtuales en Colonia, El País, 23 September 2020 27. Suache Merchán, J.C.: Candelaria en 360°: una iniciativa para la reactivación turística del sector, Alcaldía de Bogotá, 23 May 2021 28. VISYON, Augmented Reality & Holograms: Emerging Technologies. 29. Blazing Soft and Team Toon Studio, Museo MAPUKA - Universidad del Norte, Aplicación de realidad aumentada (2021). https://www.youtube.com/watch?v=Bfsbpdy4aYs. Accessed on 10 June 2021 30. UNWTO, The One Planet Sustainable Tourism Programme. https://www.unwto.org/es/sustai nabledevelopment/one-planet. Accessed on 24 June 2021 31. Secretaría de Estado de Turismo del Gobierno de España and SEGITTUR, Qué es un DTI - Red de Destinos Turísticos Inteligentes. https://www.destinosinteligentes.es/que-es-un-dti/. Accessed on 24 June 2021 32. Mohd, N.S., Abdul Aziz, M., Ismail, H.N.: Iskandar.my: framework of mobile augmented reality travel app. In: Technology Application in Aviation, Tourism and Hospitality, pp. 113– 127. Springer, Singapore (2023). https://doi.org/10.1007/978-981-19-6619-4_9 33. Pranoto, H., Saputra, P.P., Sadekh, M., Darmadi, H., Yanfi, Y.: Augmented reality navigation application to promote tourism to local state attraction “Lawang Sewu.” Procedia Comput. Sci. 216, 757–764 (2023) 34. Piñeres-Melo, M.A., Ariza-Colpas, P.P., Nieto-Bernal, W., Morales-Ortega, R.: SSwWS: structural model of information architecture. In International Conference on Swarm Intelligence, pp. 400–410. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-263546_40
612
P. P. Ariza-Colpas et al.
35. Ventoulis, E., Xinogalos, S.: AR the gods of olympus: design and pilot evaluation of an augmented reality educational game for Greek mythology. Multimodal Technol. Interact. 7(1), 2 (2023) 36. Ariza-Colpas, P.P., Piñeres-Melo, M.A., Nieto-Bernal, W., Morales-Ortega, R.: WSIA: web ontological search engine based on smart agents applied to scientific articles. In: International Conference on Swarm Intelligence, pp. 338–347. Springer, Cham (2019). https://doi.org/10. 1007/978-3-030-26354-6_34 37. Pye, V.C.: Shakespeare’s Globe “360”: transmedial performance, virtual tourism, and the reconstructed playhouse. In: Shakespeare and Tourism, pp. 251–265. Routledge (2023) 38. Dore, N., Cochetti, F., Catapano, I., Ludeno, G., Gennarelli, G., Corrado, M.E., Zampognaro, F.: The AMOR project: when technology meets cultural heritage. In: International Conference Florence Heri-Tech: the Future of Heritage Science and Technologies, pp. 257–265. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-17594-7_19 39. Paola, A.C., et al.: Strengthening the teaching of the narrative genre: story and fable in primary school children in the Department of Magdalena-Colombia. A commitment to the use of ICT Games and BayesianLogisticRegression. Procedia Comput. Sci. 191, 379–384 (2021) 40. Sovhyra, T.: AR-sculptures: issues of technological creation, their artistic significance and uniqueness. J. Urban Culture Res. 25, 40–50 (2025) 41. Mukherjee, A., Rajendran, S.D., Wahab, S.N.: Technology strategy in boosting halal tourism activities. In: Technology Application in Aviation, Tourism and Hospitality, pp. 41–56. Springer, Singapore (2023). https://doi.org/10.1007/978-981-19-6619-4_4 42. Paola, A.C., et al.: GlyphReader app: a support game for the application of the ortongillingham method with datamining techniques. Procedia Comput. Sci. 191, 373–378 (2021)
Assessment of Human Personality Traits Using Smartphone Sensing Sehrish Rafique1 , Muhammad Ehatisham-ul-Haq1(B) , Kainat Ibrar2 , Amanullah Yasin1 , Fiza Murtaza1 , and Muhammad Awais Azam3 1 Department of Creative Technologies, Air University, Islamabad, Pakistan
[email protected]
2 Department of Computer Science, University of Wah, Wah, Pakistan 3 Technology Innovation Research Group, School of Information Technology, Whitecliffe,
Wellington, New Zealand
Abstract. Assessing human personality through gait patterns is one of the newest and most widespread research areas. Gait refers to an individual’s walking style that represents a person’s identity and can reflect reliable information about mood, emotion, and intrinsic personality traits under scrutiny. Based on data, it focuses on deriving key personality traits of individuals along the Big Five model of personality. This research collects a new dataset of 22 participants using smartphones for experimentation according to the user’s Big-Five Personality Traits Profiles (BFPT). The Random Forest classifier is applied to train and classify the dataset. Experimental results and performance evaluations of the classifier demonstrate the effectiveness of the proposed scheme for all Big Five Personality Traits. Keywords: Gait Recognition · Machine Learning · Mobile Computing · Personality Assessment · Smart Sensing
1 Introduction Gait belongs to walking patterns and is unique to each individual. Bipedal movement, commonly known as ‘walking’, is a complex task accomplished by systematic and wellestablished movements of the human limbs. Gait recognition analyzes human walking patterns and is known as “the methodical study of human locomotion” [1]. The way a person walks, reveals their different personality aspects [2] like shyness, lack of confidence, openness to experience, etc. Similarly, presentations are also a great tool for assessing an individual’s personality. Specifically, the judgment carried through the presentation with the audience and presentation without the audience reveals specific traits about a person’s consciousness and stage fear. Automated gait recognition is among the most popular and active research topics. It has numerous applications in biomedicine, soft biometric recognition, security and surveillance, rehabilitation, marketing of products, sports, etc. [3]. For the collection and recognition of gait patterns, there are some traditional techniques [4], including methods based on Video-Sensor (V-S) [5], AmbientSensor (A-S) or Floor-Sensor [6], and Wearable-Sensor (W-S) [7]. The most recent and © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): SAI 2023, LNNS 711, pp. 613–622, 2023. https://doi.org/10.1007/978-3-031-37717-4_39
614
S. Rafique et al.
proliferating method is the collection and recognition of gait patterns using built-in smartphone sensors such as GPS, proximity, gyroscope, and accelerometer [8]. Users no longer need to attach sensors to multiple locations on the body [6]. Predictions about inferring personality traits through actions have been an important area for human-centric computing [11]. Researchers have been working in this domain for almost over a decade. Behavioral observation techniques effectively implied personality trait attribution [12]. Body movement is a medium used by observers to make judgments about an individual, especially if face and body appearances are indistinguishable or the person is at a distance. Most of the personality assessment tasks are achieved using questionnaires and surveys. However, these methods are not fully reliable due to the veracity of answers and cultural biases. In [13], the authors examined the personality based on the body movements of people. They captured movement through the kinetic camera and conducted a pilot study to access the personality traits explained in the Big-Five Personality Model. The results show that traits of Extroversion and Conscientiousness best fit the model. Moreover, several other patterns were identified that characterized the five personality traits. These results showed the feasibility of assessing a person’s personality through movements and opened ways for further research in this field. In [14], authors processed the wearable sensor data to analyze the motion characteristics of movements. To differentiate among several types of movements, deep learning methods were used. The wearable sensors dataset provided useful insights for human action and personality recognition. Personality Assessment is an emerging idea and it opens up exciting ways to gain insights into complicated behavioral aspects of personality [15]. Machine Learning models have been used for the prediction of a person’s Big Five personality traits from a broad range of data sources, including various social networking platforms (Instagram and Facebook), digital footprints, and mobile sensing data. Researchers are beginning to identify other psychical constructs in digital data using unsupervised ML techniques [16]. The study in [17] analyzed the extent to which a person’s personality dimensions are predictable from six classes of behavior: 1) music consumption, 2) communication and social behavior, 3) mobility, 4) app usage, 5) day- and night-time activity, 6) overall phone activity. The cross-validation results have shown which Big Five personality aspects can be predicted and which particular patterns of behavior are suggestive of which dimensions. The results predicted by this work [17] have demonstrated the possibility of acquiring information about a person’s traits from data collected through smartphones. The research [9] has investigated the sleeping attributes among different people. Smartphone sensors are used to collect data from 597 participants across many weeks. The significant findings in this research were that the previous day’s schedules affect conscientiousness. Moreover, they found individuals’ differences in their overall range of night time rest. This research [6] demonstrated that smartphone sensing integrates new efficient and ecological approaches that can assist in studying human personality. Studies show that smartphone usage correlates to user personality traits [6]. The research in [18] is evaluated with the data collected over six months from 739 Android smartphone consumers, along with the collection of a 50-item Big Five Personality Trait questionnaire. The analysis focused on category-level aggregated applications for predicting Big Five personality traits, accomplishing an average of 86–91% accuracy. The study concluded that user personality has a fundamental impact
Assessment of Human Personality Traits Using Smartphone Sensing
615
on smartphone application usage and its application categories. Reflecting the potential of future personality-driven research, this study demonstrated the importance and involvement of application categories in achieving reasonable accuracy for general traits while pursuing personality assessment research. The proposed research aims to assess key personality traits in line with the big-five personality model based on individuals’ gait pattern recognition. The Big Five inventory is a measure of widely used big-five traits: neuroticism, extraversion, openness to experience, agreeableness, and conscientiousness. It is the most scientifically validated and reliable personality model in the domain of individual differences research [9]. Researchers in [10] described the big-five model traits in detail. We maintained the BigFive Personality Test (BFPT) profiles and then collected the gait data of all participants using five different gait patterns, including regular walking, presentation with audiences, and without audiences. To excerpt meaningful data for further processing, we preprocessed the data, extracted features, and then applied Random Forest (RF) classifiers. Results obtained after classification proved the efficiency of the scheme. The main contributions of our work are as follows: • We conducted the BFPT of participants using revisions suggested by the Big Five inventory [19]. BFPT is a 50-item questionnaire, having a scale of 1–5, developed for the measurement of the big-five factors reported in the article [2]. • We collected our dataset of 22 participants, including ten males and 12 females aged between 19 to 26 years, using an android application called “data collector” through multi-position smartphones, i.e., one being in a pocket and the other in hand in different contexts. • For the first time, we proposed a method for inferring the key personality traits of individuals in line with the big-five model of personality through gait pattern recognition. • We evaluated the proposed approach for each user by comparing the inferred personality traits to the personality scores obtained through BFPT. Performance evaluation showed promising results, highlighting the efficiency of the scheme.
2 Methodology This section contains a detailed overview of the proposed system methodology and explains the techniques adopted for inferring personality traits based on gait pattern recognition using smartphone sensors. The proposed method consists of four distinct steps, i.e., data acquisition, data pre-processing, feature extraction, and classification (personality traits recognition based on gait pattern), as shown in Fig. 1. Each phase of the proposed scheme is described in the subsequent sections. 2.1 Building Users’ BFPT Profiles Before data recording, each participant completed the BFPT using revisions suggested by the Big-Five Inventory [20, 21]. The BFPT questionnaire consists of 50 questions and measures the big- five factor markers, which include Neuroticism (N), Extraversion (E), Openness to experience (O), Agreeableness (A), and Conscientiousness (C). We
616
S. Rafique et al.
Fig. 1. Proposed Methodology Diagram for Personality Assessment
have marked each question on a scale of 1–5, ranging from strongly disagree to agree strongly. To maintain users’ BPFT profiles, we used the method given in [18, 22] for calculating scores against each trait for every user. After scoring five traits separately for all participants, we used the average formula to calculate the average value for each trait. We marked all users as 1 for ‘Positive’ and 0 for ‘Negative’ for every trait based on their calculated average scores. These BFPT profiles of users were used later for training the dataset using machine learning classifiers. 2.2 Dataset Details For the validation of the proposed scheme, we collected our dataset for personality traits recognition using an android application, “Data Collector” [23]. We used two smartphones, one held in hand and the other in the pocket. Data is collected for 22 subjects; 10 males and 12 females aged between 19 to 26 years and weighing between 39 to 70 kg. Dataset contained time-series data generated by GPS, accelerometer, magnetometer, and gyroscope sensor readings at two different perspectives: altitude, gravity, user accelerations, and rotation rate collected at 50 Hz (HZ) sample rate. We selected 03 different gait patterns, including normal walk, presentation with the audience, and presentation without the audience. Each gait pattern was recorded for 3–5 min, split into segments of 20 s, giving enough examples for our analysis, and we had approximately 7500 instances in each gait data file. 2.3 Pre-processing and Feature Extraction After data acquisition, we applied an average smoothing filter along all three dimensions. After denoising the signal, we segmented the whole gait data into chunks of 20s. After pre-processing, we extracted 20 features in the time domain, concluding from the literature. The size of the feature matrix was N ∗ 20, where N is the total number of users (22), so the feature matrix size was 22 ∗ 20. The features include Maximum amplitude, Min. amplitude, Maximum Latency, Minimum Latency, Mean, Variance, Kurtosis,
Assessment of Human Personality Traits Using Smartphone Sensing
617
Skew, Latency to Amp. Ratio, Absolute Amp., Abs. Latency to Amp ratio, Peak-to-Peak value, Mean of the absolute value of 1st diff, Mean of the absolute value of 2nd diff, Mean Cumulative Sum, Inter Stride, Movement Angle, Range, Autoregression Coefficient, and Correlation Coefficient. After feature extraction, the next step is to pick an appropriate classification algorithm for the training of the dataset. 2.4 Classification and Personality Traits Prediction RF classifier is used to validate the proposed scheme on the self-collected dataset to recognize the big-five personality traits of users based on their gait recognition. The main reason for the RF classifier is its efficient performance in existing work [12, 24, 25].
3 Experimental Results
Table 1. Accuracy Measure of the Big-Five Personality Traits Recognition using RF Classifier Attribute
Walk
Presentation with audience
Presentation without Audience
Activities using Smartphone-1( Hand Position) Agreeableness
94.13
91.98
91.22
Extraversion
92.83
88.54
89.31
Conscientiousness
93.15
87.02
90.07
Neuroticism
91.85
89.69
90.45
Openness to Experience
95.43
93.51
89.31
Activities using Smartphone-2 (Pocket Position) Agreeableness
92.83
91.54
89.80
Extraversion
94.13
92.27
93.93
Conscientiousness
94.13
90.07
92.01
Neuroticism
97.06
91.54
89.35
Openness to Experience
93.15
91.91
92.28
We evaluated the proposed methodology with a 5-fold cross-validation scheme. Hence, all the instances of gait patterns in the dataset are randomly split into five equal parts. Our proposed scheme achieved promising results with a data chunk of 20 s. Overall comparison of results in terms of accuracy obtained from both hand and pocket positions, shows that the results found in the pocket position are more encouraging than in the hand position. Tables 1, 2, and 3 presents the detailed performance evaluation of recognition of all big-five personality traits against all selected gait patterns for the data segment of 20s. Accuracy, F1-score, and precision are used for the performance
618
S. Rafique et al.
evaluation of the proposed scheme. All selected gait patterns used were found effective, but the most desirable performance was achieved through the walk and presentation patterns. A detailed analysis of every attribute is given below: Table 2. F1 Measure of the Big-Five Personality Traits Recognition using RF Classifier Attribute
Walk
Presentation with audience
Presentation without audience
Activities using Smartphone-1 (Hand Position) Agreeableness
0.94
0.92
0.92
Extraversion
0.93
0.89
0.89
Conscientiousness
0.91
0.88
0.91
Neuroticism
0.92
0.89
0.90
Openness to Experience
0.95
0.94
0.89
Activities using Smartphone-2 (Pocket Position) Agreeableness
0.93
0.92
0.90
Extraversion
0.94
0.92
0.94
Conscientiousness
0.94
0.90
0.92
Neuroticism
0.97
0.92
0.88
Openness to Experience
0.93
0.92
0.92
Table 3. Precision Measure of the Big-Five Personality Traits Recognition using RF Classifier Attribute
Walk
Presentation with audience
Presentation without audience
Activities using Smartphone-1 (Hand Position) Agreeableness
0.94
0.92
0.92
Extraversion
0.93
0.89
0.89
Conscientiousness
0.93
0.89
0.90
Neuroticism
0.92
0.89
0.90
Openness to Experience
0.96
0.94
0.89
Activities using Smartphone-2 (Pocket Position Agreeableness
0.93
0.92
0.90
Extraversion
0.94
0.92
0.94
Conscientiousness
0.94
0.90
0.92
Neuroticism
0.97
0.92
0.89
Openness to Experience
0.93
0.92
0.92
Assessment of Human Personality Traits Using Smartphone Sensing
619
3.1 Agreeableness This trait accounts for the person’s lovely, warm, and friendly nature. High scores for this trait indicate that person has an optimistic nature and gets along well with others. Results show that the accuracy obtained using RF against the activity of regular walking is 94.13% which is the highest among all four classifiers. For precision, the best results achieved in agreeableness are for walking activity holding the smartphone in hand position. Presentation with audience and presentation without audience got remarkable results with 0.92 precision holding smartphone in hand position. Presentation with the audience and walking also achieved a precision of 0.92 and 0.93, respectively holding the smartphone in a pocket position. For the scenarios of presentation with audiences, and presentation without audiences, K-NN performed as the best classifier, and the accuracy obtained was 93.12% and 92.74%, respectively. It is observed that the walking activity accomplished the most promising results for agreeableness recognition in terms of all metrics used for performance analysis. Analysis of the F1- score of this particular trait against all three physical activities illustrates that using smartphone-1, scenarios of walking, presentation with the audience, and presentation without audiences provide improved results in terms of F1-score compared to SVM and KNN. 3.2 Extraversion Extraversion highlights the energetic, frank, assertive and confident nature of a person. High score for extraversion shows that person is adventurous, talkative, and physically active. For extraversion, the accuracy achieved in the case of presentation with audiences is 88.54% which is the highest accuracy acquired using the RF classifier. It can be observed that activities including normal walking, presentation with the audience, and presentation without an audience accomplished efficient results in terms of the F1-score. The highest F1-score obtained for walking, and presentation with and without audiences is 0.956, 0.929, and 0.923, respectively. For extraversion, the highest precision achieved is 0.94 for walking and presentation without the audience using the smartphone in the pocket position. 3.3 Conscientiousness Conscientiousness is a trait that accounts for the organized, careful, hardworking, and diligent nature of a person. Individuals with high scores in this trait tend to be highly efficient, sorted, and balanced in their lives. Results showed that the accuracy obtained using RF has the highest among all the four classifiers for the scenario of the presentation without an audience. For presentation without an audience, accuracy is 90.07%. Additionally, the results obtained for all remaining metrics using RF are also very efficient for both of these activities. The most effective and worth noticing activities for conscientiousness recognition are walking and presentation without and with audiences for providing the most dominant results overall. For these activities, we obtained very promising F1 scores in every data segment. It can be seen that walking activity gets the best precision results for both smartphone positions.
620
S. Rafique et al.
3.4 Neuroticism Neuroticism is considered as one of the highest order Big Five personality trait in the field of psychology. Individuals having high neuroticism scores are more likely to experience mood swings, anger, frustration, worry, anxiety, guilt, envy and jealousy than individuals having average scores. For presentations scenarios, K-NN provided the accuracy of 90.45% for presentation with audiences and 94.27% in case of presentation without audience. The highest F1-score in case of walking and presentation with audiences is 0.97 and 0.93 at 20 sec, respectively. The most remarkable precision among all activities and both smartphone positions was of walk for neuroticism as this achieved 0.97 precision. 3.5 Openness to Experience Openness to experience is one of the Big-Five personality traits used to describe human personality. It highlights the active imagination, aesthetic sensitivity, variety and novelty in thinking, intellectual curiosity, and alertness to their feelings. High scores for this trait indicate the adventurous, bold, and challenging nature of a person. The highest accuracy acquired for the activity of walk using RF is 95.43%. Furthermore, using RF classifier, in case of presentation with audience the highest accuracy achieved is 93.51% and it is 92.36% for presentation without audience. As we can observe the results obtained are highly encouraging and promising for all the activities but, the most dominant scores are being attained from the activities of presentation without audiences, walking, and presentation with audiences. For the attribute of Openness to experience, RF performed best among all four multi-class machine learning classifiers for all the activities. For the activity of presentation with audiences, the accuracy obtained was 91.91%. If we compare the results of both smartphones, we can analyze that for the trait of Openness to experience recognition the most dominant and worth-noticing results are for this attribute. The best precision gained for this attribute was 0.96 for walking and holding a smartphone in hand. Presentation with the audience has gained precision of 0.94 and 0.92 for the smartphone in hand and pocket position, respectively.
4 Conclusion and Future Work This research has focused on inferring key personality traits, including neuroticism, extraversion, openness to experience, agreeableness, and conscientiousness, in line with the big-five model of personality based on the gait pattern recognition of individuals using smartphone sensors. A novel dataset of 22 participants using a smartphone is collected for experimentation. Corresponding to the users’ Big-Five Personality Traits Profiles (BFPT), four multi-class machine learning algorithms are applied for the training and classification of the dataset. The results obtained using RF were the most desirable among all. The performance analysis of the obtained results showed the efficiency and effectiveness of the proposed scheme. It has many real-life applications in the fields of biomedicine, security and surveillance systems, soft biometric identification, rehabilitation, marketing of products, criminal investigation, and sports. However, this domain is rare and requires more work
Assessment of Human Personality Traits Using Smartphone Sensing
621
for dataset collection and drawing insights about human personality assessment. The dataset and analysis used in this research can further achieve more accuracy and precision. Performing analysis with different other data chunks might be useful in this regard.
References 1. Duan, L.T., Lawo, M., Wang, Z.G., Wang, H.Y.: Human lower limb motion capture and recognition based on smartphones. Sensors (Basel) 22(14), 5273 (2022). https://doi.org/10. 3390/s22145273 2. Wood, J.K., Anglim, J., Horwood, S.: A less evaluative measure of big five personality: comparison of structure and criterion validity. Eur. J. Pers. 36(5), 809–824 (2021). https:// doi.org/10.1177/08902070211012920 3. Renggli, D., et al.: Wearable inertial measurement units for assessing gait in real-world environments. Front. Physiol. 11(February), 90 (2020). https://doi.org/10.3389/fphys.2020. 00090 4. Minh Dang, L., Min, K., Wang, H., Jalil Piran, M., Hee Lee, C., Moon, H.: Sensor-based and vision-based human activity recognition: a comprehensive survey. Pattern Recognit 108, 107561 (2020). https://doi.org/10.1016/j.patcog.2020.107561 5. Pandey, M., Mishra, G.: Types of Sensor and Their Applications Advantages and Disadvantages, vol. 814, Springer Singapore (2019). https://doi.org/10.1007/978-981-13-15015_69 6. Balabka, D., Shkliarenko, D.: Human activity recognition with AutoML using smartphone radio data. In: UbiComp/ISWC 2021 - Adjunct Proceedings 2021 ACM International Joint Conference Pervasive Ubiquitous Computing Proceedings 2021 ACM International Symposium Wearable Computers, pp. 346–352 (2021). https://doi.org/10.1145/3460418.347 9377 7. Hasan, M.A.M., Al Abir, F., Al Siam, M., Shin, J.: Gait recognition with wearable sensors using modified residual block-based lightweight CNN. IEEE Access 10, 42577–42588 (2022). https://doi.org/10.1109/ACCESS.2022.3168019 8. Ibrar, K., Azam, M.A., Ehatisham-Ul-Haq, M.: Personal attributes identification based on gait recognition using smart phone sensors. In: ACM International Conference Proceeding Series, no. May, pp. 94–97 (2020). https://doi.org/10.1145/3384613.3384642 9. Schoedel, R., et al.: To challenge the morning lark and the night owl: using smartphone sensing data to investigate day – night behaviour patterns. Eur. J. Pers. 752(May), 733–752 (2020). https://doi.org/10.1002/per.2258 10. Bara´nczuk, U.: The five-factor model of personality and generalized self efficacy: a metaanalysis. J. Individ. Differ. 42, 183–193 (2021). https://doi.org/10.1027/1614-0001/a000345 11. Abdelbaky, A., Aly, S.: Two-stream spatiotemporal feature fusion for human action recognition. Vis. Comput. 37(7), 1821–1835 (2020). https://doi.org/10.1007/s00371-020-019 40-3 12. Utami, N.A., Maharani, W., Atastina, I.: Personality classification of facebook users according to big five personality using svm (support vector machine) method. Procedia Comput. Sci. 179, 177–184 (2021). https://doi.org/10.1016/j.procs.2020.12.023 13. Delgado-Gómez, D., Masó-Besga, A.E., Aguado, D., Rubio, V.J., Sujar, A., Bayona, S.: Automatic personality assessment through movement analysis. Sensors 22(10), 3949 (2022). https://doi.org/10.3390/s22103949 14. Gil-Martín, M., San-Segundo, R., Fernández-Martínez, F., de Córdoba, R.: Human activity recognition adapted to the type of movement. Comput. Electr. Eng. 88, 106822 (2020). https:// doi.org/10.1016/j.compeleceng.2020.106822
622
S. Rafique et al.
15. Phan, L.V., Rauthmann, J.F.: Personality computing: new frontiers in personality assessment. Soc. Personal. Psychol. Compass 15(7), 12624 (2021). https://doi.org/10.1111/spc3.12624 16. Stachl, C., et al.: Personality research and assessment in the era of machine learning. Eur. J. Pers. 34(5), 613–631 (2020). https://doi.org/10.1002/per.2257 17. Stachl, C., et al.: Predicting personality from patterns of behavior collected with smartphones. Proc. Natl. Acad. Sci. 117(30), 17680–17687 (2020). https://doi.org/10.1073/pnas.192048 4117 18. “Administering IPIP Measures, with a 50-item Sample Questionnaire.” 19. Athar, M.E., Ebrahimi, A.: Psychometric properties and factor structure of the personality inventory for DSM-5 – brief form ( PID-5-BF ) in Iranian student and clinical samples, pp. 1–10 (2021) 20. York, N., Guilford, N.Y., John, O.P., Naumann, L.P., Soto, C.J., John, O.P.: Paradigm shift to the integrative big five taxonomy. Handb. Personal. Theory Res. 3(2), 114–158 (2008) 21. Costa, P.T., McCrae, R.R.: The revised NEO personality inventory (NEO-PI-R). In: The SAGE Handbook of Personality Theory and Assessment: Volume 2 - Personality Measurement and Testing, pp. 179–198. SAGE Publications Inc. (2008). https://doi.org/10.4135/978184920047 9.n9 22. Arias, J.T., Higuita, J.C., Castrillón, O.D.: The big five personality test. Cuad. Admnistracion 23(41), 81–105 (2010). https://doi.org/10.1017/S1477175612000073 23. Shoaib, M., Scholten, H., Havinga, P.J.M.: Towards physical activity recognition using smartphone sensors. In: 10th International Conference on Ubiquitous Intelligence and Computing, UIC 2013 and IEEE 10th International Conference on Autonomic and Trusted Computing, ATC 2013, pp. 80–87 (2013). https://doi.org/10.1109/UIC-ATC.2013.43 24. Salsabila, G.D., Setiawan, E.B.: Semantic approach for big five personality prediction on twitter. J. RESTI (Rekayasa Sist dan Teknol Informasi) 5(4), 680–687 (2021). https://doi.org/ 10.29207/resti.v5i4.3197 25. Valanarasu, R.: Comparative analysis for personality prediction by digital footprints in social media. J. Inf. Technol. Digit. World 3(2), 77–91 (2021). https://doi.org/10.36548/jitdw.2021. 2.002
An Approach to Mobile App Design and Development Combining Design Thinking, User Experience, and Iterative-Incremental Development Iris Iddaly Méndez-Gurrola1 , Ramón Iván Barraza-Castillo1(B) , Abdiel Ramírez Reyes1 , and Alejandro Israel Barranco-Gutiérrez2 1 Universidad Autónoma de Ciudad Juárez, Ciudad Juárez, México
{iris.mendez,ramon.barraza,abdiel.ramirez}@uacj.mx 2 Cátedras CONACyT-TecNM Celaya, Celaya, México [email protected]
Abstract. Mobile applications have grown at an accelerated pace, closely following the technological development of mobile devices, however, on some occasions such applications do not meet customer expectations. Agile methodologies have been used in order to face the complexity in the development process, as well as cope with the changing environment and the fast delivery that the market demands. According to the literature, one of the recurring problems is that the developers do not actually consider the real needs of the users, therefore, efforts have been made to apply hybrid development models. This paper presents a methodology proposal that integrates design thinking, user experience and iterative-incremental software development with the aim of developing competitive products that offer an adequate user experience, contemplating the user as the main axis. The methodology involves 7 phases: empathize, define, analyze and ideate, design, prototype, evaluate and refine, which are described in detail. In addition, the article presents the results of the development of two mobile applications, the first addressed the stray dogs problem in a city, the second focuses of improving the communicative functionality of customers in a cafeteria, through the use of augmented reality. Both applications were verified through usability tests, they were also evaluated with respect to their initial requirements. The results of this research can help developers when considering a software creation alternative that improves the proposed solutions and is more user oriented. Keywords: Mobile Applications · Design Thinking · User Experience
1 Introduction Project development has been approached from different angles depending on each discipline, for example, projects are proposed under the research methodology in health sciences and in the chemical industry, among others. In the field of teaching, for example, there are methodologies such as Project-Based Learning (PBL), Flipped Classroom, © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): SAI 2023, LNNS 711, pp. 623–638, 2023. https://doi.org/10.1007/978-3-031-37717-4_40
624
I. I. Méndez-Gurrola et al.
Cooperative Learning, Gamification, among many others. On the other hand, from the design discipline there are methodologies to solve problems such as Design Thinking (DT) [1], which is responsible for generating ideas, creative problem solving and expanding the number of innovative solutions. There is also the design method, which consists of a series of necessary operations, arranged in a logical order, dictated by experience, which is a set of procedures used during a work process to solve a design problem. Now, when there is a work team and it is necessary to manage all the resources, the need for project management methodologies arises, they are systems that involve principles, methods, techniques and procedures that are generally used by people in disciplines specifically, their objective is to produce deliverables, establish workflows, an important detail is that the difference between these methodologies is the way in which they are structured, it should also be noted that some are not properly called methodologies but models, cycles of life, frameworks, among others. In addition, there are different perspectives from which a project is approached, such as: Agile, Waterfall Model, Scrum, Kanban, Scrumban, Six Sigma, Critical Path Method (CPM), Lean, it is worth to mention that the Project Management Institute generated a guide with fundamentals for project management that is called the PMBOK [2]. Software engineering also offers a variety of methodologies, these are generally divided into two large groups. Firstly, traditional methodologies such as: waterfall, prototyping, spiral, incremental and rapid design of applications, secondly, agile methodologies such as: Kanban, Scrum, Lean, Extreme Programming, among others. Mobile application development is among the fields that software engineering addresses. Cuello and Vittone [3], classify them into three groups: native applications, web applications and hybrid applications, each of them with specific characteristics. Mobile applications have grown at an exponential rate, almost on par with the technological development of cell phones, tablets, and other portable devices. However, for the creation of such mobile applications there are various software development methodologies, as already mentioned, the traditional ones, as well as the so-called agile methodologies, the first ones use rigid plans, negotiations are carried out under contract and the benefits are valued processes, while the latter offer flexibility in the face of change, collaboration with the client and people are valued, however, in many of them, although the client or end user is contemplated, it is not done in a systematic way since the user is not considered as a primary factor in their development. There is a lot of research about the agile development of mobile applications and how these methodologies adapt to the changing and at times confusing environment of new artifacts and new problems. However, and beyond the development of a specific solution, there is a problem that precedes it, and it is that many occasions the solution implemented for a particular problem does not contemplate the user or perhaps to a very little extent, it is perhaps for this reason that the final product does not achieve the expected success. Some companies and businesses create services and products that they then try to introduce to a target group of customers, through sales strategies and emotive advertising, however, in this traditional approach, customers are seen as categories of buyers and not as people.
An Approach to Mobile App Design and Development
625
It is for this reason that some approaches have emerged and have changed the focus of attention, that is, placing the user as the center actor for solving the problem. Methods have been created that consider this style from different angles, including User experience (UX), this approach does not focus on the solution to the problem itself, but on the design that enables a user experience through an artifact/service/process, as well as the Design Thinking (DT) method, to encourage innovation. Thus, developers can approach people in a more empathetic way since this implies understanding their real needs and being really interested in the details. Now, in the development of mobile applications, what we would like to highlight is that regardless of the type of application, their development is generally carried out using the so-called agile methodologies (for example, Kanban, Scrum) due to its benefits in relation to reduced development time and increased flexibility, however, a comprehensive and systematic review of the literature carried out in [4] highlighted that a limitation that has been mentioned in the body of work related to Scrum is lack of attention to design. As mentioned above, such limitations affect companies when they release flawed products to the public. From the aforementioned, it can be observed that the ways in which a project is approached and developed (whatever it may be) are diverse and that in the digital and technological field, specific skills and knowledge are required that involve both the design and the technological aspects of product development. It can be affirmed that there is a great variety of scientific, technological and social problems which can be aided through the development of mobile applications, as well as there are well-defined methods for the design of artifacts that support such problems, among them are DT and UX. Beyond the well-known agile development techniques for the construction of prototypes, the aforesaid artifact, service or process design methods have contributed to an efficient and flexible implementation since they place the user as the main axis. It is here where the proposal of this project arises, the development of an alternative approach that combines the UX and DT method that allows the creation of mobile applications with a greater focused on the user, so that the technology adapts to people and their needs and not the other way around. The objective of this study is to integrate UX and DT design techniques, analyzing their definitions, phases, and objectives, so that they can be combined with incremental iterative software development and at the same time propose an alternative approach to the problem of mobile application development that are user-centered and strengthen innovative design. This article presents a proposed methodology that can be used to design and develop mobile applications in any field. Section 2 covers the review of related works on various methodologies of both UX, DT, agile, the combinations with other existing methodologies, as well as the literature review. Section 3 addresses the proposed methodology developed, which represents an alternative approach that aims to adopt a systemic perspective, consisting of 7 specifically described phases. Section 4 presents the synthesis of two specific case studies, in which the proposed methodology was applied, as well as the resulting final products. Finally, conclusions are presented and some guidelines for future work are provided.
626
I. I. Méndez-Gurrola et al.
2 Related Work This first stage of the research consisted in the search and determination of the main works related to the field of study. According to Snyder [5], the process for conducting a literature review is divided into four phases: 1) designing the review, 2) conducting the review, 3) analysis, and 4) writing the review. This was the methodology used for this review and each of the phases is mentioned below: Phase I Design of the Review. For this research, the semi-systematic review was selected because the objective is to have an overview of a topic, in this particular case, how do the DT and UX methods complement each other in software development? Once the research question has been identified and a general review approach has been considered, it is necessary to develop a search strategy to identify the literature, this includes selecting the appropriate search terms and databases and deciding on the criteria of inclusion and exclusion [5]. The search was carried out in IEEE Xplore, Google Scholar, Scielo, DOAJ and SCOPUS. The search strategy used the terms “Design thinking”, “User experience” and “Software development” in both Spanish and English, as well as combinations between these terms, and the phrases: “design thinking agile development”, “design thinking in engineering”, “user centered agile methodologies”, “user centered software development”. The inclusion criteria were: The articles should be related to the proposed themes and could include some review of case studies. The exclusion criteria were: that the review and/or the proposal were only aimed at one of the topics. Phase II Carrying Out the Review. In this phase, the actual review is carried out. Snyder [5], points out that it is important to note that it should be taken into account that it is preferable to use two reviewers to select articles, in order to guarantee the quality and reliability of the search protocol. The present study was carried out by two reviewers, the 5 electronic databases were used, and 30 potentially relevant articles were found, once duplicate elements were eliminated and a preliminary analysis was carried out, based on the title, abstract and exclusion criteria, to select the documents that could be eligible for this study in this selection a total of 18 papers remained. Phase III Analysis. Here the works were analyzed, and the data was extracted in the form of descriptive information, such as authors, years of publication, subject or study type or topics covered in the work. Phase IV Write the Review. In this phase it is important to point out the different contributions that can be valuable in the field of study. Below is a general description of the findings of each study found, grouped into different areas. In the educational field, some works were found that involve Design Thinking such as the study by Chen and Huang [6], where they use design thinking in a curricular framework for creating K-12 games with App Inventor. The work of Costa, Silva and Conte [7] is an empirical study with 17 postgraduate students in the context of mobile application design. The article by Palacin et al. [8] describes the design of the Casptone course, at Lappeenranta University of Technology, the course designed is a “human-centric SE cornerstone”, infusing design thinking methods and agile practices into knowledge of
An Approach to Mobile App Design and Development
627
the project life cycle. In addition, there are works such as that of Alahmari and Anandhavalli [9] that show the application of the Design Thinking approach in the development of an information system. In the work of Apat [10] the problem of the ITBA is posed, in which the university decided to generate a series of initiatives with the aim of being able to provide answers to the needs of the students, among the initiatives was to build a mobile university. For this, an app was made with a student-centered approach, the design process based on Design Thinking was used and Scrum was used to build the final product. Pereira and Russo [11], carried out a systematic literature review in which agile methodologies for software development and Design Thinking are combined, a work in the same sense is that of Corral and Fronza [12] however, the latter is not a literature review, the authors in that work study the use of Design Thinking as a methodological approach for the instruction of Software Engineering at undergraduate level, in courses that have the particular aim of creating innovative software products from scratch. We describe the similarities and differences between Design Thinking and Software Development Processes, taking as instance Agile Practices. The work of Schön et al. [13] that the aim of this study is to capture the current state of the art of the literature related to Agile RE with focus on stakeholder and user involvement. In particular, the authors investigate what approaches exist to involve stakeholder in the process, which methodologies are commonly used to present the user perspective and how requirements management is been carried out. Now, in regards of User Experience the work of Ferris [14] is a book that gathers the decade-long results of presenting the User-Centered Design (UCD) methodology for hundreds of companies. The work of Stanziola et al. [15] also encompasses usercentered design as a design methodology and philosophy that considers the needs, goals, and success of the end user. In [16] the authors intend to show how Design Thinking can be applied to the generation of UX user experiences. Each of the phases involved in the generation of the user experience is also explained: Research, Organization, Prototyping, Testing and Design. In the healthcare field, studies such as that involving user-centered design was found to be included in the work of Gray et al. [17] in the creation of the electronic PatientReported Outcomes (ePRO) tool, there is also the work of Nystrom et al. [18] for the design of a patient-oriented laboratory test result interface prototype. In their work, Berlanga et al. [19] describe the different stages of the process and development of mobile applications, particularly in the field of health (eHealth), as well as all the aspects that must be considered all the way to distribution and updating. Gasca et al. [20] propose a working method for the development of mobile applications. The method is based on the conceptualization of agile technologies and methodologies for software development. The method is developed in five stages: analysis, design, development, functional testing and delivery. In addition, the article presents the results of the development of an mHealth service for Android and J2ME using the proposed method. The following works involve the combination of various methodologies. Dobrigkeit et al. [21] developed InnoDev, which is an approach that combines Design Thinking, Lean Startup and Scrum to create an agile software development process
628
I. I. Méndez-Gurrola et al.
that can deliver the innovative customer-oriented products and services required by competitive companies. The study aims to describe InnoDev in detail by depicting all its phases. The objective of the work of Zorzetti, et al. [22] is to characterize a development approach combination of Agile Software Development, UCD, and Lean Startup; exposing how the three approaches can be intertwined in a single development process and how they affect development. Perhaps the article that comes closest to the approach proposed here is that of Fauquex et al. [23] that presents a methodology based on Design Thinking and usercentered design methods to create what they call “people-aware” IoT applications, this work features the developed methodology, however, it is used for Internet of Things applications. From the previous works it can be concluded that no general methodologies are proposed, however, there are methodologies that can be applied in several specific domains, relevant to note that there are some methodologies that complement each other, however, some of them have forgotten the user, as the main axis and for whom the solutions are developed.
3 Methodology Given the conditions mentioned in the previous section and considering that design-based methodologies such as Design Thinking [1] and User Experience [24] are focused on satisfying user needs, a methodology was developed that combines both DT, UX and the bases of iterative-incremental software development (IID) [25] for mobile application development. This methodological proposal aims to adopt a systemic perspective that involves both approaches from design and widely tested software development processes. The methodology was called UDI which is a contraction of UX-DT-IID, it consists of 7 phases, where the first three are performed only once whereas the rest, are repeated one after the other as in the incremental iterative development to achieve a functional product at each iteration of the process. The developed methodology is shown in Fig. 1. Each phase of the UDI methodology is described below: Phase 1 Empathize: This phase involves acquiring basic knowledge about the users and about the overall a situation or problem. Investigate the target audience (needs, motivations, characteristics, habits, mental model, activities). Among the recognized techniques to empathize are: interviews, focus groups, actor maps, empathy maps. Regarding the latter (empathy map), it is a way to define a rough user personality and characterize your target users to make effective design decisions. Attention must be paid to the needs, objectives, expectations, behavior, habits, etc., of the users. Nielsen Norman Group [26] define the empathy map as a tool used to articulate what we know about a particular type of user, and divide it into four quadrants: Says, Thinks, Feels and Does. Phase 2 Define: It will be necessary to understand the strategic dimension of the challenge that will be faced. Try to synthesize the knowledge generated around the situation, to produce new and interesting perspectives. It is necessary to identify where the problems are and understand them, which will allow structuring the opportunities that will
An Approach to Mobile App Design and Development
629
Fig. 1. The Seven Phases of UDI Methodology
define the innovative solutions. It seeks to clarify and specify the problem to be addressed, the definition of the problem is essential. In addition to creating a typical user for whom a solution or product is being designed. Phase 3 Analyze and Ideate: Conduct a literature analysis that allows identifying some solutions to the problem posed, additionally, it will be convenient to perform a competitive analysis within different platforms (apps stores) looking for other products that exist with similar audiences and functions that solve such problem. Also, generate all possible solution ideas and select the most promising one. Phase 4 Design: Once the idea has been selected, the product to be built must be designed. The objective of this phase is to capture the thought of the solution through diagrams or schemes, considering the best alternative by integrating technical and functional aspects. In this phase, design decisions are made starting from its most general dimension
630
I. I. Méndez-Gurrola et al.
(information architecture and interaction design) to its most specific dimension (detailed graphic design and interactions) Norman and Draper cited in [24]. Phase 5 Prototyping: It consists of creating prototypes that implement the design in a software product. In this case, the first versions can be through tools that allow rapid prototypes of an app without functionality to be built, there are many tools on the Internet that help with this task. However, in subsequent iterations it will be necessary to enter the coding stage so that the prototype begins to implement one by one the required functionalities. Phase 6 Evaluate: The critical product process design and development decisions are tested through evaluation methods that involve users. Errors during the design and prototyping phase are inevitable, but the most important thing is to identify them through testing and learn from users’ reactions to different prototypes. Phase 7 Refine: The results are transformed into new adjustments and requirements for the next product cycle in the iterative loop of the methodology. This user feedback is essential, as it allows the prototype to be improved in each iteration. It should be noted that depending on the evaluation made by the user, it will be necessary to go back several times until it is the result expected by him. It is important to mention again that the Design, Prototype, Evaluate and Refine phases are cyclical and incremental. This means that everything that is designed must be constantly evaluated through its prototyping, so that usability errors can be addressed early in the development process. In addition, consider that the process ends when the implementation of the solution meets the client’s needs, at which time the final release or delivery is made. The UDI methodology inherits principles that the theory and practice of incremental iterative software development has developed, it is also important to note that DT is a cyclical process based on the observation of human behavior to solve problems, like User Experience, these two methodologies seen from the a design perspective and the IID considered from a development one, complement each other, and that is the reason why the applications developed under this approach have the user as the main axis and the development involves a prototype that increases and improves in each loop iteration.
4 Results This section describes the application of the methodology in two specific case studies developed by two undergraduate students as their degree project. The first of them is about the design and development of an app that addressed the problem of stray dogs in Cd. Juárez, Sect. 4.1 briefly describes how each of the phases were performed. The second case is about the design and development of an app to improve the communicative functionality of customers in a cafeteria making use of augmented reality, in Sect. 4.2 the developed solution is briefly described but the application of the methodology in this particular case is not detailed step by step.
An Approach to Mobile App Design and Development
631
4.1 Case Study 1 There are several reasons why pet owners, particularly dogs, abandon them on the streets, one of them may have to do with a lack of information in regards of animal care culture. The aim of the project is to provide a communication mechanism that provides adequate information to counteract this problem, in addition to reporting lost or stray dogs on the streets. Phase 1 Empathize: To acquire basic knowledge about users, a survey was conducted, considering young people between 11 and 15 years old, since they are the target audience, the survey was carried out using Google Forms. A sample of 20 users was selected. The survey consisted of 10 questions that ranged from whether young people know the responsibilities of having a dog as a pet, to asking them about the reasons why dogs are abandoned. Phase 2 Define: The problem focuses on the lack of information on responsible care when owning a pet, as well as the ability to report lost dogs. A typical user whom the product was designed was created, a teenage user between 11 and 15 years of age who attends high school, it was aimed at this type of user since it is still young and can be persuaded in a positive way and have an impact in the future. Phase 3 Analyze and Ideate: A competitive analysis was carried out, conducting an survey of similar applications that already existed in the market. Thirteen related apps were found in two of the main apps stores, both Google Play and App Store. It was decided to make an informative app that included the possibility of reporting lost dogs. Subsequently, the iterative-incremental phase was entered, for this case two iterations were carried out, in the first iteration of the cycle the first prototype was created, which did not contemplate the functionality of lost dogs reports, in the second prototype this functionality was covered. First Prototype Phase 4 Design: The sections of the application were defined (see Fig. 2). Each section is of utmost importance for the app, since it covers a small portion of the situation. During this same phase, the first sketches of each of the sections were made, as well as the low and high-fidelity views. The contents of each of the sections were also established and it was necessary to develop the identity of the application that led to the creation of the logo, color palettes, font families, among others.
Fig. 2. App Sections in Case Study 1, First Prototype
Phase 5 Prototyping: For the construction of the first prototype, iBuild App [27] was used, a simple online tool that allows the creation of complex applications for both
632
I. I. Méndez-Gurrola et al.
Android and iOS, however, for this application it was also necessary include some functionalities through HTML and JavaScript. In Fig. 3 you can see the implementation of the section: “What are you doing here?”, and “Contact”.
Fig. 3. Sections of: “What are you doing here?” and “Contact” in Case Study 1, First Prototype
Phase 6 Evaluate: The verification of the design and development from the user’s point of view was done through usability tests of the app with the previously selected sample, of teenagers from 11 to 15 years old. The first usability tests were carried out, resulting in some of the sections being readjusted and even some of them changed their functionality. Phase 7 Refine: Once the usability tests were carried out and to improve the prototype based on user feedback, the second iteration was carried out. Second Prototype The second version of the application involved the redesign of some features. The 4 phases of the UDI methodology in the second prototype are described below. Phase 4 Design: Some sections of the app were renamed, in Fig. 4 shows the new sections.
Fig. 4. Scheme of Redefinition in Case of Study 1, Second Prototype
An Approach to Mobile App Design and Development
633
Additionally, it was necessary to design a database to store the information that was entered by the user when they filled out a lost dog report, for the construction of the database an entity-relationship diagram was previously made, which allowed representing the entities of the application, as well as the relationships between them. For this mobile application, the entity-relationship diagram shown in Fig. 5 was developed to visually determine how the entities are related to each other.
Fig. 5. Entity-Relationship Diagram “Apperro” Application
Phase 5 Prototyping: For this second prototype, the Android Studio IDE, was selected to develop the “Apperro” application for Android platforms. Using the first prototype as a base, the layout for the main navigation destinations were created. Working with the proposed design the seven vertical layouts and two detail views for the mobile version were developed. However, the horizontal variants of the application were not designed, so some adjustments were proposed to adapt the elements of the application when rotating the mobile device horizontally. The layouts used for the development of the screens varied according to their particular needs. The “home” screen is shown as an example, in which Constraint Layout was used to properly accommodate and adapt the buttons on most Android devices as shown in Fig. 6. Once the database was built, the information was queried in the Lost Dogs section. The data was displayed using a RecyclerView which showed relevant information about the dog and allowed to navigate to a detailed view when the user click on an item (see Fig. 7). Phase 6 Evaluate: A second usability test was developed for 20 people who were unaware of the application and who could be potential customers. None of these people had heard about the project and were completely unfamiliar with the application. The test consisted of giving instructions to the users and they had to interact with the application and provide feedback on the design and usability of the software. There were a total of 5 specific tasks. The usability tests were based on 3 different metrics for data collection, which were: accuracy, time, and satisfaction. The results were excellent, most of the
634
I. I. Méndez-Gurrola et al.
Fig. 6. Front End Home Screen Development in Case Study 1, Second Prototype
Fig. 7. List of Lost Dogs and their Detailed Information with Information from the Database in Case Study 1, Second Prototype
requested tasks were successfully completed. The app’s flaws are minor design and development issues. Phase 7 Refine: The mobile application was improved, the small details that were referred by users were addressed such as wait and pause signs. More information about the development of the application, as well as the results of the usability tests applied to the users, can be found at [28].
An Approach to Mobile App Design and Development
635
4.2 Case Study 2 The “Casa Cafetzin Tostadores Artesanales” coffee shop has several problems, from consumer interested in buying coffee bags for their businesses, their menu offering, and understanding their own processes. The shop already has an informative website with the sole purpose of displaying information about the establishment, but not about the products there is also no one in charge of updating the content. The owners wanted to implement a new system, that addressed these issues, showcasing information about their key products in a visually attractive way, that is why Augmented Reality technology was selected to tackle this project. Therefore, an informative app was designed and developed that shows a menu with the different products that the coffee shop has, giving the user the possibility of visualizing them in three-dimensional space, in addition to showing the information of each product and the preparation method through 3D animations, which covers the requested requirements. Figure 8 shows some of the examples of the products offered in the shop, they are the 3D models visualized using the mobile augmented reality app. The performed usability tests were satisfactory, the customers who used the app said they had more information regarding the products as there is greater communication and interaction with them using augmented reality. More information about the development of the application, as well as the results of the usability tests applied to the users, can be found at [29].
Fig. 8. Screens of Some Drinks with Augmented Reality in Case of Study 2 [29]
636
I. I. Méndez-Gurrola et al.
5 Conclusions Building mobile applications that involve the end user during development is a difficult task when it comes to software development. Sometimes the products do not consider the real needs of the users and do not involve them in the development process to satisfy their needs. A good product is one that manages to satisfy the needs of users and is therefore one of the challenges in software development. Now, in this research it was proposed to combine two methods that in design are used precisely to put the user as the central axis and not the final product, we are talking about user experience design and design thinking. The combination of both methods in the iterative-incremental software development process allowed to support this problem to a great extent, by contemplating the user in several stages, from the conception of the main idea and during the development of the application itself. In the present work, a software development methodology was formulated that included these two methods together with a software development cycle that has been widely tested in application development, incremental iterative software development. A hybrid methodology was then proposed that adapts to the needs and resources; and where the requirements and the solution evolve through the collaboration and evaluation of the users, this methodology has been called UDI. After the development of the methodology, two case studies were carried out that would allow it to be implemented, a mobile application was developed to tackle the problem of stray dogs in Ciudad Juárez; and another mobile application to increase the communicative functionality between customers and a coffe shop, enriching the information presented through the use of augmented reality. The assessments through usability tests of the developed applications carried out on various users, allowed the product to be improved at each stage of the development cycle, the positive results on the timely completion of the test, as well as the satisfaction findings, show that the methodology followed provides satisfactory results, demonstrating that the applications developed were friendly, satisfactory, easy to use and therefore useful. As future work, a test protocol based on heuristic evaluation will be established to evaluate the prototype. Qualitative data will be collected from interviews and post-use observations of the experts. This protocol will also help to identify which heuristics or indicators are the most accurate in the developed applications.
References 1. Meinel, C., von Thienen, J.: Design thinking. Informatik-Spektrum 39(4), 310–314 (2016). https://doi.org/10.1007/s00287-016-0977-2 2. Project Management Institute: El estándar para la dirección de proyectos e Guía de los fundamentos para la dirección de proyectos (Guía del PMBOK). Séptima edición. Project Management Institute Inc, Newtown Square, Pennsylvania (2021) 3. Cuello, J., Vittone J.: Diseñando apps para móviles. Primera edición. Catalina Duque Giraldo. (2013) 4. Dingsøyr, T., et al.: A decade of agile methodologies: towards explaining agile software development. J. Syst. Softw. 85(6), 1213–1221 (2012) 5. Snyder, H.: Literature review as a research methodology: an overview and guidelines. J. Bus. Res. 104, 333–339 (2019)
An Approach to Mobile App Design and Development
637
6. Chen, P., Huang, R.: Design thinking in app inventor game design and development: a case study. In: Paper presented as IEEE 17th International Conference on Advanced Learning Technologies (ICALT), Timisoara, Rumania. (2017). https://doi.org/10.1109/ICALT.201 7.161 7. Costa, N., Silva, W., Conte, T.: The students’ perspectives on applying design thinking for design of mobile applications. In: Paper presented as IEEE/ACM 39th International Conference on Software Engineering: Software Engineering Education and Training Track (ICSE-SEET), Buenos Aires, Argentina. (2018). https://doi.org/10.1109/ICSE-SEET.201 7.10 8. Palacin, M., Khakurel, J., Happonen, A., Hynnien, T., Porras, J.: Infusing design thinking into a software engineering capstone course. In: Paper presented as IEEE 30th Conference on Software Engineering Education and Training (CSEE&T), Savannah, Georgia. (2017). https://doi.org/10.1109/CSEET.2017.41 9. Alahmari, F., Anandhavalli, M.: User design thinking in information system development: a survey. In: Paper presented as 21st Saudi Computer Society National Computer Conference (NCC), Riyadh, Saudi Arabia. (2015). https://doi.org/10.1109/NCG.2018.8593149 10. Apat, J.: Aplicaciones móviles para estudiantes a través de Design Thinking y SCRUM. Séptima Conferencia de Directores de Tecnología de Información, TICAL 2017 Gestión de las TICs para la Investigación y la Colaboración, San José (2017) 11. Pereira, J.C., Russo, R.: Design thinking integrated in agile software development: a systematic literature review. Procedia Comput. Sci 138(1), 775–782 (2018). https://doi.org/10.1016/ j.procs.2018.10.101 12. Corral, L., Fronza, I.: Design thinking and agile practices for software engineering innovation. In: Paper presented as 19th Annual SIG Conference on Information Technology Education, Lauderdale, Florida, pp. 26-31 (2018). https://doi.org/10.1145/3241815.3241864 13. Schön, E., Thomaschewski, J., Escalona, M.: Agile requirements engineering: a systematic literature review. Comput. Stan. Interfaces 49, 79–91 (2017). https://doi.org/10.1016/j.csi. 2016.08.011 14. Ferris, T.: User-centered design: an integrated approach. IEEE, 47(1), 75–77 (2004). https:// doi.org/10.1109/TPC.2004.824283 15. Stanziola, E., et al.: User-centered design of health care software development: towards a cultural change. In: Sarkar, I.N., Georgiou, A., De Azevedo, P.M. (eds.) EHealth-enabled Health, pp. 368–371. Buenos Aires, Argentina: IOS Press (2015) 16. Vargas Márquez, B.L., Inga Hanampa, L.A., Maldonado Portilla, M.G.: Design thinking applied to user experience design. Revista Innovación y Softw. 2(1), 6–19 (2021) 17. Gray, C.S., Khan, A.I., McKillop, I., Sharpe, S., Cott, C.: User-centred co-design with multiple user groups: the case of the electronic patient reported outcome (ePRO) mobile application and portal. Int. J. Integr. Care 19(4) (2019). https://doi.org/10.5334/ijic.s3439 18. Nystrom, D., Singh, H., Giardina, T., Sittig, D.: Methods for patient-centered interface design of test result display in online portals. Abiquity Press, 6(1). (2018). https://doi.org/10.5334/ egems.255 19. Berlanga Fernández, S., Villa, G.L., Dolado, M.C., Rodríguez, B.O., Fabrellas, P.N.: Creando una aplicación móvil en salud. Rev ROL Enferm. 40(6), 428–434 (2017) 20. Gasca Mantilla, M.C., Camargo Ariza, L.L., Medina Delgado, B.: Metodología para el desarrollo de aplicaciones móviles. Tecnura 18(40), 20–35 (2014) 21. Dobrigkeit, F., de Paula, D., Uflacker, M.: InnoDev: a software development methodology integrating design thinking, scrum and lean startup. In: Meinel, C., Leifer, L. (eds.) Design Thinking Research. UI, pp. 199–227. Springer, Cham (2019). https://doi.org/10.1007/978-3319-97082-0_11
638
I. I. Méndez-Gurrola et al.
22. Zorzetti, M., Signoretti, I., Salerno, L., Marczak, S., Bastos, R.: Improving agile software development using user-centered design and lean startup. Inf. Softw. Technol. 141, 1–14 (2022). https://doi.org/10.1016/j.infsof.2021.106718 23. Fauquex, M., Goyal, S., Evequoz, F., Bocchi, Y. Creating people-aware IoT applications by combining design thinking and user-centered design methods. In: Paper presented as IEEE 2nd World Forum on Internet of Things (WF-IoT), pp. 57-62. Milan, Italy. (2015). https:// doi.org/10.1109/WF-IoT.2015.7389027 24. Montero, H.Y.: Experiencia de Usuario: Principios y Métodos. (2015). http://yusef.es/Experi encia_de_Usuario.pdf 25. Larman, C., Basili, V.: Iterative and incremental development: a brief history. Computer 36(06), 47–56 (2003). https://doi.org/10.1109/MC.2003.1204375 26. Nielsen Norman Group. https://www.nngroup.com/. Accessed 05 May 2021 27. iBuild App. https://es.ibuildapp.com/. Accessed 21 Nov 2021 28. Méndez-Gurrola, I., Portillo-Payan, A., Barraza-Castillo, R.: An information mechanism to counteract the problem of stray dogs. In: Proceedings of the 23rd World Multi-Conference on Systemics, Cybernetics and Informatics (WMSCI 2019), vol. 4, pp. 150–155. Orlando, Florida, USA (2019) 29. Sánchez-Corral, E., Méndez-Gurrola, I., Rodríguez-Garay. G: Augmented reality mobile application for communicative functionality with casa cafetzin customers. Pistas Educativas 43(139), 729–746 (2021)
Deploying Digital Twin in Manufacturing Systems: Scope and Requirements Nada Ouahabi1(B) , Ahmed Chebak1 , Mouna Berquedich1 , Oulaid Kamach1,2 , and Mourad Zegrari1,3 1 Green Tech Institute, University of Mohamed VI Polytechnic, Benguerir, Morocco
[email protected]
2 Laboratory of Innovative Technologies, Abdelmalek Essaâdi University, Tangier, Morocco 3 Hassan II University of Casablanca, Casablanca, Morocco
Abstract. In order to cope with the rapid technological and social development, changing customers’ behavior and increasing production intricacy, digitalization is believed to ultimately become the primary way for manufacturing companies to achieve high efficiency and productivity. Digital twin is a burgeoning form of digitalization that provides an in-depth understanding and enhanced management of manufacturing systems based on real-time physical-virtual convergence. Driven by real-time connectivity and virtual models, the digital twin has rejuvenated many manufacturing tasks such as what-if scenarios analysis, operation monitoring and optimization, diagnosis and prognosis. However, the diversity of interpretations and applications of this concept mystifies the understanding and hinders the promises of the digital twin. This paper is intended to provide a better understanding of the digital twin within the manufacturing context including its application scenarios and main requirements. Some enabling technologies and methods for each requirement are also discussed. Finally, the paper is concluded with some research and implementation gaps that impede the widespread adoption of the digital twin. Keywords: Smart Manufacturing · Digital Twin · Enabling Technologies
1 Introduction In view of the ongoing evolution in information and communication technologies (ICT), a new industrial revolution is in full swing. Coined Industry 4.0, the fourth industrial revolution, proffers new ways to increase efficiency and productivity and upends the current industrial landscape from a conventional production system to a decentralized, flexible, self-organizing, and automated production system creating a smart manufacturing paradigm [1]. The use of models, data and information throughout the product lifecycle is the cornerstone in the implementation process of “smartness” in manufacturing environments. Industry 4.0 technologies, such as artificial intelligence, Industrial Internet of Things (IIoT) and cloud computing allow to collect the supply, manufacturing and sales data in production and obtain knowledge from data which endorses the transformation to intelligent, adaptive and autonomous manufacturing systems. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): SAI 2023, LNNS 711, pp. 639–650, 2023. https://doi.org/10.1007/978-3-031-37717-4_41
640
N. Ouahabi et al.
Recently, a mega-trend in digitalization using digital twins to underpin digital transformation can be observed. Digital twin was firstly introduced in 2003 by Michael Grieves [2] in his product lifecycle management course at the University of Michigan. Thereafter, the digital twin has garnered widespread popularity as a transition from expensive physical tests toward virtual high-fidelity simulations, sensor measurements and intelligent control. Primarily embraced by the military and aerospace, more and more fields have subsequently invested in the research and the implementation of the digital twin, viz., smart city, healthcare, agriculture and manufacturing. This popularity was also replenished by Gartner company which in 2020 named the digital twin as a promising technology for the coming 5–10 years [3], after being listed among the top ten strategic technology trends for three successive years, ranking the fifth in 2017 [4], and the fourth in 2018 [5] and 2019 [6]. Currently, the digital twin is a concept adopted by a surging number of manufacturing companies and investigated by an increasing number of academics. A digital twin is a virtual facsimile of a living physical system that achieves physical and virtual information fusion so as to provide an in-depth understanding of a specific phenomenon which allows to enhance the actual and/or future performance of the underlying system (as shown in Fig. 1). There is no consensus on the definition of the digital twin. In manufacturing, the digital twin is a virtual copy that attempts to mirror its physical counterpart (e.g., machine, workshop, shop floor and worker) as closely as possible so as to provide information in a consistent format supporting decision-making [7]. Driven by the real-time nexus between virtual and physical assets, digital twin endows users with more reliable decisions to enhance the throughput, product quality, efficiency, safety, or equipment availability.
Fig. 1. Holistic View of a Digital Twin
In the latest years, the digital twin has overshadowed many application scenarios in the manufacturing context such as predictive maintenance, online monitoring, production scheduling, parameters optimization and so on. However, there is still a scarcity of a consistent and consolidated view of what is a digital twin since there is no clearcut definition given. To the best of our knowledge, no studies have delved into the
Deploying Digital Twin in Manufacturing Systems
641
requirements that form a digital twin. Hence, the purpose of this paper is to provide a better understanding of the digital twin concept in the manufacturing context including its different application scenarios and requirements. This paper is rolled out as follows: Sect. 2 presents an overview of the current research status of digital twin and discusses some of its application scenarios within the manufacturing context. Section 3 describes the main requirements that need to be met while establishing any digital twin of a manufacturing asset which is followed by a conclusion and a discussion on some potential research opportunities in Sect. 4.
2 Literature Review Since the first definition of digital twin given by NASA in 2010 [8], abundant definitions have been proposed and the dilution of the original concept has ensued due to the absence of a common understanding and interpretation of the term digital twin which varies according to the application context. Table 1 summarizes some of the generic definitions of the term digital twin in the literature. For a complete list of definitions, the reader is referred to [9]. From these definitions, we can conclude that the main feature of the digital twin is the correlation and integration between physical entity, virtual models, specific system applications, and real-time data exchanged in a bidirectional communication channel. On this basis, a consolidated and generic definition of the digital twin is formulated here as “a virtual, dynamic, tailored bespoke and multidisciplinary representation of a physical system that guarantees a trade-off between high fidelity, low complexity and low implementation cost.”, which is consistent with the most of definitions proposed in the literature. Table 1. Definitions of the Term Digital Twin Reference
Definition
[8]
A digital twin is an integrated multi-physics, multi-scale, probabilistic simulation of a vehicle or system that uses the best available physical models, sensor updates, fleet history, etc., to mirror the life of its flying twin
[10]
A digital copy of a real factory, machine, worker etc., that is created and can be independently expanded, automatically updated as well as being globally available in real time
[11]
Digital twins are software representations of assets and processes that are used to understand, predict, and optimize performance in order to achieve improved business outcomes
[12]
A digital twin is a virtual instance of a physical system (twin) that is continually updated with the latter’s performance, maintenance, and health status data throughout the physical system’s life cycle
[13]
Digital twin can be regarded as a paradigm by means of which selected online measurements are dynamically assimilated into the simulation world, with the running simulation model guiding the real world adaptively in reverse
642
N. Ouahabi et al.
It is noteworthy that the digital twin is not an entirely new concept. It is an evolution of previous technologies, such as simulation and 3D modeling technologies. Technically speaking, the digital twin is not restricted to one fixed technical means but it is an integrated concept leveraging many technologies implemented in tandem to achieve digital twin modeling and data management (e.g., data acquisition, preprocessing, privacy and security). For instance, driven by the incoming IIoT data and according to the communication criticality between the physical and virtual entities, a digital twin can be deployed at different layers of the communication pyramid, i.e., at the cloud near the application or the edge near the data sources or even a collaboration between these two layers. Machine Learning (ML) algorithms as pattern matching tools could be utilized to emulate a specific phenomenon in the physical system and thus establish its virtual image. Moreover, ML algorithms can preprocess the incoming raw data of physical assets which improve the digital twin´s performance. Furthermore, blockchain technology can be leveraged in the digital twin to ensure secure data management and trustworthiness. The digital twin is applied in a broad spectrum of manufacturing, to name but a few: (1) Real-time monitoring; based on real-time data alongside historical data, the digital twin can track the past, supervise the present and predict the future which helps to provide timely decision support to correct any production issue. (2) Prognostics and health management (PHM); digital twin enables to improve the reliability and safety of operators and manufacturing assets through timely anomaly monitoring and efficient maintenance decisions based on the diagnosis/prognosis outputs of the digital twin. (3) Process evaluation and optimization; with the dynamic changes in the machining process and uncertain manufacturing environment, product quality must be dynamically monitored. Real-time data endows the digital twin with awareness of the actual equipment state and machining process, and corrective measures are performed according to the processing deviation. (4) Production planning; the uncertainty in the manufacturing environment renders the static production planning methods obsolete. The digital twin can dynamically optimize and/or verify production plans once a disturbance is detected on the shop floor. Owing to the hierarchical structure of industrial entities, digital twin models are categorized into three levels, namely, unit level, system level and system of systems (SoS) level [14]. Unit-level digital twin enables individual level functions like state optimization, real-time monitoring, prognosis, etc. System-level and SoS-level digital twin models can be established based on the integration and coupling between multiple subsystems so as to enhance the overall performance and escalate the implementation of the “smart factory”. In a shop-floor environment, equipment, production line and shop floor, respectively, constitute these three levels. From another perspective, an equipment can be considered at the SoS level, for example, a mechanical machine contains the main body, feed system and spindle system which are at the system level. Functional components, such as drivers and motors, which constitute the feed system, are at the unit level. Hence, the hierarchical structure of a digital twin model is relative to the complexity of the manufacturing environment and the precision required by the owners of the digital twin.
Deploying Digital Twin in Manufacturing Systems
643
3 Requirements of Digital Twin A digital twin is the projection of physical assets in cyberspace. The requirements should be satisfied as making the digital twin reflect more the reality. This section encompasses a discussion on the main requirements that define a digital twin in industrial settings, namely, interconnectivity, high-fidelity, extensibility, standardization, interoperability and reusability. 3.1 Interconnectivity A digital twin is frequently confused with “digital model” and “digital shadow” since they all fall within the scope of digitalization. However, the magnitude of connectivity with the physical entity is different for each level. As seen in Fig. 2, the digital model represents the foundation of digitalization which does not involve any automatic data stream between the image created in cyberspace and the physical entity. The digital shadow transcends the digital model by using an automated one-way data stream from the physical entity to the digital model. In contrast, the digital twin must have a real-time two-way data binding between the physical entity and the digital model, which entails a reciprocal impact on the state of both entities (i.e., the virtual entity and the physical entity). Meanwhile, these three levels could be also considered as stages in the lifecycle of a digital twin as its physical counterpart transitions from the design phase to the use phase. Industrial communication protocols provide the data communication rules between physical and digital assets. Protocols such as Message Queuing Telemetry Transport (MQTT), Modbus protocol, Open Platform Communications Unified Architecture (OPC UA) and Contract Net Protocol (CNP) provide a bidirectional communication channel to transmit data and communicate decisions in a real-time fashion and thus support the digital twin´s implementation. It is noteworthy that a digital twin can be interconnected to its physical counterpart via a proxy such as a database that contains real-time data from the physical asset or a remote server via the cloud for example.
Fig. 2. The Three Levels of Digitalization
644
N. Ouahabi et al.
3.2 High-Fidelity Model fidelity emphasizes the realism of the model when compared with its physical counterpart in a certain aspect. The model, as a success driver of digital twin applications, can be a physics-based computational model or a data-driven model applied standalone or in tandem. Physics-based models (e.g., finite element models), proffer high-fidelity models with fine-scale spatiotemporal resolution that can mirror high-complexity manufacturing systems with low-uncertainty. Additionally, they do not require large amounts of data and retain the interpretability of a model, thus capturing the dynamics of a system and generating physically consistent data. However, physics-based models are generally limited by their high computational burden which restricts the application of the digital twin in manufacturing scenarios with rapid response requirement, such as realtime monitoring. On the other hand, data-driven models (e.g., machine learning) proffer lightweight models with high-uncertainty for both low-complexity and high-complexity manufacturing systems since they can extract complex spatiotemporal patterns from data, which relieves the computational burden and modeling efforts, and making them suited to real-time manufacturing applications. However, machine learning are datahungry tools as they require more data in the training to offset the lack of a mathematical structure (materialized as physical law), and they have a limited capacity of extrapolation, i.e., the trained model fails in unseen scenarios. Thereby, a synergistic approach that combines both approaches should be adopted while modeling a digital twin specifically for complex manufacturing systems so as to establish a high-fidelity model with a realtime interaction with the corresponding physical system. For more details about hybrid data-driven physics-based models in the manufacturing context, the reader is referred to [15]. While promoting model fidelity, model simplicity must also be considered. That is, a digital twin model should be a compromise between fidelity and simplicity. In current digital twin modeling efforts, due to the term ‘twin’, being ‘the same’ as the physical counterpart is frequently overemphasized, which results in a superfluous complexity in the digital twin model. Indeed, it is an absurdity to establish a multidisciplinary model to study only a purely mechanical phenomenon. Thus, a digital twin model must meet a specific requirement rather than an exact copy of its physical counterpart. 3.3 Extensibility The SoS evolves over time, gradually integrating new objects and new features in each object. Therefore, a digital twin must be easily reconfigured to maintain synchronization with the corresponding physical system. This feature can be provided by adopting a modular approach when designing and building an SoS-level digital twin model (e.g., the shop-floor digital twin), which employs several independent software blocks, i.e., unit-level digital twin models. This can ease reconfiguration when upgrading, adding or removing individual elements. Modular architectural patterns such as service-oriented architectures (SOA) and microservices are commonly adopted to develop flexible, evolutionary and distributed systems. Service-oriented digital twin implementation entails breaking the system into separate service modules, which in our context can be seen as interactive digital twins that gather data, execute decision models, perform actions
Deploying Digital Twin in Manufacturing Systems
645
and share information to improve decision-making not only for systems but also for the system of systems as a whole. Microservices architecture is an evolution of SOA since services are more independent from each other which provides more elasticity (i.e., the ability to upgrade frequently the digital twin without interrupting its operation) and improves fault tolerance by reducing the risk of cascading failure in the SoS-level digital twin model [16]. However, notwithstanding the above advantages of modular design, it creates significant interoperability, security and privacy issues that ought to be addressed. 3.4 Standardization Standardization serves as a basis in any product or process development as it tackles the extensibility, interoperability and reusability of digital twin models across their lifecycles. In the digital twin implementation process, standardization can cover many modeling aspects including model running environment, development tools and data specifications. Heretofore, some endeavors have been already achieved to define standards for the digital twin´s implementation in the manufacturing context. Taking the lens of the well-known five-dimension architecture of digital twin (i.e., physical entity, virtual entity, digital twin data, connection and services), proposed by [17], we can map each dimension to a set of digital twin-related standards: (1) Physical entity; a physical entity is either a data source (sensor) or an action driver (actuator) for a digital twin model. Standard such as IEEE 1451, a series of smart transducer interface protocols, provides common and network-independent communication interfaces for smart transducers (actuators and sensors) [18]. IEEE 2888 defines a series of interfaces between physical and cyber spaces including vocabulary, metrics, requirements, data specifications and APIs enabling the acquisition of sensor measurements [19]. (2) Virtual entity; a virtual entity is a set of virtual information constructs that describe the multi-scale features (i.e., space-scale and time-scale) of the modeling object. To reflect the physical entity comprehensively, the virtual entity can integrate geometric models, physical models, behavior models and rule models [20]. The geometric models are utilized as a front-end display since they reflect the external features of the physical entity. Physical and behavior models are utilized for simulation and analysis in the back-end since they reflect the internal features of the physical entity. Rule models describe laws or constraints in geometric, physical and behavior models. Furthermore, the digital representation of a physical entity may be the fusion among multidisciplinary heterogeneous models such as kinematic models and control models. Thus, standards must cover all the above aspects of digital twin modeling. In the manufacturing context, IEC 63278– 1 Asset Administration Shell for Industrial Applications Part 1: Asset Administration Shell Structure defines a standardized virtual representation of an industrial asset and ensures interoperability through a formal data exchange. ISO 23247–3 [21] employs some existing modeling standards (e.g., IEC 62264 “Enterprise-control system integration” [22], IEC 62714 “Automation Markup Language (AML)” [23]) as they satisfy many use cases when common industrial data formats are utilized such as OPC-UA, JSON, RDF and XML. (3) Data; data is the propulsion of a digital twin. Standards must cover data integration, data fusion, data preprocessing, data storage, data exchange and synchronization through the standardization of data structures and properties. Existing
646
N. Ouahabi et al.
standards like ISO 29002, IEC 61360–1, ISO 13584 and IEC 61987 can be adopted in digital twin implementation since they provide guidelines to support data exchange and standardize data library, data structures and data type elements for a manufacturing asset. However, standards must also consider the real-time feedback loop between virtual and physical entities. (4) Services; In digital twin, services include business services that convey relevant information to users enabling decision-making and functional services that maintain the proper operation of the digital twin system. ISO 23247–4 defined three use cases describing digital twins of product, process, and resources for dynamic scheduling between a set of robots. In manufacturing, digital twin-related standards must also tackle other services (e.g., online monitoring and predictive maintenance) as covering digital twin service test, service management, service QoS and service levels. (5) Connection; To implement a digital twin system, a real-time bidirectional connection is required between the physical and virtual entities. IEC and IEEE provide solutions for wired communication to achieve better synchronization and bandwidth as required in the digital twin, such as IEC 61784–2 and IEEE 802.3 standards series. Moreover, wireless communication was also tackled in several standards such as IEC 62948 (WIA-FA) and IEC 62601 (WIA-PA). On the other hand, non-related digital twin standards (e.g., 5G, 6LoWPAN and SmartMesh) can also be utilized as they can solve digital twin problems. However, standards must cover infrastructure specifications including equipment type and parameters as well as heterogeneous networks integration since many different communication technologies could coexist in the same manufacturing environment. Standards must also emphasize the “real-time” connection as it is the key feature of any digital twin, thus methods for the evaluation and implementation of real-time connection must be defined and standardized. 3.5 Interoperability The heterogeneity of system-level digital twins involved in an SoS-level digital twin and the need for seamless interoperation between different digital twin models makes interoperability another important feature to ensure while implementing a digital twin. Ontology engineering and semantic modeling are some of the endeavors to mitigate interoperability issues in a heterogeneous information system. Ontologies allow the construction of formal information models based on a shared vocabulary so as to underpin the reasoning and common understanding and thus information sharing between heterogeneous systems and domains. Leveraging ontology in digital twins has recently led to what is known as cognitive twin (CT) or cognitive digital twin. The CT concept is defined as a digital twin with augmented semantic capabilities for identifying the dynamics of virtual model evolution, promoting the understanding of interrelationships between virtual models and enhancing the decision-making based on digital twin [24]. In literature, several efforts have been devoted to establishing an interoperable digital twin system based on standard ontologies, such as Semantic Sensor Network (SSN) Ontology [25, 26] and taxonomy from a set of international standards like ISO 2041,13372,17359:2011 in [27] and IEC 61968,61970,62325 CIM in [28], or based on standard architectures, such as RAMI 4.0 in [29], the well-known Reference Architecture Model Industrie 4.0 which involves the concept of an Industry 4.0 component,
Deploying Digital Twin in Manufacturing Systems
647
consisting of Asset and Asset Administration Shell (AAS) that supports the implementation of digital twins for Industry 4.0 and promotes interoperability through digitally describing an asset in a standard fashion. Still, the heterogeneity of information, models and technologies in an SoS-level digital twin system makes creating interoperable digital twins extremely challenging and a promising avenue for investigation. 3.6 Reusability Digital twin models are typically customized for specific artifacts and scenes. Therefore, if the same digital twin model is reused in another working condition, the performance is decreased and the model cannot be effectively executed. At the same time, the digital twin model reconstructed from scratch will lead to poor modeling efficiency and a waste of resources. Reusability refers to the ability to reuse current knowledge in working condition settings or a new application at the lowest cost. To allow reusability in different applications, an engineering strategy that is called Software reuse is commonly employed where the development process is reliant on reusing existing software components. In our context, generic services or functions of a digital twin system could be encapsulated as components and stored in what is called model library which enables users to reuse digital twins already developed in previous projects. Model libraries also store the relationships between heterogeneous models rather than just the model´s description like databases. On the other hand, digital twins are not static but they must evolve to decrease the discrepancy between physical and virtual entities and thus they must be reusable in different working conditions such as those that are ensued from changeovers or performance degradation of the physical entity. One straightforward way to deal with model reusability in nonstationary environments is deploying knowledge transfer approaches, namely, incremental learning and transfer learning. The major difference between these two approaches lies in how the existing knowledge is exploited. Incremental learning attempts to retain existing knowledge as much as possible while learning new knowledge. On the other hand, transfer learning only utilizes existing knowledge to acquire new knowledge. Once learning is achieved, transfer learning only attempts to improve the performance of new knowledge and neglects the performance of existing knowledge. Transfer learning shows better performance when the model must be quickly established and the working environment varies significantly, hence it has been already employed in many digital twin studies such as [30, 31].
4 Conclusion Since the exordium of the term “digital twin” and its first application in aerospace, the digital twin has been quickly evolving from theoretical research to pragmatic implementation. However, the polysemous nature of the term digital twin impedes its understanding and its full potential. Hence, this paper has first sought to provide a consolidated view on the digital twin, including its applications in the manufacturing context and its hierarchical structure. Then, the paper proposed a novel definition of a digital twin as: “a virtual, dynamic, tailored bespoke and multidisciplinary representation of a physical system that guarantees a trade-off between high fidelity, low complexity and low
648
N. Ouahabi et al.
implementation cost.”, which is universal as it is domain and sector independent. Moreover, the paper discussed the main requirements that constitute a digital twin system, highlighting some enabling technologies and methods for each requirement. In order to underpin a consistent evolution of the digital twin, many imperatives should be resolved. For instance, questions regarding the ability of the digital twin to provide tangible improvements are still extant since there is a lack of successful demonstrations of the added value of the concept which hampers its implementation. Then, quantitative metrics need to be formally established including domain-dependent and domain-independent metrics so as to understand and improve the performance of a digital twin. Digital twin-related standards and protocols must be developed and/or updated to cover the latest developments and include up-to-date manufacturing technologies such as blockchain technology and cloud and edge computing paradigms. Additionally, more endeavors ought to be devoted to refining the ability of digital twin models in reflecting reality with a high synchronization level and ensuring interoperability, extensibility and fault tolerance. Finally, a special focus should be given to data management especially data privacy and security in the two-way communication channel of the digital twin which enables to secure physical entities from malicious attacks that may tamper with exchanged data and damage them.
References 1. Büchi, G., Cugno, M., Castagnoli, R.: Smart factory performance and industry 4.0. Technol. Forecast. Soc. Chang. 150, 119790 (2020). https://doi.org/10.1016/j.techfore.2019.119790 2. Grieves, M., Vickers, J.: Digital twin: mitigating unpredictable, undesirable emergent behavior in complex systems. In: Kahlen, F.-J., Flumerfelt, S., Alves, A. (eds.) Transdisciplinary Perspectives on Complex Systems, pp. 85–113. Springer, Cham (2017). https://doi.org/10. 1007/978-3-319-38756-7_4 3. 5 Trends Drive the Gartner Hype Cycle for Emerging Technologies, 2020. Gartner. https://www.gartner.com/smarterwithgartner/5-trends-drive-the-gartner-hype-cycle-foremerging-technologies-2020. Accessed 01 Sept 2022 4. Gartners Top 10 Technology Trends (2017). https://www.gartner.com/smarterwithgartner/gar tners-top-10-technology-trends-2017. Accessed 01 Sept 2022 5. Gartner Top 10 Strategic Technology Trends for (2018). www.gartner.com/smarterwithgart ner/gartner-top-10-strategic-technology-trends-for-2018/. Accessed 29 Jan 2021 6. Gartner Top 10 Strategic Technology Trends for 2019. www.gartner.com/smarterwithgartner/ gartner-top-10-strategic-technology-trends-for-2019/. Accessed 29 Mar 2021 7. Ouahabi, N., Chebak, A., Zegrari, M., Kamach, O., Berquedich, M.: A distributed digital twin architecture for shop floor monitoring based on edge-cloud collaboration. In: 2021 Third International Conference on Transportation and Smart Technologies (TST), pp. 72–78, May 2021. https://doi.org/10.1109/TST52996.2021.00019 8. Shafto, M., et al.: Modeling, simulation, information technology & processing roadmap. Natl. Aeronaut. Space Adm. 32(2012), 1–38 (2012) 9. Digital Twin: Generalization, characterization and implementation. Decis. Support Syst. 145, 113524 (2021). https://doi.org/10.1016/j.dss.2021.113524 10. Brenner, B., Hummel, V.: Digital twin as enabler for an innovative digital shopfloor management system in the ESB logistics learning factory at reutlingen - university. Procedia Manufact. 9, 198–205 (2017). https://doi.org/10.1016/j.promfg.2017.04.039
Deploying Digital Twin in Manufacturing Systems
649
11. What is a Digital Twin?. https://www.ge.com/digital/blog/what-digital-twin. Accessed 29 Sept 2022 12. Madni, A.M., Madni, C.C., Lucero, S.D.: Leveraging digital twin technology in model-based systems engineering. Systems 7(1), 1 (2019). https://doi.org/10.3390/systems7010007 13. Wang, P., Yang, M., Peng, Y., Zhu, J., Ju, R., Yin, Q.: Sensor control in anti-submarine warfare—a digital twin and random finite sets based approach. Entropy 21(8), 8 (2019). https://doi.org/10.3390/e21080767 14. Tao, F., Qi, Q., Wang, L., Nee, A.Y.C.: Digital twins and cyber-physical systems toward smart manufacturing and industry 4.0: correlation and comparison. Engineering 5(4), 653–661 (2019). https://doi.org/10.1016/j.eng.2019.01.014 15. Wang, J., Li, Y., Gao, R.X., Zhang, F.: Hybrid physics-based and data-driven models for smart manufacturing: modelling, simulation, and explainability. J. Manuf. Syst. 63, 381–391 (2022). https://doi.org/10.1016/j.jmsy.2022.04.004 16. Microservices vs SOA: The Differences Explained, Talend - A Leader in Data Integration & Data Integrity. https://www.talend.com/resources/microservices-vs-soa/. Accessed 17 Sept 2022 17. Tao, F., Zhang, M., Cheng, J., Qi, Q.: Digital twin workshop: a new paradigm for future workshop. Comput. Integr. Manuf. Syst. 23(1), 1–9 (2017) 18. Song, E.Y., Burns, M., Pandey, A., Roth, T.: IEEE 1451 smart sensor digital twin federation for IoT/CPS research. In: 2019 IEEE Sensors Applications Symposium (SAS), pp. 1–6 (2019). https://doi.org/10.1109/SAS.2019.8706111 19. 2888.1, IEEE 2888. https://sagroups.ieee.org/2888/ieee-2888-1/. Accessed 22 Sept 2022 20. Digital twin driven prognostics and health management for complex equipment. CIRP Ann. 67(1), 169–172 (2018). https://doi.org/10.1016/j.cirp.2018.04.055 21. ISO 23247–1 : Automation Systems and Integration - Digital Twin Framework For Manufacturing - Part 1: Overview And General Principles. https://global.ihs.com/doc_detail.cfm?& document_name=ISO%2023247%2D1&item_s_key=00822357&item_key_date=991231. Accessed 30 Sept 2022 22. :00–17:00, IEC 62264–3:2016, ISO. https://www.iso.org/cms/render/live/en/sites/isoorg/con tents/data/standard/06/74/67480.html. Accessed 30 Sept 2022 23. EN IEC 62714–1:2018 - Engineering data exchange format for use in industrial automation systems. https://standards.iteh.ai/catalog/standards/clc/e797da52-af4d-4d8e-bc65-895 86972108f/en-iec-62714-1-2018. Accessed 30 Sept 2022 24. Cognitive Twins for Supporting Decision-Makings of Internet of Things Systems|SpringerLink. https://link.springer.com/chapter/https://doi.org/10.1007/978-3-030-462 12-3_7. Accessed 25 Sept 2022 25. Hosamo, H.H., Svennevig, P.R., Svidt, K., Han, D., Nielsen, H.K.: A digital twin predictive maintenance framework of air handling units based on automatic fault detection and diagnostics. Energy Build. 261, 111988 (2022). https://doi.org/10.1016/j.enbuild.2022. 111988 26. Liu, C., Jiang, P., Jiang, W.: Web-based digital twin modeling and remote control of cyberphysical production systems. Robot. Comput.-Integr. Manufact. 64, 101956 (2020). https:// doi.org/10.1016/j.rcim.2020.101956 27. Zinnikus, I., et al.: Integrated semantic fault analysis and worker support for cyber-physical production systems. In: 2017 IEEE 19th Conference on Business Informatics (CBI), vol. 01, pp. 207–216 (2017). https://doi.org/10.1109/CBI.2017.54 28. Grebenyuk, G.G., Kalyanov, G.N., Kovalyov, S.P., Krygin, A.A., Lukinova, O.V., Nikishov, S.M.: Technological infrastructure management models and methods based on digital twins. In: 2021 14th International Conference Management of Large-Scale System Development (MLSD), pp. 1–5 (2021). https://doi.org/10.1109/MLSD52249.2021.9600185
650
N. Ouahabi et al.
29. Towards a Digital Twin Platform for Industrie 4.0|IEEE Conference Publication|IEEE Xplore. https://ieeexplore.ieee.org/abstract/document/9468204. Accessed 25 Sept 2022 30. Xu, Y., Sun, Y., Liu, X., Zheng, Y.: A digital-twin-assisted fault diagnosis using deep transfer learning. IEEE Access 7, 19990–19999 (2019). https://doi.org/10.1109/ACCESS.2018.289 0566 31. Liu, S., Lu, Y., Zheng, P., Shen, H., Bao, J.: Adaptive reconstruction of digital twins for machining systems: a transfer learning approach. Robot. Comput. Integr. Manufact. 78, 102390 (2022). https://doi.org/10.1016/j.rcim.2022.102390
Toward the Selection of a Lightweight Authentication Technique for the Security of Smart Homes: Framework Architecture Based on a User Centric Design Tanya Koohpayeh Araghi1,2(B) , David Megías1,2 , and Andrea Rosales1 1 Internet Interdisciplinary Institute (IN3), Universitat Oberta de Catalunya, Barcelona, Spain
{tkoohpayeharaghi,dmegias,arosales}@uoc.edu 2 Center for Cybersecurity Research of Catalonia (CYBERCAT), Barcelona, Spain
Abstract. The Internet of things (IoT) is known today as a substantial technology due to variety of its applications resulting in a huge circulation of users’ private information. Thus, it poses vital security concerns for the users of the IoT products. Consequently, preserving data integrity is a necessity in IoT applications. In this paper, we explore and analyze different existing techniques that countermeasure data integrity and authentication threats. The aim is to select the most lightweight and secure techniques to ensure data transmission in a proposed framework for smart homes and, more specifically, in smart meters data integrity. Our focus in this paper is on designing the different phases and the architecture of the proposed framework to preserve data integrity in the mentioned appliances. Moreover, the proposed framework follows a user centric design according to the people with cultural diversities from two different continents of Asia and Europe (Iran and Spain) in order to alleviate the user resistance against accepting this technology and designing a compatible system according to the users’ cultural varieties. To the best of our knowledge, less effort has been made in the field of secure data transmission in smart homes considering user’s security concerns by a multicultural approach. We believe that this research can be motivating and encouraging for the researchers and developers of IoT smart home appliances. Keywords: Digital Watermarking · Smart Home · Authentication · Data Integrity · User Experience Study · IoT Security
1 Introduction The Internet of Things (IoT) is defined as a collection of smart electronic devices and communication networks [1]. IoT networks offer various application domains encompassing environmental monitoring, healthcare, smart cities, military usages, and intelligent transportation systems [2, 3]. One of the most remarkable applications of the IoT is a smart grid, which has been changed to an extremely attractive field amongst researchers due to the possibility of a © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): SAI 2023, LNNS 711, pp. 651–667, 2023. https://doi.org/10.1007/978-3-031-37717-4_42
652
T. K. Araghi et al.
real-time system management and easy access. Nevertheless, cyber-attacks are regarded as the main challenge, till date. False data injection attacks are examples of these threats in smart grids [4]. For instance, meter readings can be compromised by false data injection to cause unnoticed errors in state variables and state estimates [5], which can affect in the assessment of the power control, or making decisions leading to disruption in power transmission and distribution [1]. Generally, data integrity threat is defined as every tampering in the original data that change it such as illegitimate data insertion, deletion manipulation and etc. Many recent research works has been done to countermeasure against data injection attacks and data integrity in the IoT. Securing IoT devices against data integrity attacks requires accounting for their stringent computational power and need for lowlatency operations [6]. The weaknesses of these techniques make them complex [7], computationally expensive, or need extra storage [8] and some are application- or context dependent [9, 10]. Regarding simplicity requirements related to battery constraints, some techniques are simple but not secure [11–18]. Although those techniques can be successful to some extent to solve data integrity problems, they are not secure regarding the simplicity requirements related to their battery constraints [19]. Therefore, research effort should be made to design a new low-complexity scheme, yet secure data integrity system that can detect and filter out false data injection attacks, which is still an open issue [20] that needs to be solved. Furthermore, acceptation of the designed security frameworks by the end users is another important issue causing that the ultimate product will not be successful enough for the developers to showcase a strong effect both scientifically and commercially. That is why, in this research, our goal is to firstly design a lightweight authentication framework considering the limitation of the IoT devices as well as assuring the end users about their security and privacy concerns regarding a user centric research according to the cultural diversities and requirements in two different countries of Iran and Spain representing from two continents of Asia and Europe. In this paper we propose a conceptual framework to address the above mentioned problems. Furthermore, we provide some answers to the question of how user’s privacy concerns and vulnerability perception affect the resistance of new information technology services from a multicultural perspective, and how it could be counterbalanced. A foremost strength of this project is the combination of these two research activities to offer more generalized results than performing research separately, to achieve usable outcomes and an incorporated and user-centric prototype which fulfills strategic lines of research from the interaction between technology and the human and social sciences. The rest of the paper is organized as follows. In Sect. 2, the preliminary definitions such as IoT, its architecture and smart home are presented. Section 3 overviews the state of the art techniques for data integrity preservation. Section 4 is assigned to explain a background of the techniques we use for the proposed framework. In Sect. 5, the project design, the envisioned framework and the research phases are illustrated. The expected impacts are described in Sect. 6 and, finally, conclusions and future work are outlined in Sect. 7.
Toward the Selection of a Lightweight Authentication Technique
653
2 Preliminaries In this section, some preliminary definitions such as concept of the Internet of Things and its architecture, smart home and smart meters are explained. 2.1 Internet of Things (IoT) The Internet of Things (IoT) is a paradigm in which everyday objects are interconnected and communicate with each other over the Internet [21]. Since its introduction, the IoT has caused great repercussions all over the world that many manpower and material resources have been invested to support the research, and remarkable achievements have been obtained [22]. Security and privacy of the IoT devices are the most notable challenges [23, 24]. The basic characteristics of IoT are the comprehensive perception, reliable transmission, and intelligent processing of information, while the key is to realize the information interaction between human and things or things and things [5, 25]. IoT devices have limited power and computing capabilities [26] and, since their communications are typically wirelessly, they are often prone to various cyber attacks. Furthermore, the number of devices revealed to the public network is increasing gradually and the devices have direct interaction with the physical world to gather data. All this makes these devices suitable targets of malicious users. Hence, it is essential to assure data authenticity and ensure that the device is operating in an expected status and is not affected by malware. As IoT devices are built on various technologies, such as power management and sensors, their security requirements vary from one application to another [27]. 2.2 IoT Architecture, Smart Homes and Smart Meters The architecture of the IoT is defined as three layers: perception, network, and application layers [19]. These layers are presented in a typical smart home in Fig. 1. The perception layer includes sensors, webcams, and smart phones with the tasks of perceiving and collecting information of objects from an environment. The network layer is responsible for transmission of the sensed data to the application layer and consists of the Internet and satellite network or wireless network such as 5G. The information is transmitted through the Internet, satellite networks, or wireless networks such as 5G. In the application layer, platforms like distributed parallel computing, data mining, and cloud computing are supported. In both the application and network layers, the security of the sensed data is approximately ensured by applying technology. However, in the perception layer, securing the IoT signals is highly challenging since the disclosure of perception data can lead to the disclosure of critical information across the whole network, resulting in immeasurable consequences [22]. Our research focuses on security and authenticity of user’s data in smart home IoT smart meters in the perception layer. With certainty, one of the most attractive fields of IoT systems lies at homes, offices, and, in general, those environments in which people spend most of their time. Smart homes are the residential extension of building automation and involve the control and automation of all their embedded technologies. A smart home defines a residence that has
654
T. K. Araghi et al.
Fig. 1. Three Layers of IoT Architecture [28]
appliances, lighting, heating, air conditioning, TVs, computers, entertainment systems, big home appliances such as washers/dryers and refrigerators/freezers, security and camera systems capable of communicating with each other and being controlled remotely by a time schedule, phone, mobile or the Internet. These systems consist of switches and sensors connected to a central hub controlled by the home resident using a wall-mounted terminal or a mobile unit connected to the Internet cloud services [29]. Smart meters are regarded as one of the key elements of a smart grid. A smart meter has a capability of facilitating data communication among appliances in the smart home and the electricity grid [30]. These devices present a range of advantages both for utilities and consumers, such as eliminating the cost of reading meters, offering abundant meter readings to calculate on time pricing, loading remote management, monetary savings for customers, and further efficient replies to power outages compared with the analogue metering systems [31, 32]. However, various societal concerns are raised in the exploitation of smart meters with related digital technologies [33, 34]. While smart home IoT services are expected to augment the efficiency and comfort of users’ daily lives, the fact that more communication capabilities bring more attacking possibilities cannot be ignored. For instance, having access to the energy consumption, intruders can reveal what time of the day the owner is away from home. Additionally, many devices could be used as attacker zombies [35, 36]. Nevertheless, the growing development of technology, particularly information and communication technologies (ICT), and its integration into users’ private and professional life, a decision regarding its acceptance or rejection still remains an open question [37].
3 Literature Review Sensor nodes in IoT applications are commonly installed in a potentially unreliable and malevolent environment, where they are threatened by various kinds of attacks including data modification, packet reply, packet forgery and packet drop attacks [38, 39]. These threats are mainly because of effortless joining to the network [40] in the perception layer, the resource constrained nature of the sensors and their low computational and storage capabilities [38]. In order to select the most suitable and lightweight technique for data
Toward the Selection of a Lightweight Authentication Technique
655
integrity preservation in the proposed framework, an analysis and investigation has been performed amongst the existing types of research in this field. These countermeasures have been categorized into five major groups [12, 20, 41–43] as follows: 1) software attestation, 2) anomaly detection, 3) trust management, 4) signature-based techniques, and 5) digital watermarking techniques. In the software attestation techniques, sensor nodes confront a challenge request to check the node software’s integrity. The integrity will be verified in case of providing a correct response to this challenge. The main weakness of such techniques is just considering software-based vulnerabilities, while in the case of a physical attacker, securing networks is not possible by these solutions [41, 44–46]. The advantage of anomaly detection techniques is that they can be easily set out with any required application. However, a precise definition of the boundaries for the malicious behavior should be considered for these techniques [10, 47, 48]. Trust management techniques are usually based on the location of the sensor nodes and can guarantee data integrity by setting keys to manage location of the sensors. Different types of keys need to be generated, which requires the key management consideration in these techniques, in addition to limitation of usage due to the need of static location for the sensors [49–53]. Signature-based techniques provide a good application for data integrity that has made these techniques the center of a large amount of research works [42, 43, 45, 54]. However, these techniques impose additional computational costs for data processing. Another technique for ensuring data integrity is digital watermarking. Apart from the characteristics of the information sequences that IoT-devices produce, using digital watermarks can authenticate source of data, detect and locate tampering data while transferring the hidden information embedded in the main information resulting in absence of additional fields for the communication of the code authenticity. [12, 20, 55]. However, there is no exact clarification to thwarting all mentioned attacks in IoT heterogeneous systems using digital watermarking [20]. Table 1 summarizes the pros and cons of these techniques. The main disadvantages of these techniques deal with complexity, costs and extra storage needs; on the other hand, simplicity requirements related to battery constraints make them simple but not secure. As it is shown in this table, watermarking tools provide more advantages and they can be regarded as a good candidate for the proposed framework because of fulfilling the required specifications to design a lightweight authentication system for data integrity with approximately no overhead. Consequently, the chosen technique for our proposed framework would be digital watermarking. By using digital watermarking, not only will the integrity of data collected by sensors be preserved, but also the sensors that maliciously try to add themselves to trusted smart home appliances would be discovered. Additionally, the recognition of the location of the tampered data may be possible. Digital watermarking technology has been used vastly in multimedia security, but its usage is also getting pervasive in IoT applications. Since the data values carried by sensors are relatively close to each other, the same techniques of multimedia data are applicable [56]. Moreover, digital watermarking techniques do not impose additional
656
T. K. Araghi et al.
cost and communication overhead, while they are much lighter than other techniques. Thus, they are suitable for sensors with constrained resources. Table 1. State-of-the-Art-Techniques Technique
Advantages
Disadvantages
Software attestation [41, 44–46]
• Consider vulnerabilities & provide controls • Check for threats
• Mainly consider software-based attacks
Anomaly detection [10, 47, • Can be easily deployed on top • Need to accurately define 48] of any application the boundary for expected • Detecting any deviation from behavior normal profile • High false positive alarm • Low burden imposed on rate bandwidth • Cannot detect unknown attacks without history Trust management [49–52]
• Ensure data integrity and authenticity
• Relying on a static, configured location parameter • Limited usage to a specific number of applications
Signature based [42, 43, 45, • Easy to implement 54]
• Additional cost for data processing and communication • Low speed
Node monitoring [57]
• High accuracy
• High computational load • Impose burden on bandwidth
Watermarking [12, 20, 55]
• Authenticating of source of • No specific solution for data preventing attacks • Identification and localization of distortions • Transfer of service information embedded in the information • Absence of an additional field to authenticate code transmission
4 Background This section deals with a brief explanation for the selected techniques that will be used throughout the project and in our proposed framework.
Toward the Selection of a Lightweight Authentication Technique
657
4.1 Digital Watermarking Digital image watermarking is referred to as a technique to protect the digital data from illegal access and illegitimate modification. It fundamentally includes secret signals as watermarks embedded in digital data called cover data for the purposes of copyright protection and authentication verification [58, 59]. Furthermore, the bits carrying out the watermark will spread in the whole cover, such that they cannot be detected without authorized access. The original information must not be changed by the embedding technique perceptually and, to extract the watermark data, an extraction algorithm needs to be executed by the authorized parties [60]. With the purpose of concealing the data related to the identification of the owner of the original content, a number of characteristics of the cover must be altered and then both cover and watermark data should pass through each of the watermarking techniques until the watermarked data is achieved accordingly [61]. Digital watermarks are divided into three types: robust, fragile, and semi-fragile [12]. Mostly, the first two types of the watermarks have been used for research in the wireless sensor networks, while semi-fragile watermarking has not been addressed in WSNs [20]. 4.2 User Resistance User resistance is an important aspect of the new technology adoption process. At the core of resistance theory, a new technology must endure the process of resistance before it is adopted by the user [62]. From this perspective, the user resistance theory provides a basis to understand home IoT technology’s failure to spread; it was essentially caused by an insufficient understanding of consumer needs, privacy and data protection issues, and high costs [63]. Using technological vulnerabilities, unauthorized people could gain unauthorized access to the system, potentially stealing personal information, violating privacy, or blackmailing users by obtaining control of the smart home IoT [64]. Leaked personal information may also be used for other crimes, such as intrusions via identity theft or individual profiling [65]. User resistance behavior due to privacy challenges occurs in various forms; the behavior may appear in a passive form, in which inaccurate information is disseminated and the use is avoided, or it may appear in an aggressive form in which complaints are displayed or problems are announced [66]. Thus, understanding the level of user resistance due to security challenges is integral to smart home IoT companies’ response to user resistance and the development of improved technologies [36]. The existence of gaps in the current techniques, which cannot prevent data integrity loss against malicious sensor node attacks, makes users reluctant to accept this technology. Thus, we include the user experience approach in the proposed framework to design a lightweight, secure watermarking scheme to detect and prevent false data injection attacks according to our user-centric study.
658
T. K. Araghi et al.
5 Project Design, Envisioned Framework and Research Phases This project is divided into two focal areas: 1. Design and implementation of a watermarking scheme in order to preserve data integrity in the perception layer and increase security of the smart home by recognition of unauthenticated malicious sensors and preventing them from joining the smart home’s appliances (smart meters). Instead of adding extra bits of memory, which causes extra bits of memory in addition to memory for main data, as well as more computational load exposing on the sensors to evaluate whether data has been tampered, we hide recognition features (resulting from objective IV), into the main data imperceptibly in order to save memory and avoid extra computational load. The aim, therefore, is to promise integrity of data and boost efficiency regarding memory, power, and energy consumption by a secure lightweight watermarking process, in which in every datum embed a unique watermark before sending it to aggregators. Consequently, data integrity can be verified by the aggregators.
Phase 3 Phase 2
Phase 1
Fig. 2. Conceptual Framework High Level Architecture
2. A case study to include the cultural dimension and to center our design and implementation on real users’ needs and behavior to increase trust, and consequently, acceptability of the proposed scheme. To consider the factors related to the acceptance of the system, this project will use a user-centric design approach from an intercultural perspective. The user study will be conducted in Iran and Spain; two countries that represent diverse cultural backgrounds and with participants that represent diverse
Toward the Selection of a Lightweight Authentication Technique
659
life trajectories. This will allow us to design the proposed watermarking scheme with an inclusive and global aim to be implemented. Hence, this project follows an interdisciplinary approach by proposing a comprehensive user experience study to put the user at the center of this research. This study will analyze cultural and behavioral aspects that are essential to accept the proposed scheme to be used in smart homes. People from the countries of Iran and Spain will participate in the study to provide relevant data in order to adapt the design and implementation of our proposed framework and measure its success in the two different countries. For this purpose, the following framework is presented: Figure 2 shows the conceptual framework for the project. As shown in this figure, the conceptual framework consists of three phases. The first two phases are not necessarily in order and can be done concurrently until implementation. 5.1 Phase 1: User Experience Study In the first phase, the user’s concerns and hesitations for accepting the proposed watermarking scheme for data authentication and integrity will be investigated. We consider the technology acceptance according to the Unified Theory of Acceptance and Use of Technology (UTAUT) model [67]. Table 2 shows the most important elements of the UTAUT model. Clearly number 4 and 6 (facilitating and self efficacy) are not impressive in our case study. As a result, a questionnaire will be designed based on the rest of the 5 factors to receive the user’s positive and negative comments and opinions about installing smart meters in their home and interview them if they have had any experience for security breach in this regard. We will conduct three objectives related to the social and interdisciplinary aspects of the project. As mentioned above, it can be carried out along with the technical objectives in Phase 2. In order to involve the users’ point of view and the cultural knowledge gathered around those in the development of the proposed scheme, a case study will be conducted to provide information for the design, implementation, and evaluation for accepting the considered security mechanisms in two cultural contexts, of Iran and Spain, based on the experiences of diverse potential users. This case study aims at: 1. Providing a detailed analysis on the cultural factors related to security of smart home appliances. How users interact with smart home’s appliances in general and in particular in relation to the security, the credibility and the integrity of the information. Socio demographic characteristics such as age, gender, cultural background, information consumption practices, and digital literacy, will be taken into account to conduct the detailed investigation on how the proposed scheme can contribute to: (1) increasing security by checking authorization of installed smart meters in the smart homes; and (2) preserve integrity of transmitted data to the upper layers of the IoT architecture. 2. From a human-computer interaction point of view, the study will analyze the best possible ways to offer the proposed scheme to the consumer. For instance, it will be studied which type of watermark is more effective, the level of sensitivity of users according to the type of information and when and which kinds of information must be kept secret, etc.
660
T. K. Araghi et al. Table 2. Technology Acceptance According to the UTAUT Model
UTAUT MODEL 1
Performance expectancy
• • • • •
Perceived usefulness Intrinsic motivation Job fit Relative advantage Outcome expectation
2
effort expectancy
• Perceived ace of use • Complexity • Ace of use
3
Social influence
• Subjective norms • Social factors • image
4
Facilitating
• Perceived behavioral control • Facilitating conditions • compatibility
5
Attributes toward using technology
• • • •
6
Self efficacy
7
Anxiety
Attributes toward behavior Intrinsic motivation Affect toward use Affect
3. Providing results that are relevant to the two countries participating in this project (Iran and Spain) will allow a micro-level comparative research. For this purpose, the case study will be adapted to the cultural background of each of the above mentioned countries. This case study is scheduled to start from the beginning of the project and the main objectives associated to this area are: • Identify the cultural factors that should be considered for acceptance and persuasion of our proposed watermarking scheme for authentication of user’s data in smart meters in smart homes. • Take decisive decisions for optimizing design and implementation of the proposed scheme according to findings related to the IV and V objectives of the technical part of the project in phase 2. • Evaluate the potential impact on increasing acceptability of the proposed scheme among users. 5.2 Phase 2: Watermarking Tools The results of this phase fulfill the three following objectives related to the technical part of the project:
Toward the Selection of a Lightweight Authentication Technique
661
1. Identify vital influential factors to maximize effectiveness of a watermarking approach regarding the security of heterogeneous nature of IoT devices. Raising the performance is crucial, thus, we identified several elements as influential factors to assess system performance and achieve optimum values for maximum security. These elements are defined as efficient security, time and energy overhead, watermark type as well as method of watermark embedding, false positive and false negative effects. The next step is design and implementation of the scheme according to these achievements. So the next objective would be: 2. Propose and develop a digital watermarking scheme that can detect and prevent unauthenticated access to data in the perception layer. Though the lack of sensors’ ability in complex calculations imposes high computational load and soaring power consumption in the sensors, this will be compensated by achieving the first objective (IV). 3. Test and evaluate the proposed scheme against sensor node compromising attacks. To prove performance, attack scenarios will be implemented to assess the level of efficacy as well as investigating unwanted modifications like the effect of noise in the communication channel on the transmitted data and modification of the watermark features to check whether the proposed scheme can identify existence of any false positive effect. The purpose of design and implementation of the watermarking tools is shown in Table 3. The proposed watermarking scheme would be lightweight, so that the battery usage for watermark generation, embedding and verification is optimum for the scheme to be energy saving as much as possible. Table 3. Purpose of Watermarking Design Purpose of design 1
User privacy
2
Information leakage prevention
3
Tamper detection
4
Data restoration
5
Authentication & integrity
6
Energy saving
7
Tamper location
662
T. K. Araghi et al.
5.3 Phase 3: Comparative Study In this phase, a comparative study will be performed by disclosing the results to the users in the case study to show the ability of the proposed watermarking scheme in detection and prevention of attacks such as data tampering (deletion, addition, and modification), packet replay, select forwarding, packet forgery and eavesdropping in comparison to the systems without having these abilities. Then, by disclosing the results of the implementation to the participants in our case study, we will analyze the persuasion of them in using smart meters. The percentage of user acceptance after implementation of the proposed scheme will be calculated and analyzed to see how effective our proposed scheme is.
6 Expected Impact A threefold envisioned impact is expected by implementation of this project which is described as follows: 6.1 Scientific Point of View From a scientific point of view, the outcome of this project would be a lightweight authentication scheme for data integrity in smart meters at smart homes with capability of tamper detection and localization using watermarking and encryption for the user’s data. This will result in several publications in the field of information security, watermarking and data hiding, as well as presentation of knowledge about consideration of the social and cultural aspects associated with the impact of the acceptance of the IoT appliances in addition to the publication in this field. 6.2 From Users and Developers Point of View Since this project will be performed based on the cultural factors of two different Asian and European countries, it can be inspiring for the researchers and IoT smart home developers to expand the research for the other countries and accordingly the developers can extend their products based on different preferences of the people in different countries or making unique standardization for the products regarding the security issues. Previously, some papers have been published for acceptance of IoT appliances [68–72], but none of them researched on the multicultural aspects just investigated the cultural aspects of technology acceptance based on interview of one geographical area or one country, while this research is based on two different backgrounds for the people from two continents of Europe and Asia.
7 Comparison and Discussion Although there are a lot of articles to address data integrity in IoT smart homes and particularly smart meters, to the best of our knowledge our proposed framework is one of the rare researches that has considered both scientific factors as well as cultural factors to encourage technology acceptance by the end users. Table 4 shows a brief comparison between the features proposed by the other authors and our proposed watermarking framework.
Toward the Selection of a Lightweight Authentication Technique
663
Table 4. Comparison with Other Works Features Articles
Data authentication and integrity
Tamper detection and location
Consideration of cultural factors
Jain et.al.[73]
Yes
No
No
Nakamura and nishi [74]
Yes
No
No
Proposed framework
Yes
Yes
Yes
8 Conclusion and Future Work With the fast growth of IoT applications in smart homes, smart healthcare etc., the need for security solutions to ensure data secrecy and integrity to preserve user’s data from manipulation is rising exponentially. On the other hand, it is necessary for the end users of these applications to use this technology trustfully. To fulfill these requirements, in this paper we presented a framework for detection and prevention of tampering users data in smart meters for smart homes by utilizing digital watermarking techniques. This research is within an ongoing project named LAWS started in January 2022. The ultimate goal of the LAWS is to develop a watermarking tool for protecting data from tampering, so that it can be able not only to detect and prevent malicious manipulation on data, but also to locate the area of the data which has been tampered. Along with the technical part, a case study based on the user experience according to diversities such as geographical place of the users, age, gender, and other issues is planned to interfere with the user’s implications in the design and implementation as well as integration of them to increase acceptance of the proposed scheme. This paper is regarded as initial steps toward firstly, the selection of a lightweight authentication technique for secure data transfer in smart homes infrastructure, and secondly, the design of a framework to develop the mentioned goals of the project. In summary, we focus on the specifications of the architecture of the framework and also on exploring different security techniques to find a lightweight model for our security design. Future work is aimed to create the major modules of the platform, such as implementation of the most appropriate digital watermarking technique. Finally, it is aimed to observe its effects on persuading people to use IoT smart meters according to the usability of the proposed framework in the detection and prevention of malicious attacks. Acknowledgements. The authors acknowledge the funding obtained by the Detection of fake newS on SocIal MedIa pLAtfoRms project from the EIG CONCERT-Japan with grant PCI2020–120689-2 (Government of Spain), and to the RTI2018-095094-B-C22 “CONSENT” and PID2021-125962OB-C31 "SE-CURING" projects granted by the Spanish Ministry of Science and Innovation.
664
T. K. Araghi et al.
References 1. Rashed, M., Kamruzzaman, J., Gondal, I., Islam, S.: Vulnerability assessment framework for a smart grid. In: 2022 4th Global Power, Energy and Communication Conference (GPECOM), pp. 449-454 (2022) 2. Kumar, J.S., Patel, D.R.: A survey on internet of things: security and privacy issues. Int. J. Comput. Appl. 90, 20–26 (2014) 3. Campolo, C., Genovese, G., Iera, A., Molinaro, A.: Virtualizing AI at the distributed edge towards intelligent IoT applications. J. Sens. Actuator Netw. 10, 13 (2021) 4. Chen, S., Xu, H., Liu, D., Hu, B., Wang, H.: A vision of IoT: applications, challenges, and opportunities with china perspective. IEEE Internet Things J. 1, 349–359 (2014) 5. Miorandi, D., Sicari, S., De Pellegrini, F., Chlamtac, I.: Internet of things: vision, applications and research challenges. Ad Hoc Netw. 10, 1497–1516 (2012) 6. Ferdowsi, A., Saad, W.: Deep learning-based dynamic watermarking for secure signal authentication in the Internet of Things. In: 2018 IEEE International Conference on Communications (ICC), pp. 1-6 (2018) 7. Bartariya, S., Rastogi, A.: Security in wireless sensor networks: attacks and solutions. Environment 5, 214–220 (2016) 8. Shi, X., Xiao, D.: A reversible watermarking authentication scheme for wireless sensor networks. Inf. Sci. 240, 173–183 (2013) 9. Ren, K., Lou, W., Zhang, Y.: LEDS: providing location-aware end-to-end data security in wireless sensor networks. IEEE Trans. Mob. Comput. 7, 585–598 (2008) 10. Illiano, V.P., Lupu, E.C.: Detecting malicious data injections in wireless sensor networks: a survey. ACM Comput. Surv. (CSUR) 48, 1–33 (2015) 11. Hameed, K., Khan, A., Ahmed, M., Reddy, A.G., Rathore, M.M.: Towards a formally verified zero watermarking scheme for data integrity in the Internet of Things based-wireless sensor networks. Futur. Gener. Comput. Syst. 82, 274–289 (2018) 12. Lalem, F., et al.: Data authenticity and integrity in wireless sensor networks based on a watermarking approach. In: The Twenty-Ninth International Flairs Conference (2016) 13. Tiwari, A., Chakraborty, S., Mishra, M.K.: Secure data aggregation using irreversible watermarking in WSNs (2013) 14. Boubiche, D.E., Boubiche, S., Bilami, A.: A cross-layer watermarking-based mechanism for data aggregation integrity in heterogeneous WSNs. IEEE Commun. Lett. 19, 823–826 (2015) 15. Sun, X., Su, J., Wang, B., Liu, Q.: Digital watermarking method for data integrity protection in wireless sensor networks. Int. J. Secur. Appl. 7, 407–416 (2013) 16. Rouissi, N., Gharsellaoui, H.: Improved hybrid LEACH based approach for preserving secured integrity in wireless sensor networks. Procedia Comput. Sci. 112, 1429–1438 (2017) 17. Guan, T., Chen, Y.: A node clone attack detection scheme based on digital watermark in WSNs. In: 2016 First IEEE International Conference on Computer Communication and the Internet (ICCCI), pp. 257-260 (2016) 18. Ding, Q., Wang, B., Sun, X., Wang, J., Shen, J.: A reversible watermarking scheme based on difference expansion for wireless sensor networks. Int. J. Grid Distrib. Comput. 8, 143–154 (2015) 19. Tewari, A., Gupta, B.: Security, privacy and trust of different layers in Internet-of-Things (IoTs) framework. Futur. Gener. Comput. Syst. 108, 909–920 (2020) 20. Alromih, A., Al-Rodhaan, M., Tian, Y.: A randomized watermarking technique for detecting malicious data injection attacks in heterogeneous wireless sensor networks for Internet of Things applications. Sensors 18, 4346 (2018) 21. Hameed, S., et al.: A scalable key and trust management solution for IoT sensors using SDN and blockchain technology. IEEE Sensors J. 21, 8716–8733 (2021)
Toward the Selection of a Lightweight Authentication Technique
665
22. Zhang, G., Kou, L., Zhang, L., Liu, C., Da, Q., Sun, J.: A new digital watermarking method for data integrity protection in the perception layer of IoT. Secur. Commun. Netw. 2017, 1–12 (2017) 23. Barhamgi, M., Perera, C., Yolum, P.: Introduction to the special section on human-centered security, privacy, and trust in the Internet of Things, ACM New York, NY, USA (2021) 24. Huber, B., Kandah, F.: Behavioral model based trust management design for IoT at scale. In: 2020 International Conferences on Internet of Things (iThings) and IEEE Green Computing and Communications (GreenCom) and IEEE Cyber, Physical and Social Computing (CPSCom) and IEEE Smart Data (SmartData) and IEEE Congress on Cybermatics (Cybermatics), pp. 9–17 (2020) 25. Dash, S., Prusty, D.: Domain-specific IoT applications. In: Pani, S.K., Pandey, M. (eds.) Internet of Things: Enabling Technologies, Security and Social Implications, pp. 27–36. Springer Singapore, Singapore (2021). https://doi.org/10.1007/978-981-15-8621-7_3 26. Spanos, G., et al.: A lightweight cyber-security defense framework for smart homes. In: 2020 International Conference on Innovations in Intelligent Systems and Applications (INISTA), pp. 1–7 (2020) 27. Kavianpour, S., Shanmugam, B., Azam, S., Zamani, M., Narayana Samy, G., De Boer, F.: A systematic literature review of authentication in internet of things for heterogeneous devices. J. Comput. Netw. Commun. 2019, 1–14 (2019) 28. Mao, J., Lin, Q., Bian, J.: Application of learning algorithms in smart home IoT system security. Math. Found. Comput. 1, 63 (2018) 29. Domb, M.: Smart home systems based on Internet of Things. In: IoT and Smart Home Automation. IntechOpen (2019) 30. Hess, D.J.: Smart meters and public acceptance: comparative analysis and governance implications. Health Risk Soc. 16, 243–258 (2014) 31. Stephens, J.C., Wilson, E.J., Peterson, T.R.: Smart grid (R) evolution. Cambridge University Press (2015) 32. Babangida, L., Perumal, T., Mustapha, N., Yaakob, R.: Internet of Things (IoT) based activity recognition strategies in smart homes: a review. IEEE Sens. J. 22, 8327–8336 (2022) 33. Lee, D., Hess, D.J.: Data privacy and residential smart meters: comparative analysis and harmonization potential. Utilities Policy 70, 101188 (2021) 34. Miglani, A., Kumar, N., Chamola, V., Zeadally, S.: Blockchain for Internet of energy management: review, solutions, and challenges. Comput. Commun. 151, 395–418 (2020) 35. Baldini, G., et al.: Internet of Things: IoT governance, privacy and security issues. IERCEuropean Research Cluster on the Internet of Things, Position Paper Activity Chain, vol. 5 (2015) 36. Lee, H.: Home IoT resistance: extended privacy and vulnerability perspective. Telematics Inform. 49, 101377 (2020) 37. Maranguni´c, N., Grani´c, A.: Technology acceptance model: a literature review from 1986 to 2013. Univ. Access Inf. Soc. 14(1), 81–95 (2014) 38. Casado-Vara, R., Martin-del Rey, A., Affes, S., Prieto, J., Corchado, J.M.: IoT network slicing on virtual layers of homogeneous data for improved algorithm operation in smart buildings. Future Gener. Comput. Syst. 102, 965–977 (2020) 39. Haque, M.A., Kumar, K., Haque, S., Singh, N.K.: A comprehensive study of cyber security attacks, classification, and countermeasures in the Internet of Things. In: Digital Transformation and Challenges to Data Security and Privacy, p. 63 (2021) 40. Araghi, T.K., Zamani, M., Manaf, A.B.A., Abdullah, S.M., Bojnord, H.S., Araghi, S.K.: A secure model for prevention of black hole attack in wireless mobile ad hoc networks. In: 12th WSEAS International Conference on Applied Computer and Applied Computational Science, Malaysia (2013)
666
T. K. Araghi et al.
41. Asokan, N., et al.: Seda: scalable embedded device attestation. In: Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security, pp. 964–975 (2015) 42. Di Pietro, R., Michiardi, P., Molva, R.: Confidentiality and integrity for data aggregation in WSN using peer monitoring. Secur. Commun. Netw. 2, 181–194 (2009) 43. Cui, J., Shao, L., Zhong, H., Xu, Y., Liu, L.: Data aggregation with end-to-end confidentiality and integrity for large-scale wireless sensor networks. Peer-to-Peer Networking Appl. 11, 1022–1037 (2017) 44. Gupta, P., Sinha, A., Srivastava, P.K., Perti, A., Singh, A.K.: Security implementations in IoT using digital signature. In: Innovations in Electrical and Electronic Engineering, pp. 523–535. Springer (2021) 45. Alzubi, J.A.: Blockchain-based Lamport Merkle digital signature: authentication tool in IoT healthcare. Comput. Commun. 170, 200–208 (2021) 46. Ammar, M.: Software-based trusted computing architecture for resource-constrained Internet of Things devices (2021) 47. Yahyaoui, A., Abdellatif, T., Yangui, S., Attia, R.: READ-IoT: reliable event and anomaly detection framework for the Internet of Things. IEEE Access 9, 24168–24186 (2021) 48. Cauteruccio, F., et al.: A framework for anomaly detection and classification in multiple IoT scenarios. Futur. Gener. Comput. Syst. 114, 322–335 (2021) 49. Yang, X., Lin, J., Yu, W., Moulema, P.-M., Fu, X., Zhao, W.: A novel en-route filtering scheme against false data injection attacks in cyber-physical networked systems. IEEE Trans. Comput. 64, 4–18 (2013) 50. Yan, Z., Zhang, P., Vasilakos, A.V.: A survey on trust management for Internet of Things. J. Netw. Comput. Appl. 42, 120–134 (2014) 51. Ahmed, K.I., Tahir, M., Lau, S.L.: Trust management for IoT security: taxonomy and future research directions. In: 2020 IEEE Conference on Application, Information and Network Security (AINS), pp. 26–31 (2020) 52. Araghi, T.K., Zamani, M., Manaf, A.B.A., Abdullah, S.M., Lyastani, S.G., Araghi, S.K.: A survey for prevention of black hole attacks in wireless mobile AdHoc networks using trusted neighbor nodes. In: 12th WSEAS International Conference on Applied Computer and Applied Computational Science, pp. 176–191 (2013) 53. Araghi, T.K., Zamani, M., Manaf, A.A., Araghi, S.K.: An Access Control Framework in an Ad Hoc Network Infrastructure, pp. 747–754. Cham (2015) 54. Kumar, M., Verma, S., Lata, K.: Secure data aggregation in wireless sensor networks using homomorphic encryption. Int. J. Electron. 102, 690–702 (2015) 55. Bordel, B., Alcarria, R., Robles, T., Iglesias, M.S.: Data authentication and anonymization in IoT scenarios and future 5G networks using chaotic digital watermarking. IEEE Access 9, 22378–22398 (2021) 56. Li, X., Peng, J., Obaidat, M.S., Wu, F., Khan, M.K., Chen, C.: A secure three-factor user authentication protocol with forward secrecy for wireless medical sensor network systems. IEEE Syst. J. 14, 39–50 (2019) 57. Panoff, M., Dutta, R.G., Hu, Y., Yang, K., Jin, Y.: On sensor security in the era of IoT and CPS. SN Comput. Sci. 2(1), 1–14 (2021) 58. Araghi, T.K., Manaf, A.B.T.A.: Evaluation of digital image watermarking techniques. pp. 361–368. Cham (2018) 59. Araghi, T.K., Manaf, A.A., Alarood, A., Zainol, A.B.: Host feasibility investigation to improve robustness in hybrid DWT+ SVD based image watermarking schemes. Adv. Multimedia 2018, 1–9 (2018) 60. Araghi, T.K., Manaf, A., Zamani, M., Araghi, S.K.: A survey on digital image watermarking techniques in spatial and transform domains. Int. J. Adv. Image Process. Tech. 3, 6–10 (2016)
Toward the Selection of a Lightweight Authentication Technique
667
61. Araghi, T.K., Alarood, A.A., Araghi, S.K.: Analysis and evaluation of template based methods against geometric attacks: a survey. In: Saeed, F., Mohammed, F., Al-Nahari, A. (eds.) IRICT 2020. LNDECT, vol. 72, pp. 807–814. Springer, Cham (2021). https://doi.org/10.1007/9783-030-70713-2_73 62. Kim, J., Kim, S., Nam, C.: User resistance to acceptance of in-vehicle infotainment (IVI) systems. Telecommun. Policy 40, 919–930 (2016) 63. Ciesielska, M., Li, F.: The connected home: from market barriers to business model solutions. In: Conference on e-Business, e-Services and e-Society, pp. 189–199 (2011) 64. Komninos, N., Philippou, E., Pitsillides, A.: Survey in smart grid and smart home security: Issues, challenges and countermeasures. IEEE Commun. Surv. Tutorials 16, 1933–1954 (2014) 65. A. GROUP: The Internet of Things: the future of consumer adoption ACQUITY GROUP (2014) 66. Son, J.-Y., Kim, S.S.: Internet users’ information privacy-protective responses: a taxonomy and a nomological model. MIS Q. 32, 503–529 (2008) 67. Venkatesh, V., Morris, M.G., Davis, G.B., Davis, F.D.: User acceptance of information technology: toward a unified view. MIS Q. 27, 425–478 (2003) 68. Mashal, I., Alsaryrah, O., Chung, T.-Y., Yuan, F.-C.: A multi-criteria analysis for an internet of things application recommendation system. Technol. Soc. 60, 101216 (2020) 69. Cho, M., Lee, S., Lee, K.-P.: How do people adapt to use of an IoT air purifier?: From low expectation to minimal use. Int. J. Des. 13, 21–38 (2019) 70. Zheng, S., Apthorpe, N., Chetty, M., Feamster, N.: User perceptions of smart home IoT privacy. Proc. ACM Human-Comput. Interact. 2, 1–20 (2018) 71. de Boer, P.S., van Deursen, A.J., Van Rompay, T.J.: Accepting the Internet-of-Things in our homes: the role of user skills. Telematics Inform. 36, 147–156 (2019) 72. Al-Husamiyah, A., Al-Bashayreh, M.: A comprehensive acceptance model for smart home services. Int. J. Data Netw. Sci. 6, 45–58 (2022) 73. Jain, H., Kumar, M., Joshi, A.M.: Intelligent energy cyber physical systems (iECPS) for reliable smart grid against energy theft and false data injection. Electr. Eng. 104, 331–346 (2021). https://doi.org/10.1007/s00202-021-01380-9 74. Nakamura, Y., Nishi, H.: Digital watermarking for anonymized data with low information loss. IEEE Access 9, 130570–130585 (2021)
Evaluating the User Experience of Music Streaming Services Roko Fumi´c, Mateo Rumac, and Tihomir Orehovaˇcki(B) Faculty of Informatics, Juraj Dobrila University of Pula, Zagrebaˇcka 30, 52 100 Pula, Croatia [email protected] Abstract. Good user experience (UX) is an important signifier of users’ intention to employ music streaming services such as Spotify, YouTube Music, and Deezer. To examine an interplay among pragmatic and hedonic UX dimensions which constitute a research model for the evaluation of these services, an empirical study was carried out. The sample of study participants was composed of randomly selected regular users of music streaming services. Data was collected employing a questionnaire that was administered using Google Forms. Validity and reliability of the research model were tested using the partial least squares (PLS) structural equation modeling (SEM) method. The reported findings uncovered that, in the context of music streaming services, perceived ease of use and perceived usefulness are significant determinants of satisfaction, security and user interface aesthetics significantly affect perceived usefulness, and user interface aesthetics significantly contributes to perceived ease of use. Keywords: Music Streaming Services · User Experience · Evaluation · PLS-SEM · Questionnaire · Empirical Study
1 Introduction With the improvement of internet availability and web-based technologies, streaming has overtaken file sharing as the main method for digital multimedia distribution [40]. This paradigm has been essential to the rise of various multi-billion-dollar companies such as YouTube, Netflix, and Hulu, which has compelled several industries to embrace streaming as a means of product delivery. One such industry is the music industry, where physical and digital ownership has been almost entirely replaced by the streaming model [6]. With this distribution method, listeners get digital access to music content, without explicitly purchasing or owning the files, while usually under a subscriptionbased contract [47]. When overall usage is considered, YouTube is far ahead of all other music streaming services with 2.5 billion users [4]. On the other hand, with 150 million subscribers, Spotify is the most popular music streaming service in terms of paid users [4]. It is projected that the revenue in the music streaming segment will reach 15.22 billion USD in 2023 where most of it will be generated in the United States [41]. The findings of a study carried out by Lee et al. [20] revealed that digital music consumption positively contributes to physical album sales. Given the popularity and influence of music streaming services, we must seek to explore their particularities and improve their features. One of the ways we can do this is by evaluating their user experience (UX). © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): SAI 2023, LNNS 711, pp. 668–683, 2023. https://doi.org/10.1007/978-3-031-37717-4_43
Evaluating the User Experience of Music Streaming Services
669
Definitions of UX are varied as researchers often have similar but marginally different interpretations of it, with most of them fitting into one of two groups of thought. The first one considers UX to be no different from usability. Bevan [3], for example, suggests that UX can be measured as the user’s satisfaction with achieving pragmatic and hedonic goals, which is increased by improving elements of usability, such as learnability, accessibility, and safety, and therefore usability evaluation is necessary to better understand user needs so overall UX can be made better. The second common interpretation of UX sees it as completely separate from usability [15], with its purpose being to fulfill users’ needs and desires. Several studies have been done to explore elements of UX with digital music streaming services. One such qualitative study was carried out by Lee and Price [19] in which participants were describing their feelings towards different aspects of UX during interviews. The authors found that participants exhibited very different personalities and attitudes toward music services. By analyzing the introduced research model, Yeoh et al. [49], found that perceived usefulness and perceived ease of use positively affect users’ attitudes toward music streaming services. On the other hand, Mariano-Melo et al. [22] discovered that, in the context of music streaming acceptance in Brazil, perceived ease of use contributes to behavioral intention while perceived usefulness and perceived enjoyment appeared to be statistically insignificant in that respect. This paper aims to explore the interplay of five diverse facets of UX when interaction with digital music streaming services is concerned. The remainder of the paper is structured as follows. A brief literature review of current advances in the field is provided in the next section. The research model and corresponding hypotheses are proposed in the third section. A description of the methodology used to gather data is provided in the fourth section. The results of an empirical study are reported in the fifth section. Study findings and limitations are discussed in the sixth section. Conclusions are drawn in the last section.
2 Background to the Research Quantitative studies related to music streaming in recent years have been mainly focused on understanding users’ acceptance as well as their willingness to pay for these services. Chen and Leon [5] compared two different purchase motivations for users of music streaming services: social influence and hedonic performance expectancy. They found that social influence impacts users’ attitudes towards music streaming, which in turn drives purchase intention. The same authors also revealed that the continuance intention related to paid streaming is driven by the hedonic performance expectancy of consumers, rather than consumer attitude. Wagner and Hess [46] examined whether and why users of music streaming services would be willing to pay for a premium version of a service even if a free version is available. To test their hypotheses, they used the Theory of Planned Behavior [1] framework, adapted to fit the music streaming context. Their findings show that the intention to use the free service negatively influences the intention to use the premium service, while price value, innovativeness, and preference are important determinants of attitude towards premium services. They also discovered that music streaming services implement a free trial period to persuade customers to pay for a premium version.
670
R. Fumi´c et al.
By analyzing survey data using an extended version of the Unified Theory of Acceptance and Use of Technology 2 (UTAUT2) [45] model, as well as semi-structured interviews, Barata and Coelho [2] determined which factors influence users’ intention to recommend and pay for music streaming services. They in particular discovered that habit, performance expectancy, and price value significantly influence the behavioral intention to use a paid service. The same authors also found that behavioral intention is a strong predictor of recommendation and that habit has an essential role in influencing the intention to use a paid music streaming service. Wulandari et al. [48] identified which elements of music streaming services influence continuance intention the most. The results of their study suggest that perceived ease of use, perceived enjoyment, and entertainment have a significant positive influence on satisfaction, which in turn, along with habit, has a significant positive effect on continuance intention, while social influence has a significant negative influence on it. Pal and Triyason [36] conducted a study to examine user attitudes toward music streaming by employing a modified version of the Technology Acceptance Model 3 [44]. They found that perceived satisfaction had a significant positive effect on perceived enjoyment, perceived ease of use had a significant positive influence on perceived usefulness, and perceived enjoyment had a significant effect on behavioral intention, while neither perceived usefulness nor perceived ease of use had a significant influence on behavioral intention which might suggest that music streaming services are purely used for enjoyment rather than utility. Unlike previously mentioned research, this study aims to develop a research framework for UX evaluation of music streaming services that will be composed of factors designed for measuring both pragmatic and hedonic aspects of UX.
3 Research Model and Hypotheses Based on the findings of our current work dealing with the evaluation of a version control system [31], intelligent personal assistants [32], IoT ecosystem [33], cloud computing applications [27], web mashups [30], educational tool [34], educational artifacts [35], mobile applications [29], and various social web applications [23–26, 28], we identified five constructs (satisfaction, user interface aesthetics, security, perceived ease of use, and perceived usefulness) that constitute the research model for evaluating the user experience of music streaming services. User interface aesthetics (UIA) denotes the extent to which a user interface of a music streaming service is visually appealing [8, 25]. Applications with low visual complexity are more attractive to users [43]. According to the findings of an empirical study on evaluating UX in the context of social Web applications [24], user interface aesthetics is a significant determinant of hedonic UX facets such as satisfaction and pleasure. Current studies suggest that user interface aesthetics affects both perceived ease of use and perceived usefulness [8, 21]. In that respect, the following hypotheses were proposed: H1. User interface aesthetics positively influences perceived usefulness in the context of music streaming services.
Evaluating the User Experience of Music Streaming Services
671
H2. User interface aesthetics positively influences perceived ease of use in the context of music streaming services. Security (SEC) represents the degree to which a music streaming service protects users’ data and information from unauthorized access and use [28]. This UX dimension can be further decomposed into integrity (the extent to which users’ information and data are trustworthy and accurate) and confidentiality (the degree to which data and information can be accessed only by authorized users) [27]. Findings of a recent study on e-learning adoption [10] suggest that security has a positive impact on perceived usefulness. Therefore, the following hypothesis was proposed: H3. Security positively influences perceived usefulness in the context of music streaming services. Perceived ease of use (PEOU) denotes the extent to which a user believes that interaction with a music streaming service is free from effort [9, 23]. While findings of empirical studies carried out by Wulandari et al. [48] and Orehovaˇcki et al. [29] indicate that perceived ease of use has a positive impact on satisfaction, results of a study conducted by Pal and Triyason [36] did not confirm a significant influence in that respect. Therefore, the following hypothesis was proposed: H4. Perceived ease of use positively influences satisfaction in the context of music streaming services. Perceived usefulness (PU) represents the degree to which a user believes that interaction with a music streaming service enhances his/her job performance [9, 23]. Because music streaming services are more hedonic systems, rather than utilitarian ones, we argue that users will be more satisfied with a service if they find it advantageous. According to findings reported in [34], usefulness is a significant determinant of users’ behavioral intentions. While most current studies (e.g. [26, 29, 32]) have demonstrated that perceived usefulness contributes to satisfaction, in some of them (e.g. [31]) the situation was the quite opposite. In that respect, the following hypothesis was proposed: H5. Perceived usefulness positively influences satisfaction in the context of music streaming services. Satisfaction (SAT) refers to the degree to which interaction with music streaming service meets users’ expectations [23]. Although it is considered an important dimension of quality in use [18], satisfaction is also commonly examined as a relevant aspect of software quality [30], user experience [33], usability [35], and adoption [31]. According to findings reported in [29], security, usefulness, and ease of use are significant predictors of user satisfaction. When users are satisfied with a certain piece of software, they will continue to use it and recommend it to others [26, 32]. It is therefore of particular importance that music streaming services meet the requirements of this UX dimension. The research framework in the form of a conceptual model which illustrates the aforementioned hypotheses is presented in Fig. 1.
672
R. Fumi´c et al.
Fig. 1. Research Framework
4 Methodology To evaluate the proposed research model, an empirical study was conducted. Data were collected with a questionnaire over the course of a week during which 53 valid responses were gathered. The survey was implemented using the Google Forms service and was shared using various social media and messaging platforms, such as Facebook, WhatsApp, and Slack. The questionnaire consisted of two sections. The first section was composed of five items related to participants’ demographics, such as age, gender, and frequency of using music streaming services. The second section contained twenty-five items designed to measure various aspects of the five facets of user experience as proposed in the research model, shown in Table 1. Answers to the questionnaire items were modulated on a fivepoint Likert scale (1 – strongly disagree, 5 – strongly agree). Some questionnaire items related to perceived usefulness and perceived ease of use were adopted from [9], but the majority of others were worded by the authors of this paper to be more relevant to the context of music streaming services. The assessment of the research model was done using the partial least squares structural equation modeling (PLS-SEM) method [12] in SmartPLS 4.0.8.7 software tool [37]. Our decision to use PLS-SEM instead of its covariance-based (CB-SEM) alternative is based on the following three main reasons: (1) PLS-SEM does not require a comprehensive theoretical backbone, which makes it appropriate for exploratory studies [16]; (2) PLS-SEM achieves high levels of statistical power even when the sample size is relatively small [42]; (3) when data significantly deviate from a normal distribution, the PLS-SEM algorithm transforms them following the central limit theorem thus making parameter estimations highly reliable [13].
Evaluating the User Experience of Music Streaming Services
673
Table 1. The Initial Pool of Questionnaire Items Grouped by UX Constructs UX constructs
Items
Perceived ease of use (PEOU)
PEOU1. It is easy to find the desired music or podcast using the service PEOU2. It is easy to adjust the sound volume on the service PEOU3. The service allows the easy creation of playlists PEOU4. Using the service is easy PEOU5. I can use the service without additional instructions
Perceived usefulness (PU)
PU1. The service is useful PU2. Using the service increases my productivity PU3. Using the service saves me time PU4. Using the service improves my music listening experience PU5. Using the service positively contributes to my everyday life
Satisfaction (SAT)
SAT1. The service meets my expectations SAT2. I am satisfied with the price of the service SAT3. I am satisfied with the content of the service SAT4. I am satisfied with the audio quality of the service SAT5. I am satisfied with using the service
Security (SEC)
SEC1. I believe that my personal information is well protected by the service SEC2. The service communicates well how it uses my personal information SEC3. The way the service collects personal data is not invasive SEC4. I believe that the service does not abuse my data SEC5. My user account on the service is well protected
User interface aesthetics (UIA)
UIA1. The service’s interface looks nice UIA2. The service’s interface is consistent UIA3. The colors of the interface are well combined UIA4. The service’s interface is understandable UIA5. The design of the service’s interface is creative
674
R. Fumi´c et al.
5 Results 5.1 Participants The sample was composed of 54.72% male and 45.28% female respondents. They ranged in age from 18 to 53 years with a mean of 24.85 years and the mode and median both being 22 years. Out of all study participants, 56.60% of them declared they regularly use YouTube Music, 26.42% of them commonly interact with Spotify, 5.66% of them are consuming Apple Music on daily basis, 3.77% of them are usually using Deezer, while the remaining 7.55% of respondents interact ordinarily with various other music streaming services. 5.2 Model Assessment The psychometric features of the introduced research model were tested with the PLSSEM method that draws on an iterative two-step algorithm. Firstly, it approximates the measurement model parameters, after which it evaluates the standardized partial regression coefficients of the structural model [12]. To evaluate the quality of the measurement model, an examination of the reliability of the manifest and latent variables, convergent validity, and discriminant validity was done. The reliability of manifest variables was tested by analyzing the standardized outer loadings of manifest variables with their corresponding latent variables. As proposed by Hulland [17], a manifest variable should be excluded from the model if its standardized outer loading is lower than 0.708. Initial results indicated that 8 out of 25 latent variables were below the recommended cut-off value. After removing items PEOU3, PEOU4, PEUO5, PU1, SAT2, and UIA4, the remaining values were above the proposed threshold value, as shown in Table 2. The standardized loadings of manifest variables ranged from 0.714 to 0.893 which implies that the latent variables explain between 50.98% and 79.74% of the variance of their corresponding manifest variables. Table 2. Standardized Factor Loadings (Bold Values on the Diagonal) and Cross-Loadings of Items PEOU
PU
SAT
SEC
UIA
PEOU1
0.874
0.375
0.448
0.233
0.337
PEOU2
0.893
0.239
0.409
0.325
0.448
PU2
0.028
0.714
0.360
0.470
0.537
PU3
0.331
0.731
0.453
0.343
0.414
PU4
0.486
0.819
0.606
0.494
0.627
PU5
0.110
0.727
0.437
0.431
0.345
SAT1
0.294
0.413
0.741
0.507
0.639
SAT3
0.370
0.720
0.852
0.585
0.582 (continued)
Evaluating the User Experience of Music Streaming Services
675
Table 2. (continued) PEOU
PU
SAT
SEC
UIA
SAT4
0.456
0.390
0.866
0.473
0.561
SAT5
0.480
0.492
0.852
0.499
0.541
SEC1
0.374
0.423
0.283
0.755
0.406
SEC2
0.275
0.494
0.505
0.883
0.490
SEC3
0.200
0.471
0.546
0.850
0.514
SEC4
0.146
0.329
0.486
0.806
0.375
SEC5
0.287
0.607
0.686
0.821
0.564
UIA1
0.517
0.511
0.615
0.424
0.891
UIA2
0.435
0.538
0.629
0.548
0.846
UIA3
0.309
0.536
0.632
0.514
0.845
UIA5
0.258
0.666
0.503
0.518
0.840
The reliability of the latent variables was examined using the consistent reliability (rho_a), composite reliability (rho_c), and Cronbach’s alpha (α) coefficients. As suggested by Hair et al. [14], values of these reliability indices should be between 0.70 and 0.95 for all constructs to avoid content validity being compromised, which was the case in the proposed research model. Convergent validity, measured by average variance extracted (AVE), indicates the common variance between the manifest variables and their latent variable. The value of AVE should be higher than 0.5 for all constructs which appeared to be the case in our research model. Findings on internal consistency and convergent reliability of latent variables are summarized in Table 3. Table 3. Internal Consistency and Convergent Validity Cronbach’s alpha
Composite reliability
Average variance extracted (AVE)
rho_a
rho_c
PEOU
0.719
0.723
0.877
0.781
PU
0.740
0.761
0.836
0.561
SAT
0.850
0.877
0.898
0.688
SEC
0.883
0.903
0.914
0.680
UIA
0.878
0.880
0.916
0.732
To check if similarities between latent variables in the model are low or non-existent, discriminant validity was examined. The first measure checked in that respect was crossloadings. For each manifest variable, this value should be highest when it is related to
676
R. Fumi´c et al.
the latent variable it represents, rather than to any other latent variable in the model. This was shown to be the case for all manifest variables in our research model, as displayed in Table 2. The second measure that was used for evaluating discriminant validity was the Heterotrait-Monotrait ratio of correlations (HTMT). For latent variables to be sufficiently distinct, their HTMT values should not exceed 0.85 when the model is composed of conceptually different constructs [38]. As illustrated in Table 4, HTMT values were in the range from 0.387 to 0.812 thus indicating that constructs which constitute the introduced research model are sufficiently different. Table 4. Heterotrait–Monotrait Ratio of Correlations (HTMT) PEOU
PU
SAT
SEC
UIA
PEOU PU
0.441
SAT
0.617
0.749
SEC
0.387
0.689
0.696
UIA
0.555
0.795
0.812
0.648
The last measure employed for examining discriminant validity was the FornellLacker criterion which suggests that the square root of the AVE should be greater than its highest correlation with other latent variables in the model [11]. Findings reported in Table 5 demonstrate that each latent variable shares more variance with its manifest variables than with other latent variables, which supports the discriminant validity of the proposed research model. All reported results speak in favor of the reliability and validity of the measurement model. Table 5. Fornell-Larcker Criterion (Note that Bold Values on the Diagonal Represent the Square Root of AVE of Each Construct) PEOU PEOU
PU
SAT
SEC
UIA
0.884
PU
0.344
0.749
SAT
0.483
0.631
0.830
SEC
0.318
0.584
0.626
0.824
UIA
0.447
0.658
0.694
0.584
0.856
Measuring the quality of the structural model was done by examining the collinearity, determination coefficient of endogenous latent variables, significance levels of path coefficients, effect sizes, the predictive relevance of exogenous latent variables, and the
Evaluating the User Experience of Music Streaming Services
677
predictive power of the research model. To avoid bias in estimated partial regression coefficients, there should be a low level of collinearity in the structural model. Variance Inflation Factor (VIF) is an indicator most commonly used for checking the extent of collinearity among constructs. According to [14], VIF values should be close to three or lower. As shown in Table 6, the VIF values of constructs are below the set forth threshold thus confirming the absence of collinearity in the structural model. Table 6. Results of Testing Collinearity among Constructs in the Structural Model PEOU
PU
SAT
PEOU
1.135
PU
1.135
SEC
UIA
SAT SEC
1.518
UIA
1.000
1.518
The proportion of the endogenous latent variable’s variance explained by a set of its predictors is known as the determination coefficient (R2 ). In empirical studies dealing with software evaluation, R2 values of 0.46, 0.34, and 0.15 for endogenous latent variables in a structural model indicate substantial, moderate, and weak explanatory power of their antecedents, respectively [23]. Considering that adjusted R2 tailors the value of R2 concerning the size of the model, it is usually interpreted instead of R2 [38]. Findings reported in Table 7 suggest that, in the proposed research model, user interface aesthetics explains 18.40% of the variance in perceived ease of use, 47.30% of the variance in perceived usefulness was explained by security and user interface aesthetics, while perceived usefulness and perceived ease of use account for 45.80% of the variance in satisfaction. The aforementioned indicates that predictors of perceived usefulness and satisfaction have substantial explanatory power, while the determining factor of perceived ease of use has weak explanatory power. Table 7. Results of Testing the Explanatory Power of the Research Model Endogenous constructs
R2
R2 Adjusted
Perceived ease of use
0.200
0.184
Perceived Usefulness
0.494
0.473
Satisfaction
0.479
0.458
To evaluate the significance of path coefficients between a particular latent variable and a set of its predictors, and to test the hypotheses, the bootstrapping method with 5000 subsamples was employed. As shown in Table 8, user interface aesthetics significantly affects perceived ease of use (β = 0.447, p < 0.001) and perceived usefulness (β = 0.481,
678
R. Fumi´c et al.
p < 0.001) thus supporting H1 and H2, security significantly contributes to perceived usefulness (β = 0.303, p < 0.001) thus confirming H3, while perceived ease of use (β = 0.302, p < 0.05) and perceived usefulness (β = 0.527, p < 0.001) significantly affect satisfaction, thus providing support for H4 and H5, respectively. Effect size (f 2 ) indicates the relative impact of a predictor on an endogenous latent variable concerning its explanatory power. According to Cohen [7], values of 0.02, 0.15, and 0.35 imply a small, moderate, and large influence of an exogenous latent variable on an endogenous latent variable, respectively. Table 8 shows that perceived ease of use has a small (f 2 = 0.121) while perceived usefulness has a strong impact (f 2 = 0.453) on satisfaction. Security has a small (f 2 = 0.115) influence on perceived usefulness, while user interface aesthetics has a moderate influence on both perceived ease of use (f 2 = 0.250) and perceived usefulness (f 2 = 0.304). Assessment of the predictive validity of exogenous latent variables was done with 2 measure, calculated using the PLSpredict algorithm [39] implemented in SmartQpredict 2 PLS 4.0.8.7 software tool [37]. Values of Qpredict greater than 0, 0.25, and 0.5 suggest small, medium, and large predictive relevance of the PLS path model [14]. The changes 2 represent the predictive relevance (q2 ) of latent variables. As was the case with in Qpredict 2 2 f , the q values of 0.02, 0.15, and 0.35 represent a small, medium, and large predictive relevance of an exogenous latent variable. From the calculated values shown in Table 8, it can be reasoned that perceived ease of use and perceived usefulness have small (q2 = 0.037) and large (q2 = 0.432) relevance in predicting satisfaction, respectively. Furthermore, user interface aesthetics has moderate (q2 = 0.242) and security has small (q2 = 0.066) relevance in predicting perceived usefulness. Finally, user interface aesthetics has medium (q2 = 0.195) relevance in predicting perceived ease of use. Table 8. Results of Testing the Hypotheses, Effect Size, and Predictive Validity Path coefficient
T statistics
p values
f2
q2
Decision
PEOU -> SAT
0.302
2.280
0.023
0.121
0.037
Accepted
PU -> SAT
0.527
5.985
0.000
0.453
0.432
Accepted
SEC -> PU
0.303
3.521
0.000
0.115
0.066
Accepted
UIA -> PEOU
0.447
4.673
0.000
0.250
0.195
Accepted
UIA -> PU
0.481
4.307
0.000
0.304
0.242
Accepted
The predictive power of a model is commonly explored with the root mean squared error (RMSE), but when a highly non-symmetric distribution of prediction errors exists, the mean absolute error (MAE) is used instead [39]. The evaluation procedure consists of a comparison of the RMSE (or MAE) values with predictions for items generated by a naïve benchmark using a linear regression model (LM). Visual inspection of error histograms revealed that the distribution of prediction errors is symmetric so we performed predictive power evaluation on RMSE. As shown in the third column of Table 9, almost all endogenous construct items (except SAT1) have smaller PLS-SEM_RMSE values
Evaluating the User Experience of Music Streaming Services
679
when compared to the naïve LM_RMSE benchmark, which indicates that the proposed model has medium predictive power [39]. Table 9. Results of Testing the Predictive Power of the Research Model Q2 predict
PLS-SEM_RMSE
PLS-SEM_MAE
LM_RMSE
LM_MAE
PEOU1
0.093
0.768
0.493
0.956
0.612
PEOU2
0.178
0.434
0.321
0.458
0.348
PU2
0.292
0.796
0.605
0.838
0.639
PU3
0.145
1.032
0.811
1.258
0.998
PU4
0.385
0.686
0.503
0.833
0.616
PU5
0.141
0.723
0.521
0.805
0.607
SAT1
0.340
0.547
0.454
0.509
0.398
SAT3
0.355
0.538
0.424
0.577
0.432
SAT4
0.313
0.532
0.416
0.608
0.479
SAT5
0.299
0.655
0.501
0.707
0.553
6 Discussion and Limitations As shown in the previous section, both the validity and reliability of the research model were confirmed. User interface aesthetics, as a subset of usability [18], was shown to have a positive influence on perceived ease of use and perceived usefulness, which was previously confirmed by findings of current studies [8, 21], although both of its f 2 and q2 values were moderate. A study conducted by Farooq et al. [10] revealed that security has a positive impact on perceived usefulness. This was also shown to be the case in this study, as security appeared to have a significant positive impact on perceived usefulness. However, its effect size and predictive relevance were both found to be small. The findings of our study indicate that perceived ease of use has a small impact on satisfaction in the context of music streaming services which is in line with results reported in [29, 48]. A previous study [36] on music streaming services evaluation uncovered that perceived ease of use does not significantly affect satisfaction. The reason for this may be the fact that music streaming services are hedonic systems that serve users to enjoy listening to music. We also discovered that users are more satisfied with music streaming services if they find them useful which is consistent with study results discussed in [26, 29, 32]. An important limitation of this study was its sample size. Having only 53 study participants, along with their average age of 24.85, the reported findings cannot be generalized to all groups of music streaming services users but only to those who took part in this study. Also, the minimum sample size required for PLS-SEM is either ten times the largest number of indicators used to measure one construct or ten times the
680
R. Fumi´c et al.
largest number of structural paths directed at a particular endogenous construct in the structural model [12]. With this in mind, a minimum for the introduced research model would be 50, which the sample size barely exceeds. Considering all the aforementioned, the study results should be interpreted cautiously.
7 Conclusion This paper provides several contributions to the extant body of knowledge. First, we identified which pragmatic (perceived ease of use, perceived usefulness, and security) and hedonic (user interface aesthetics and satisfaction) constructs constitute the user experience of digital music streaming services, such as Spotify, Deezer, and YouTube Music. Requirements of these user experience dimensions can be employed as a set of guidelines during the evaluation of current and design of new music streaming services. To enhance users’ music listening experience and to be perceived as easy to use, music streaming services should have a visually appealing user interface. In addition, to be perceived as advantageous in users’ daily life, music streaming services are required to have implemented mechanisms that prevent unauthorized access to user data and information. Finally, to meet users’ expectations, music streaming services should take little effort to use and interaction with them should increase users’ productivity. Second, we designed a measuring instrument in the form of a post-use questionnaire that can be used for examining user experience of music streaming services at different levels of granularity. Finally, we proposed a valid and reliable conceptual model that can be employed for predicting users’ satisfaction with music streaming services. Both the questionnaire and research model can serve researchers as a foundation for future advances in the field. To draw sound generalizable conclusions, a repeat of the study using a larger sample size and including heterogeneous users concerning their demographics should be done. The model could also be adapted for evaluating the user experience of other sorts of hedonic systems such as video and movie streaming services like YouTube and Netflix, live streaming services like Twitch TV, as well as other kinds of audio software such as music player programs for mobile phones. An expansion of the model could also be done by measuring more dimensions of user experience and usability or extending the proposed constructs with more items.
References 1. Ajzen, I.: The theory of planned behavior. Organ. Behav. Hum. Decis. Process. 50(2), 179–211 (1991) 2. Barata, M.L., Coelho, P.S.: Music streaming services: understanding the drivers of customer purchase. Heliyon 7(8), e07783 (2021) 3. Bevan, N.: Classifying and selecting UX and usability measures. In: Proceedings of the International Workshop on Meaningful Measures: Valid Useful User Experience Measurement, pp. 13–18. Institute of Research in Informatics of Toulouse (IRIT), Toulouse (2008) 4. Business of Apps: Music Streaming App Revenue and Usage Statistics (2022). https://www. businessofapps.com/data/music-streaming-market/. Accessed 8 Jan 2023 5. Chen, C.C., Leon, S.: Converting music streaming free users to paid subscribers: social influence or hedonic performance. Int. J. Electron. Bus. 14(2), 128–145 (2018)
Evaluating the User Experience of Music Streaming Services
681
6. Coelho, M.P., Mendes, J.Z.: Digital music and the “death of the long tail.” J. Bus. Res. 101, 454–460 (2019) 7. Cohen, J.: Statistical Power Analysis for the Behavioral Sciences. Routledge, New York (1988) 8. Cyr, D., Head, M., Ivanov, A.: Design aesthetics leading to m-loyalty in mobile commerce. Inf. Manag. 43(8), 950–963 (2006) 9. Davis, F.D.: Perceived usefulness, perceived ease of use, and user acceptance of information technology. MIS Q. 13(3), 319–340 (1989) 10. Farooq, A., Ahmad, F., Khadam, N., Lorenz, B., Isoaho, J.: The impact of perceived security on intention to use E-learning among students. In: Proceedings of the 20th International Conference on Advanced Learning Technologies (ICALT), pp. 360–364. IEEE, Tartu (2020) 11. Fornell, C., Larcker, D.F.: Structural equation models with unobservable variables and measurement error: algebra and statistics. J. Mark. Res. 18(3), 328–388 (1981) 12. Hair, J.F., Hult, G.T.M., Ringle, C.M., Sarstedt, M.: A Primer on Partial Least Squares Structural Equation Modeling (PLS-SEM), 3rd edn. Sage, Thousand Oaks (2022) 13. Hair, J.F., Ringle, C.M., Sarstedt, M.: PLS-SEM: indeed a silver bullet. J. Mark. Theory Pract. 19(2), 139–151 (2011) 14. Hair, J.F., Risher, J.J., Sarstedt, M., Ringle, C.M.: When to use and how to report the results of PLS-SEM. Eur. Bus. Rev. 31(1), 2–24 (2019) 15. Hassenzahl, M.: User Experience (UX): towards an experiential perspective on product. In: Proceedings of the 20th Conference on l’Interaction Homme-Machine (IHM), pp. 11–15. ACM, Metz (2008) 16. Henseler, J., Ringle, C.M., Sinkovics, R.R.: The use of partial least squares path modeling in international marketing. Adv. Int. Mark. 20, 277–319 (2009) 17. Hulland, J.: Use of partial least squares (PLS) in strategic management research: a review of four recent studies. Strateg. Manag. J. 20(2), 195–204 (1999) 18. International Standards Organization: ISO/IEC 25010:2011. Systems and software engineering — Systems and software Quality Requirements and Evaluation (SQuaRE) — System and software quality models (2011) 19. Lee, J.H., Price, R.: User experience with commercial music services: an empirical exploration. J. Assoc. Inf. Sci. Technol. 67(4), 800–811 (2016) 20. Lee, M., Choi, H.S., Cho, D., Lee, H.: Can digital consumption boost physical consumption? The effect of online music streaming on record sales. Decis. Support Syst. 135, 113337 (2020) 21. Li, Y.-M., Yeh, Y.-S.: Increasing trust in mobile commerce through design aesthetics. Comput. Hum. Behav. 26(4), 673–684 (2010) 22. Mariano-Melo, A., Ramírez-Correa, P., Ramírez-Rivas, C.: Music streaming acceptance in brazil: a study using structural equations. In: Proceedings of the International Conference on Industrial Engineering and Operations Management, pp. 494–495. IEOM Society International, Sao Paulo (2021) 23. Orehovaˇcki, T.: Methodology for evaluating the quality in use of web 2.0 applications. Ph.D. Thesis, University of Zagreb, Faculty of Organization and Informatics, Varaždin, Croatia (2013) 24. Orehovaˇcki, T.: Evaluating the interplay of user experience facets in the context of social web applications. In: Proceedings of the 9th International Conference on Software and Information Engineering, pp. 73–78. ACM, Kairo (2020) 25. Orehovaˇcki, T., Babi´c, S.: Identifying the relevance of quality dimensions contributing to universal access of social web applications for collaborative writing on mobile devices: an empirical study. Univers. Access Inf. Soc. 17(3), 453–473 (2018) 26. Orehovaˇcki, T., Babi´c, S.: Predicting students’ continuance intention related to the use of collaborative web 2.0 applications. Proceedings of the 23rd International Conference on
682
27.
28.
29. 30.
31.
32.
33.
34.
35.
36.
37. 38. 39. 40.
41. 42.
R. Fumi´c et al. Information Systems Development (ISD), pp. 112–122. Association for Information Systems, Varaždin (2014) Orehovaˇcki, T., Babi´c, S., Etinger, D.: Identifying Relevance of Security, Privacy, Trust, and Adoption Dimensions Concerning Cloud Computing Applications Employed in Educational Settings. In: Nicholson, D. (ed.) AHFE 2017. AISC, vol. 593, pp. 308–320. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-60585-2_29 Orehovaˇcki, T., Babi´c, S., Jadri´c, M.: Exploring the validity of an instrument to measure the perceived quality in use of web 2.0 applications with educational potential. In: Zaphiris, P., Ioannou, A. (eds.) LCT, Part I, HCII 2014. Lecture Notes in Computer Science, vol. 8523, pp. 192–203. Springer, Heraklion (2014). https://doi.org/10.1007/978-3-319-07482-5_15 Orehovaˇcki, T., Blaškovi´c, L., Kurevija, M.: Evaluating the perceived quality of mobile banking applications in croatia: an empirical study. Future Internet 15(1), 8 (2023) Orehovaˇcki, T., Cappiello, C., Matera, M.: Identifying Relevant Dimensions for the Quality of Web Mashups: An Empirical Study. In: Kurosu, M. (ed.) HCI 2016. LNCS, vol. 9731, pp. 396–407. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-39510-4_37 Orehovaˇcki, T., Etinger, D., Babi´c, S.: Modelling the adoption of the version control system: an empirical study. In: Proceedings of the 3rd International Conference on Human Systems Engineering and Design: Future Trends and Applications (IHSED), pp. 45–50. Springer, Pula (2020). https://doi.org/10.1007/978-3-030-58282-1_8 Orehovaˇcki, T., Etinger, D., Babi´c, S.: The antecedents of intelligent personal assistants adoption. In: Proceedings of the AHFE 2018 International Conference on Human Factors and Systems Interaction, pp. 76–87. Springer, Orlando (2018). https://doi.org/10.1007/9783-319-94334-3_10 Orehovaˇcki, T., Plantak Vukovac, D., Džeko, M., Stapi´c, Z.: Evaluating relevant UX dimensions with respect to IoT ecosystem intended for students’ activities tracking and success prediction. In: Zaphiris, P., Ioannou, A. (eds.) LCT 2018. LNCS, vol. 10924, pp. 279–293. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-91743-6_22 Orehovaˇcki, T., Radoševi´c, D.: Exploring the antecedents of verificator adoption. In: Kurosu, M. (ed.) HCII 2021. LNCS, vol. 12764, pp. 400–417. Springer, Cham (2021). https://doi.org/ 10.1007/978-3-030-78468-3_27 Orehovaˇcki, T., Žajdela Hrustek, N.: Development and validation of an instrument to measure the usability of educational artifacts created with web 2.0 applications. In: Marcus, A. (ed.) DUXU 2013. LNCS, vol. 8012, pp. 369–378. Springer, Heidelberg (2013). https://doi.org/ 10.1007/978-3-642-39229-0_40 Pal, D., Triyason, T.: User intention towards a music streaming service: a Thailand case study. In: Proceedings of the 9th International Conference on Advances in Information Technology (IAIT), pp.1–16. King Mongkut’s University of Technology Thonburi, Bangkok (2018) Ringle, C.M., Wende, S., Becker, J.-M.: SmartPLS 4. Boenningstedt: SmartPLS. Retrieved from https://www.smartpls.com (2022) Russo, D., Stol, K.-J.: PLS-SEM for software engineering research: an introduction and survey. ACM Comput. Surv. 54(4), 1–38 (2022) Shmueli, G., et al.: Predictive model assessment in PLS-SEM: guidelines for using PLSpredict. Eur. J. Mark. 53(11), 2322–2347 (2019) Silfverberg, S., Liikkanen, L. A., Lampinen, A.: “I’ll press play, but I won’t listen": profile work in a music-focused social network service. In: Proceedings of the ACM 2011 Conference on Computer Supported Cooperative Work, pp. 207–216. ACM, Hangzhou (2011) Statista: Music streaming – worldwide (2022). https://www.statista.com/outlook/dmo/digitalmedia/digital-music/music-streaming/worldwide. Accessed 8 Jan 2023 Tenenhaus, M., Esposito Vinzi, V., Chatelin, Y.-M., Lauro, C.: PLS path modeling. Comput. Stat. Data Anal. 48(1), 159–205 (2005)
Evaluating the User Experience of Music Streaming Services
683
43. Tuch, A.N., Presslaber, E.E., Stöcklin, M., Bargas-Avila, J.A.: The role of visual complexity and prototypicality regarding first impression of websites: Working towards understanding aesthetic judgments. Int. J. Hum. Comput. Stud. 70(11), 794–811 (2012) 44. Venkatesh, V., Bala, H.: Technology acceptance model 3 and a research agenda on interventions. Decis. Sci. 39(2), 273–315 (2008) 45. Venkatesh, V., Thong, J.Y.L., Xu, X.: Consumer acceptance and use of information technology: extending the unified theory of acceptance and use of technology. MIS Q. 36(1), 157–178 (2012) 46. Wagner, T.M., Hess, T.: What drives users to pay for freemium services. Examining people’s willingness to pay for music services. In: Proceedings of the 19th Americas Conference of Information Systems (AMCIS), pp. 1–8. AIS, Chicago (2013) 47. Wikström, P.: A typology of music distribution models. Int. J. Music Bus. Res. 1(1), 7–20 (2012) 48. Wulandari, D., Suhud, U., Purwohedi, U.: The influence factors of continuance intention to use a music streaming application. Int. J. Adv. Sci. Educ. Religion 2(2), 17–25 (2019) 49. Yeoh, S.Y., Yuntavid, X.J., Chin, P.N.: Examining the continuous usage intention and behaviours of music streaming subscribers. Int. J. Electron. Bus. 17(2), 183–203 (2022)
Using Drone and AI Application for Power Transmission Line Inspection and Maintenance: A Case Study in Vietnam Dinh Cong Nguyen1(B) , Le Nhan Tam2 , Dinh Hung Phan3 , The Cuong Nguyen1 , Dung Nguyen Duy4 , and Quang Nguyen Xuan4 1 Hong Duc University, Thanh Hoa, Vietnam {nguyendinhcong,nguyenthecuong}@hdu.edu.vn 2 Microsoft, Ha Noi, Vietnam [email protected] 3 Thinklabs JSC, Thanh Hoa, Vietnam [email protected] 4 EVN NPTC2, Da Nang, Vietnam {nguyenduydung,nguyenxuanquang3}@npt.com.vn
Abstract. The current manual inspection and maintenance processes of Power Transmission Line and its equipment on the Power Tower in Vietnam present a number of challenges such as dangers to workers, taking time to perform, inaccurate information and not digitized data to archive. In this article, we introduce the management information system (MIS) for Power Transmission Line which supports for the digitalization process of the inspection and maintenance tasks. It also illustrates how to automatically use drones for inspecting the infrastructure and then to combine with artificial intelligence (AI) machine for recognizing and classifying equipment mounted on the lines and power towers and then detecting failures as well as abnormalities of them. Keywords: Digitalization · Inspection and Maintenance Intelligent · Drone · Power Line Transmission
1
· Artificial
Introduction
In March, 2020, the Vietnamese government approved the decision of a national digital transformation programme to 2025, with a vision to 2030. The main objective of the program is to boost Vietnam toward a digital government with stability and prosperity. It tends to form national digital technology groups which could be of going global. Based on that the conventional businesses are required to adapt with the new operations and organizations of their own businesses [1]. Due to the increased technological competition, enterprises should improve their infrastructure and promote the application based on digital c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): SAI 2023, LNNS 711, pp. 684–698, 2023. https://doi.org/10.1007/978-3-031-37717-4_44
Using Drone and AI Application
685
technologies.Moreover, one of the key factors is to provide employees with digital skills to support them more dynamically in fitting with new technologies. For the convenience, we could discuss here the definitions of the two common terms as digitization and digitalization. The most popular assumption of digitization is mentioned as a technical process [2] which converts analog samples into digital information. In comparison with the digitalization known as digital transformation is normally referred to the procedures and processes of employing information from digital media [6]. Digitization and digital transformation are crucial trends which influence the many corners of the business models as well as business development. In the scope of our project, we have used the digitization process as the background/backbone to accelerate the digitalization process. PTC2 (Power Transmission Company 2) is one of the branches of the National power transmission corporation (NPTC) including eight provinces in the central region of Vietnam, shown in Fig. 1. The mission is to manage all the power transmission systems of the regions. For these scales, the PTC2 has to manage the following units: 6 routes of the high voltage power transmission of 500kV with 2834 towers, 43 routes of the high voltage power transmission of 200kV with 2790 towers, 3 power transformers of 500kV stations, 14 power transformers of 200kV stations. To maintain and operate the system smoothly and effectively, the PTC2 needs two thousands of workers with different levels of management.
Fig. 1. The Terrain Factors of the Power Line Systems are Divided into many Areas from Forests, High Mountains, Rivers and Lakes, and Coastal Areas.
With the conventional approach, some drawbacks of the work flows of the PTC2 could be summarized as: – The interactions between the direct managers and workers to do the daily tasks are completed manually through employee task assignments corresponding to different kinds of tasks. Then, the statistical analysis reports and/or synthesis reports are given through the monitoring forms one a day. Every week/month/year, specialized divisions have to sum up and to make up the
686
D. C. Nguyen et al.
Fig. 2. The Methodology for our Context in the Digital Transformation Process of the PTC2.
reports. With a ton of data sheets or documents produced every day, it is really hard to follow. – All technical reports, records of the maintenance and operations of the PTC2 systems are stored physically in forms of the papers, notebooks. Therefore, there are several disadvantages in the tasks of updates, statistical comparisons, and stores. – The power line inspections are periodically carried out by the conventional approach that is mostly based on the naked-eye observation. This work remains some challenges such as time-consuming, expensive and lifethreatening processes as addressed in [7]. In Vietnam, the problems of digital transformation in business models/activities have recently received more attention, and discussions [3]. However, the digitalization processes are mostly applied in finance, banking [4], health care [5] or education [11]. Needless to say that the digitalization process sounds like a herculean task. It is hard to find a general platform for different domains/areas. To the best of our knowledge, there are a few publications to discuss about the digital transformation in the power line inspection topic in terms of the scientific aspects [19,20]. In this paper, we address these weaknesses of the business activities with a special attention on the PTC2. By doing so, our contribution is proposed/published an end-to-end system to manage the business activities with a particular focus on the periodic inspection and maintenance tasks as one of the most important missions of the PTC2 through: – All the documents related to the operational management system such as inspection, measurement of equipment, handling of existence, maintenance and repair of the power transmission lines are digitized and stored. A dashboard is designed to manage the equipment and the infrastructure of the power line transmission, and then automatically presented, extracted or aggregated reports according to the prescribed forms.
Using Drone and AI Application
687
– An AI platform into the management information system which embedded a object detection model using convolutional neural networks (CNNs) is proposed. The model automatically detects and classifies general objects to evaluate their situations as normal objects or defective objects including insulator faults, connectors corroded, conductor damaged and so on from collected images by the UAV (Unmanned aerial vehicle). – A solution of controlling the UAV in order to fly automatically based on a defined trajectory is presented. These trajectories are different from each other due to the terrain conditions, line infrastructure at different units. This process maximizes the accuracy, safety when operating the UAV. The remainder of the paper is presented as follows. Section 2 explains the used methodology to build the system. Next, Sect. 3 describes our system in detail. Then, Sect. 4 evaluates the proposed system. Finally, Sect. 5 concludes the paper and gives the perspectives while Sect. 5 gives some detailed information.
2
Methodology
Digital transformation with the support of digital technologies is possibly established by two methods such as recombination and invention [8]. Recombination combines between new business models and the previous one, while invention tends to deploy a totally new business model. With our context, we prefer the recombination method, illustrated in Fig. 2. Figure 2 describes our system development methodology. Our first objective is to have the best understanding of the current situations and requirements of the business. Therefore, it is obvious that the existing problems and/or their expectations about a new management system need to be addressed by the PTC2. As illustrated, the related documents, existing processes need to be carefully studied by our business analysts. In addition, we organized several workshops/seminars with business units, technical workers, leaders and then to collect and analyze the opinions/suggestions as well as comments about a new management information system (MIS). Based on that foundation, the software system requirements are analyzed and designed in order to build an architectural model that tends to fit well with the business demands. After that the beta version of the MIS was published to the PTC2. The purpose was for users to experience and test the functions. Throughout this process we have obtained valuable feedback, which was used as positive improvements of the MIS.
3
Our Proposed Management Information System
As mentioned in Sect. 1, our proposed system includes of three main parts. Section 3.1 describes the applications of the inspection and maintenance on the line/tower transmission systems. Section 3.2 represents the applications of information management and the image processing module employed the support of artificial intelligence (AI) for object detection. Section 3.3 shows the applications where users can interact with the system. Figure 3 illustrates the overall system.
688
D. C. Nguyen et al.
Fig. 3. The Overall System Includes Three Main Parts.
3.1
Applications for Inspection and Maintenance
The applications for inspection and maintenance aim to serve technical workers in the periodic inspections throughout the year on the website and mobile devices. We configured two different scenarios for the observations. The first scenario, an automatic observation was deployed by the UAV. In this context, we developed a mobile application (app) for tablet devices. Technical workers will interact with the mobile app not only to receive job assignments from leaders, to work alongside with partners but also to control the UAV. Moreover, with the places which cannot be accessed by the human or cannot connect to pilot signals, we contributed to design a program that allows the UAV to fly according to programmed trajectories without controlling by table devices, shown in Fig. 13. The observation results could be downloaded to the management system later. The second scenario, as the conventional approach without the support of the UAV, employees still could react to the application to get the jobs and then to put data as the results of the manual observations directly into the system. Figure 5 presents our pipeline when obtained images. The AI model known as a convolutional neural network (CNN) [9] was employed for object detection and classification. This topic has recently gained lots of attention in the literature with a special attention on images collected by UAV. The author in [7] discussed multi-object detection for power transmission line inspection with a special focus on insulator faults. [13,14] applied either the R-CNN model (Region based convolutional neural networks) or the Faster R-CNN model for insulator detection (Faster region based convolutional neural networks). Other contributions was discovered to detect rip currents from the UAV images [17,18] based on the R-CNN model. For our context to detect 26 labels of objects: 18 labels of general objects and 8 labels of faulty objects, we proposed to use the YOLOv5 model [12]. This model is known as the best version compared with the previous versions among the YOLO series, it not only reduced the model size but also accelerated the speed of the training step and detection. Basically, YOLOv5 model was developed
Using Drone and AI Application
689
with four versions based on their depths and widths as: YOLOv5x, YOLOv5l, YOLOv5m, and YOLOv5s. With the strongest of mAP (Mean average precision) on the pretrained checkpoints, YOLOv5x is preferred, shown Fig. 4.
Fig. 4. Performances of the YOLO Models(https://github.com/ultralytics/yolov5) with EfficientDet.
The assessment of objects will be proposed either by the AI machine or technical workers/professionals. For example, the defective objects such as insulator faults, connectors corroded are strongly notified by the colors or icons. The results are updated and stored into the database for the following tasks.
Fig. 5. The Inspection and Maintenance with the Support of AI Machine.
3.2
Applications for Information Management and Image Processing
This is the main part of our system including two key parts: Management information system (MIS) for power transmission lines and AI application system (AAS) for image processing and object detection.
690
D. C. Nguyen et al.
Fig. 6. The Interaction between the Management Information System and the Users.
– With the MIS, digital transformation is mainly applied in this place. To build up this system, we first carried out the digitalization process with documents such as: job assignment forms; inspection checklists; evaluation forms; administrative forms; architectural/structural drawings of towers, power lines. In addition, we designed here a dashboard module to provide information visually to users at the different levels of management. The dashboard could serve to extract diverse reports based on the business requirements from the PTC2. – With the AAS, advanced techniques of the AI and image processing are embedded into the application. This platform assists technicians, engineers easily to manipulate the AI technologies in the most convenient way. The platform provides a conveniently graphical interface with main functions such as: Image/video managements, Image/video labeling, AI trained model managements, AI inferencing managements. Both the MIS and the AAS interact with each other. The input images/videos are processed first at the AAS. The results of the inspection tasks give back to the MIS. As designed, the MIS will analyze and visualize the necessary information and update to the users through the dashboard. 3.3
Applications for Interaction with Users
The dashboards of the web/mobile - based applications for users to access and interact with the MIS and the AI applications were designed. Based on the roles of the management and/or the appropriate levels of the users, the system will have decentralization of functions to access information respectively. For example, with leaders or managers they can observe the overall information about infrastructure backgrounds, inspection and fixing progresses or the
Using Drone and AI Application
691
detailed statistics and reports. For technical workers or engineers, they can access with their duties to receive the corresponding tasks, to input data, to create typical assessment reports, illustrated in Fig. 6. In addition, the AI module for image processing and object detection, specialized engineers will be able to access, create, adjust, update data sets, and to choose suitable labels for the objects. At the time of this submission we got 60 thousands of images with 85 thousands of labels for 26 objects such as insulators, corona rings, connectors, conductors, discharge guns and so on. For the detailed information, we recommend readers to the Appendix A.
4
Results
As mentioned, we developed the software in both the web/mobile platform. For the visual illustrations, we would present and discuss some main functions of the software here. Figure 7 shows the dashboard on the mobile platform. This module allows users to follow task progress such as inspection, monitoring, maintenance.
Fig. 7. The Dashboard with the Modules of General Situations of the Comprehensive Reports and Charts (The First Image), Inspection Report (The Second Image).
As a new contribution, the AI platform was currently embedded into the system. The platform will automatically analyze and detect objects from images,
692
D. C. Nguyen et al.
then to notify if there are some abnormal objects such as insulator faults, broken conductors. The interface of this module presents in Fig. 8.
Fig. 8. The Interface to Analyze Images, and to use the AI Machine for Object Detection and then to Inform any Unusual Object.
This platform also includes several functions such as the object label management, the adding label tool, the training AI model tool, the inferencing model tool Fig. 9. As the key part of the AI platform, we evaluate here the performance of our trained model, which is based on the YOLOv5x model Fig. 10. For the training step, the collected dataset contains 59911 images taken from flying-cam devices (Phantom 4 RTK) with very high resolutions in range of 12– 20 Mpx. The captured areas include mountains, forests and flats. For safety, the images were captured at least 5 m from the devices with many different shooting angles. Then, they were automatically uploaded and labeled using our system. For more detailed about the dataset description and our system. For the testing step, we report the results in the Appendix A with the confusion matrix with precision (%), recall (%), mean average precision (mAP%) over classes about 70%, 10%, 20%, respectively and some visual examples of our model Fig. 11. Based on the report that we already surveyed nearly 2000 leaders/managers, technical workers, and administrative assistants from the PTC2 in terms of questionnaires. The participants’ age ranged from 25 to 55. About 15% of the participants got the master of science degree. Others got the degrees of the bachelor of accountancy, business administration, information technology, electricity, and law. It is worth noting here that all laborers have knowledge and skills for using very basic applications on the web/mobile platforms. Our goal is to evaluate their satisfactions of our system when deployed in the PTC2 for one year. Figure 12 shows the customer satisfaction indexes with 15 questions in survey.
Using Drone and AI Application
693
Fig. 9. The Dataset Management and the Training AI Model Tools.
Fig. 10. The our Trained Model on the Training set and the Validation set over 26 Labels of Objects.
Excellent index means our system has fitted well most of the digital transformation objectives within the expected time frame, 85% participants agreed with. Good index shows that our system has adapted with the majority of the digital transformation goals within the expected time frame with 8% of the agreements. The remaining percentages about 7% believe that our system only performed poorly on their tasks. Other evaluation is based on the different scenarios of the UAV when processing the inspection tasks. The automatic observation was configured as Fig. 13 with the automatic flight mode. The automatic flight control was discovered in some works such as [15,16]. However, for the commercial needs, we developed by our own algorithms to configure the UAV. Other schemes controlled the UAV by hand controls. In the experiments, we used two kinds of the UAV as Mavic 2 Zoom, PhanTom 4 RTK. We measured several places and times then computed on average. Moreover, we processed this tasks by a group of technical
694
D. C. Nguyen et al.
Fig. 11. Some Visual Examples of our Model when Processed with the UAV.
Fig. 12. The Evaluation of Satisfaction from the Proposed System in the PTC2 after used One Year.
engineers. They have already trained and collected lots of experiences with the UAV control. Table 1 shows our results. These results are obtained during several testing times. As given, the automatic schemes outperformed the manuals in terms of energy consumption. It could save from 15% to 20% of the battery power which is a key performance of each UAV. Moreover, there are a large number of transmission towers located in forests, mountains, even on the rivers. Therefore, the automatic configuration is therefore preferable. It is pretty noticeable that our AI models have performed significantly well in most of the cases with over 90%. It could detect all the objects from images obtained by the UAV. It results from good models and adequate training data. In order to measure the economic efficiency of the PTC2 when they applied our system for their work. We already worked with their Board of directors, Department of Financial Planning. As a consequence, we obtained the results as shown in Fig. 14. As explained, in the first period the company needs to take more time in order to be familiar with the system. Due to the different backgrounds of information technology workers, they have to train and educate their laborers to adapt with the system. That explains why the benefits are slightly gained in
Using Drone and AI Application
695
Fig. 13. The Scenario of the Automatic Flight Trajectory of the UAV. Table 1. Performance Evaluation of the Different Scenarios with Automatic/Manual Flight Configurations UAVs
Routes
Schemes
Energy consumption (%)
Mavic 2
Da Nang - Thanh My
Automatic Manual Automatic Manual
30 50 30 44
Thanh My - Pleiku
Automatic Manual Automatic Manual
30 46 31 44
Da Nang - Doc Soi
Automatic Manual Automatic Manual
46 63 48 67
Doc Soi - Pleiku
Automatic Manual Automatic Manual
54 76 53 78
Quang Trach - Doc Soi Automatic Manual PhanTom 4 Automatic Manual
68 87 71 92
PhanTom 4 Mavic 2 PhanTom 4 Mavic 2 PhanTom 4 Mavic 2 PhanTom 4 Mavic 2
the short terms. However, the values expect to increase sharply in the long term periods. Our system could substitute half of the labors used to process the tasks of inspection and maintenance as the conventional approaches.
696
D. C. Nguyen et al.
Fig. 14. The Cost Reduction of the Company when they have Applied our System Since 2021 and to Predict the Benefits in the Following Years.
5
Conclusion and Future Works
In the paper, we present the system using the AI application and the UAV as key technologies for the digitalization in the tasks of inspection and maintenance of Power line systems within the PTC2. Our system has been deployed since 2021 in the PTC2 and gained a lot of benefits for their business. The proposed systems completely changed their conventional working methods, shaping many new processes with outstanding quality. However. during the implementation process, the proposed system has a few weakness as following: – The on-board battery capacity of the drones is limited, therefore, the distances for the inspection were not too far. – The 4G connection was sometimes interrupted in isolated areas such as on the mountains or in the forests. This results in the downloading speeds of images from the drone into the servers are low. – The IT backgrounds of the workers are different from nearly two thousand people. Therefore, the progress of the training job and technology transfer were quite time-consuming. As perspectives, some wireless relay stations could be built up in order to guarantee the quality of signals in remote places. Moreover, the robotic power transmission line hybridizing between the climbing and flying robots has recently been demonstrated as a promising approach [10]. Therefore, we could investigate this trend to deal with the limitations of the inspection. Acknowledgments. This research was funded by ThinkLABs R&D (https:// thinklabs.com.vn/) and the sample data was supported by EVN NPTC2 (https:// www.npt.com.vn) in Da Nang.
Using Drone and AI Application
697
Appendix A Extended Results See Fig. 15.
Fig. 15. The Confusion Matrix of our Model when Inferencing Data.
References 1. Michailidi, E., Michailidis, H.: Digital transformation of small Greek companies during the Covid-19 pandemic. In: DASA, pp. 1103–1108. IEEE (2021) 2. Bican, P.M., Brem, A.: Digital business model, digital transformation, digital entrepreneurship: is there a sustainable “digital”?. Sustainability 12(13) (2020) 3. Do, T.D., Pham, H.A.T., Thalassinos, E.I., Le, H.A.: The impact of digital transformation on performance: evidence from vietnamese commercial banks. J. Risk Financ. Manag. 15(1) (2022)
698
D. C. Nguyen et al.
4. Nguyen, T.T., et al.: Determinants of digital banking services in Vietnam: applying utaut2 model. Asian Econ. Financ. Rev. 10(6), 680–697 (2020) 5. Dang, T.H., Nguyen, T.A., Van, M.H., Santin, O.: Patient-centered care: transforming the health care system in Vietnam with support of digital health technology. J. Med. Internet Res. 23(6), e24601 (2021) 6. Ritter, T., Pedersen , C.L.: Digitization capability and the digitalization of business models in business-to-business firms: Past, present, and future. Ind. Market. Manag. 86, 180–190 (2020) 7. Nguyen, D.C., Nguyen, T.C., Phan, D.H., Le, N.T.: Multi-object detection by using CNN for power transmission line inspection. In: INISCOM, pp. 337–347 (2021) 8. Priyono, A., Moin, A., Putri, V.N.A.O.: Identifying digital transformation paths in the business model of SMEs during the COVID-19 pandemic. J. Open Innov. Technol. 6(4), 104 (2020) 9. Alom, M.Z., Taha, T.M., Yakopcic, C., Westberg, S.: The history began from alexnet: a comprehensive survey on deep learning approaches (2018). arXiv:1803.01164 10. Alhassan, A.B., Zhang, X.: Power transmission line inspection robots: a review, trends and challenges for future research. IJEPES (118) (2020) 11. Dung, N.T., Tri, N.M.: Digital transformation meets national development requirements. Linguist. Cult. Rev. 5(S2), 892–905 (2021) 12. Glenn, J., et al.: ultralytics/yolov5: v5.0 - YOLOv5-P6 1280 models, AWS, Supervisely and YouTube integrations, Zenodo (2021) 13. Kang, G., Gao, S., Yu, L., Zhang, D.: Deep architecture for high-speed railway insulator surface defect detection: denoising autoencoder with multitask learning. IEEE Trans. Instrum. Meas. 68(8), 2679–2690 (2018) 14. Ma, L., et al.: Detection method of insulator based on faster r-CNN. In: Annual International Conference on CYBER Technology in Automation, Control, and Intelligent Systems (CYBER), pp. 1410–1414 (2017) 15. Zaludin, Z., Gires, E.: Automatic flight control requirements for transition flight phases when converting long endurance fixed wing UAV to VTOL aircraft. In: I2CACIS, pp. 273–278. IEEE (2019) 16. Erdelj, M., et a.: UAVs that fly forever: uninterrupted structural inspection through automatic UAV replacement. Ad Hoc Netw. 94, 101612 (2019) 17. Sun, A.: UAV-video-based rip current detection in near-shore areas. Ph.D. dissertation, University of Miami (2020) 18. de Silva, A., Mori, I., Dusek, G., Davis, J., Pang, A.: Automated rip current detection with region based convolutional neural networks. Coastal Eng. 166 (2021) 19. Jenssen, R., Roverso, D.: Automatic autonomous vision-based power line inspection: a review of current status and the potential role of deep learning. Int. J. Electr. Power Energy Syst. 99, 107–120 (2018) 20. Kim, S., Kim, D., Jeong, S., Ham, J.W., Lee, J.K., Oh, K.Y.: Fault diagnosis of power transmission lines using a UAV-mounted smart inspection system. EEE Access 8, 149999–150009 (2020)
Artificial Intelligence Traffic Analysis Framework for Smart Cities Monther Tarawneh(B) , Faisal AlZyoud, and Yousef Sharrab Computer Science Department, Isra University, Amman, Jordan {mtarawneh,faisal.alzyoud,Sharrab}@iu.edu.jo
Abstract. The use of artificial intelligence and the transfer to smart cities assist in improving the mobility inside them by enhancing the traffic flow and decreasing the number of traffic accident deaths. There have been many studies to handle traffic management in urban cities by modifying the main dimensions for smart cities in mobility management and benefiting from artificial intelligence in intelligent decision-making. The use of network to build infrastructure for smart city is a major factor in successful smart city. This will enable running variety of application the use of unlimited space and data analysis using cloud edge computing. Vehicles become smart to run smartly in smart city. However, the safety of driver is the main concern and remote monitoring of driver and vehicles is the only way to have smart city without accidents. In this research, we propose a framework that analyze traffic data on time to reduce the number of accidents and safe people lives. The framework makes use of IoT devices and artificial intelligence diagnose the driver, car performance and road condition. The simulation shows that the proposed framework is great help in saving driver’s life by monitoring it remotely, great help in watching car performance to reduce cost and prevent damage to the car. In addition to that, it can be used by driver to check road status and by government to control traffic. Keywords: Artificial Intelligence (AI) · Smart Cities · Sustainability · Traffic Accidents · VANETs · Remote Monitoring
1 Introduction Artificial intelligence is the branch of science that uses the computer to mimic and emulate the human brain in solving complex problem, which results in intelligent solution with minimum leakages constraints. John McCarthy, a computer scientist has launched the Artificial Intelligence (AI) concepts by defining it as “the science and engineering of making intelligent machines” [1]. Nowadays, AI is implemented in all aspect of our lives; it has become the core of our life. Deep learning, machine learning and game theory are the most used branches of AI. It can assist in developing several sectors relating to human life such as electronic governance and citizen participation that help in fulfilling the desired objectives in the smart as depicted in Fig. 1. Artificial Intelligence (AI) already has a positive impact on different activities in our life. As a result, there has been much research, which is launched to implement and apply © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): SAI 2023, LNNS 711, pp. 699–711, 2023. https://doi.org/10.1007/978-3-031-37717-4_45
700
M. Tarawneh et al.
Fig. 1. AI Implementation Fields in Smart Cities
AI in most of the societies’ activities [2, 3]. There has been great attention to modify studies that are launched to implement and apply AI in most of the societies’ activities, since AI contribute to sustainability for future smart cities development [4]. Since the 19th, the planners of urban cities started to organize the cities in a manner that utilizes the natural resources and enhances the quality of lifestyles through having sustainable development that needs immediate actions from the governments, industry, and society as a whole [5]. Smart city include a wide range of application in smart transportation, smart health care, smart parking, smart education and smart government services. These applications require sensors, big data processing, communications, and security [20]. The wide spread of smart, devices, AI, and development of communication by implementing 5G contributes the transformation of conventional cities into smart cities, as they help the cities to achieve their goals of becoming smart cities. The use of AI and smart cities concepts address these problems, since the main dimensions on which smart cities are built can be addressed through developing the following dimension: Government or public administration development, Inhabitances qualification and satisfaction, Economic competitiveness, Quality of lifestyle and healthcare, Environment pollution and energy consumption, and Mobility and public transportation as shown in Fig. 2. AI helps in constructing the smart cities as it enhances the resources management and improves the sustainability through addressing the energy and pollution problems by utilizing the cities resources efficiently [6]. The population growth around the world has an impact on large urban cities, since it results in increasing the number of vehicles, CO2 emission, noise, and other problems. Traffic is considered as a main challenge for most governments around the world, since it causes health and economic problems in addition to the leakage of productivity and
Artificial Intelligence Traffic Analysis Framework
701
Fig. 2. Smart Cities Main Dimensions
increases number of deaths. The spread of IoT devices and improvement of communication with the features of AI has a positive effect in constructing smart cities that will replace the existing urban cities, since AI can help in developing the prediction of jams, drivers’ behavior and probabilities of accidents. This prediction will help the planners to construct solutions to eliminate these problems. Smart transportation system requires modifying vehicle to be smart in order to improve driving and reduce number of accidents. New smart vehicles have some devices included in it such as sensors, communication and processing capabilities.
2 Literature Review There is huge increase in the vehicle industry all over the world and high demand on transports. Vehicles have a great benefit to our daily life that includes up to date technology and entertainment beyond our expectations. However, the increase use of vehicles will increase the percentage of accidents, which cause around one million deaths [21] every year. In addition, road traffic injuries would become the fifth leading cause of mortality by 2030, according to a World Health Organization (WHO) report on road safety [22]. It is known that 95% of the accidents are caused by human errors. Traffic control inside urban cities is bottleneck for most smart cities’ developer, so most of the cities developer is concerned with smooth traffic flow to reduce transportation time, and the road safety to minimize the number of accidents, since these issues are considered the main challenges for Intelligent Transportation Systems (ITSs) [7]. Several studies are done to handle traffic congestion by using sensors and IoT devices in addition to the implementation of new communication technologies [8, 9]. Most of these studies develop dynamic traffic control scenarios to route the vehicles using smart solutions to enhance the traffic flow on roads and solve the traffic jam and city gridlock. A network of vehicles with sensors and physical remote access devices are connect via internet to reduce human involvement are proposed with respect to the applications of vehicle ad hoc networks (VANETs) which is hampered by the rapid movement changes of
702
M. Tarawneh et al.
VANETs nodes [10, 11]. The use of wireless mobile ad hoc networks or VANETs in traffic management either as a Vehicle-to-Vehicle (V2V) or Vehicle to Roadside (V2R) will result in accident prevention and better monitoring as the vehicles will have a selfmanagement based on AI. The main objective for VANETs implementation is driver safety and traffic jam minimization [12], but this requires all the vehicles on the road to have the ability of communications. Machine learning (ML) is an application of AI. It is used in vehicles management through enabling intelligent systems that learn from the previous stored data and draw decisions without the human assistance and explicitly programming [13], the use of ML will contribute the minimization of travelling cost and increasing the traffic flow, which will result in reducing the travelling time delay. Data fusion proposed to raise the accuracy of decision in intelligent traffic system, since it is based on data combination from one and multiple sources, which will produce single intelligent decision [14, 20]. Smart solution for accident management using Cupcarbon simulator which provides a good simulation technique for Smart-city and IoT applications, this research is proposed to improve the secure operation in Jordan as a case through determining the minimum time delay for rescue operations, in this research knowledge presentation techniques is combined with the heuristics programming using answer set programming techniques [15]. Convolutional Neural Network (CNN) is with spatial data was developed using camera on Unmanned Aerial Vehicle to monitor the traffic system using Long Short-Term Memory (LSTM) frameworks [16, 17], the achieved accuracy was satisfied. Most of the proposed approaches for traffic management suffer from short delays in communication between vehicles and roadside units, and the smoothly of traffic flow which results in decreasing the safety of roads. In this research, we are going to build an AI model to extract the main factors for accidents after cleaning the huge data of car accident that is collected from traffic accident records in Jordan during the interval time 2016–2020 [18]. Driver is the main factor to prevent accidents. Driver inattention is any event that cause the driver to lose his attention to the environment around him on the road [23]. Reading the vital signs of the driver could help to prevent accidents. Unpredictable health condition may happen any time, but mentoring driver health status is important to avoid death and reduce the number of accidents. Vital signs are collected by embedded sensors, which send all readings to a cloud server. Sensors are used to detect many things inside and outside the car such as the number of occupied seat, outside/inside temperature, passenger weight, driver’s heart rate and more. Furthermore, real time cameras are used to detect driver tiredness by focusing on the driver’s face, any changes in the mouth, eye and head that indicate tiredness will alarm the driver immediately without any further action [24]. On the same approach of detecting driver status, sensors and cameras are embedded in smart cars to detect many things such as fuel level, distance between cars, engine temperature, oil level, and more. Smart car monitoring system is useful to prevent road accident and reduce maintenance cost. Most of the proposed systems are based on AI and IoT [25]. Road type and condition has an impact on the driver’s health and car performance. Cameras are used to detect moving object around car and their speed to
Artificial Intelligence Traffic Analysis Framework
703
prevent collision [26]. Auto driver in foggy environment has been implemented base on image enhancement techniques to increase visibility during bad weather [28]. There are many researches done separately to provide safe driving and reduce accidents. In this paper, we have combine three major object that affect driving: driver, car and road. Then we design a framework that collect data about the three objects in order to analyze it and take an action or give warning to drive. The framework helps drivers to share their data with others maintain safe driving environment in smart cities.
3 System Framework Smart city may have a variety of smart services such as transportation, health, light, parking, and more. These smart applications use the same architecture of the smart city. Smart Transportation is an important part in smart city and traffic is getting more and more complex overtime. The smart car system is the solution that reduce traffic, collect data, eliminate traffic accidents and help people to reach their destination quickly and safely. The proposed framework for smart car system shown in Fig. 3.
Fig. 3. Smart Car System Architecture
The three phase in the proposed smart car system is consistent with the architecture of smart city. It can be embedded easily in the smart city model. The three phases are independent, where any phase can be modified without affecting the other phases. 1. Collection Phase: Sensors are embedded in the car to collect variety of information, this information collected in an internal gateway then forwarded to the cloud storage. • Driver Condition: We expect that most cars in the future have the ability to read drive status through cameras and sensors on the dashboard and seats. These sensors track the tiredness and health signals such as temperature, heart beats, blood pressure and sugar level. In case of unusual readings such as closed eyes or high blood pressure; the car may take over and park safely. Driver-monitoring systems (DMS) implemented in all smart cars [27].
704
M. Tarawneh et al.
• Car Condition: Sensors are embedded in different places to collect information about the car itself such as: engine temperature, water level, oil level, fuel level, tier pressure and more. • Road Condition: this data includes weather condition on the road, road surface, dark, light, narrow or wide road, speed it, curved or straight. All these data could be used to prevent accidents and advice driver on the right road. • Location: the location is used in the computing phase to plot the data on the road map using google map’s coordinates. This data used to draw some statistics on that location on the road such as accident probability, road status in a specific time of the day and natural disaster in a specific time. 2. Security Phase: the integrity and confidentiality of the data is important. The security is essential to protect data confidentiality and privacy. Each car should have a private key that is used to encrypt data before transition. The cloud storage identifies the car and search for its public key to decrypt the data for processing in the computing phase. However, security is not the focus in this paper. 3. Computing Phase: it is the core of the framework where all the collected data will be filtered and analyzed. The cloud storage is more like a datacenter for a smart city. Processing this big data that coming from different sources is time consuming. Therefore, it will be performed in time in a distributed clouded processor in order to increase the throughput and the availability of the data. The computing phase has the following stages: • Data Preparation: the cloud storage will receive huge amount of data from different resources. We will have quantity of data and the focus will be on the data quality. Data should be stored in the highest quality before processing. First: clean the data from missing values, remove fixed values which will not affect the analysis, delete repeated data from different sources on the same location and time, and convert the data into XML format:
The reasons for choosing XML format is that XML is standard, easy to read and process. • Classification: data needs to be organized in a good way that make processing easier. Data is classified into three main categories: driver, road and car. The above formatted
Artificial Intelligence Traffic Analysis Framework
705
features in the XML will be merged from different sources and classified into car, road or driver. The xml file will be as below.
…..
…..
…..
• Data Analysis: The amount of data organized into the XML file analyzed to make a decision related to driver, car or draw some points on the map. We can consider the XML file as DNA for the drive or for the car. If we have unusual mutation in the sequence, then a decision has to be made based on the situation. From the collected data, we can track the driver condition. The driver signs over some time is compared to current data to see if there is a tiredness or not. Not only that, health data such as temperature, heart beats, blood pressure and sugar level can give us indication that the driver has the ability to drive safely or not. This can be embedded in smart cars to pull over safely or start auto driver. Car information such engine temperature, water level, oil level, fuel level, tier pressure and more, will predict the car ability to drive and give driver warning to consider an action related to the problem. However, in case the driver continues and data shows that the fuel is not enough to the next station, water
706
M. Tarawneh et al.
level may raise the car temperature, a smart pilot system may use to take an action to prevent major car problems or accidents. Road condition and location is useful to specify the required speed and condition that help the driver to pass this location in this condition. All this will be based on previous collected data. Figure 4 describe the analysis part of the system.
Fig. 4. Analysis Part of the Proposed Computing Layer
Data analysis is not limited to for driver safety, car care, and record road condition to prevent accident. It is used to draw other things such as Auto traffic ticketing system for unfastened belt, using mobile while driving, and speeding, or remote health monitoring system for patients where car may drive to a hospitals in case of emergency. In addition, the data is plotted on google map using collated location coordinates to draw some statistics on every location and provide drivers with information about dangerous locations.
4 Results and Discussion The main aim of applying smart technology is to save people life. We expect more application of technology in the future on all fields, which has been lunched as smart cities. A smart city should have a data center that store variety of data from different resources to analyses it in order to make a decision in all fields, such as remote heath care, energy saving, educations, economy, and more. UAE has the lead in the area to apply smart technology in most of their fields. Traffic accident records in Jordan during 2016–2020 indicates that the number of vehicles reached 1729343 and 117743 transit vehicles entered Jordan [18]. The study shows that the main factors for accidents in Jordan are related to Driver status, road condition, vehicle condition, and weather condition. Figure 5 compares the numbers
Artificial Intelligence Traffic Analysis Framework
707
of accident in Jordan during 2016–2020. The population and vehicles are increased exponentially during the tested interval, causing an increase in the number of road accidents, increasing several injuries and human deaths. In 2020, there is a huge decrease in number road accidents due to the country lockdown to face COVID-19. Huge amount of data obtained from the statistical department in Jordan. The data shows that the main factors are related to driver condition, car condition, and road condition. Driver’s attention is affect by his/her health status. Car has to be in good condition in order to have a safe trip. While road may have different conditions that may cause accidents such as signs, number of lanes, obstacles, sand and rain. However, the road condition will affect the driver status and attention.
Fig. 5. Road Accidents Information during 2016–2020 in Jordan
The simulations of the framework conducted on the cisco packet tracer. Python used to implement the security and computing phase. The packet tracer environment enabled us to design the framework, run simulation on random reading for driver health signs (temperature, blood pressure, heart rate, glucose level, and tiredness), car condition randomly filled with some important readings (temperature, speed, weight, fuel, tube air, and balance), and road condition randomly implemented by number of lanes, sand, rain, snow, hot and crowded). We used three road locations and thee cars, then randomly assign different values in different times for each key. Figure 6 shows part of the randomly generated XML file shown below: The key and values in the generated xml file used in the computing phase and shows good performance in the simulated values. The computing phase analyses the driver features because it is the main cause of accidents. It gives an initial diagnose based the normal and collected health readings recommendation and then recommend an action as shown in Table 1. After that the car and road conditions readings are combined to give indication whether the tripe is safe or not. However, most of the actions in this part are warning and reports.
708
M. Tarawneh et al.
Fig. 6. Randomly Generated XML File
Table 1. Sample of the Computing Phase Vital signs
Value
Prediction
Action
Heart rate
190
illness
Auto driver to hospitals
temperature
38
normal
Sugar level
400
dizzy
Weight
85
–
Age
45
–
History
Diabetes
–
However, the accuracy and performance cannot be test on simulated. Research will continue to apply this framework partially or fully on reality. Driver can extract some information the safety of the road in a specific at specific location. Figure 7 shows the risk of the three locations. The data taken from the randomly generated XML file. Driver may show only statistics for one location only. Data can be
Fig. 7. Risk Analysis for Three Locations all Day Times
Artificial Intelligence Traffic Analysis Framework
709
plotted on a map to have interactive view. The percentage calculated based on the Eq. 1. The driver given more weight because it is the main factor of accidents and affected by other factors. risk =
roadrisk + (driverrisk ∗ 4) + carrisk 6
(1)
5 Conclusion and Future Work The spread of wireless sensors and percolation of IoT devices because of the vast development of communication technologies raises the tendency to develop urban cities to smart ones. The smart cities developed on several dimension that maintain the sustainability and improvement of environment, economy, mobility, life quality, and the public administration in cities. There have been many frame works, which are built to develop urban cities to achieve smart cities goals [19]. Traffic management is a crucial factor in cities development as it has a direct influence on economy, productivity, environment, and the life quality for cities inhabitants. In this research, we use AI techniques to extract the main reasons for car accidents in Jordan as a case study, these reasons can be used as a guide for smart cities planner to predict the ways to minimize the number of accidents, which will enhance the mobility in cities. We have proposed a traffic data analyses framework using AI to diagnose collected data from driver, car, and road conditions. The system will compare the previous with current values and give recommendations based on given facts. A simulation has been done on cisco packet tracer environment where the computing phase implemented using java. The experiment shows that the proposed framework has good impact on reducing the number of accidents. In the future, we are going to extend our research to study network problems and solutions and focus more on the security layer.
References 1. Srivastava, S., Bisht, A., Narayan, N.: Safety and security in smart cities using artificial intelligence—a review. In; 2017 7th International Conference on Cloud Computing, Data Science & Engineering-Confluence. IEEE (2017) 2. Gonzalez, R.A., et al.: Government and governance in intelligent cities, smart transportation study case in Bogotá Colombia. Ain Shams Eng. J. 11, 25–34 (2020) 3. Floridi, L., Cowls, J.: A unified framework of five principles for AI in society. In: Machine Learning and the City: Applications in Architecture and Urban Design, pp. 535–545 (2022) 4. Navarathna, P.J., Malagi, V.P.: Artificial intelligence in smart city analysis. In: 2018 International Conference on Smart Systems and Inventive Technology (ICSSIT). IEEE (2018) 5. Silvestre, B.S., Tîrc˘ ¸ a, D.M.: Innovations for sustainable development: Moving toward a sustainable future. J. Clean. Prod. 208, 325–332 (2019) 6. Ortega-Fernández, A., Martín-Rojas, R., García-Morales, V.J.: Artificial intelligence in the urban environment: smart cities as models for developing innovation and sustainability. Sustainability 12(19), 7860 (2020) 7. Saleem, M., et al.: Smart cities: Fusion-based intelligent traffic congestion control system for vehicular networks using machine learning techniques. Egyptian Inf. J. 23(3), 417–426 (2022)
710
M. Tarawneh et al.
8. Nwankwo, W., Olayinka, A.S., Ukhurebor, K.E.: The urban traffic congestion problem in Benin City and the search for an ICT-improved solution. Int. J. Sci. Technol. 8(12), 65–72 (2019) 9. Ullo, S.L., Sinha, G.R.: Advances in smart environment monitoring systems using IoT and sensors. Sensors 20(11), 3113 (2020) 10. Qureshi, K.N., et al.: Internet of vehicles: Key technologies, network model, solutions and challenges with future aspects. IEEE Trans. Intell. Transp. Syst. 22(3), 1777–1786 (2020) 11. Günay, F.B., Öztürk, E., Çavdar, Y.T., Hanay, S., Khan, A.U.R.: Vehicular ad hoc network (VANET) localization techniques: a survey. Arch. Comput. Meth. Eng. 28(4), 3001–3033 (2020). https://doi.org/10.1007/s11831-020-09487-1 12. Yan, G., Rawat, D.B.: Vehicle-to-vehicle connectivity analysis for vehicular ad-hoc networks. Ad Hoc Netw. 58(1), 25–35 (2017) 13. Meena, G., Sharma, D., Mahrishi, M.: Traffic prediction for intelligent transportation system using machine learning. In: 3rd International Conference on Emerging Technologies in Computer Engineering: Machine Learning and Internet of Things (ICETCE), p. 145–148 14. Meng, T., Jing, X., Yan, Z., Pedrycz, W.: A survey on machine learning for data fusion. Inform. Fusion 57(1), 115–129 (2020) 15. Alzyoud, F., Sharman, N.A.L., Al-Roosan, T., Alsalah, Y.: Smart accident management in Jordan using cup carbon simulation. Eur. J. Sci. Res. 152, 128–135 (2019) 16. Khan, Q.T.A., Abbas, S., Khan, M.A., Fatima, A., Alanazi, S., et al.: Modelling intelligent driving behaviour using machine learning. Comput. Mater. Continua. 68(3), 3061–3077 (2021) 17. Tabassum, N., Ditta, A., Alyas, T., Abbas, S., Alquhayz, H., et al.: Prediction of cloud ranking in a hyperconverged cloud ecosystem using machine learning. Comp. Mater. Continua. 67(1), 3129–3141 (2021) 18. Al-Rousan, T.M., Umar, A.A., Al-Omari, A.A.: Characteristics of crashes caused by distracted driving on rural and suburban roadways in Jordan. Infrastructures 6(8), 107 (2021) 19. Sharma, B., Maherchandani, J.K.: Review of recent developments in sustainable traffic management system. In: Reddy, A.N.R., Marla, D., Favorskaya, M.N., Satapathy, S.C. (eds.) Intelligent Manufacturing and Energy Sustainability. SIST, vol. 265, pp. 401–409. Springer, Singapore (2022). https://doi.org/10.1007/978-981-16-6482-3_40 20. Habibzadeh, H., Soyata, T., Kantarci, B., Boukerche, A., Kaptan, C.: Sensing, communication and security planes: a new challenge for a smart city system design. Comput. Netw. 144, 163–200 (2018) 21. Sava¸s, B.K., Becerikli, Y.: Real time driver fatigue detection system based on multi-task ConNN. IEEE Access 8, 12491–12498 (2020) 22. World Health Organization: World report on road traffic injury prevention: summary. In: World Report on Road Traffic Injury Prevention: Summary, pp. ix–52. (2004) 23. Regan, M.A., Lee, J.D., Young, K.: Driver Distraction: Theory, Effects, and Mitigation. CRC Press, Boca Raton (2008) 24. Raju, J.V.V.S.N., Rakesh, P., Neelima, N.: Driver drowsiness monitoring system. In: Reddy, A.N.R., Marla, D., Simic, M., Favorskaya, M.N., Satapathy, S.C. (eds.) Intelligent Manufacturing and Energy Sustainability. SIST, vol. 169, pp. 675–683. Springer, Singapore (2020). https://doi.org/10.1007/978-981-15-1616-0_65 25. Bedi, P., Goyal, S.B., Kumar, J., Choudhary, S.: Smart automobile health monitoring system. In: Kumar, R., Sharma, R., Pattnaik, P.K. (eds.) Multimedia Technologies in the Internet of Things Environment, Volume 2. SBD, vol. 93, pp. 127–146. Springer, Singapore (2022). https://doi.org/10.1007/978-981-16-3828-2_7
Artificial Intelligence Traffic Analysis Framework
711
26. Singh, T., Sheikh, F., Sharma, A., Pandya, R., Singh, A.: A smart driver assistance system for accident prevention. In: Chen, J.I.-Z., Wang, H., Du K-L, V., Suma (eds.) Machine Learning and Autonomous Systems: Proceedings of ICMLAS 2021, pp. 255–269. Springer Nature Singapore, Singapore (2022). https://doi.org/10.1007/978-981-16-7996-4_18 27. Kumari, S., et al.: Intelligent driving system at opencast mines during foggy weather. Int. J. Min. Reclam. Environ. 36(3), 196–217 (2022) 28. Nees, M., Liu, C.: Mental Models of Driver Monitoring Systems: Perceptions of Monitoring Capabilities. Transp. Res. Part F Traffic Psychol. 91, 484–498 (2022)
Predictability and Comprehensibility in Post-Hoc XAI Methods: A User-Centered Analysis Anahid Jalali1(B) , Bernhard Haslhofer2 , Simone Kriglstein1,3 , and Andreas Rauber4 1
Austrian Institute of Technology (AIT), Giefinggasse 4, 1210 Vienna, Austria {anahid.jalali,simone.kriglstein}@ait.ac.at 2 Vienna Complexity Science Hub, Vienna, Austria [email protected] 3 Masaryk University, Brno, Czech Republic 4 Vienna University of Technology, Vienna, Austria [email protected]
Abstract. Post-hoc explainability methods aim to clarify predictions of black-box machine learning models. However, it is still largely unclear how well users comprehend the provided explanations and whether these increase the users’ ability to predict the model behavior. We approach this question by conducting a user study to evaluate comprehensibility and predictability in two widely used tools: LIME and SHAP. Moreover, we investigate the effect of counterfactual explanations and misclassifications on users’ ability to understand and predict the model behavior. We find that the comprehensibility of SHAP is significantly reduced when explanations are provided for samples near a model’s decision boundary. Furthermore, we find that counterfactual explanations and misclassifications can significantly increase the users’ understanding of how a machine learning model is making decisions. Based on our findings, we also derive design recommendations for future post-hoc explainability methods with increased comprehensibility and predictability. Keywords: eXplainable Artificial Intelligence · Machine Learning Interpretability · Human Computer Interaction
1
Introduction
The opacity of machine learning models is a well-known problem in application areas such as health care systems, financial services, or industrial applications [2], where transparency and accountability are fundamental requirements. When models are not transparent, users and machine learning experts (ML) have difficulty explaining how models arrive at their predictions. Therefore, ongoing research in the field of Interpretable Machine Learning (IML), also known as eXplainable Artificial Intelligence (XAI), focuses on implementing methods that c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): SAI 2023, LNNS 711, pp. 712–733, 2023. https://doi.org/10.1007/978-3-031-37717-4_46
Predictability and Comprehensibility
713
either examine the inner structure of models (model-specific) or explain predictions of a trained model based on a training dataset (post-hoc) [21]. Well-known post-hoc explanation techniques are: Local Interpretable Model Agnostic Explanations (LIME) [26], DeepLift [31], COVAR [9], and Shapely Additive exPlanations (SHAP) [15]. They provide different types of explanations to the user and aim at fulfilling the goals of interpretability, which are: following a model’s prediction for a given dataset, being easy to comprehend, and being efficient [28]. However, as [2] stated, there is ambiguity in the definition of interpretability and how it should be measured and evaluated. The authors further argued that application-grounded evaluation is the most appropriate as “it assesses interpretability in the end goal with the end-users”. Several user studies already evaluated model interpretations and explanations: [26,27] measured comprehensibility and trust by asking users to explain the best model, the most suitable features, as well as model behaviors and irregularities. Other studies also focused on measuring comprehensiveness [23,24], usefulness [19,26], and trustworthiness of explanations [13,23,24,29]. However, they all evaluated a single XAI method and aimed to improve its accuracy for a particular task or manipulated explanations to mislead users and measure their bias when presenting a fidelity or model accuracy score for a given task. Furthermore, they studied mainly annotation tasks on image or text datasets. Our work is motivated by [29], who measured comprehensibility and trust based on the users’ interaction time in a text classification task. However, it remains unclear how users make their judgments: do they blindly follow provided model explanations or make their judgments based on the real-world meaning of data points, which were words in this case? Furthermore, current studies do not infer recommendations that could inform future, improved model explainability methods. The need for XAI user studies has been pointed out by [12], who argue that XAI techniques must align with the mental model of ML-practitioners. Also, [18] stated that “the ultimate goal is for people (experts and/or users) to understand the models, and it is, therefore, essential to involve human feedback and reasoning as a requisite component for design and evaluation of interpretable-ML systems.” We acknowledge the work of Jacovi et al. [11] on formalizing human-AI trust, in which they stated “defining trust as the user’s attempt to predict the impact of the model behavior under risk and uncertainty [10] is a goal but not necessarily a symptom of trust”. Moreover, Mohseni et al. [20], defines one of the desired properties of explainer systems as “predictability”, which is the ability of these systems to “support building a mental model of the system that enables user to predict system behavior.” Therefore, in this work, we aim at measuring the comprehensibility and the predictability, to compare two well-known and widely used XAI approaches, SHAP and LIME. For this purpose, we first refine our notion of comprehensibility and predictability in classification tasks as follows: Comprehensibility: denotes the user’s ability to transfer information on feature contributions obtained from model explanations across samples of the same class.
714
A. Jalali et al.
Previous studies [5,14,26,29] assume trust as “trust in model correctness” and evaluate the user’s ability to guess a sample’s label correctly, given model explanations. We additionally consider both notions of predictability [20] and simulation [8], which is the user’s ability to guess a model’s prediction on a new sample correctly, and broaden the definition of predictability as follows. Predictability: denotes the explainer’s ability to support the users with predicting model predictions, which can be correct or incorrect, on a new sample given model explanations for a sample that the model predicted correctly with high confidence. In the following, we aim to evaluate SHAP and LIME explanations qualitatively and quantitatively as part of a user study. We formulate our research questions as follows: RQ1. To what extent do users comprehend the explanations provided by different XAI methods, and are they able to predict the decision made by the model? When interacting with these tools, we noticed that it is essential to understand how features impact a model decision and that this is easier to comprehend when a model is more confident about a decision. We tested these hypotheses and answered the above question in a comparative study, which we describe in more detail in Section RQ1: Comprehensibility. We found evidence that supports our hypotheses for SHAP. Furthermore, we observed that users who understand model predictions for the two different classes found it easier to classify unexplained and unlabeled samples. Therefore, we hypothesize that users predictability increases when they can classify new samples using the explanations from samples of different classes. In Section RQ1: Predictability, we elaborate on this in more detail and show that users are able to predict the decisions made by the model, using both SHAP and LIME equally. Given these findings, we further investigate the effect of counterfactual and misclassified samples on the users ability to predict the model’s decision. More precisely, we consider the following research question: RQ2. To what extent can visualizations of counterfactual and misclassified samples improve the user’s predictability? We answered this question by testing the following hypothesis: adding explanations of misclassified and counterfactual samples can improve the predictability and support anticipating the model’s behavior. We describe our experiment in more details in Section RQ2: Improving Predictability with Visualizations and report evidence that supports that hypothesis for both SHAP and LIME. We found that users have higher predictability with LIME than with SHAP explanations. Moreover, throughout our experiments, we asked users to provide their subjective feedback on given explanations. Their responses allowed us to answer our third and final research question:
Predictability and Comprehensibility
715
RQ3. To what extent can visualizations of local XAI explanations guide users in finding global explanations? Throughout our experiments, we collected user feedback for both LIME and SHAP, and asked the participants to explain possible shortcomings and sources of confusion. As detailed in Section RQ3: Qualitative Analysis, we observed that negative feedback provides valuable insight into how users interpret explanations. For example, the users complained about inconsistent explanations of values and different scaling of visualizations. Our experiments showed that this reduced the ability of the user to understand the behavior of a model. In the next Section Related Work, we summarize some of the existing related XAI evaluation approaches in Section, and highlight our contribution compared to the existing approaches.In Section Methodology, we focus on describing our methodology for measuring the Comprehensibility and Predictability of the explanations. Each of the remaining sections after the Methodology Work, as mentioned in each RQs above, analyses the results of our user study. The findings of our comparative user study can substantially contribute to the design of new or refinement of existing XAI methods, and therefore, we propose a set of design recommendations in Section Discussion, which we will confer in more detail. We close our work with summarizing our findings and the set of recommendations on designing the XAI outputs in Section Conclusion.
2
Related Work
Previous studies, such as [2], [37], [35] or [32], discuss the various properties of explainability and define evaluation criteria for qualitative (e.g. measuring explanations’ comprehensiveness, trustworthiness) and quantitative (e.g. measuring explanations’ accuracy, fidelity and consistency) approaches. The need for an explainability baseline for evaluating the quality of explanations was raised by [19], who measured the trustworthiness of explanations and quantitatively compared LIME with Grad-Cam explanations on three different baselines: human-attention mask, segmentation mask, and human judgment ratings. Their findings indicate human biases in ratings and significant differences in evaluation scores, which they assume to be caused by “clear non-uniform distribution of weights in human attention masks”. Our work builds on previous user studies, which evaluated model interpretations and explanations. We summarize them into Table 1 and roughly divide them into three categories: the first category ([1,24,29]) evaluates explainability quantitatively and measures the success of the user with and without explanations. The second category ([8,13,23,24,26,27] measures comprehensiveness and trustworthiness of explanations quantitatively by aligning meaningful and meaningless or manipulated explanations with human logic. They point out that manipulated explanations can increase user trust in biased models and lead to mistrust when explanations do not match the decision-making process of users. The third category ([12,30,33,36]) measures the comprehensibility and trust of
716
A. Jalali et al.
users by considering system interactions. These approaches focus on understanding how users perceive information, diagnose and refine the AI systems. Their qualitative results are often used as recommendations for designing the XAI approaches to improve users’ trust. Thus, previous studies focused mainly on image or text data, misleading the users with manipulated explanations in binary annotation tasks. User considerations of the decision-making process and design recommendations on improving explanations were mainly out of scope. For more detailed information on each of the mentioned works, please refer to the summary in Table 1. Therefore, we do not limit ourselves to a quantitative evaluation but also evaluate XAI approaches qualitatively and consider user feedback to understand the reasons behind possible confusion in decision making. This also helps us to understand what the users actually need and also, for us to get a better understanding of the quantitative scores. Moreover, we include explanations of counterfactual and misclassified samples to test users’ predictability using the explanations. Table 1. A Summary of Existing Studies for Evaluating XAI Approaches Evaluation Paper
Metric
Explanations
Approach Data
Manipulated
Show
Model Examples Preds Labels [1]
Compreh.
Quan.
Image
✕
✕
✓
✕
[8]
Simulation
Quan.
Text
✕
✓
✓
✕
Quan.
Tabular
✕
✕
✓
✕
(Trust) [12]
Trust
Qual. [13]
Trust
Quan.
Tabular
✓
✓
✕
✕
[19]
Trust
Quan.
Image
✕
✕
✓
✕
[23]
Trust
Quan.
Image
✕
✓
✕
✕
[24]
Trust
Quan.
Text
✕
✓
✕
✕
Image
✓
✕
✓
✕
Text
Compreh. [26]
Trust
Quan.
Compreh.
Text
[27]
Compreh.
Quan.
Image
✕
✕
✓
✕
[29]
Trust
Quan.
Text
✓
✕
✓
✕ ✕
Compreh. [30]
Trust
Quan.
Text
✕
✕
✓
[33]
Compreh.
Qual.
Image
✕
✕
✓
✕
[36]
Trust
Qual.
Tabular
✕
✕
✓
✕
✕
✕
✓
✓
(Medical) Our Work Predictability Quan. Compreh.
Qual.
Tabular
Predictability and Comprehensibility
3
717
Methodology
We conducted a user study to evaluate comprehensibility and predictability in explanations provided by two widely-used XAI methods: SHAP and LIME. Our experimental setup follows a between-subject design, with the XAI method as the primary varying condition. That means we exposed each participant to only one condition (LIME or SHAP), which shortened the duration of experimental sessions for each participant (c.f., [16]). Figure 1 depicts our overall experimental setup, which started with participant recruitment and a pre-test survey phase. As part of the central survey, we defined, for each research question, several assignments and tasks to be solved by the participants. The first assignment measured the user’s comprehensibility for a given XAI method, and the second one the explainer’s predictability. The third assignment investigated the effect of adding explanations of misclassified and counterfactual samples on explainer’s predictability. In each task, we presented visualizations of SHAP- or LIME-explanations to the participant and asked them to answer four questions, in which they had to interpret the visualizations for the given test sample.
Fig. 1. An Overview of the Assignments of our User Study. We asked Participants to Work on Assignments and Specific Tasks, Answering a Specific Research Question in the main Survey. Assignment One and Two Contained Three Samples and their Explanations, Depicted as EXxx, Provided by Either LIME or SHAP. The Users Studied the Samples and Explanations and had to Answer Questions for a Test Sample, Depicted as Sample Cx or Tx.
Dataset and Implementation. We chose the Boston Housing dataset [6] because of its simplicity and transparency. This dataset estimates the median price of apartments in Boston. We transferred this regression task into a classification task by categorizing the estimated prices into three classes: 1) low-price, 2) medium-price, and 3) high-price while preserving the feature correlations with the target variable. We used five features that give our model the highest accuracy: average number of rooms, pupil and teacher ratio, air pollution level, crime rate, and the zone where an apartment is located. In an initial trial experiment, we found that participants tend to project their interpretation of feature labels (e.g., crime rate) onto an explanation instead of
718
A. Jalali et al.
interpreting the information provided by either LIME or SHAP. Therefore, we anonymized the feature names to F1, F2, F3, F4, and F5, respectively. We further min-max normalized each feature and trained a machine learning model using a 3-layered fully connected dense neural network (64 units, Relu function, and a softmax at the output layer). We optimized the model’s trained weights with Stochastic Gradient Descent (0.001 learning rate). The model had 93% accuracy, and median prediction probability of 70.21%, 46.81%, and 72.87% for low-price, medium-price, and high-price classes, respectively. However, we did not provide the participants with this information. To set up our experiment, we used Python 3.7, and for the explainability visualizations from LIME and SHAP, we used the LimeTabularExplainer (lime library version 0.2.0.1), and the KernelExplainer (shap library version 0.34.0). Participants. We recruited participants with ML and Data Science experience having technical backgrounds in Computer Science, Mathematics, and Physics. The participants were scientists and practitioners from eight different institutions in five countries, collected via their LinkedIn profile. Overall, 47 participants took part in our experiment, and we randomly assigned one XAI approach, either SHAP or LIME, to each participant. We compensated participants with €20 for their approximately 1-h effort of taking part in the experiment. We had to remove one user, who answered “I do not know” and stated afterwards that he was not focused and could not participate in this study. This data cleaning step left us with a total of 46 participants: 30 male and 16 female with an average age of 31, 24 users evaluated LIME, and 22 users evaluated SHAP. When asked about their experience with explainability approaches and interpreting machine learning; they often stated that they interpret models by looking at feature importance plots of decision trees or on the coefficients of linear regression models. On the other hand, only twelve users had experience with XAI approaches such as LIME, Layer-wise relevance-propagation (LRP), heatmaps, or Google’s Language Interpretability Tool (LIT) [34]. Therefore, we can assume that most of our participants (34 of 46) were non-experts in XAI. Survey Procedure. We implemented our survey using the [25] platform. After the users gave their consent to the overall experimental design, they started the pre-test survey, in which we asked them about their demographic information, background, and data science experience. Then, we measured their experience by presenting them with 12 data science and machine learning know-how questions such as bias and variance trade-off or distribution functions. Afterwards, we acquainted the participants with the overall survey procedure in an initial training phase, in which we explained the dataset, the tasks, the visualizations provided by each explainability approach, and the structure of the assignments. Then, in the second part of the survey, the participants started working on the three assignments, each comprising two tasks with four questions. Thus, each participant had to answer 24 questions related to interpretability in total.
Predictability and Comprehensibility
719
Finally, we present the participants with the Nasa Task Load Index (TLX) [7] questionnaire to obtain insight into the mental, physical and temporal demand of the survey as well as the participants’ success, effort, and frustration. Data Coding. For the quantitative part of our analysis, we assigned scores to each multiple-choice answer and computed the sum of all answers to compare assignment results. We gave each correct answer a score of 2, each wrong answer a -1, and “I do not know” answers a 0. This scoring scheme allows us to distinguish participants who tried to answer the questions seriously but possibly wrong from those who just checked “I don’t know.” For the qualitative evaluation, we interviewed the participants and transcribed their feedback throughout each interview session. We followed the Mayring qualitative analysis decoding rules described in [17] to categorize the transcribed participant feedback.
4
RQ1: Comprehensibility
This section provides answers to our first research question, RQ1, which seeks to understand the relationship between the comprehensibility of a set of three explanations for the user and the prediction confidence of a machine learning model. For this purpose, we randomly assigned each user to a XAI method (LIME or SHAP) and presented them with three explained samples (EX01, EX02, and EX03), each representing an apartment that the machine learning model classified as belonging to the class low-price (coded as 0). Figure 2 illustrates one of these three samples and shows how we presented and explained it to the users. We have chosen these samples based on their information and ensured that several features contributed to its low-price classification. Moreover, we chose two additional test samples from the Boston Housing dataset, C1 and C2, with features similar to the explained samples, which the model correctly classified as low-price. However, the prediction confidence for C1 was higher than the confidence for C2, indicating that C1 is further away from the model’s decision boundary. We use the model’s Probability Distribution Delta (PDD) to quantify the model’s confidence. The users studied the explanations and tried to use the information they learned from EX01, EX02, and EX03 to answer the following multiple-choice questions for C1 and then C2: 1. Choose two features that highly influence the prediction of class “low-price (0)”. 2. How does the value of F2 (together with F1 and F5) influence the model’s decision on class “low-price (0)”? 3. How does the value of F1 affect the probability of class “low-price (0)”? 4. How does the value of F3 in this sample (w.r.t. explanations) affect the probability of class “low-price (0)”?
720
A. Jalali et al.
Fig. 2. Example Explanation (EX01) Provided to the User. On the Left Hand Side, it Shows a Sample (an Apartment) that the Model Classified as being Low-Price (Label 0). The X-Axis of the Bar Plot are the Sample’s Attributes and the Y-Axis are the Values. On the Right Hand Side, a LIME Explanation, Describes the Decision of the Model.
With the above questions, we wanted to measure the user’s understanding of whether they can interpret how feature values can increase or decrease the probability of being classified as “low price”. Therefore, we also formulated control questions to avoid random answers and ensure participants focused on the tasks. For the third question, for instance, the answer should match the answer to the second question. If this is not the case, we know that an answer is random and that we should remove it from our analysis. However, this was not the case with our users, and they did not randomly answer the questions. We compute scores for all the responses, giving an overall comprehensibility score for each user and each assignment. We also compared the individual question scores of C1 and C2 to measure quantitatively whether the presented visualization helps the user comprehend each feature’s contribution to the model’s decision. Furthermore, we code the participant’s interview responses into three categories: (i) C1 was more difficult than C2, (ii) C2 was more difficult than C1, and (iii) C1 and C2 were equally challenging. Results. We collected responses from 46 participants, each answering the questions above for either LIME (24) or SHAP (22), and computed the overall comprehensibility score for each user. Figure 3 shows the minimum, maximum, sample median, as well as the first and third quartiles of the comprehensibility scores for both methods. Post-hoc comparison of mean values using a two-sample t-test revealed that there is no significant difference (t=−1.56, p=0.12) between the mean comprehensibility score of SHAP (13.73) and LIME (11.42). This result shows that LIME and SHAP are equally comprehensible by the participants. Next, we tested whether a model’s confidence in the prediction, which we can measure by considering the PDD between possible classes, affects the users’ comprehensibility. Recall that C1 is further away from the model’s decision boundary than C2. Since the responses to these tasks represent variables from repeated measures groups, we follow [3] and compare means by first calculating an adjustment factor for each user. We computed that factor by subtracting the partic-
Predictability and Comprehensibility
721
Fig. 3. LIME and SHAP’s Comprehensibility Score. We Included the Mean Values with Stars, and Included the Median Values above the Median Lines of each Box-Plot. No Significant Difference is Visible between these Two XAI Methods.
ipant’s means (pMean) from the mean of both C1 and C2. We then add these adjustment factors to our participants’ actual comprehensibility scores, comparing C1 and C2. Again, we did not see a significant difference (t= −0.80, p=0.42) between the mean comprehensibility scores of C1 (5.42) and C2 (6.0). However, as shown in Fig. 4, for SHAP, we measured a significant decrease (t= 5.54, p=0.00) between the scores of C1 (8.18) and C2 (5.55). These results show that SHAP’s explainability visualizations were less comprehensible to the participants when the model’s prediction was closer to the decision boundary. Moreover, we see a significant difference (t=4.51, p=0.00) between the mean comprehensibility score of SHAP (8.18) and LIME (5.42) for C1. That difference indicates that SHAP’s users’ comprehensibility was higher than LIME’s for C1. The low score can be explained by the “I do not know” answers on the F3 feature contribution. In our first intermediate analysis, we noticed the difference between SHAP and LIME’s comprehensibility scores, and therefore, to obtain further qualitative insight into the effect of the decision boundary distance on the participants’ comprehensibility, we started asking them the following question right after the first assignment was over: Q1. Which task did you find more difficult (between C1 and C2)? And why? We collected the feedback from 20 participants and categorized their answers into three groups: i) “C1 more difficult than C2”, ii) “C2 more difficult than C1”, and iii) “C1 and C2 similarly difficult”. Fifteen participants stated that they found answering the questions for C1, which has a higher distance from the decision boundary than C2, more complicated than C2. One participant, for instance, stated “I did not know how to answer the questions and work with visualizations for C1. But for C2, it was clear for me how to use the visualizations and find my answers”. This result indicates a learning effect involved and that participants became familiar with the
722
A. Jalali et al.
Fig. 4. C1 and C2 Scores for Both LIME and SHAP. For LIME, we Observe No Significant Difference between C1 and C2 Scores. On the Other Hand, we Observe a Significant Decrease from C1 to C2 for SHAP’s Comprehensibility Tasks.
visualizations over time while answering these samples’ questions. We compared their feedback with their scores and noticed that none of these 15 participants answered the questions for C2 correctly, which indicates a mismatch between perceived and actual comprehensibility.
5
RQ1: Predictability
After we reported on the comprehensibility aspect of the first research question, RQ1, in the previous section, we now focus on the predictability aspect. For that purpose, we presented the second assignment to the participants (see Fig. 1) to analyze whether they can predict the model’s behavior and detect the misclassification using the information they receive from the explanations. Figure 5 shows one of the three SHAP explanations we presented to the user. In contrast to the previous assignment, each was classified differently as low-, medium-, or high-price. In two tasks, we also introduced misclassified samples from medium- to low-price (T1) and from high- to medium-price (T2). We asked the participants to answer the following multiple-choice questions for the test samples in each task: 1. What would the model predict when considering only the value of F1. 2. What would the model predict when we consider the value of F2 and F1? 3. What is the effect on class probabilities if we consider F1, F2, and F3 (or F5) values? 4. What does the model predict for this sample? The fourth question is about the model’s behavior and follows a different scoring mechanism than the other questions. Table 2 lists the possible answers and the scoring rules we applied. Following our comprehensibility scoring, we
Predictability and Comprehensibility
723
Fig. 5. Example SHAP Explanation (EX05) Shown to the User as Part of Assignment 2. The Sample is Classified as Medium-Price (Label 1) and Depicted as Bar Plot with its Actual Label and the Model’s Prediction next its Plot. Table 2. Scoring Rules for the Last Question in Assignment 2. The Rules are based on what the User Detected Score Detected correct Detected prediction label correctly 4
yes
yes
1
no
yes
1
yes
no
-2
no
no
1
User suspects the misclassification when the model misclassifies a sample
0
User answers with “I do not know”
give each correct answer (yes) a score of 2, and each wrong answer (no) a score of -1. For example, a participant that guesses the label correctly but predicts the model prediction wrong, receives a score of 1 (2 + -1). However, participants who suspect the misclassification receive a score of 1 for correctly guessing the model’s failure. Results. Analogous to the comprehensibility score reported in the previous section, we computed a predictability score for both XAI methods. As shown in Fig. 6 (RQ1), and confirmed by a two-sample t-test (t-value=−0.553, p=0.582) we found no significant difference between the mean predictability scores of SHAP (5.64) and LIME (6.38), which were both drawn from Gaussian distributions that were not significantly different from each other. We further apply the Mann-Whitney test to analyze the difference between LIME and SHAP answer category distribution. We categorize the answers into three groups; first, the category of correct guesses on model prediction, second are the incorrect prediction guesses, and third is the neutral category”I do not know”. For SHAP with 22 participants and their answers for two tasks, this categorization results in 15 counts for the first category, 20 counts for the second
724
A. Jalali et al.
category and 9 counts for neutral category. Categorization of LIME with 24 participants results in 20 counts for the first category, 23 counts for the second category and 5 counts for neutral category. We find no significant difference between LIME and SHAP answer category distribution for this assignment (tvalue=922.5, p=0.128). We also analyzed the participant’s answers to see whether they could predict the model’s behavior using this set of explained samples. We compared the first question of samples T1 and T2, which only considers the values of feature F1. We noticed that participants using LIME answered correctly more often than those using SHAP and that SHAP users struggled to find a threshold to decide whether the feature value increased the classification probability. LIME explanations, on the other hand, provide such a threshold range. For the second question, we noticed that participants using SHAP scored better than those using LIME. Based on LIME participants’ feedback, the effect of F1 and F2 values on medium-price class was unclear to them. As part of the third question, we asked the participants about the impact of feature value F3. We wanted to know whether it pushes a decision towards a class and away from another class. The average scores of both participants groups, SHAP and LIME, were low for this question, with a mean score of 0.27 and 0.66, respectively. Fourteen participants using SHAP answered wrong, and 4 participants answered with “I do not know”. On the other hand, 8 Participants using LIME answered wrong, and 6 participants answered with “I do not know”. Finally, the participants performed equally well when answering the fourth question for SHAP and LIME. Breaking down the scores and looking deeper, we noticed that SHAP participants often predicted the label of a sample correctly (15 from 22 participants) but failed to predict the model’s behavior. On the other hand, LIME participants detected model misclassifications more often and predicted labels correctly (15 out of 24 participants).
6
RQ2: Improving Predictability with Visualizations
We now address our second research question, RQ2, and examine how visualizations of misclassified and counterfactual samples can improve the users’ predictability. For that purpose, Assignment 3 uses the same set of questions and test samples as Assignment 2 but presents explanations about the misclassified and correctly classified samples to the user. This experiment allows us to compare predictability scores across assignments and measure how they are affected by additional visualizations. We illustrate the tasks and the assignment in Fig. 1. All chosen samples have low PDD values and are close to the model’s decision boundary. Results. After checking that sample scores for both methods follow a Gaussian distribution, we compared the mean scores of both methods using a two-sample t-test and found that LIME’s predictability score was significantly higher (tvalue=−2.263, p=0.029) than that of SHAP. This result is also visible in Fig. 6,
Predictability and Comprehensibility
725
which also shows that LIME’s visualization of misclassified samples has a more substantial effect on improving the explainer’s predictability. We further compared the total sum of the achieved predictability scores of Assignment 3 with those of the previously conducted Assignment 2 and see, as shown in Fig. 6, significant improvements for LIME (t-value=−4.387, p=0.0), but not for SHAP (t-value=−1.97, p=0.055). Since the responses in the Assignments 2 and 3 also represent variables from repeated measures groups, analogous to the comprehensibility score in RQ1, we follow Field et al., [3] method of comparing means for repeated measures by first calculating an adjustment factor for each user and adding this factor to their predictability scores. As shown in Fig. 7, we see a significant improvement of the participants for both LIME (tvalue=−4.301, p=0.0) and SHAP (t-value=−2.245, p=0.030). This result indicates that presenting explanations around the model’s decision boundary to the user helps the user to understand the model’s decision-making.
Fig. 6. LIME’s and SHAP’s Predictability Score before and after the Users Study the Explained Counter-Factual and Misclassified Samples. LIME Shows Significant Improvement
Analogous to our analysis in predictability RQ1, we apply the Mann-Whitney test to analyze the difference between LIME and SHAP category of answers distribution and find no significant difference (t-value=961.0, p=0.203). The categorization of SHAP answers results in 23 counts for the first category, 17 counts for the second category and 4 counts for neutral category. Categorization of LIME results in 31 counts for the first category, 15 counts for the second category and 2 counts for neutral category. We further compare the category distribution between LIME answers in RQ1 and RQ2 and notice the distributions are significantly different (t-value=702.0, p=0.0). We find the same when comparing SHAP’s answer category distribution between RQ1 and RQ2 (t-value=685.5, 0.004). Moreover, we observed that participants correctly identified sample labels and predicted the correct class for the misclassified test sample. From only 4
726
A. Jalali et al.
Fig. 7. LIME’s and SHAP’s Calculated Adjusted Mean of Predictability Score before and after each User Studies the Explained Counter-Factual and Misclassified Samples. Both Methods Show Significant Improvements
participants using LIME and 7 participants using SHAP, we see an improvement to 15 participants using LIME and 12 participants using SHAP, who answered the third assignment correctly.
7
RQ3: Qualitative Analysis
In this section, we answer our third research question, RQ3, studying the participants’ feedback to provide a guideline for improving the design of XAI methods with local explanations. We seek to answer this question by qualitatively analyzing the participants’ input. Therefore, at the end of the third assignment, we presented the participants with the following question: Q2. How confident were you when answering the questions? When did you feel less confident? We followed Mayring’s qualitative coding rules[17] and coded the participants’ responses into the following categories: i) high, ii) average, and iii) low self-confidence. This categorization resulted in balanced classes: 17 participants had high self-confidence using LIME (8) and SHAP (9). 13 participants had average self-confidence using LIME (6) and SHAP (7), and 10 participants had low self-confidence using LIME (10) and SHAP (6). We further compared the participants’ scores with their self-confidence category; we found that SHAP’s participants’ confidence category matches with their scores, indicating that participants who had high self-confidence using SHAP also correctly answered the assignments. However, we observed no correlation between the participants’ self-confidence and the LIME scores. The median comprehensibility score (median score =
Predictability and Comprehensibility
727
14.0) for 7 participants with average self-confidence was higher than both highconfidence (median score = 10.0) and low-confidence (median score = 11.0). LIME’s predictability scores for both assignments two and three stayed the same and did not decrease with the participants’ confidence category. We compared the relationship between the participants’ expertise (years of experience in ML and data science) with their scores for each category. We noticed a positive correlation for participants using SHAP, indicating more experienced participants could better interpret the explanations than participants with lower expertise. However, participants who used LIME and had low expertise achieved higher scores than the more experienced users. We considered the time participants needed to answer the questions and noticed that those who took more time achieved a higher score. The participants with high expertise and higher confidence often took less time to answer LIME’s questions and had lower scores than the other user’s who needed longer to answer the questions. We noticed that participants using LIME agreed that misclassification information increased their confidence in their answers. However, participants using SHAP stated that the visualizations did not improve their confidence, and 6 participants stated that the misclassification information provided by SHAP did not help them at all and confused them more. Their feedback also correlates with their scores. We continued our qualitative analysis and presented the participants with the following questions: Q3. How much did the visualizations (of the second assignment) help to get an insight into the model and how it comes to its decisions? We categorized the helpfulness and negative feedback of the participants into categories i) helpful to answer the test sample, ii) only helps to understand the explained samples, and iii) not helpful at all. We further clustered the negative feedback from the second and third categories to construct the improvement guidelines based on participants’ needs. Our coding resulted in having 36 of 46 participants (18 LIME and 17 SHAP) stating that the information was not enough to scale to a new sample. One participant who used LIME found the visualizations helpful to scale the information for a new sample. Ten participants (6 SHAP and 4 LIME) stated that they could not use the information and answered with very low confidence. We moved further to our fourth question to understand whether the misclassification and counter-factual samples help the participants gain more insight into the model’s decision-making process. Q4. How much did the explanations of counter-factual and misclassified samples help to answer the questions of the third assignment? We cluster the participants’ feedback based on their similarities to understand what confused them and caused them to fail in solving the assignments. We
728
A. Jalali et al.
clustered their feedback into three categories; i) inconsistency of the explained values, ii) missing information from the visualizations, and iii) not at all helpful. We first present the negative feedback from participants who used LIME; Two participants stated that assignments two and three were confusing. One had a low expertise rank, and the other had average expertise and no XAI experience. However, both scores significantly increased after the visualizations of misclassified samples and counter-factual explanations. One participant stated that ”the way all the information was presented at once made it very difficult to understand the differences of the samples” (user id=27), and ”I do not understand what the visualizations are trying to point out, they are all very similar to each other and identifying the correlation of the feature values on the target classes was not intuitive at all. I always used the bar plot and compared my sample with the labeled samples.” (user id=39). Four LIME users were in the second category (missing information from the visualizations) and stated that they needed more explanations of more misclassified samples. They also mentioned that “the visualizations helped me understand why and when a misclassification might happen. Still, it was not enough to assume the model’s behavior confidently.” (user id=1). These participants also had significantly higher scores in the third assignment, indicating that the visualizations helped identify misclassifications regardless of their subjective feedback. Finally, only one LIME user stated that the information was completely confusing. the user could not make sense of the inequality range given by LIME’s visualization and why some features had the same inequality range, even though they were classified into two different classes. When we explained that the inequalities depend on the feature interactions and why these changes stay the same for one feature, users admitted that they understood it. Still, it was not intuitive for the user at the assignment time. We conclude that the reason behind most users’ confusion was the unexplained, inconsistent range of inequalities for samples from the same class. Moreover, LIME’s local explanations only present that a feature value reduces the probability of the predicted class but does not reveal which class’s probability increases. SHAP plots present this information. We move on to negative feedback given by participants who used SHAP; No participant stated that SHAP’s visualizations were not at all helpful. Only two stated that more explanations of misclassified samples would have helped them scale the explanations for a new sample ( category (ii)) and had a non-significant lower score t=1.0, p=0.5) for assignment three. Moreover, eight participants, who also achieved a higher score for the third assignment, mentioned that the visualizations confused them very much: “There were too many bars and numbers and finding the contribution of features to each class and the effect of feature values on other classes was very demanding and tiring” (user id=26). All these participants also stated that the inconsistency of the plot’s scaling fooled them, and they had to invest more time to find the real contribution of feature values towards each class.
Predictability and Comprehensibility
729
We noticed that for SHAP, it still plots a long bar for each feature when all features have a low contribution. If the user does not look at the probability scales carefully, they might assume that all these features are equally contributing highly towards the respective class.
8
Discussion and Design Recommendation
We now summarize the key findings of our user study and propose a set of design recommendations that can substantially contribute to the design of new or refinement of existing XAI methods: – Explanations should be Consistent. This may seem obvious, but when evaluating LIME, we noticed that many participants were confused by inconsistent inequality ranges for the same feature in different samples of the same class (e.g., model predicted two instances as low-price because one instance had 0.26¡F1-value¡0.36, while another instance had F1-value¡0.26). Such inconsistencies could be explained by showing the correlation between feature values. – Explanations should have Fixed Scales. Participants working with SHAP tended to interpret feature explanations incorrectly for features with smaller scales. The visualizations “fooled” them, and they assumed that a feature with a higher bar has a greater influence on a sample, even though its SHAP value was much smaller than in another sample. – Explanations should also Provide Counterfactual Examples. Our quantitative and qualitative results show that participants predict model behavior better using the explanations if counterfactual samples are presented. These additional explanations could also help users understand the differences between the classes more clearly. – Explanations should Contain Misclassified Samples close to the Decision Boundary. From our experiments, we learned that participants predict model’s decisions with explanations better when they see misclassified examples and understand why the model made a wrong decision. This can be achieved by presenting samples close to a model’s decision boundary, which of course, are often subject to misclassification. – Explanations should Contain Correctly Predicted Samples close to the Decision Boundary. Our results also indicate that participants achieved higher scores when being presented with explanations on correct predictions of samples. These samples can be identified by selecting correctly classified samples with low prediction confidence. Our work is, of course, currently limited to comparing the local explanations of two XAI approaches, LIME and SHAP. Also, the Boston housing dataset we used in our experiments has been simplified to tabular data with a few dimensions. Moreover, users neither had the option to choose the samples themselves nor did we allow them to interact with the model and different outputs of the XAI approaches. However, we argue that these restrictions were necessary to
730
A. Jalali et al.
reduce the number of confounding variables in our user study, which could influence our variables by events that are not causally related. We also believe that our experimental setup provides the necessary degree of generalizability to be transferred to the evaluation of other explanation tools. Potential future work could expand our approach to other types of data (e.g., acoustic or sensor data) and other emerging XAI techniques (e.g., LORE [4]), we have not yet considered in our explanations. Moreover, one could compare local and global explanation designs and study how they affect users’ comprehensibility and predictability. Another potentially interesting research direction is to investigate how active learning techniques could be used to support users in comprehending and predicting model decisions [22]. To our knowledge no other work focused on user’s predictability of model outcome by combining the qualitative and quantitative methods. We believe this combination helps to improve the output of the XAI approaches towards a better representation of the inner learning-logic of the model and how it comes to its decisions, especially for samples that are not trivially distinguishable. Overall, we believe that user studies should become an integral part of the improvement of existing and the development of new XAI methods. Since explanations must ultimately be understood by users, their perceptions and interpretations of explanations should also be systematically analyzed and understood. This can improve the XAI methods and also the user interaction with these methods.
9
Conclusion
In this paper, we conducted a user study to investigate how well users comprehend the explanations and predict model behavior provided by two widely used tools, LIME, and SHAP. We measured the comprehensibility and predictability participants gained after interpreting a given set of local explanations to increase the comprehensiveness of the captured information for a new un-explained sample. We formed our first research question to measure comprehensibility and predictability. We showed that the comprehensibility of SHAP explanations significantly decreases for samples close to the decision boundary. Second, we studied the information participants require to gain a more global interpretation of the model behavior and to increase their predictability using explanations. We observed that explaining misclassified and counterfactual samples to the participants can significantly improve their predictability (especially with LIME explanations). They recognized the model behavior for unexplained samples close to the model decision boundary. Furthermore, our qualitative analysis of participants’ feedback indicated that they require information such as justifying the explained values (LIME inequalities or SHAP values) to correctly interpret the model behavior and move towards a more global interpretation of the model decision boundary. Finally, we learned that the users’ confidence in interpreting the explanations strongly relies on the diversity and quantity of the explained samples. The more different instances were studied by participants, the more accurately they could interpret the outputs of the explainability approaches and predict the decisions of the model.
Predictability and Comprehensibility
731
Acknowledgment. We thank all the volunteers and all the reviewers who wrote and provided helpful comments on previous versions of this document. We especially thank our colleagues, Clemens Heistracher and Denis Katic, for their constructive feedback on the structure of this work. We further thank Dr. Philipp Wintersberger for his constructive feedback and insight into this work. We further thank Dr. Jasmin Lampert for her constructive feedback and insight into this work. We also thank the Austrian Research Promotion Agency (FFG) for funding this work, which is a part of the industrial project DeepRUL, project ID 871357, and the funding from the European Union’s H2020 research and innovation program as part of the STARLIGHT (GA No 101021797) project.
References 1. Alqaraawi, A., Schuessler, M., Weiß, P., Costanza, E., Berthouze, N.: Evaluating saliency map explanations for convolutional neural networks: A user study. In: Proceedings of the 25th International Conference on Intelligent User Interfaces, IUI ’20, page 275–285, New York, NY, USA (2020). Association for Computing Machinery 2. Carvalho, D.V., Pereira, E.M., Jaime S Cardoso, J.S.: Machine learning interpretability: a survey on methods and metrics. Electronics 8(8), 832 (2019) 3. Field, A.P., Miles, J., Zo¨e Field, Z.: Discovering statistics using R. SAGE publications, London, England (2012) 4. Anna, G., et al.: Local rule-based explanations of black box decision systems. arXiv preprint arXiv:1805.10820 (2018) 5. Guidotti, R.: Monreale, Anna, Ruggieri, Salvatore, Turini, Franco, Giannotti, Fosca, Pedreschi, Dino: A survey of methods for explaining black box models. ACM Comput. Surv. (CSUR) 51(5), 1–42 (2018) 6. Harrison, D., Rubinfeld, D.L.: Hedonic housing prices and the demand for clean air. J. Environ. Econ. Manag. 5(1), 81–102 (1978) 7. Hart, S.G., et al.: Development of NASA-TLX: results of empirical and theoretical research.” inp. a. hancock and n. meshkati (eds.), human mental workload (1988) 8. Hase, P., Bansal, M.: Evaluating explainable AI: Which algorithmic explanations help users predict model behavior? Association for Computational Linguistics (ACL) (2020) 9. Stefan, H., et al.: Felix: On the interpretation of weight vectors of linear models in multivariate neuroimaging. Neuroimage 87, 96–110 (2014) 10. Hoffman R.R.: A taxonomy of emergent trusting in the human-machine relationship. Cognitive systems engineering: the future for a changing world, pp. 137–164 (2017) 11. Jacovi, A., Marasovi´c, A., Miller, T., Goldberg, Y.: Formalizing trust in artificial intelligence: Prerequisites, causes and goals of human trust in AI. In: Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pp. 624–635 (2021) 12. Kaur, H.: Interpreting interpretability: Understanding data scientists’ use of interpretability tools for machine learning. In: Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, CHI ’20, page 1–14, New York, NY, USA (2020). Association for Computing Machinery
732
A. Jalali et al.
13. Lakkaraju, H., Bastani, O.: how do i fool you?: manipulating user trust via misleading black box explanations. In: Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, AIES ’20, page 79–85, New York, NY, USA (2020). Association for Computing Machinery 14. Lipton, Z.C.: The mythos of model interpretability: machine learning, the concept of interpretability is both important and slippery. Queue 16(3), 31–57 (2018) 15. Lundberg, S.M., Su-In Lee, S.-I.: A unified approach to interpreting model predictions. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, pp. 4768–4777, Red Hook, NY, USA (2017). Curran Associates Inc 16. Martin, D.W.: Doing psychology experiments. pp. 148–170 (2007) 17. Mayring, P.: Qualitative content analysis. A Companion Qual. Res. 1(2), 159–176 (2004) 18. Mohseni, S.: Toward design and evaluation framework for interpretable machine learning systems. In: Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, pp. 553–554 (2019) 19. Mohseni, S., Ragan, E.D.: A human-grounded evaluation benchmark for local explanations of machine learning. arXiv preprint arXiv:1801.05075 (2020) 20. Mohseni, S., Zarei, N., Ragan, E.D.: A multidisciplinary survey and framework for design and evaluation of explainable AI systems. ACM Trans. Interact. Intell. Syst. (TIIS) 11(3–4), 1–45 2021 21. Molnar, C., Casalicchio, G., Bischl, B.: Interpretable Machine Learning – A Brief History, State-of-the-Art and Challenges. In: Koprinska, I., et al. (eds.) ECML PKDD 2020. CCIS, vol. 1323, pp. 417–431. Springer, Cham (2020). https://doi. org/10.1007/978-3-030-65965-3 28 22. Mondal, I., Ganguly, D.: Alex: active learning based enhancement of a classification model’s explainability. In: Proceedings of the 29th ACM International Conference on Information & Knowledge Management, pp. 3309–3312 (2020) 23. Nourani, M., Kabir, S., Mohseni, S., Ragan, E.D.: The effects of meaningful and meaningless explanations on trust and perceived system accuracy in intelligent systems. In: Proceedings of the AAAI Conference on Human Computation and Crowdsourcing, vol. 7, pp. 97–105 (2019) 24. Papenmeier, P., Englebienne, G., Seifert, C.: How model accuracy and explanation fidelity influence user trust. AI IJCAI Workshop on Explainable Artificial Intelligence (2019) 25. Qualtrics. Copyright year: 2021, location: Provo, utah, usa 26. Ribeiro, M.T., Singh, S., Guestrin, C.: why should i trust you? explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1135–1144 (2016) 27. Ribeiro, M.T., Singh, S., Guestrin, C.: Anchors: high-precision model-agnostic explanations. In: Proceedings of the 32nd AAAI International Conference on Artificial Intelligence, vol. 18, pp. 1527–1535 (2018) 28. Stefan R¨ uping. Learning interpretable models (2006) 29. Schmidt P., Biessmann, F.: Quantifying interpretability and trust in machine learning systems. In: AAAI-19 Workshop on Network Interpretability for Deep Learning (2019) 30. Shin, D.: The effects of explainability and causability on perception, trust, and acceptance: Implications for explainable AI. Int. J. Hum.-Comput. Stud. 146, 102551 (2021)
Predictability and Comprehensibility
733
31. Shrikumar, A., Greenside, P., Kundaje, A.: Learning important features through propagating activation differences. In: Proceedings of the 34th International Conference on Machine Learning - Vol. 70, ICML’17, pp. 3145–3153, Sydney, NSW, Australia (2017). JMLR.org 32. Sokol, K., Flach, P.: Explainability fact sheets: a framework for systematic assessment of explainable approaches. In: Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, FAT* ’20, pp. 56–67, New York, NY, USA (2020). Association for Computing Machinery 33. Spinner, T., Udo, S., Hanna, S., Mennatallah, E-A.: explainer: a visual analytics framework for interactive and explainable machine learning. IEEE Trans. Vis. Comput. Graph. 26(1), 1064–1074 (2019) 34. Tenney, I., et al.: The language interpretability tool: Extensible, interactive visualizations and analysis for NLP models. pp. 107–118 (2020) 35. Tonekaboni, S., Joshi, S., McCradden, M.D., Anna Goldenberg. A.: What clinicians want: Contextualizing explainable machine learning for clinical end use. In: DoshiVelez, F. eds., Proceedings of the 4th Machine Learning for Healthcare Conference, vol. 106 of Proceedings of Machine Learning Research, pp. 359–380, Ann Arbor, Michigan (2019). PMLR 36. Wang, D., Yang, Q., Abdul, A., Lim, B.Y.: Designing Theory-Driven User-Centric Explainable AI, pp. 1–15. Association for Computing Machinery, New York, NY, USA (2019) 37. Zhou, J., Gandomi, A.H., Chen, F., Holzinger, A.: Evaluating the quality of machine learning explanations: a survey on methods and metrics. Electronics 10(5), 593 (2021)
Multi-sensor Failure Recovery in Aero-Engines Using a Digital Twin Platform: A Case Study A. Manuja1 , Saurav Anilkumar1 , V. V. Varun1 , A. Mathew1 , S. P. Sureshkumar2 , and R. George3(B) 1 IVA Pvt. Ltd., Technopark, Trivandrum, India 2 Bangalore, India 3 Department of Cyber-Physical Systems, Clark Atlanta University, Atlanta, USA
[email protected]
Abstract. Digital twin technology has found wide applications in the aerospace industry, for fleet monitoring, diagnostics, predictive maintenance, and manufacturing. Sensing the aero-engine in a test environment is challenging, due to issues with noisy and failed sensors, introduced by extreme conditions. Downstream diagnostics and prognostics require that key sensed values are available for the duration of the test for both real-time and off-line analysis. Virtual sensors that mimic the behavior of the physical sensors provide a resilient solution in the presence of such failures. This paper describes virtual sensors designed using a low-cost digital twin platform with off-the-shelf analytical software components and libraries. The platform is a no-code environment, that permits users to rapidly experiment with different analytical models and configurations. We show that virtual sensors can emulate the behavior of the physical sensor(s) in the event of multiple physical sensor failures with high accuracy, the design of which is facilitated by the Digital Twin platform. From a process standpoint, the Digital Twin results in several advantages to the organization including the breaking down of departmental data silos, reevaluation of key assumptions regarding system design, and the standardization of monitoring process. The digital twin approach is seen to be a catalyst for reengineering the design and monitoring lifecycle for industrial organizations. Keywords: Aero-Engines · Multi-sensor failure · Virtual Sensors · Digital Twin
1 Introduction The worldwide industrial landscape is undergoing a steady digital transformation, with the adoption of concepts such as Industry 4.0/5.0 [1]. Industry 4.0/5.0 is driven by enablers such as IoT (the Internet of Things), digitization, and automation prospects of cyber-physical systems, and organizationally, by the disappearance of boundaries between Information Technology and Operations and the integration of people and teams S. P. Sureshkumar—Independent Consultant, Bangalore, India. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): SAI 2023, LNNS 711, pp. 734–741, 2023. https://doi.org/10.1007/978-3-031-37717-4_47
Multi-sensor Failure Recovery in Aero-Engines
735
with systems. This transformation is manifest in different aspects of the aerospace industry, including aero-engine design and manufacturing and aircraft operations which have resulted in safer, and more efficient aircraft systems [2]. In design and implementation scenarios, digitization has resulted in improved reliability of engines, with improved employee productivity and reduced time to market. In the operational space, the implementation of these concepts has resulted in the deployment of electronic and informational sub-systems such as Digital Avionic Systems [7], Health, and Usage Monitoring Systems (HUMS) [9, 10], and systems of sensors as integral components of the aircraft. In these scenarios, sensing the large number of parameters within the system, increases the difficulty of providing analysis and understanding of system behavior.
Fig. 1. Digital Twin Block Diagram
A digital twin is a set of high-fidelity virtual constructs that mimics the structure, context, and behavior of the deployed physical entity [3]. The digital twin acquires, correlates, analyzes and performs sense-making, providing continuous evaluation of the corresponding physical entity. The use of the digital twin enables assessments, diagnostics, and predictions quicker and more precisely than with traditional methods. Advances in Artificial Intelligence, IoT, and cloud infrastructure is driving the adoption of digital twins in the aerospace industry, with the dominant manufacturers and government agencies heavily committed to this technology. The concept of the digital twin was introduced by Michael Grieves in relation to Product Lifecycle Management (PLM) [4]. NASA and the U.S. Air Force were early adopters of this technology developing several aerospace related applications. (Fig. 1 shows the primary components and users of a Digital Twin for an Aero-Engine [5] test facility.) In the past few years, several research organizations and companies have taken up the challenge to extend the theory and practice of this technology to new industry segments [11, 12]. Virtual Sensors [6, 7] are software representations of sensor hardware that aggregate signals from physical sensors. They overcome the drawbacks of physical sensors with lower operating costs, increased reliability, and the ability to indirectly measure properties. In this project we take a black-box approach to the definition of the virtual sensor, without assuming any knowledge of the underlying physical processes that drive the operation of the sensors. In this research effort we present a Proof of Concept (PoC) of a digital twin [8] which aggregates data from multiple sensors, to create a virtual sensor, which is used to recover from the failure of multiple key sensors. The focus of this paper
736
A. Manuja et al.
is to demonstrate a novel application of analytical models in an aeroengine test facility using a digital twin platform. The analytical models are used to create a virtual sensor that duplicates the behavior of the sensor(s) when there is a single or multiple physical sensor failure. The scope of this research is limited to demonstrating the feasibility of defining and using virtual sensors, which are shown to accurately predict physical sensor outputs, in the event of multiple sensor failures. We present the problem approach with a detailed process flow for the digital twin and show results based of the model predictions in Sect. 2. We conclude this paper with the open research questions as applied to aerospace test facilities and draw lessons on the applicability of the digital twin to such problems in Sect. 3.
2 Approach 2.1 Aero-Engine Test Environment The development and test environment are part of aero-engine R&D effort where a high level of metering and sensing of parameters is maintained during the engine testing stage. Sensors are used to measure parameters including temperature, pressure, vibration, strain, lubrication, etc. with further specialized sensors for each category. For example, specialized pressure sensors are used to monitor core pressure and wall pressure, while dedicated temperature sensors measure gas and lubricant temperature. The test set up uses 600 sensors of these various types to monitor these various parameters. Further complexity is introduced by the position, and type-based sampling rate for different sensors, which are position based and rotate at rates from 10/sec to 75k/sec. Test typically run 8–10 h continuously, and the environment is characterized by high temperature, pressure and vibration. These harsh conditions cause sensors to fail, critically affecting downstream actions engineering, production, and maintenance related operations. Strain gauges which monitor the fan blades of the engine record the most critical measurements during test. Damage to the blades through uncontrolled operations has severe schedule, and financial implications. Virtual sensors are defined to replicate the behavior of the physical sensors using a combination of historical and real-time sensor measurements from the full range of sensors used for metering the entire system. Analytical models are used to impute the missing values from the physical sensor, even in the event of multiple sensor failure. This allows for the prediction of missing sensor values, thereby facilitating downstream decision-making even in the absence of real data. 2.2 The Digital Twin Platform The design and evaluation of the Virtual Sensors is performed using a generalized Digital Twin platform developed for this project (Note that Fig. 2 shows only the AI/Machine Learning component of the Digital Twin). The platform is a web-based industrial digitization tool for visualizing, analyzing, and processing the large data sets generated in such applications. It provides a no-code environment for the definition of virtual sensors (and other applications) and provides support for downstream decision making.
Multi-sensor Failure Recovery in Aero-Engines
737
Fig. 2. Digital Twin Platform Functional Components
The main components are briefly detailed below: Data Pre-Processing A high level of pre-processing is needed for signal conditioning, synchronization before sampling. Noise reduction techniques, using high-pass and band-pass filters are initially applied to the raw signals followed by windowed averaging techniques. The synchronization of the signals is challenging since the rotational frequency is sole criterion for the alignment, and it varies by sensor type and position. Up sampling and down sampling are both evaluated after time synchronizing the sensor signals. There are downstream implications to the use of approximation approaches in pre-processing, since there are stringent accuracy requirements for the virtual sensors. Several iterations of pre-processing are required to reach the accuracy requirements. Feature Engineering The feature engineering module converts pre-processed observations into features using statistical and machine learning approaches. In this context, this involved the discovery of the most relevant set of signals from the larger set of 600 sensor signals that are useful to predict the virtual strain sensor. Model Selection The goal is to create a set of virtual strain sensors using information from other sensors in the system. The Model Selection component allows the user to evaluate a variety of machine learning and deep learning models to select the model with the best accuracy and/or evaluation metrics. Model Tuning and Evaluation The Model Tuning and Evaluation component is used to improve the performance of the selected model using machine learning tuning methodologies.
738
A. Manuja et al.
Model Deployment The tuned virtual sensor is deployed using Model Deployment component, which creates a standalone (without dependencies) software component that may be integrated into the users IT infrastructure. The component is used via an API or as a webservice. Presentation This module offers the user the ability to view the output in a variety of alternate modes, including dashboards, reports, and visualization. Model Operational Evaluation The Model Operational Evaluation module is used to track the performance of the defined virtual sensor in its operational phase. Note that for the purpose of the present study not all components of the Digital Twin platform are used.
3 Results Virtual Strain Sensors for the aeroengine are designed and implemented on the Digital Twin platform using the test setup described in Sect. 2. The data pre-processing using the steps described previously are performed initially. The problem of recovery in the virtual strain sensors in the presence multiple sensor failure is challenging, and several regression paradigms based on machine learning and deep learning were experimented. The results presented detail the experiments to create virtual analogues of three physical strain sensors, using the functioning of the sensor bank (including the strain sensors themselves). The Machine Learning approaches tested require one regression model per virtual sensor, while the Long-Short Term Memory (LSTM) deep learning model, requires only a single model for the entire set of virtual sensors. Table 1 summarizes the metrics of different Machine Learning-based Virtual Sensors, and Fig. 3 shows the LSTM outputs of the Virtual Sensors. It is seen that the LSTM model tracks the physical model accurately (all models fell short in their ability to capture sharp peaks). The LSTM-based Virtual Strain Sensors were picked for the operational environment, Table 1. Metrics on Machine Learning Regression Models Model Gradient Boost Random Forest Ridge Regression
MAE
MSE
RMSE
RMSLE
R²
14.26
484.70
22.01
3.09
0.86
3.10
142.55
11.94
2.47
0.91
5.65
81.95
9.05
2.20
0.83
Note: MAE, MSE, and RMSE refer to the Mean Absolute Error, Mean Squared Error, Root Mean Sqared Error, and Root Mean Sqared Logarithmic Error respectively.
Multi-sensor Failure Recovery in Aero-Engines
739
based on their accuracy and lower training requirements (only one model required to represent the complete scenario).
Physical Sensor (Actual) Virtual Sensor (Predicted) Fig. 3. LSTM Virtual Sensor(s) Predicted vs. Physical Sensor (LSTM Outputs from a Single Model)
The goal of this project was to define and test the use of Virtual Sensors to recover from failures in physical strain sensors in an aero-engine test facility. It is seen that the Virtual Sensors provide sufficiently accurate results that may be used for both downstream predictive and prognostic analysis. The model has been operationalized and is currently being used for real-time test tasks. While immediate concerns such as failure recovery, and sensor accuracy were the initial drivers of the project, the practical reach of the project is significant. The availability of an integrated environment for analysis and visualization has provided Design Engineers, with an unprecedented insight into the
740
A. Manuja et al.
operation of the aero-engine test environment revealing inconsistencies in the positional assignment of sensors, exposed anomalous sensor behavior, and several redundant sensors. These artifacts result from “stove piped” design and test tasks distributed among different Departments and functional units. The use of a lifecycle platform for tasks that range from design to disposal has implications for collaborative engineering work that cuts across individual specialties, with the prospect of more robust, and efficiently engineered products.
4 Conclusion The Virtual Sensor development on a Digital Twin platform was conceived as a Proofof-Concept project for an Aero-Engine Test Facility. Building a generalized Data Twin platform is successful approach since many alternate analytical configurations could be evaluated rapidly. Of these alternate approaches the LSTM model is seen to be the most applicable to the Virtual Sensor approach. This effort is currently being extended to sensor minimization, and optimal sensor positioning. Both these projects will use the same Digital Twin infrastructure, with a consistent set of data pre-processing routines, and the ability to define, test, and validate numerous analytical models. It should be noted that this functionality is available with the AI/Machine Learning component alone- the addition of Physical Modeling with Simulation would enhance the usefulness of the Digital Twin further. While aerospace is a challenging vertical and usually a first adopter of new technological paradigms, from this project it may be understood that the Digital Twin technology is a necessary adjunct to the digitization of any industry sector. Acknowledgments. This research is funded in part by NSF Grants No. FAIN-1901150, NSF 1926806, and DOEd Grant P116Z220008 (1). Any opinions, findings, and conclusions expressed here are those of the author(s) and do not reflect the views of the sponsor(s).
References 1. Fourth Industrial Revolution, wikipedia.com. https://en.wikipedia.org/wiki/Fourth_Indust rial_Revolution. Accessed 5 Nov 2022 2. Yin, H., Wang, Z.L.: Application and development prospect of digital twin technology in aerospace. IFAC-Papers OnLine 53(5), 732–737 (2020). https://doi.org/10.1016/j.ifacol. 2021.04.165 3. Digital Twin: Definitions &Value, AIAA and AIA Position Paper, December 2020 4. Grieves, M.: Origins of the Digital Twin Concept (2016). https://www.researchgate.net/pub lication/307509727_. Accessed 25 Sept 2020 5. Wikipedia, Components of jet engines. https://en.wikipedia.org/wiki/Components_of_jet_ engines. Accessed 15 Oct 2022 6. Brunello, A., Urgolo, A., Pittino, F., Montvay, A., Montanari, A.: Virtual sensing and sensors selection for efficient temperature monitoring in indoor environments. Sensors 21(8), (2021) 7. Martin, D., Kühl, N., Satzger, G.: Virtual sensors. Bus. Inf. Syst. Eng. 63(3), 315–323 (2021). https://doi.org/10.1007/s12599-021-00689-w
Multi-sensor Failure Recovery in Aero-Engines
741
8. Trusova, K.V.: Synthesis and analysis of avionics functions digital twins using machine learning classification algorithms. In: Velichko, E., Kapralova, V., Karaseov, P., Zavjalov, S., Angueira, P., Andreev, S. (eds.) International Youth Conference on Electronics, Telecommunications and Information Technologies. SPP, vol. 268, pp. 3–18. Springer, Cham (2022). https://doi.org/10.1007/978-3-030-81119-8_1 9. Tiainen, T., Miettinen, J., Viitala, R., Hiekkanen, K., Kuosmanen, P.: Digital twin and virtual sensor for a rotor system. In: Proceedings of the 30th International DAAAM SymposiumIntelligent Manufacturing & Automation, October 2019 10. Xiong, M., Wang, H., Fu, Q., Xu, Y.: Digital twin–driven aero-engine intelligent predictive maintenance. Int. J. Adv. Manufact. Technol. 114(11–12), 3751–3761 (2021). https://doi.org/ 10.1007/s00170-021-06976-w 11. Xiong, M., Wang, H.: Digital twin applications in aviation industry: a review. Int. J. Adv. Manuf. Technol. 121, 5677–5692 (2022). https://doi.org/10.1007/s00170-022-09717-9 12. da Silva Mendonça, R., et al.: Digital twin applications: a survey of recent advances and challenges. Processes 10(4), 744 (2022)
Hierarchical Joint Entity Recognition and Relation Extraction of Contextual Entities in Family History Records Daniel Segrera1(B) , Chetan Joshi1 , Lawry Sorenson1 , Stephen Hood1 , Timothy Brown1 , Mark Clement1 , Joseph Price1 , Eric Burdett2 , and Stanley Fujimoto2 1
2
Brigham Young University, Provo, UT 84602, USA [email protected] Ancestry.com LLC, 1300 West Traverse Parkway, Lehi, UT 84043, USA [email protected]
Abstract. Entity extraction is an important step in document understanding. Higher accuracy entity extraction on fine-grained entities can be achieved by combining the utility of Named Entity Recognition (NER) and Relation Extraction (RE) models. In this paper, a cascading model is proposed that implements NER and Relation extraction. This model utilizes relations between entities to infer context-dependent fine-grain named entities in text corpora. The RE module runs independent of the NER module, which reduces error accumulation from sequential steps. This process improves on the fine-grained NER F1-score of existing stateof-the-art from .4753 to .8563 on our data, albeit on a strictly limited domain. This provides the potential for further applications in historical document processing. These applications will enable automated searching of historical documents, such as those used in economics research and family history. Keywords: Natural Language Processing · Named Entity Recognition · Relation Extraction · Information Extraction
1
Introduction
Named Entity Recognition (NER), also called entity extraction or entity identification – is a natural language processing (NLP) technique that automatically identifies named entities (names, places, or dates for example) in a text and classifies them into predefined categories. It is often sufficient to identify an entity as a course-grained entity (e.g. name) if the application is attempting to identify employees working for a company from a paragraph of text. Although this course-grained entity recognition is sufficient for many applications, fine-grained classification is necessary for family history applications where it is necessary to know the relationship between different entities in addition to deriving their classification. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): SAI 2023, LNNS 711, pp. 742–752, 2023. https://doi.org/10.1007/978-3-031-37717-4_48
Joint Entity-Relation Extraction
743
There are many companies and research organizations in the fields of family history and historical document understanding. Family history work helps people learn about their heritage and form connections with their ancestors. Automatic extraction in information en masse from historical documents provides valuable service to these efforts. One part of this process is called entity extraction. For example, in family history, digital text is searched for particular entities. These entities include names of parents, names of children, birth dates, marriage dates, etc. These entities are extracted and compared to each other to build family tree charts in a process sometimes called indexing (as illustrated in Fig. 1). To precisely index historical documents it is not enough to have course-grained labels, such as name or date, but fine-grained labels, such as spouse name and marriage date, are necessary. Furthermore, these fine-grained labels often rely on the document’s internal context.
Fig. 1. Transcription on Records into Family Trees. Example of a Handwritten Birth Record, Written in French, which after Transcription (using the Handwriting Recognition Model) is Passed through our Joint Entity Relation Model in Order to Identify the Entities and their Respective Relations. Note that the Transcription is not Perfect as it Contains Spelling Mistakes, Grammatical Inconsistencies, etc. that make it Harder for the Model to find Correct Entities and Relations
Unfortunately, these entities are often fine-grained and contextual to relationships between entities within a record. These organizations do not have models
744
D. Segrera et al.
that can accurately extract such context-dependent fine-grained entities, so they instead extract important words as coarse-grained entities (such as person, place, or date) and manually label them as fine-grained classifications. This problem is even more pronounced for records written in languages such as French with less labeled data.
2
Related Work
In 2018, Belkoulis revolutionized the field of named entity recognition (NER) by conceptualizing joint entity and relationship extraction as a multi-head selection problem [1]. His paper demonstrated that relation extraction was a helpful tool to improve entity extraction accuracy. His original model used a bidirectional LSTM for sentence encoding. That same year, papers saw improved results using ELMo [6] for sentence encoding [8]. After the introduction of BERT [2] in 2019, several papers saw even better results when using BERT for sentence encoding instead of ELMo. Since then, papers in the field have improved entity recognition accuracy by creating joint models that use relation extraction. The ways these joint models are implemented vary. One approach is to have the joint model ask questions about the data [4,15]. Another approach uses distantly supervised data augmentation to reduce the impact of negative labels on the joint model [12]. Jue Wang’s paper sees improved relation extraction by filling an entity-relation table [11]. Hierarchical relationship extraction is particularly effective at detecting hierarchical relationships [3,10,14]. The most successful approaches for entity recognition insert markers into the sentences. This is done in models such as PURE [16]. These markers reduce the need for embeddings which reduces memory needed and improves inference speed. Papers that use this approach are in the top 3 micro-F1 scores for both entity extraction and relation extraction accuracy for the benchmark datasets: CoNLL 2003, ACE2004, ACE2005, and SciERC [13]. The models from existing research perform well on benchmark datasets. However, they fail to perform on more complicated datasets, such as ones with finegrained or contextual entities. Most of these models can not find all the relations in a multi-sentence corpus because they can not map a relationship between two entities in separate sentences. However, cross-sentence relation extraction is necessary to find the nuanced relationships in family history records.
3
Problem Description
Existing research insufficiently performs the task of fine-grained entity extraction on contextual entities. Traditional methods may be able to identify names and dates, but are unable to identify Mother’s Name vs Sisters Name or the dates corresponding to different events in a record. Much of the difficulty comes from the space between entities in a paragraph.
Joint Entity-Relation Extraction
745
Combining entity recognition with relation extraction significantly improves the accuracy of contextual fine-grained entity extraction for automatic indexing systems used by researchers performing automated historical document analysis. This contributes a novel solution to significantly reduce the manual annotation effort when indexing records without sacrificing recognition accuracy. The indexing of family history records requires a NER model that identifies contextual fine-grained entities. However, it is difficult to train NER models to do that on their own. Relation extraction can help find the context in the corpus by identifying relations between entities. A combined model is needed that can both extract entities and the relationships between those entities. This kind of model more accurately extracts context-based fine-grained entities in family history records. Additionally, most real-world transcriptions from handwritten documents are full of errors. Common errors include incomplete transcriptions, transcriptions in the wrong language, incorrect labeling, grammar mistakes, and spelling errors. The model built must be able to overcome these difficulties.
Fig. 2. The Joint Entity and Relation Extraction Model. The BERT Embeddings are Passed Simultaneously through the NER and RE Module as an End-to-End Model to Preserve Cross-Sentence Context and Prevent the Compounding of Error.
3.1
Methods
A cascading entity-relation extraction model can be used to extract fine grained entities from transcriptions resulting from automated handwriting transcription. A language model is needed to encode tokens and obtain contextualized representations for each input token in the training dataset. BERT was selected for our tests. Most joint entity and relation extraction models suffer from cascading error propagation. In this domain, errors in course grain entity extraction can increase errors in the relation extraction phase of the algorithm. These errors can accumulate more rapidly if the two sides of the joint model are too dependent on each other. To prevent this, two sets of BERT encoded tokens are simultaneously passed through the entity extraction module and the relation extraction module
746
D. Segrera et al.
as an end-to-end model (as illustrated in Fig. 2). Both modules use cross entropy for their loss function in order to improve training accuracy. The output from both modules are then decoded together to provide the fine-grained labels. This approach preserves cross-sentence context and reduces the opportunity for errors to compound in the pipeline, relative to iterative relationship extraction after named entity recognition. For each record, the pipeline labels entities with tokens (as seen in Example 1). Simultaneously, all possible pairs of tokens are fed to the relation extraction module (as seen in Example 2). For each pair, the module predicts the relation between the two entities and generates a triplet tuple that represents the predicted relation. From those tokens, the combined model is decoded to classify the entities and generates tuples that represent the entity classifications. The tuples from the relation extraction module and entity extraction module can be used to infer the fine-grained labels of the entities. Example 1: {ENTSTART=Name} Paul {ENTEND=Name} and {ENTSTART=Name} James Thompson {ENTEND=Name} ran to the store to get milk for their parents in {ENTSTART=Month} March {ENTEND=Month}. Example 2: {SUBJSTART=Name} Paul {SUBJEND=Name} and {OBJSTART=Name} James Thompson {OBJEND=Name} ran to the store to get milk for their parents in March. Data Augmentation. One of each format type is taken out of test data and put in training data. Training data is generated by augmenting the 5 samples of original data and the rest of the data is test and evaluation data. Named Entity Recognition. Named Entity Recognition is performed using the context-aware embeddings produced by BERT. After running the raw text through BERT, the embedding produced for each word is normalized. This normalization process prevents exploding gradients. Each embedding is then run through two linear output layers. The first layer predicts BIO tags [7]. BIO tagging is a common tagging format for tagging tokens in a chunking task in computational linguistics (ex. named-entity recognition). The B- prefix before a tag indicates that the tag is the beginning of a chunk, and an I- prefix before a tag indicates that the tag is inside a chunk. The B- tag is used only when a tag is followed by a tag of the same type without O tokens between them. An O tag indicates that a token belongs to no entity/chunk. The BIO tagging helps keep entities separate even if they are the same entity category. The last layer completes the process and predicts the entity type between Name, Date, Gender, Age, and a None class.
Joint Entity-Relation Extraction
747
Relation Extraction. Initially, our Relationship Extraction model was heavily inspired by the PURE model [16]. PURE works by adding markers to highlight the two objects within the sentence. These markers help the BERT part of the model extract context-dependent embeddings. As such, these markers reduce the need for embeddings which reduces memory needed and improves inference speed. Then the model classifies relationships based on embeddings for those markers. Training with these markers has the advantage of maintaining the embeddings of other words, which helps the PURE model generalize to unseen words. However, the historical records differ largely from the records that PURE was trained on. Generally, relationships need to be found between 10–13 entities in those records. Re-encoding the sentence and running it through BERT for each possible pair of entities would cause a large computation overhead. Additionally, these records have a relatively constrained vocabulary, which reduced the generalization benefit of using markers. Due to these differences, the markers are dropped in the modified model. The relationships are instead identified by running the embeddings of the first word in each entity through two linear layers. One of these layers represents a transform for subject entities, while the other represents the transform for object entities. The inner product between all subject entities and object entities is then taken, and the result is run through a linear layer to give the final relationship classification. This system was derived from self-attention. Dropping the markers also allows for sharing the BERT embeddings between the NER and RE steps, such that the raw text is passed through BERT once. We tested our system with both shared BERT embeddings and separate embeddings. We found that using separate embeddings performed better when using the described self-attention for the relationship output layer, and we report our results for this architecture. Otherwise, having shared embeddings performed better when the relationship extraction step used a only linear layer rather than the described self-attention. The computation overhead of running this classifier on all possible pairs of words in the sentence is negligible compared to the complexity of BERT itself. Due to this difference, the model can consider all possible relationships for all entity types in an end-to-end fashion. The final confidence of a relationship tuple is the geometric mean of the confidence of the relationship with the confidence of the entities involved. Extracting Fine-Grained Entities. After the relations between entities are found in a record, they can be converted into fine-grained entities. By assigning one person as the primary subject of the record, these entity relationships can be used to convert the coarse-grained labels on entities (such as Name) into fine-grained entities (such as FatherName). For example, if “Ted” is the main subject of the record then: (“Susan”, SpouseOf, “Ted”) → (“Susan”, SpouseName) (“1830”, BirthOf, “Ted”) → (“1830”, BirthYear)
748
D. Segrera et al. (“Thirty”, AgeOf, “Ted”) → (“Thirty”, SelfAge) (“Male”, GenderOf, “Ted”) → (“Male”, SelfGender) (“Luxembourg”, BirthPlaceOf, “Ted”) → (“Luxembourg”, BirthPlace)
4 4.1
Experiments Setup
To verify the effectiveness of this system for cascading entity and relation extraction, comprehensive experiments are conducted on a corpus of French birth records and marriage records. Our joint model uses the same training dataset and test dataset for both the NER component and relation extraction component. Using data augmentation on the french records, a dataset of 400 artificial birth and marriage records is generated at the time of execution. This is used as the training dataset. The test dataset is the original 270 hand-labeled french birth records and marriage records. However, this test dataset comes with many complications. These include inaccurate labels, spelling mistakes, missing punctuation, myriads of categories and lengthy sentences. Also French has fewer existing language models and more complicated grammar rules, adding to the complications. All these anomalies make it harder for neural networks to understand the data. Whereas, the benchmark datasets (in relation extraction) have nearly perfect sentences in English with fewer entity and relation types. These qualities make the benchmark datasets less noisy than our French dataset. Table 1. Entity Recognition Accuracy Metrics on a set of French Birth and Marriage Dataset from the 19th Century. It is Interesting to Note that Precision and Recall are more than 90% for All the Entities. The Date Entity has been Further Divided into the Year, Birth and Day and Still its Accuracy Metrics are not much Compromised.
4.2
Entity
Precision Recall F1
Name Year Month Day Gender Age
98.92 98.62 92.41 94.12 99.99 95.52
97.40 98.62 94.81 93.02 99.99 99.99
Micro Avg. 98.49
97.25
97.87
Entity Extraction Results
The rough-grained entity extraction component of our joint model has a total micro F1-score of approximately 97.87% (as seen in Table 1). This is a marked
Joint Entity-Relation Extraction
749
improvement over the benchmark model in Entity Recognition, the PL-Marker model, which has a micro F1-score of 91.1% [13]. The higher accuracy of our entity recognition model is mutually beneficial to the relation extraction component of our joint model. Of note is our entity extraction architecture is able to successfully train with relatively little training data and perform remarkably well. This also allows our model to train very quickly. One reason the entity extraction is so good, is our use of BERT. Its ability to parse sentences bidirectionally allows for embeddings to be built bidirectionally. Table 2. All Models are Run on our Data. In these Tests our Model Performs Better than the Best Models in the Field of NER (97.9%) and the Field of Relation Extraction (97.4%) Model
NER Relation Micro F-1 Micro F-1
Ours
97.9
97.4
PL-Marker [13]
91.1
73.0
PURE [16]
90.9
69.4
Table-Sequence 89.5 [11]
67.6
TriMF [9]
87.6
66.5
TablERT [5]
88.0
66.1
Table 3. Relation Recognition Accuracy Metrics on French Birth and Marriage Dataset from the 19th Century. It is Interesting to Note that Precision and Recall are more than 90% for all the Relations Relation
Precision Recall F1
GenderOf AgeOf FatherOf MotherOf SpouseOf BirthOf MarriageOf
99.99 96.33 91.428 97.26 92.86 99.99 99.99
Micro Avg. 96.04
99.99 98.13 98.63 97.10 99.99 99.99 99.99 98.83
97.42
750
D. Segrera et al.
Table 4. Comparison of Fine-Grained Entity Extraction between a State-of-the-Art NER Model (FlairNER) and our Proposed Joint Model. Our Model Stands as a Clear Winner with an F1 Score being 85.63 whereas the Regular NER Model’s F1 Score is 45.53. FlairNER Seemingly does Better on Dictionary based Entities and Non-RelationBased Entities as it Runs by not Factoring in Relationships. Note our Model’s Huge Improvement in Important Fine-Grained Entities like SelfName FatherName, MotherName, SpouseName, FatherAge, MotherAge. Also, there is a huge Improvement in the Recall for Marriage Dates and Birth State-of-the-art Fine-grained Entity Extraction (FlairNER) Fine-Entity SelfGender
Precision 98.04
Recall
SelfName
16.67
0.55
FatherName
36.84
28.00
MotherName
42.86
45.00
SpouseName
41.74
47.52
SpouseFatherName
34.12
38.16
SpouseMotherName
44.29
48.44
3.80
88.97
OtherPersonName
F1
99.99
SelfAge
76.71
73.68
FatherAge
51.90
49.40
MotherAge
56.07
64.52
SpouseAge
83.87
72.22
SpouseFatherAge
33.61
16.67
SpouseMotherAge
35.71
35.71
BirthDay
72.73
77.78
BirthMonth
76.09
83.33
BirthYear
90.34
81.12
MarriageDay
47.37
45.00
MarriageMonth
53.33
30.00
MarriageYear
52.46
56.14
Micro Avg.
65.29
37.36
47.53
Our model (Joint NER-Relation Extraction) Fine-Entity SelfGender
Precision 99.99
Recall
SelfName
89.29
97.40
FatherName
81.25
90.28
MotherName
89.74
92.11
SpouseName
75.86
84.62
SpouseFatherName
46.15
57.14
SpouseMotherName
56.00
58.33
OtherPersonName
91.80
84.85
SelfAge
75.81
99.99
FatherAge
76.92
76.92
MotherAge
91.67
88.00
SpouseAge
60.71
77.27
SpouseFatherAge
36.35
38.75
SpouseMotherAge
20.00
33.33
BirthDay
66.67
99.99
BirthMonth
66.67
99.99
BirthYear
99.99
99.99
MarriageDay
99.99
99.99
MarriageMonth
F1
96.15
57.143
99.99
MarriageYear
50.00
99.99
Micro Avg.
83.98
87.35
85.63
Joint Entity-Relation Extraction
4.3
751
Relation Extraction Results
The benchmark in the field for relation extraction, the PL-Marker model [13], has a micro F1 score of 73% when run on our data (as seen in Table 2). Our model has achieved a RE micro F1 score of approximately 97.42% (as seen in Table 3). This is an exceptional improvement in relation extraction performance. Our model achieves this level of improvement due to the nature of joint entityrelation extraction models where training entity extraction improves relation extraction and to a lesser degree training relation extraction improves entity extraction. Much like the Entity Extraction component of our model, the Relation Extraction of our model also trains quickly and accurately, with relatively little training data. These impressive results of the relation extraction module (in conjunction with the entity extraction module) clearly demonstrate the value our model has. 4.4
Final Results
Next the predicted relationships and entities are used to infer the fine-grained entities. Where as the entity recognition module and relation extraction module avoid compounding errors by using an end-to-end architecture, inferences made after training are very susceptible to compounding errors. This explains why certain fine-grained entities such as the age of the subject’s spouse’s mother have such low accuracy metrics; they are too sensitive to compounding errors. However, the final results (as seen in Table 4) show the joint model is a marked improvement over isolated NER (like FlairNER) when indexing family history documents.
5
Conclusion
The model succeeded in indexing the fine-grained entities in family history records within an acceptable margin of accuracy. Results were validated by recording the model’s NER micro F1-score, relation extraction micro F1-score, and the F1 score for each fine-grained entity type at the end of the pipeline. It then compared the NER micro F1-score and RE micro F1-score against the corresponding metrics for benchmark joint entity-relation extraction papers. The results of our joint model are better than the benchmark models in the field, even though our data was much noisier than the data they used. This additionally suggests our model is robust to noise in the data. As such our model contributes to the current literature on joint entity-relation extraction in terms of family history records.
References 1. Bekoulis, G., Deleu, J., Demeester, T., Develder, C.: Joint entity recognition and relation extraction as a multi-head selection problem. Expert Syst. Appl. 114, 34–45 (2018). arXiv: 1804.07847
752
D. Segrera et al.
2. Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding, May 2019. arXiv:1810.04805 [cs] 3. Han, X., Yu, P., Liu, Z., Sun, M., Li, P.: Hierarchical relation extraction with coarse-to-fine grained attention. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2236–2245. Association for Computational Linguistics, Brussels, Belgium (2018) 4. Li, X., et al.: Entity-relation extraction as multi-turn question answering. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 1340–1350, Florence, Italy. Association for Computational Linguistics (2019) 5. Ma, Y., Hiraoka, T., Okazaki, N.: Named entity recognition and relation extraction using enhanced table filling by contextualized representations, January 2022. Number: arXiv:2010.07522 [cs] 6. Peters, M.E., et al.: Deep contextualized word representations. arXiv:1802.05365 [cs], March 2018 7. Ramshaw, L.A., Marcus, M.P.: Text Chunking Using Transformation-Based Learning, pp. 157–176. Springer, Dordrecht (1999). https://doi.org/10.1007/978-94-0172390-9 10 8. Sanh, V., Wolf, T., Ruder, S.: A hierarchical multi-task approach for learning embeddings from semantic tasks. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 6949–6956 (2019) 9. Shen, Y., Ma, X., Tang, Y., Lu, W.: A trigger-sense memory flow framework for joint entity and relation extraction, April 2021. Number: arXiv:2101.10213 [cs] 10. Takanobu, R., Zhang, T., Liu, J., Huang, M.: A hierarchical framework for relation extraction with reinforcement learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 7072–7079 (2019) 11. Wang, J., Lu, W.: Two are better than one: joint entity and relation extraction with table-sequence encoders. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1706–1721, Online. Association for Computational Linguistics (2020) 12. Xie, C., Liang, J., Liu, J., Huang, C., Huang, W., Xiao, Y.: Revisiting the negative data of distantly supervised relation extraction. arXiv:2105.10158 [cs], May 2021 13. Ye, D., Lin, Y., Sun, M.: Pack together: entity and relation extraction with levitated marker. arXiv:2109.06067 [cs], October 2021 14. Zhang, K., et al.: Open hierarchical relation extraction. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 5682–5693, Online. Association for Computational Linguistics, June 2021 15. Zhao, T., Yan, Z., Cao, Y., Li, Z.: Asking effective and diverse questions: a machine reading comprehension based framework for joint entity-relation extraction. In: Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, pp. 3948–3954, Yokohama, Japan. International Joint Conferences on Artificial Intelligence Organization, July 2020 16. Zhong, Z., Chen, D.: A frustratingly easy approach for entity and relation extraction. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 50–61, Online. Association for Computational Linguistics, June 2021
Coupled-Tensor Generated Word Embeddings and Their Composition Matej Cibula(B) and Radek Marik CTU in Prague, FEE, Technicka 2, Prague 6, Czech Republic [email protected]
Abstract. Contemporary methods of computing vector-space embeddings of words are able to accurately capture both their semantic and syntactic properties. Methods for computing n-gram embeddings do, however, come with downsides. They either require high resources during training or estimation, or come with other disadvantages, such as loss of information about individual positions of words in phrases. We propose two novel approaches to training word vectors enabling a composition of word embeddings into n-gram embeddings. Both methods are based on coupled CP decomposition of tensors that are generated by a sequence of time-shifted word embeddings. We compare our methods with SGNS and show that they provide superior performance on word-analogy tasks. Keywords: Tensor Decomposition Embeddings
1
· Word Analogy · Word
Introduction
Recently, we have seen many deep learning approaches with an ever-increasing number of parameters, achieving state-of-the-art results on a wide range of natural-language-processing tasks [3,10]. The transition from treating words as atomic units to using real-valued vector representations as word proxies has been an essential factor in improving language models in the last decade. Those word embeddings can capture both semantic and syntactic meanings of individual words while using unlabeled, although large corpora. Unfortunately, current large models, such as GPT3 [5], are resource-intensive during both their training and evaluation phases. Hence, simpler techniques are still of interest in many applications. Many of those methods take pre-trained word vectors as their input, creating a need for an improvement in this area. Traditionally, word vectors were evaluated based on the cosine distance between individual words, clustering semantically and syntactically more similar words closer together. This can be seen as creating a word embedding model with added similarity mapping. The work of Mikolov et al. [19] introduced the use of linear relationships between word vectors and provided a model allowing better to retain the information about linear regularities between words. This approach has been further improved upon in the work of Pennington et al. [23], c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): SAI 2023, LNNS 711, pp. 753–767, 2023. https://doi.org/10.1007/978-3-031-37717-4_49
754
M. Cibula and R. Marik
where they were able to further improve accuracy and propose a new training algorithm based on the modified decomposition of the word co-occurrence matrix using the least squares method. Their framework has more directly connected the word embeddings and word co-occurrence statistics. The work of Zhao et al. [31] explored the addition of n-grams into wordembedding spaces. They, too, used the decomposition of the global co-occurrence matrix, with the distinction of including even individual n-grams. Although their embeddings retain some linear regularities both between n-grams and between n-grams and unigrams, n-gram embeddings had to be trained as if they were separate ’words’. Hence, they did not provide an operation to compose unigram embeddings into an n-gram one. In other words, their ngram2vec class of models could not handle out-of-vocabulary n-grams even though they may recognize all the unigrams they are composed of. The problem of embeddings of out-ofvocabulary words was addressed by Bojanowski et al. [4], however, only in the case of unigrams. In Sect. 2, we provide a summary of the previous works concerning word embeddings. Their differences are pointed out and at the end of the section, we provide their brief comparison to our work. The following Sect. 3 contains basic definitions and notation used throughout this work. It concisely introduces reader to word embeddings and tensor decomposition. In Sect. 4 we propose our two models - StackEmb and PermEmb. We explain the motivation for the introduction of the time-shift operation and subsequently describe its use in the word-composition process. We experimentally evaluate and compare our models to word2vec in Sect. 5. We describe used datasets and hyperparameters. Finally, we compare our approaches to the original models. The proposed models, their results and limitations are discussed in Sect. 6. Lastly, Sect. 7 concludes the work and summarizes the results and direction of the further research.
2
Related Work
Models from the word2vec [19,20] package, especially the skip-gram model with negative sampling (SGNS), are baselines of word embeddings. Their counterpart, GloVe [23], estimates word embeddings by directly working with the word co-occurrence matrix. Extending both of those [31], even each n-gram can be given its own embedding vector. However, as opposed to this work, each n-gram embedding has to be trained separately. Phrase or sentence embeddings created from word embeddings have already been used before [15]. However, there are significant differences from our approach. Previous methods involved summation of individual word vectors together, resulting in a bag-of-words phrase representation as opposed to our approach that uses fixed n-grams, preserving word order. Yu and Dredze [29] used the Hadamard product to combine word vectors into phrases but use additional information, such as part-of-speech tags, that is either presented with the given phrase or is inferred using other independently-working methods.
Coupled-Tensor Generated Word Embeddings
755
The work of Huang et al. [12] uses a two-phase approach involving maxpooling convolutions, which allows them to use various-sized phrases. However, their approach is not as computationally efficient and does not allow the easy inclusion of additional words into phrases without repeating the whole computation. Next group of word-embedding combination models are deep neural networks, such as convolutional networks [13] or long short-term memory models [28]. Yet another approach to sentence/phrase embeddings involves the use of deep bidirectional language models [26]. Those embeddings provide superior performance on semantic-similarity tasks. Nonetheless, those models have relatively high resource requirements for both training and estimation which limits their use in many applications. The sequence of words (phrase, sentence, ngram) has to be evaluated together and one cannot append a new word into the sequence without having to repeat the whole computation anew, which contrasts the approach presented in this work. For those reasons, we omitted a comparison with those models from this work. Lastly, there has been work done in an area of coupled CP tensor decomposition [1,22], which involves decomposition of multiple tensors that share some of their components. For the StackEmb model, we adopt this approach, repeating multiple components between multiple tensors. In the case of our PermEmb model, there is really only one component that is being transformed and reused between multiple tensors. Nevertheless, we do not directly estimate values of those tensors as we use an “online” iterative learning approach and the exact values of those tensors are not considered important in this work. As stated above, all previous approaches either introduced high computational complexity, or limited the expressivity potential of the model by omitting information about words’ order. We propose two novel approaches to training word embeddings, inspired by methods from the field of tensor decompositions [14]. Our primary contribution is the introduction of the composition operation into the word-vector model [19]. The proposed approach allows an easy composition of n-gram embeddings solely from pretrained tensor-based unigram embeddings. Our models are able to represent both unigrams and ngrams in the same vector space. The n-gram embeddings are either accumulated from positional embeddings (StackEmb), or they are directly computed from the embeddings of individual unigrams (PermEmb) using only permutations and Hadamard (element-wise) multiplication. We evaluate our proposed models on various benchmarks, showing the benefits even for embeddings at unigram level.
3
Preliminaries
In Subsect. 3.1 we review the word representation vectors, focusing on the skipgram model with negative sampling from word2vec package [19]. Subsection 3.2 introduces the concepts of tensor decompositions, focusing on CP decomposition [14].
756
3.1
M. Cibula and R. Marik
Pretrained Word Representations
Word embeddings have become a key component of many natural-language understanding models, thanks to their ability to capture complex relationships between words of a language’s corpora. This approach is generally performed by training (semi-) unsupervised models on large-scale datasets. Based on their training algorithms, we can distinguish two classes of models: a) global matrix/tensor decompositions [7,23], and b) “online” or batch training methods [19,24]. While the “online” methods iterate over the training dataset, the global methods first estimate the whole-corpus statistics and afterwards use them to train a model. It has been shown [16] that the SGNS method is implicitly factorizing a word-context pointwise-mutual-information (PMI) matrix shifted by a constant. Arguably, the most crucial property of word embeddings is that the similarity between a pair of words can be relatively easily estimated. This estimation is commonly done by calculating the cosine similarity between their embedding vectors. In the case of normalized embeddings, this amounts to calculating the dot product of their embeddings: cos(∠ab) =
a b . ||a||2 ||b||2
(1)
The dot product is the core operation of most matrix and tensor decompositions, such as SVD, NMF, or CP decomposition. This property was already utilized in the formulation of the GloVe [23] model. While the original SGNS model was trained using the SGD, GloVe introduced the pre-processing step during which it estimated the co-occurrence matrix. During the actual optimization phase, they were solving the weighted least-squares regression model. Another important property of word embeddings is the linear relationship between analogous words. It is used to answer questions of the type: b is to a what d is to c, find d. Or in mathematical terms: a − b + c = d. The work of Levy et al. [16] showed that the SGNS method is implicitly factorizing a word-context pointwise-mutual-information (PMI) matrix, shifted by a constant. This contribution provided a connection between global and “online” word-embedding methods. 3.2
Tensor Decomposition
CP tensor decomposition [14] is based on the idea of representing a general tensor as the finite sum of rank-1 tensors. An order-N tensor X ∈ Rd1 ×···×dN is rank-1 if it can be computed as the outer product of N vectors, i.e., X = x(1) ◦ x(2) ◦ · · · ◦ x(N ) . For a general rank-R tensor X this is typically denoted as X=
R r=1
(2) (N ) x(1) = [[X(1) , X(2) , . . . , X(N ) ]], r ◦ xr ◦ · · · ◦ xr
(2)
Coupled-Tensor Generated Word Embeddings
757 (i)
where matrices X(i) are components of the decomposition and a vector xj is the j-th column of X(i) . We are, however, interested in rows of those matrices. In the case of order-2 tensor, this relation can be rewritten as
X = X(1) X(2) .
(3)
As can be seen, rows of individual components each represent a “value” (or a “position”) in the given dimension. The Hadamard product is the element-wise multiplication. The Hadamard product of 2 vectors, x, y, is denoted as x ∗ y. We could adapt the definition of the CP decomposition into using the Hadamard product instead of the vector inner product. By doing this, we get a following identity: [X]ab...z =
n i=1
[a ∗ b ∗ · · · ∗ z]i .
(4)
In other words, a tensor can be considered to be generated by vectors representing individual elements of its index space. Coupled decomposition of heterogeneous tensors [1,27] is a simultaneous decomposition of multiple tensors in such a way that they share one or more components, for example matrices A, B could be of forms (5) A = [[X, C(A) ]], B = [[X, C(B) ]].
4 4.1
Proposed Models Word Representations as Decompositions
We start by noticing that SGNS is an implicit decomposition of a matrix consisting of PMI values (up to a constant) for word pairs [16]. Hence, we can write the following: (6) PP M I = [[X, C]], where i-th row of X contains an embedding vector for a word that was assigned a hash value i (we refer to it as the word i in the following text), and i-th row of C contains context vector for the word i. To extend our model to n-grams, we assume the existence of another decomposition that is coupled to the one above: Pbi−P M I = [[Y, X, C]],
(7)
where matrices X, C are the same in both decompositions. In fact, for high enough embedding dimension, this decomposition always exists. Just like this, we can construct a sequence of tensors of strictly increasing order. Each tensor would be composed of all the components from its predecessor plus a new one that represents the additional dimension. All of those decompositions have the same context component - C - ensuring that n-grams will be comparable to unigrams based on the context. Within this paper, we will refer to this extension of the basic model as the StackEmb model.
758
M. Cibula and R. Marik
Unfortunately, the StackEmb approach requires us to estimate and store in the memory one or more additional matrices (as compared to SGNS). If we combined all word matrices (Y, X, . . .) into one, we would end up with the method of Zhao et al. [31], a.k.a. with a matrix containing all unigrams and bigrams etc. We would be interested in a more efficient method. We hypothesize that with high-enough embedding dimension, there exists a function f (x), that we call the time-shift operation, that could connect matrices X, Y, Z as Y = f (X), Z = f (f (X)) . (8) 4.2
Time-Shift Operation
We start by requiring an existence of the operation of time-shift, f : Rn → Rn . Then, we follow by setting several requirements for this function: 1. time-shift needs to be reversible, i.e. −1 (∀x ∈ Rn ) f −1 (f (x)) = x , ∃f 2. combination (of time-shifted vectors) is associative, i.e. (∀x, y, z) (((f (x) ∗ y) ∗ z) = (f (x) ∗ (y ∗ z))) , 3. time-shift is distributive over combination, i.e. (∀x, y) (f (x ∗ y) = f (x) ∗ (y)) . Ad 1, time-shift needs to be reversible so that we can always shift time context into both sides - which allows for the ad-hoc n-gram extension. Another consequence of this property is that no information can be lost by time-shifting. This consequence is contrary to many sequential-data neural network models, such as RNNs exhibiting exponential decay of the past inputs. Ad 2, as we use the operation of time-shift to incorporate the relative-time information, we require the associativity of the combination operation. Thanks to this property, the context window can be updated ad-hoc instead of having to rebuild the whole series of combinations for each new sub-gram. Ad 3, we expect that a combination of time-shifted vectors should give the same result as time-shifting an already computed combination. This condition is an essential part of the requirement for our model to allow the continuous expansion of n-grams. As the requirement 2 is trivially satisfied, we are interested in the other two requirements. To make the time-shift as computationally simple as possible, we assume that it has a form of a linear function: f (x) = Ax + b, where A and b are parameter matrix and parameter vector, respectively.
(9)
Coupled-Tensor Generated Word Embeddings
759
Following the condition 3, we get A (x ∗ y) + b = (Ay + b) ∗ (Ax + b) .
(10)
After rewriting this relationship into its index-wise form, we get ∀i a i (x ∗ y) + bi = ai x + bi ∗ ai y + bi ,
(11)
and by expanding dot products and Hadamard products in the above, we get a condition bi +
n
n
aij xj yj = b2i + bi
j=1
aij (xj + yj ) +
j=1
n n
aij aik xj yk ,
j=1 k=1
and lastly by further reordering, we get the following equality: n
aij xj yj + bi =
j=1
n
a2ij xj yj +
n n
j=1
aij aik xj yk + bi
j=1 k=1 k=j
n
aij (xj + yj ) + b2i .
j=1
As this must be satisfied by all vectors x, y, we get following requirements for the time-shift operation as was assumed above: 1. (∀j, k) (j = k =⇒ aij aik = 0) , 2. aij = a2ij , 3. bi = 0. In other words, each row of the matrix A may contain at most one non-zero element, and using the second equation, we get that all elements are equal to 0, 1, or −1. After considering the requirement number 1, to enforce invertibility of the matrix, each row has to contain a non-zero element, and all rows have to be linearly independent. Only one group of matrices satisfies all of those conditions, namely the particular subcategory of generalized permutation matrices. We define P as a permutation matrix and D as a diagonal matrix that satisfies D2 = 1, and we get that the time-shift operation has to be of the form f (x) = DPx.
(12)
In this work, we assume D = I. 4.3
Models
In this paper, we will refer to the method utilizing the time-shift operation as the PermEmb model. We can look at the PermEmb model as a decomposition algorithm for a series of tensors of the form n
2
[[X(PT ) , . . . , X(PT ) , XPT , X, C]]
∀n ∈ N0 .
(13)
760
M. Cibula and R. Marik
The properties of such series are dependent on the exact choice of the permutation matrix P. If it was an identity, then the model could not distinguish between different n-grams containing the same set of words. Similarly, if the permutation order is too short, the permuted components will periodically repeat themselves. Similar approach could be adapted to extend the context matrix C, however, we do not explore it in this work. 4.4
Composition
Following the Eq. 13, the last dimension always represents the context using the matrix C. All the dimensions “built up” from X represent individual positions of words relative to the context. In other words, all n-grams from all of the tensors are aligned to the same context, and hence they can be compared among each other. We can combine all non-context components of the tensor into one large matrix using the Khatri-Rao product. While the i -th row of the matrix X is the word vector for the word i, to compose word embedding xi and xj into an n-gram embedding xij , based on Equations 7, 8, 12 we calculate xij = (Pxi ) ∗ xj .
(14)
Thanks to our choice of the time-shift operation, we can use this relationship recursively for higher n-grams, i.e. xijk = (P(Pxi ∗xj ))∗xk = (P2 xi ∗Pxj )∗xk = P2 xi ∗Pxj ∗xk = Pxij ∗xk . (15) 4.5
Optimization
Our models closely follow optimization process of SGNS [19]. We use negative sampling only on context vectors. The general loss function is k log σ z c + Eci ∼Pn log σ −z ci ,
(16)
i=1
where Pn is the sampling distribution. We adapt the same setting of the sampling distribution as SGNS. In the case of the StackEmb, we iterate over training data, decomposing all tensors considered. In the case of a bigram ij, we optimize in sequence for z = xi , and z = xj , and z = yi ∗ xj . In the case of the PermEmb on bigrams, for each bigram ij we would optimize for z = xi , z = xj , and finally z = (Pxi ) ∗ xj .
5
Experiments
In this paper, we focused on exploring possible additions (extensions) to the structure of the underlying space of word embeddings. Hence, in the experiments, we considered important showing the comparability of our model with its predecessors. We emphasize that our embeddings are extended with an additional structure in the form of composition operations that turns word embeddings into n-gram embeddings using Eq. 14.
Coupled-Tensor Generated Word Embeddings
761
Fig. 1. Overview of the Accuracy of Embedding Models on the BATS Benchmark
5.1
Datasets
For word similarity, we used publicly available datasets consisting of word pairs with human-assigned similarity scores. The used datasets are: partitions of the WordSim353 [8]: WordSim Similarity [30] and WordSim Relatedness [2], Mechanical Turk dataset [25], MEN [6], Rare Words [18], and finally the SimLex999 [11]. Models are evaluated by estimating cosine similarity between word pairs and measuring their correlation with human-assigned values using the Spearman’s rank correlation coefficient. To compare models on word analogy tasks, we used the MSR [21] and Google’s analogy [19] datasets. We removed problems involving out-ofvocabulary words considering our training data. Analogy datasets consist of four-word tasks, where we want to predict the fourth word based on the mutual relationship between the first two words. We evaluated those tasks using the 3CosAdd additive function [17]. Finally, we test our models on the BATS dataset [9]. It is useful for analyzing how well do embeddings capture various kinds of linguistic relations. It is a more balanced benchmark for testing word analogies, consisting of 40 tests in 4 categories. All models were trained on wikipedia corpus. Words that appeared less than 5 times were ignored. 5.2
Hyperparameters
Unless stated otherwise, each model has been trained for 3 iterations with starting learning rate equal to 0.025. High-frequency words were skipped with the sub-sampling parameter equal to 10−5 and for each positive example, we sample 5 negative ones. The exact choice of window size, embedding dimension and permutations (in the case of PermEmb) can be seen in tables. The implementation of the proposed models will be publicly available.
762
M. Cibula and R. Marik
Table 1. Accuracy of the PermEmb on Word Similarity Tasks based on the Choice of Permutation Dim Permutation
MEN Rare Words Mech. Turk SL-999 WS353 WS Sim. WS Rel.
300
(10,20,30,40,50,70,80) .648
.382
.578
.285
.612
.535
.707
300
(300)
.652
.378
.584
.299
.605
.518
.691
300
(20 * 15)
.655
.375
.573
.307
.600
.520
.701
300
(3 * 100)
.651
.365
.564
.296
.594
.512
.700
300
(10 * 30)
.650
.373
.564
.303
.599
.524
.695
Table 2. Accuracy of the PermEmb on Word Analogy Tasks based on the Choice of Permutation Dim Permutation
Google MSR Google Sem. Google Syn.
300
(10,20,30,40,50,70,80) .389
.354
.334
.422
300
(300)
.379
.346
.336
.404
300
(20 * 15)
.384
.346
.328
.418
300
(3 * 100)
.391
.345
.346
.417
300
(10 * 30)
.382
.356
.339
.408
Table 3. Accuracy on Word Similarity Tasks for Various Models with Fixed Embedding Dimension Equal to 500 Model
MEN Rare Words Mech. Turk SL-999 WS353 WS Sim. WS Rel.
PermEmb (10 * 50) .668
.384
.575
.304
.608
.546
.701
StackEmb
.625
.370
.555
.299
.587
.494
.670
SGNS
.650
.385
.607
.337
.673
.614
.732
Table 4. Accuracy on Word Analogy Tasks for Various Models with Fixed Embedding Dimension Equal to 500 Model
5.3
Google MSR Google Sem. Google Syn.
PermEmb (10 * 50) .418
.360
.374
.444
StackEmb
.434
.396 .367
.473
SGNS
.402
.378
.459
.306
Dependence on the Choice of Permutation
As it can be shown that permutations with the same set of independent cycles are equivalent from the point of view of our model, we use them to characterize the permutations used uniquely. For example, a (3 ∗ 10, 20)-permutation is composed of 4 independent permutations, 3 of which have the orbit length equal to 10, and one has the orbit length equal to 20. This permutation can only correspond to a model with an embedding dimension equal to 50.
Coupled-Tensor Generated Word Embeddings
763
The results can be seen in Tables 1 and 2. Outside of degenerated cases, we have not noticed much of a dependence of the model’s quality on the choice of the permutation matrix. 5.4
Comparison to Other Methods
We compare the quality of our embeddings to those of skip-gram model using negative sampling (SGNS) from word2vec [20]. The results of the basic benchmarks can be seen in Tables 3 and 4. Although word2vec has superior performance on most of the similarity datasets, our models outperform it on word analogy tasks. The results of the BATS benchmark are shown in Fig. 1. While the StackEmb is the best performer in the first half of the tasks, the second half is dominated by the PermEmb model. The first 20 tasks of the benchmark are focused on inflectional and derivational morphology. The other 20 tests are focused on lexicographic and encyclopedic semantics. 5.5
Mutual Comparison of Phrase Embeddings
As stated in previous sections, our proposed models can be used to generate embeddings of phrases resp. n-grams. This is done iteratively by composing the embedding vector for (n − 1)-gram with the embedding vector of the preceding (or following) word. The whole computation consists of only the multiplication and time-shift operation (refer to Eq. 14). We compare our models with the Phrase Skip-Gram word2vec model [20]. This model first identifies commonly occurring phrases and afterwards replaces them with new unique tokens. We trained our proposed models in a standard way then compare them based on their phrase embeddings. In this experiment, we consider the word2vec model to be the ground truth as each phrase is given its own independent embedding vector. Our goal is to show that our models are able to implicitly express the same information with considerably lower memory footprint. To compare phrase embeddings of two different models, we start by estimating 5 closest words to each phrase in both models. Afterwards, we compute the Jaccard index J(A, B) between those two sets, where J(A, B) =
|A ∩ B| . |A ∪ B|
(17)
Then to compare individual models, we calculate their average Jaccard index over all phrases. The ground-truth model (Phrase Skip-Gram) has the Jaccardindex average equal to 1. For all the other models, the higher the Jaccard-index average, the more similar they are to the ground truth. This way, we are able to quantify the similarity between induced topologies of all models.
764
M. Cibula and R. Marik
Table 5. The Average of Jaccard Indices of 5-Closest-Words-to-Phrase Sets. The Embedding Dimension of All Models is Equal to 500 Model
Jaccard-Index Average
PermEmb (10 * 50) .48 StackEmb Phrase Skip-Gram
.71 1
The results of the phrase embedding experiment are visible in Table 5. The StackEmb model has attained a higher score than the PermEmb. This is most likely caused by its higher expressivity thanks to it having twice as many learnable parameters as the other. Nonetheless, both models seem to provide satisfactory results (if 4 words out of 5 match, we get Jaccard index equal to 23 ).
6
Discussion
In this work, we have compared our models only with other similar approaches. It would be rigorous to use state-of-the-art large language models as the baseline to our comparisons. However, we consider such comparison to be unfair as we used limited training data and computational resources. Instead, we focused on presenting a new approach to language modelling and left large-model applications to (possible) future work. One of the issues we dealt with is the comparison of n-gram embeddings between each other. We have chosen to set one model as the “ground truth” and compare all the other ones against it (see Sect. 5.5). This approach assumes that the one of the models is the correct one which is not a valid assumption. The other option would be an indirect comparison through an external benchmark. Unfortunately, most of the benchmarks don’t rely directly on phrase embeddings which can cause high variance in their results.
7
Conclusions
We have introduced two computationally efficient ways of training and composing vector representations of words into n-grams. The primary goal of this paper is focused on the proof-of-concept allowing easy and efficient composition of singular word representations into n-gram representations based on tensordecomposition methods. Our experimental results on unigrams demonstrate the effectiveness of our models, especially at word analogy tasks where they outperform SGNS. Compared to the previous art, we have lowered the memory footprint of ngram representations by introducing operation of composition on the embedding space, which means that in the case of the PermEmb model, only unigram vectors need to be stored in memory.
Coupled-Tensor Generated Word Embeddings
765
In the future work, we would like to introduce a better n-gram embeddings evaluation framework and focus on similar extensions in the context of transformer-like large-scale neural network models. ˇ Acknowledgments. Sponsored by the project for SGS CVUT No: SGS22/163/ OHK3/3T/13: Generative Models of Categorical Data Sequences.
References 1. Acar, E., Kolda, T.G., Dunlavy, D.M.: All-at-once optimization for coupled matrix and tensor factorizations. arXiv, abs/1105.3422 (2011) 2. Agirre, E., Alfonseca, E., Hall, K., Kravalova, J., Pa¸sca, M., Soroa, A.: A study on similarity and relatedness using distributional and WordNet-based approaches. In: Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 19–27, Boulder, Colorado, June 2009. Association for Computational Linguistics (2009) 3. Beltagy, I., Peters, M.E., Cohan, A.: Longformer: the long-document transformer. arXiv, abs/2004.05150 (2020) 4. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 135–146 (2017) 5. Brown, T.B., et al.: Language models are few-shot learners. arXiv, abs/2005.14165 (2020) 6. Bruni, E., Boleda, G., Baroni, M., Tran, N.-K.: Distributional semantics in technicolor. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 136–145, Jeju Island, Korea, July 2012. Association for Computational Linguistics (2012) 7. Deerwester, S.C., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.A.: Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41, 391–407 (1990) 8. Finkelstein, L., et al.: Placing search in context: the concept revisited. ACM Trans. Inf. Syst. 20, 116–131 (2002) 9. Gladkova, A., Drozd, A., Matsuoka, S.: Analogy-based detection of morphological and semantic relations with word embeddings: what works and what doesn’t. In: Proceedings of the NAACL Student Research Workshop, pp. 8–15, San Diego, California, June 2016. Association for Computational Linguistics (2016) 10. He, P., Liu, X., Gao, J., Chen, W.: DeBERTa: decoding-enhanced BERT with disentangled attention. arXiv, abs/2006.03654 (2021) 11. Hill, F., Reichart, R., Korhonen, A.: SimLex-999: evaluating semantic models with (genuine) similarity estimation. Comput. Linguist. 41(4), 665–695 (2015) 12. Huang, F., Anandkumar, A.: Unsupervised learning of word-sequence representations from scratch via convolutional tensor decomposition. arXiv, abs/1606.03153 (2016) 13. Kalchbrenner, N., Grefenstette, E., Blunsom, P.: A convolutional neural network for modelling sentences. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 655–665, Baltimore, Maryland, June 2014. Association for Computational Linguistics (2014) 14. Kolda, T.G., Bader, B.W.: Tensor decompositions and applications. SIAM Rev. 51, 455–500 (2009)
766
M. Cibula and R. Marik
15. Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In: International Conference on Machine Learning, pp. 1188–1196. PMLR (2014) 16. Levy, O., Goldberg, Y.: Neural word embedding as implicit matrix factorization. In: Advances in Neural Information Processing Systems, vol. 27 (2014) 17. Levy, O., Goldberg, Y., Dagan, I.: Improving distributional similarity with lessons learned from word embeddings. Trans. Assoc. Comput. Linguist. 3, 211–225 (2015) 18. Luong, T., Socher, R., Manning, C.: Better word representations with recursive neural networks for morphology. In: Proceedings of the Seventeenth Conference on Computational Natural Language Learning, pp. 104–113, Sofia, Bulgaria, August 2013. Association for Computational Linguistics (2013) 19. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: Bengio, Y., LeCun, Y. (eds.) 1st International Conference on Learning Representations, ICLR 2013, Scottsdale, Arizona, USA, 2–4 May 2013, Workshop Track Proceedings (2013) 20. Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2, NIPS’13, pp. 3111–3119, Red Hook, NY, USA, Curran Associates Inc. (2013) 21. Mikolov, T., Yih, W.-T., Zweig, G.: Linguistic regularities in continuous space word representations. In: Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 746–751, Atlanta, Georgia, June 2013. Association for Computational Linguistics (2013) 22. Naskovska, K., Lau, S., Korobkov, A.A., Haueisen, J., Haardt, M.: Coupled CP decomposition of simultaneous MEG-EEG signals for differentiating oscillators during photic driving. Front. Neurosci. 14 (2020) 23. Pennington, J., Socher, R., Manning, C.: GloVe: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543, Doha, Qatar, October 2014. Association for Computational Linguistics (2014) 24. Peters, M.E., et al.: Deep contextualized word representations. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 2227–2237, New Orleans, Louisiana, June 2018. Association for Computational Linguistics (2018) 25. Radinsky, K., Agichtein, E., Gabrilovich, E., Markovitch, S.: A word at a time: computing word relatedness using temporal semantic analysis. In: Proceedings of the 20th International Conference on World Wide Web, pp. 337–346 (2011) 26. Reimers, N., Gurevych, I.: Sentence-BERT: sentence embeddings using Siamese BERT-networks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3982–3992, Hong Kong, China, November 2019. Association for Computational Linguistics (2019) 27. Smilde, A.K., Westerhuis, J.A., Boqu´e, R.: Multiway multiblock component and covariates regression models. J. Chemometrics 14 (2000) 28. Tai, K.S., Socher, R., Manning, C.D.: Improved semantic representations from treestructured long short-term memory networks. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 1556–1566, Beijing, China, July 2015. Association for Computational Linguistics (2015)
Coupled-Tensor Generated Word Embeddings
767
29. Mo, Yu., Dredze, M.: Learning composition models for phrase embeddings. Trans. Assoc. Comput. Linguist. 3, 227–242 (2015) 30. Zesch, T., M¨ uller, C., Gurevych, I.: Using wiktionary for computing semantic relatedness. In: AAAI, vol. 8, pp. 861–866 (2008) 31. Zhao, Z., Liu, T., Li, S., Li, B., Du, X.: Ngram2vec: learning improved word representations from ngram co-occurrence statistics. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 244–253, Copenhagen, Denmark, September 2017. Association for Computational Linguistics (2017)
Credibility Analysis for Social Media Content Using Sentence Transformer Based Machine Learning Sanjeev Roka and Danda B. Rawat(B) Department of Electrical Engineering and Computer Science, Howard University, Washington, DC 20059, USA [email protected], [email protected]
Abstract. Lately, people have been impacted by the increased use of social media platforms like Twitter in disseminating the various news and information. However, a lot of this content has been shown to be false or unreliable. It takes a lot of time to manually evaluate the veracity of such contents. As a result, research has been conducted to develop a classifier or credibility analyzer for fake/disinformation content utilizing different machine learning and deep learning techniques. The majority of existing approaches rely on data that has been manually classified as authentic or fraudulent, but by ignoring the semantic context of the text/content. Therefore, in this paper, we study a method that analyzes the semantic meaning of the text/content and use the content that has been published on multiple legitimate news websites as our source of truth. Core architecture of our approach consists of six different components that can be generalized to access the credibility of text contents published in social media. Specifically, we employ a fine-tuned sentence transformers model to extract the semantic meaning from texts, use web scraping to extract the news similar to the tweets from multiple sources and then compare them to assign a credibility score. Our results show that our approach is more effective than existing approaches since it immediately validates the social media contents with multiple reliable news media. In this paper, we use Twitter for evaluation of our approach but the approach can be generalized over other social media.
Keywords: Social Media Content Credibility Fine Tuning Sentence Transformer
1
· Tweets Credibility ·
Introduction
With easy access to the internet and the growing use of social media such as Twitter, many people tend to share different news and information in the form of tweets. Such social media contents such as tweets seem to spread broadly which leads many people to follow the information, independent of the validity of the information. There are users with millions of followers and each of their c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): SAI 2023, LNNS 711, pp. 768–784, 2023. https://doi.org/10.1007/978-3-031-37717-4_50
Social Media Content Credibility Using Machine Learning
769
tweets has a high reach. Tweets can be related to different domains and can be about different events or world politics, sports, entertainment, and so on and so forth. On average, around 500 million tweets are posted on Twitter each day and the numbers are only increasing [6]. People try to convey the content/message through popular social networking sites. However, it is not necessary that shared information is always true. People can intentionally or unintentionally share untrue information that can spread just like true news. Now, it is a very critical problem in such social media platforms. It is very important to know how credible a tweet is along with the user who is posting the tweet. This is one of the challenges in primary news-sharing social media platforms such as Twitter. Several approaches have been proposed so far for the assessment of the credibility of the tweets or detection of false information. Most of the systems are completely dependent on building the machine learning classifier based on the manual labeling of the tweets and don’t consider the actual semantic meaning of the tweets. These approaches use more user-based features that have nothing to do with the actual tweet content itself. In this paper, we utilize the concept of semantic similarity with transformer models in Natural Language Processing (NLP) and the news posted by multiple authentic news sites as the base credible news. Using the transformer models to find embed the contents, we are considering the semantic meaning of the tweet contents, specially tweets. Also, we are directly comparing the tweets to the authentic and credible news that is posted by verified news media. Hence, our approach is more effective in finding the credibility of tweets. Along with the tweets, we can even access the credibility of the Twitter user. We are doing this by taking the mean credibility score of the tweets fetched and accessed for that user. Moreover, this framework can be generalized for other social media platforms as well. However, in this paper, we focus specifically on Twitter. The major contributions of this paper are as follows: 1. We provide a generalized framework for the credibility analysis of news on social media considering their semantic meaning and taking the news posted by multiple authentic news sites as base truth by considering social media content posted on Twitter. 2. A fine-tuned sentence transformer model trained on a Twitter-specific dataset for finding the semantic similarity. We organize the rest of the paper as follows. In Sect. 2, we present the literature of works done to analyze the credibility and fake news carried by tweets. In Sect. 3, we discuss the problem that exists in the present methods for the tweet credibility analysis. Section 4 explains the proposed approach which is divided further into six subsections. Section 5 highlights the results obtained by using the proposed approach. The challenges of the proposed approach and the potential solutions are discussed in Sect. 6. Finally, we conclude the paper along with possible future directions in Sect. 7.
770
2
S. Roka and D. B. Rawat
Related Work
There have been several research works done previously on the analysis of the credibility of tweets. In this section, we are discussing some of the major works related to our approach. There have been works done to evaluate the credibility of tweets that are event-based. The survey work by Majed et al. [3] shows that there are three major approaches for the credibility assessment: The automated approach, the human-based approach, and the Hybrid approach. The work by Yang et al. [23] used the Hurricane Harvey Twitter data and proposed an automated credibility framework based on the URLs and retweets for a given tweet. The research by Gupta et al. [11] proposed a semi-supervised ranking model using the SVM rank to assess the credibility, based on training data obtained from six high-impact crisis events. The author considers a set of forty-five features categorized as Tweet Meta-data Features (e.g. source of tweet, tweet contains Geo-coordinates, etc.), tweet content features (e.g. a number of characters, words, URLs, hashtags, smileys, etc.), user-based features (e.g. a number of followers, time since joining Twitter, etc.), network features(e.g. a number of retweets, mentions, etc.), linguistic features (e.g. presence of swear words, emotion words, etc.) and external resource features (e.g. WOT score for URL, ratio of likes/dislikes for videos, etc.). To assign a credibility score, they used a ranking toolkit as RankLib and SVM-rank. Majed et al. [2] proposed a real-time, web-based, credibility assessment tool called Credfinder that calculates a credibility score for a tweet based on the content and the author. Another work done by the same author, Majed et al. [4] proposed a hybrid approach by combining the reputation-based technique, credibility technique, ranking features, and user expertise to obtain a trustworthiness value for a tweet. To do so, the author uses some text features such as the number of replies, number of retweets, hashtags, mentions and URLs, sentiment features, and user-level features such as age, gender, etc. In [18], Namihira et al. present two different measures to analyze the credibility of tweets at the topic level: the sentiment of opinions of the topic and the author’s expertise to calculate the credibility. The tweet credibility calculator utilizes the tweet opinion classifier to identify the topic and opinion of the given tweet. Then, it performs a majority decision on contrary opinions on the same topic by using the tweet opinion database. A hybrid approach is proposed by Tarek et al. [13] that considers both, the features of the users (e.g., created at, username, profile image, status counts, description, etc.) and their social graph (followers-following graph) to analyze the credibility of tweets. These features are fed into the binary machine learning classifier to make the prediction. Work by Yudith et al. [7] proposes an automated general framework design as a google chrome extension that works in real-time to analyze the credibility of tweets. The credibility model depends on two main components: the post’s content and the author. The features extracted from these components are fed to the actual credibility model. Further, the model consists of three credibility measures: Text credibility (analyses content syntactically), User Credibility (analyses only the user in isolation), and social credibility (analyses followers and
Social Media Content Credibility Using Machine Learning
771
following). Further, they extended their work in [15] by integrating the detection of the topic in the tweet and calculating the topic credibility measure by considering hashtags. For topic detection, the author considered three techniques namely: clustering, matrix factorization, and probabilistic techniques. Another work by krzysztof et al. [16] presented an automated credibility assessment built using the random forest classifier trained on the manually built dataset. The focus of the research was to focus on the external link features and check if the link’s content matches the tweet content or if it leads to an interactive ad instead. Some other features were the following-to-follower ratio, length of the tweet, whether the account was verified or not, etc. Hasan et al. in their paper [14] introduce a classification model based on supervised machine learning techniques and word-based N-gram analysis to classify Twitter messages automatically into credible and not credible. The author has used five different supervised classification techniques namely: Linear Support Vector Machines (LSVM), Logistic Regression (LR), Random Forests (RF), Naive Bayes (NB), and K-Nearest Neighbors (KNN). The research investigates two feature representations (TF and TF-IDF) and different word N-gram ranges and concludes that the best performance is achieved using a combination of both unigrams and bigrams, LSVM as a classifier and TF-IDF as a feature extraction technique. Also, the training is done for English and Arabic tweets. One of the approaches most like our research was proposed by Al-Khalifa et al. [5] that developed a model to measure the credibility of Twitter messages and assign the credibility level (high, low, and moderate) to each tweet. The proposed approach is based on the similarity between Twitter messages and authorized news sources like Aljazeera.net. However, the system was built for Arabic texts only. Almost every work done in the field mainly considers the user features and content features only at the surface level. However, they do not consider the actual semantic meaning of the tweet. Also, they do not consider the actual news based on which the tweet can be marked as credible. Hence, to overcome this, we propose our method in this paper. Table 1 compares our approach to existing approaches on the basis of what features they use, what they consider as the ground truth, and whether they consider semantic meaning or not.
3
Problem Statement
Twitter, more than a social networking site, has become more of a news and information-sharing platform around the globe. Millions of people and organizations utilize the platform to convey their thoughts and local or international news in the form of tweets. Especially, Twitter has been observed to be very effective during international events, global concerns, natural disasters, global pandemics, and so on. The truthfulness of the tweets becomes more critical in such sensitive cases. Hence, there must be a proper technique to find out how credible a tweet is given any circumstances. For this, there are several systems built as discussed in the previous section, that mainly consider the user-based features and contentbased features such as retweets, attached URLS, tweet lengths, punctuation, etc.
772
S. Roka and D. B. Rawat Table 1. Comparison to Related Works
Ref. #
Credibility Features
Ground Truth Approach
[2, 4, 11]
Number of chars, Human words, URLs, tweet Annotated sources, smileys, Data hashtags, retweets, etc.
SemiSupervised Ranking Model
No
[18]
Topic Level and author’s expertise
–
DecisionBased on Opinions
No
[13, 14]
Username, profile image, status counts, followers-following graph etc.
Human Annotated Data
ML Classifier
No
[7]
Text syntactic features, author features and topic
–
Topic No Detection with Dictionaries
[16]
External Link features, account verified etc.
Human Annotated Data
Random Forest Classifier
[5]
Similarity with verified content, POS tags, Inappropriate words
Arabic News Similarity Sites Score (Aljazeera and Saudi Press Agency)
Our Approach Semantic Similarity Popular News using verified News Sites Content
Semantic Similarity with SBERT
Semantic
No
No
Yes
to assess the credibility of tweets. However, existing works do not consider the actual semantics of the tweets and lack the ground truth based on which the credibility is assessed. Moreover, many other systems require the manual labeling of the data that needs prior knowledge of the domain to classify the tweets. Thus, the automated credibility analysis of tweets along with the users, which considers the semantic aspect of the tweet based on the news published on some authentic news sites as the ground truth, is the primary aim of this research.
4
Proposed Approach
We propose an approach to assess the credibility of tweets based on the opinion that some of the popular news sites must provide. Rather than using user-based features and syntax-based features, we consider the semantic meaning of the
Social Media Content Credibility Using Machine Learning
773
content and what it wants to convey. For each tweet, credibility is calculated using its semantic feature and how close it is to the news content that the few popular news sites have to say regarding the same context. Capturing the semantic meaning of the tweets and the news contents is a very crucial and critical task. This is basically achieved by converting the texts into vectors that are enriched with the actual semantic meaning. The process starts with fetching the tweets for users using the Twitter API which is then preprocessed. For each tweet, we extract the keywords from them and use them as a query to fetch the news from some popular news sites. We again pre-process both the tweet and the news content fetched for that tweet. We then vectorize the preprocessed texts and use cosine similarity to calculate the semantic similarity score. Finally, we filter the most similar news item found, to the given tweet and based on that assign the credibility score to the tweet.
Fig. 1. Block Diagram for Proposed Approach
Figure 1 presents the block diagram of our proposed architecture for tweet credibility assessment. Our approach can be described in the following six main stages: Fetching Tweets, Key-phrases Extraction, News Sites Scraping, Preprocessing, Embedding and Score Assessment. 4.1
Fetching Tweets
To analyze the tweets, the first task is to fetch the tweets. This is done using the Twitter API v2 which is provided by Twitter itself [1]. We signed up for essential access to the Twitter API which allowed up to fetch tweets as per our requirements. We are using a third-party tool Tweepy [21] which provides us with methods for the easy API call to Twitter. For our research purpose, we fetched tweets based on a particular user such that we can evaluate each tweet of the given user, assign a credibility score, and finally assess the credibility of the given user as well.
774
S. Roka and D. B. Rawat
4.2
Key-Phrase Extraction
As our approach requires searching for the news that is most similar to the tweets, it is really necessary to feed the proper queries to the news site so that we can get the news that is relevant and matches the given tweet. Thus, for this task, we have a component called a Key-phrase extractor. The major function of this component is to take a tweet and extract the most important keywords as a phrase from the given tweet. The component is implemented using the module KeyBERT [10]. The KeyBERT is a lightweight and very simple keyword extraction technique, that is based on the BERT [8] embedding to fetch the key phrases which are most like the given text document, in our context, tweets. The working of KeyBERT is very simple. First, the given tweet is embedded using the BERT using the sentence-transformer package [20]. Then the N-gram words/phrases are extracted, and they are also embedded with the same technique as the tweets are embedded. Finally, cosine similarity is used to find the most similar n-gram words/phrases to the given tweet. And these phrases can be used as representatives of the tweet instead of the whole tweet for fetching the news articles. Also, while fetching the key phrases, the consideration of stop words was very crucial. If we excluded the general stop-words from being selected as a part of n-grams, the results we were getting were quite blunt as we evaluated manually. Such keywords made very less impact while being used as queries to the news site. However, we then defined custom stop words set and excluded only those being included in our key phrases. For our approach, we are extracting three key phrases for each tweet. Also, a parameter is defined to diversify the extracted key phrases from each other. 4.3
News Sites Scraping
The basis for assessing the credibility of tweets of our approach is that popular and well-known news sites publish credible news. So, what we do is, for a given tweet, we use its key phrases that we extracted in phase 4.2, as the queries to scrape news from the news site that we have selected based on their popularity. It is an important task to get the news items that are relevant to the corresponding tweet to evaluate its credibility. Hence, based on the popularity, we selected the given sites. – – – – – –
American Broadcasting Company (ABC) News: (www.abcnews.com) Aljazeera: (www.aljazeera.com) Associated Press (AP) News: (www.apnews.com) British Broadcasting Corporation (BBC): (www.BBC.com/news) Cable News Network (CNN): (www.CNN.com) Reuters: (www.reuters.com)
Each news site is fed with the queries and the returned news is collected. Also, the relevancy of the news articles may vary based on the date, news has been published. Hence, we filtered out only the news that are published before or after the 30 days of the day the tweet was published. For every news story,
Social Media Content Credibility Using Machine Learning
775
we fetched its headline and its content, which are later used to compare with the corresponding tweet. 4.4
Pre-processing
Once tweets and their corresponding news articles have been fetched, we need to consider the noises that are present in the raw text documents. Specifically, tweets are very short, non-structured, and contain a lot of noise such as acronyms, emojis, symbols, URLs, and so on. News articles as compared to tweets are not that noisy and are more structured than tweets. However, in our context, we passed both the tweets and the news articles through the same text pre-processing pipeline. The major tasks carried out in this step are: Lower casing texts, stop word removals, emoji-to-text conversion, removal of punctuation, etc. Moreover, some text contractions are expanded, such as replacing I’m with I am, we’re we are, isn’t with is not, and so on. Some tweets contained usernames preceded by the @ symbol. So, we removed the @ keeping only the username. Also, the tweets usually contain URLs, and we find the URL has very less contribution towards the semantic meaning of the entire tweet. So, we removed URLs. Also, we noticed that a lot of tweets have hashtags to represent trends or information about the tweet itself. The hash # symbol in it can alter our vocabulary as they are processed differently by our tokenizer. So, we trimmed out the hash symbol from the hashtags. In this way, we preprocessed both, the tweets, and the news contents. 4.5
Text Embedding
Basically, text embedding refers to the representation of words in the form of a real-valued vector that carries the semantics of the text. One important feature is that the words with similar meanings or words which are closer in context are expected to have embeddings with less difference. Or we can also say that the two words that are closer in the vector space are most likely similar in meaning. The major reason for embedding texts is to analyze the text after encoding it into a real-valued vector. This step can be considered the most crucial part of our approach because we are largely depending on the semantic meaning of the two texts or say a tweet and the corresponding news articles, to assess the credibility of the tweet. There are different ways to embed texts into vectors. Previously, there were statistical methods to vectorize the text documents such as Bag of words, Term Frequency-Inverse Document Frequency, latent semantic dialect, etc. But with the rise of deep learning, the trend moved towards the learning of word embedding to get the actual meaning of the word, such as in word2vec, GloVe, etc. However, such representations could only learn the word meaning statically, neglecting the context in which the word is being used. For example, a word bank can have different meanings and must have different embeddings. Such issues can be addressed using Recurrent Neural Networks where the words are embedded based on their context and are good for sequence data. However,
776
S. Roka and D. B. Rawat
RNNs face the problem of short-term memory also known as gradient fade. Further, these issues are addressed by LSTMs and GRU where the implementation of gates suggests which part of data is important and which to keep and which to throw. The problem with these models is vanishing gradients and exploding problems. Now, with the rise of transformer models [22], which is a novel architecture to solve sequence-to-sequence tasks along with handling long-term dependencies with ease. The transformers model relies entirely on self-attention to compute the representations of the input and output without using the NN’s as RNN and convolutional. LSTMs process the data sequentially which makes them very slow. On the other hand, transformers utilize parallelization for the sequential data making the architecture faster. BERT, short for Bidirectional Encoder Representations from Transformers provides state-of-the-art results on eleven NLP tasks and is pre-trained on unlabeled data with two unsupervised tasks, Masked Language Modeling (MLM) and Next sentence Prediction (NSP). It stacks the series of encoders to build a language model that can be further fine-tuned with just one additional output layer to perform some specific task [9]. However, as the BERT requires both input sentences to be fed into the network that causes massive computational overhead. To find the most similar pair in a collection of 10,000 sentences, BERT takes about 65 h. Hence, for our proposed approach, we are using Sentence-BERT(SBERT), which makes a slight modification to the pre-trained BERT model and uses a Siamese network structure to obtain the sentence embedding that is semantically meaningful [20]. 4.5.1 Fine Tuned Sentence-BERT The SBERT is the modification of the pre-trained BERT network that uses Siamese and triplet network structures to derive semantically meaningful sentence embedding that can be compared using simple similarity metrics such as cosine similarity. The main aim of SBERT is to reduce the computational overhead caused by the repetitive feeding of sentences to the network while maintaining the accuracy from BERT [20]. The SBERT adds a pooling (MEAN) layer to the output obtained from the BERT model to output a fixed-size embedding that semantically represents the input sentence. This then can be used for different applications. For our approach, we are using SBERT to find the semantic similarity between tweets and news articles. There are many models that are pre-trained on huge data sets mostly for the English language. But in our scenario, we are dealing with tweets that are quite unstructured and consist of lots of irregularities than the normal English for which the models are trained. So, we propose to fine-tune the model using the domain-specific data, in our case tweets, to find the semantic similarity between the tweets and the news articles. Dataset Description: The dataset we referred to finetune the task of semantic similarity between tweets and news articles is a Twitter URL dataset [17]. The dataset consists of a training set, collected between Oct–Dec 2016, and the testing set collected in January 2017 of tweets, which are the subsets of raw data with human annotation. Both files are in the format where each line contains:
Social Media Content Credibility Using Machine Learning
777
Sentence1, Sentence2, (n,6) and URL using tab as a delimiter. For each sentence pair, there are 6 workers manually annotating it. And, the value (n,6) signifies, n out of 6 annotators think that the corresponding pair is a paraphrase. For our approach, we decided to neglect the pairs for which the value of n is 3. This means that 3 of the annotators claimed the pair to be a paraphrase and the other 3 claimed it to be a non-paraphrase. So, to resolve this conflict, we simply exclude them from our training data set. Base Model-TweetBERT: The need for a huge amount of data to train Natural Language Models makes the training very expensive. The hardware and other resources required to train certain models cost up to a million. However, the concept of transfer learning allows us to take models which are pre-trained on enormous data and use them instead of training a whole new model from the scratch. On top of that, these models can be fine-tuned to achieve the desired task. We are using TweetBERT for our approach. The TweetBERT is a largescale pre-trained language model for English Tweets that comes with two models that are domain-specific language presentation models and are pre-trained on millions of tweets [19]. Since we are dealing with the task of embedding tweets, we are using TweetBERT’s base model which is trained on around 5 million English tweets. We use the TweetBERT-base as our base model to fine-tune the task of calculating the semantic similarity between the tweets and news articles. Fine-tuning is done considering the data set we have, which has the binary labeled data. Each sentence pair is labeled either 1 for paraphrase or 0 for non-paraphrase. Hence, we fine-tuned the task of paraphrase detection. Since this task also mainly focus on achieving the semantic similarity of the sentences and comparing them to identify as a paraphrase or not, this is quite like our task of finding the semantic textual similarity. While using the base model, we are using its tokenizer, vocabulary, and several other parameters. Loss Function-Contrastive Loss: Since the training sentence pairs are of label 1 or 0, we used contrastive loss. Contrastive Loss expects as input a sentence pair and a label of either 0 or 1. The way contrastive loss works is that, if the label is 1, then the distance between the embedding of sentence pairs is reduced. And if the label is 0 the distance between the embedding of sentence pairs is increased. The loss function as shown in the Eq. 2 gives a loss value function as defined in [12]. − → −→ −→ −→ (1) Dw = Dw (X1 , X2 ) = cosine(X1 , X2 ) −→ − → 1 1 L(W, Y, X1 , X2 ) = (Y ) (Dw )2 + (1 − Y ) max(0, m − (Dw )2 )) (2) 2 2 − → − → where X1 and X2 is a pair of input vectors, Y is a binary label assigned to the pair: Y = 0 if X1 andX2 are marked dissimilar, and Y=1 if they are marked similar, Dw is the parameterized distance function to be learned between X1 , X2 for which we take cosine distance, and m > 0 is a margin defines the minimum distance that must be maintained between the dissimilar pairs [12].
778
S. Roka and D. B. Rawat
Fig. 2. SBERT Architecture
Evaluator: We are using the Binary Classification Evaluator for our fine-tuned model. The evaluator evaluates a model based on the similarity of the embeddings by calculating the accuracy of identifying similar and dissimilar sentences. We are using the cosine similarity as our metrics and the evaluator returns us the accuracy of the model. This evaluator requires the labels to be 0 for dissimilar pairs and 1 for similar pairs. Final Model Architecture: The final fine-tuned model on the Twitter URL dataset consists of two main layers, one is the Transformer Layer and the other is the Mean Pooling Layer. The architecture of the model is shown in Fig. 2 [20] and the parameters set for the model are shown in Table 2. The model performed well with an accuracy of 81.58 when tested on the Twitter URL test dataset and in this paper, we use the name tweeturl-bertweet-base-3 for our fine-tuned model. Table 2. Fine Tuned Model Parameters Transformer Model
RobertaModel
Max Sequence Length
256
Word Embedding Dimension 768
4.5.2 Pre-trained Models Along with the fine-tuned model, we are also using another pre-trained transformer model paraphrase-MiniLM-L6-v2 trained by sentence-transformers [20]. This model is designated for semantic search that maps sentences and paragraphs to a 384-dimensional dense vector space. We are using this model to see how different our fine-tuned model performs for semantic similarity for the tweet as compared to the standard pre-trained model trained for generic tasks. The architecture for this pre-trained model is the same as the one shown in Fig. 2. However, the parameter differs slightly from the one we have fine-tuned. This model has its max sequence length of 128 and the pooling layer generates the word embedding of dimension 384. The model is based on the BERT model and does not lowercase the input before embedding.
Social Media Content Credibility Using Machine Learning
4.6
779
Score Assessment
Once the embedding of tweets and news articles is generated, the final task is to assess the score for the credibility of that tweet. For this, we propose a simple approach of using the cosine similarity to find the news article that is the closest to the given tweet. The cosine similarity for two vectors is defined as the dot product divided by the product of their magnitude. cos(u, v) =
u.v |u|.|v|
(3)
The calculated cosine score ranges from 0 to 1 and we normalize the score to fall between 0 and 100. For each tweet, we calculate the cosine similarity between that tweet and every other news article that is scraped for that tweet. Among all those, we select the one that has the highest similarity score. Our assumption is that, if the tweet is credible, then some news media must have published news like that. Basically, we first find the semantic meaning of the tweet and then find the news article that is most like the tweet. If the news article is very similar to the tweet, then it will have a higher similarity score and hence should be considered more credible. And if the most similar news has a very less similarity score that means, there is no such news posted by the selected news media. In this case, the tweet should be considered less credible and assigned a less credibility score. For example, if the system assigns score 60 to a tweet, then we consider it to be 60% credible. This assessment is performed for each embedding model that we have discussed previously. And we present the results produced by each model.
5
Performance Evaluation and Discussion
We carried out the credibility analysis on several Twitter accounts. Firstly, we analyzed the Twitter accounts of some news sites that we have selected. We fetched around fifty to a hundred tweets for each selected account and fed them to our credibility analysis system. For these accounts, we assume that the credibility score must be very high since we are depending upon these sites to analyze the credibility of other tweets. The results we obtained are shown in a scatter plots in Fig. 3, Fig. 4, Fig. 5 and Fig. 6 where the X-axis represents the number of tweets as their identifier and the Y-axis is the credibility scores of the corresponding tweets in percentage. Figure 3a shows the credibility scores for 80 tweets of @ReutersWorld assigned using a selected pre-trained model (paraphrase-MiniLM-L6-v2) and Fig. 3b shows the same with our fine-tuned model (tweeturl-bertweet-base-3). As expected, most of the tweets are analysed to be 100% or more than 80% credible as there was a direct match of tweets with the news that we scraped. Whereas analyzing around 75 tweets from Aljazeera’s official Twitter (@AjEnglish), we can see that the average credibility score of the account is around 45% as evaluated by the pre-trained model (paraphrase-MiniLM-L6-v2) and 65% as evaluated
780
S. Roka and D. B. Rawat
Fig. 3. Reuters News Credibility Score for Pre-Trained Model and our Model
Fig. 4. Aljazeera Credibility Score for Pre-Trained Model and our Model
by our fine-tuned model (tweeturl-bertweet-base-3) which are shown in Fig. 4a and Fig. 4b respectively. When we analyzed the reason for such score difference between the two accounts (@ReutersWorld and @AjEnglish), we noticed that Aljazeera uses more images and videos in their tweets. Since, our technique doesn’t consider the image and video components in the tweets, it becomes challenging for the system to discover news articles that exactly match the tweets. As a result, it gives such tweets, a lower credibility score, which lowers the account’s overall score. Furthermore, we can also notice the difference in the scores assigned by the pre-trained model (paraphrase-MiniLM-L6-v2) and our fine-tuned model (tweeturl-bertweet-base-3). This is because our fine-tuned model uses TweetBERT [19] as a base model that is designed specifically for embedding tweets. On the top of that, we used Twitter specific dataset (Twitter URL dataset [17])
Social Media Content Credibility Using Machine Learning
781
to fine tune the model for the task of paraphrase detection for tweets. Due to this, our fine-tuned model (tweeturl-bertweet-base-3) performs better with tweets as compared to the selected pre-trained model (paraphrase-MiniLM-L6-v2) which is trained for normal English texts. We can see the same difference for other accounts that we have evaluated. Apart from these two, we analyzed a few other Twitter accounts such as the official Twitter account of the Ministry of Foreign Affairs of Russia (@mfa russia), (Fig. 5), the Ministry of Defense Ukraine (@DefenceU) (Fig. 6), etc. For each account, we considered the most recent 50 tweets only, as our system takes a lot of time to analyse each tweet. As we can see, the average credibility score for mfa russia is around 35% as evaluated by paraphrase-MiniLM-L6-v2 model (Fig. 5a) and the same evaluated by tweeturl-bertweet-base-3 (Fig 5b) is around 50%. Similarly, for @DefenceU the credibility scores are around 40% and 60% as assigned by paraphrase-MiniLM-L6-v2 model (Fig. 6a) and tweeturlbertweet-base-3 (Fig. 6b) respectively. This means that, on analyzing around 50 tweets by our system using fine-tuned model, the twitter account @mfa russia is 50% credible whereas the @DefenceU is 60% credible.
Fig. 5. mfa russia Credibility Score for Pre-Trained Model and our Model
To ensure the reliability of the results obtained, we carefully examined a few of the tweets manually. We used the findings returned by our model for a tweet, together with the credibility score and a link to the news source that was determined to be most comparable to the tweet. Then, we crawled to the website and contrasted its news content with that found in the tweets. We conducted this analysis on a subset of the tweets. We observed that many tweets which included a link that led to news articles with remarkably comparable content. Our model was giving such tweets a high credibility score. The news on the URLs that were retrieved, however, was not quite similar for certain tweets. As we analysed, we found a few reasons why this was happening. The foremost reason is the one
782
S. Roka and D. B. Rawat
Fig. 6. DefenceU Credibility Score for Pre-Trained Model and our Model
we mentioned earlier, i.e. attachment of image and video contents in the tweets more than text content. Our system ignores the images and videos along with any external URLs and this restricts the system from finding the relevant news articles that matches the tweet content. One other reason was, some tweets were about personal opinions and statements rather than news like content. For such tweets, our system couldn’t scrape out relevant news articles. Therefore, the system gave such tweets a lower credibility score.
6
Challenges and Perspectives
The work in this paper analyzes the tweets based on credible real-world news which seems to be very promising rather than considering different user and tweet-based features that can be altered under different scenarios. Also, using the transformer-based model to calculate the similarity really gives the advantage of considering the actual semantic meaning of the tweet. However, one of the major challenges of this system is we need to rely on authentic and credible news sites to consider as a ground truth. Also, we are considering only six news sites. We can solve this by adding more authentic news sites with advanced web scraping to get more reliable news content. The other challenge is that our system doesn’t consider tweets that have only images, videos, or external URLs. The potential solution to this is to use proper techniques to extract information from images, videos, or external URLs and use the same methodology to assess them.
7
Conclusion and Future Work
In this paper, we have proposed a method which is more appropriate and fairer for determining the credibility of tweets as we have used tweet’s actual semantics
Social Media Content Credibility Using Machine Learning
783
and contrasting it with real-world news reported by reliable news sites. Additionally, embedding tweets using a fine-tuned transformer model provides a semantic overview of the tweets and improves the validity of the analysis over using only user-based and syntax-based characteristics. The study may be further expanded by extracting and analyzing the text from tweets that contain images and videos using the image-based transformer models Acknowledgment. This work was supported in part by the DoD Center of Excellence in AI and Machine Learning (CoE-AIML) at Howard University under Contract W911NF-20-2-0277 with the U.S. Army Research Laboratory, NSF grant # 1828811 and VMware research gift fund. However, any opinion, finding, and conclusions or recommendations expressed in this document are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of the funding agencies.
References 1. Twitter API Documentation — Docs — Twitter Developer Platform 2. AlRubaian, M., Al-Qurishi, M., Al-Rakhami, M., Hassan, M.M., Alamri, A.: Credfinder: a real-time tweets credibility assessing system. In: 2016 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM), pp. 1406–1409 (2016) 3. Alrubaian, M., Al-Qurishi, M., Alamri, A., Al-Rakhami, M., Hassan, M.M., Fortino, G.: Credibility in online social networks: a survey. IEEE Access 7, 2828– 2855 (2019) 4. Alrubaian, M., Al-Qurishi, M., Hassan, M.M., Alamri, A.: A credibility analysis system for assessing information on Twitter. IEEE Trans. Depend. Secure Comput. 15(4), 661–674 (2018) 5. Al-Khalifa, H.S., Al-Eidan, R.M.: An experimental system for measuring the credibility of news content in Twitter. Int. J. Web Inf. Syst. 7(2), 130–151 (2011). Publisher: Emerald Group Publishing Limited 6. Aslam, S.: Twitter by the numbers: stats, demographics & fun facts (2018). http:// Omnicoreagency.com 7. Cardinale, Y., Dongo, I., Robayo, G., Cabeza, D., Aguilera, A., Medina, S.: T-creo: a Twitter credibility analysis framework. IEEE Access 9, 32498–32516 (2021) 8. Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding (2018) 9. Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding, May 2019. arXiv:1810.04805 [cs] 10. Grootendorst, M.: KeyBERT: minimal keyword extraction with BERT (2020) 11. Gupta, A., Kumaraguru, P., Castillo, C., Meier, P.: TweetCred: real-time credibility assessment of content on Twitter. In: TweetCred: Real-Time Credibility Assessment of Content on Twitter, pp. 228–243, November 2014 12. Hadsell, R., Chopra, S., LeCun, Y.: Dimensionality reduction by learning an invariant mapping. In: 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), vol. 2, pp. 1735–1742 (2006)
784
S. Roka and D. B. Rawat
13. Hamdi, T., Slimi, H., Bounhas, I., Slimani, Y.: A Hybrid Approach for Fake News Detection in Twitter Based on User Features and Graph Embedding, pp. 266–280. Distributed Computing and Internet Technology, December 2019 14. Hassan, N., Gomaa, W., Khoriba, G., Haggag, M.: Credibility detection in twitter using word n-gram analysis and supervised machine learning techniques. Int. J. Intell. Eng. Syst. 13, 291–300 (2020) 15. Hernandez-Mendoza, M., Aguilera, A., Dongo, I., Cornejo-Lupa, J., Cardinale, Y.: Credibility analysis on twitter considering topic detection. Appl. Sci. 12, 9081 (2022) 16. Krzysztof, L., Jacek, S.-W., Jankowski-Lorek, M., Amit, G.: Automated credibility assessment on Twitter. Comput. Sci. 16, 157 (2015) 17. Lan, W., Qiu, S., He, H., Xu, W.: A continuously growing dataset of sentential paraphrases. In: Proceedings of the 2017 Conference on Empirical Methods on Natural Language Processing (EMNLP), pp. 1235–1245. Association for Computational Linguistics (2017) 18. Namihira, Y., Segawa, N., Ikegami, Y., Kawai, K., Kawabe, T., Tsuruta, S.: High precision credibility analysis of information on twitter. In: 2013 International Conference on Signal-Image Technology and Internet-Based Systems, pp. 909–915 (2013) 19. Nguyen, D.Q., Vu, T., Nguyen, A.T.: BERTweet: a pre-trained language model for English tweets. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 9–14, Online, October 2020. Association for Computational Linguistics (2020) 20. Reimers, N., Gurevych, I.: Sentence-BERT: sentence embeddings using Siamese BERT-networks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, November 2019 21. Roesslein, J.: Tweepy: Twitter for Python! (2020). https://github.com/tweepy/ tweepy 22. Vaswani, A., et al.: Attention Is All You Need, December 2017. arXiv:1706.03762 [cs] 23. Yang, J., Yu, M., Qin, H., Lu, M., Yang, C.: A Twitter data credibility frameworkhurricane harvey as a use case. ISPRS Int. J. Geo-Inf. 8(3), 111 (2019). Number: 3 Publisher: Multidisciplinary Digital Publishing Institute
Cryptocurrency Valuation: An Explainable AI Approach Yulin Liu2(B) and Luyao Zhang1(B) 1
Data Science Research Center and Social Science Division, Duke Kunshan University, Suzhou 215316, Jiangsu, China [email protected] 2 SciEcon CIC, London WC2H 9JQ, UK [email protected]
Abstract. Currently, there are no convincing proxies for the fundamentals of cryptocurrency assets. We propose a new market-to-fundamental ratio, the price-to-utility (PU) ratio, utilizing unique blockchain accounting methods. We then proxy various existing fundamental-to-market ratios by Bitcoin historical data and find they have little predictive power for short-term bitcoin returns. However, PU ratio effectively predicts long-term bitcoin returns than alternative methods. Furthermore, we verify the explainability of PU ratio using machine learning. Finally, we present an automated trading strategy advised by the PU ratio that outperforms the conventional buy-and-hold and market-timing strategies. Our research contributes to explainable AI in finance from three facets: First, our market-to-fundamental ratio is based on classic monetary theory and the unique UTXO model of Bitcoin accounting rather than ad hoc; Second, the empirical evidence testifies the buy-low and sell-high implications of the ratio; Finally, we distribute the trading algorithms as open-source software via Python Package Index for future research, which is exceptional in finance research. Keywords: Asset Valuation · Machine Learning · Bitcoin · Cryptocurrency · Market Timing · Automated Trading · Explainable AI · Store of Value · UTXO · Python · Open Source
1
Introduction
The market cap of cryptocurrency or crypto tokens1 has steadily increased from nearly zero to more than $1 trillion2 in the last decade [38–41]. In addition, 1
In this article, these two terms are synonyms. The term “token” has a few different meanings in general use, which are beyond the scope of this paper. See Cong and Xiao (2021) [20] for token categories. 2 Data source: https://coinmarketcap.com.
The authors are by the alphabetical order of the last names and both authors are joint first and corresponding authors. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): SAI 2023, LNNS 711, pp. 785–807, 2023. https://doi.org/10.1007/978-3-031-37717-4_51
786
Y. Liu and L. Zhang
institutional and private investors are adding cryptocurrency to their portfolios, and retailers are increasingly accepting cryptocurrency for payment. As most cryptocurrencies have very volatile prices, it would be intriguing to investigate the fundamentals of a cryptocurrency to determine an optimal indicator that reflects its market valuation relative to its fundamentals-that is, is the cryptocurrency overvalued or undervalued? Currently, the existing literature is far from identifying a reliable proxy for token fundamentals. Drawing on classic monetary theory, we first propose a new market-tofundamental ratio, which we call the price-to-utility (PU) ratio, utilizing factors specific to cryptocurrency markets. As cryptocurrency is one type of digital currency, we refer to canonical monetary economics as a theoretical foundation to pin down the fundamentals of cryptocurrencies. More specifically, we construct fundamental proxies considering three major currency functions: medium-ofexchange, store-of-value, and unit-of-account for cryptocurrencies. Innovatively, we proxy store-of-value by staking ratios, available for the sake of the unique unspent transaction outputs (UTXO) accounting on the Bitcoin Blockchain [50]. Then, we construct proxies for a variety of fundamental-to-market ratios using bitcoin (BTC) historical data. We find that different from the stock market, fundamental-to-market ratios have little predictive power on short-term BTC returns, consistent with existing literature. However, PU ratio, together with several other ratios approximating user adoption to fundamentals [17], effectively predict long-term BTC returns than alternative methods. An effective valuation ratio can thus inform investment strategies: to buy (sell) when the asset is undervalued (overvalued). Theoretically, when the market-to-fundamentals ratio is high (low), the asset is overvalued (undervalued). We use machine learning to verify that the PU ratio has explainability for future returns. We found that buying low and selling high based on PU ratio valuation are associated with higher investment returns than other strategies. Finally, we propose an automated trading strategy guided by PU ratio and find that it outperforms conventional buy-and-hold and market-timing strategies. Our research contributes to explainable AI in finance from three facets: first, our market-to-fundamental ratio is based on classic monetary theory and the unique UTXO model of Bitcoin accounting rather than ad hoc; second, the empirical evidence testifies the buy-low and sell-high implications of the ratio; finally, we distribute the trading algorithms as open-source software via Python Package Index for future research, which is exceptional in finance research. We organize the rest of the paper as follows. Subsections 1.1 and 1.2 introduce related literature and the data basics respectively. Section 2 introduces the methods of cryptocurrency valuations and test compares the performance of alternative methods in predictions. Section 3 further evaluates the explainability of the PUratio using machine learning methods. Section 4 demonstrates and compares the automated trading strategies. Section 5 concludes and discusses.
Cryptocurrency Valuation: An Explainable AI Approach
787
Fig. 1. BTC Return Distributions Notes: This figure plots the distributions of daily, weekly, 30-day, 90-day, 180-day, and 360-day BTC returns from July 13, 2011, to December 31, 2020 (3, 460 days in total).
1.1
Related Literature
Our research contributes to three lines of literature: cryptocurrency valuation, machine learning for economics, and explainable AI in finance. Cryptocurrency Valuation. Our research adds to the literature on cryptocurrency valuations. Some of the literature provides partial evidence for the validity of our indicators by relating BTC prices to a proxy for one of the monetary functions. For instance, Biais et al. (2018) [10] provide an overlapping generation equilibrium model and emphasize BTC’s role as a medium of exchange. By using a dynamic asset pricing model, Cong, Li, and Wang (2021) [17] derive the value of platform tokens by aggregating heterogeneous users’ demand for cryptocurrency as a medium of exchange. Alabi (2017, 2020) [2,3] and Wheatley et al. (2019) [67] use Metcalfe’s law [55] to measure the fundamental value of Bitcoin. Metcalfe’s law stipulates that the value of a network is proportional to the square of the number of its active users. While this law captures the value of a given token as a medium of exchange by measuring the number of its active users, it ignores the relatively inactive users who hold cryptocurrencies as a store of value. Athey et al. (2016) [9] document how trading volume, another proxy for the medium of exchange, affects the token price. Their results also provide evidence that BTC is mostly used as a store of value, which underscores the necessity of considering stores of value in BTC valuations. Additionally, current literature has found other predictors of BTC prices, such as attractiveness and
788
Y. Liu and L. Zhang
popularity on social media,3 which are not directly related to the fundamentals considered in our research.4 Machine Learning for Economics. Our research also contributes to the emerging literature that applies machine learning to economics in a variety of fields to improve prediction and inform decisions (e.g., Athey 2017, 2019 [5,6]; Athey and Imbens 2017, 2019 [7,8]; Mullainathan and Spiess 2017 [57]). Machine learning techniques are shown to constitute a powerful analytical tool for big data and are effective in modeling complex relationships (see Varian 2014 [66]). Among these techniques, the most relevant to our research are those that can be applied to return predictions in the stock market. For instance, Gu, Kelly, and Xiu (2020) [35] report evidence on significant returns to investments using machine learning predictions that in some cases offer twice the performance of leading regression-based strategies. Gu, Kelly, and Xiu (2021) [36] developed an autoencoder asset pricing model that first applies advanced unsupervised machine learning methods in finance and delivers much lower out-of-sample pricing errors than traditional factor models. Our research differs from the preceding studies by applying machine learning to the valuation of cryptocurrency; that is, the previous research uses machine learning to advance asset pricing in the stock markets, while ours uses machine learning to pioneer asset pricing in the cryptocurrency market. Moreover, existing literature in general uses supervised machine learning methods in asset pricing, while ours shows how unsupervised machine learning can also inform economics. Explainable AI in Finance. Furthermore, by integrating economic theory into machine learning, our research joins the emerging literature in explainable artificial intelligence (XAI) (e.g., Gunning 2019 [37]; Adadi and Berrada
3
4
In exploring the impact of social media on bitcoin price, Mai et al. (2018) [52] and Georgoula et al. (2015) [31] find that bullish forum sentiment is positively correlated with the price of bitcoin futures, while Polasik et al. (2015) [59] find in an empirical study of Bitcoin’s payment and investment functions that bitcoin price is mainly driven by its popularity on social media. Ciaian, Rajcaniova, and Kancs (2015) [16] find that the main drivers of the bitcoin price are the market forces of the bitcoin supply and demand and the attractiveness of Bitcoin for investors and users. Liu and Tsyvinski (2021) [50] show that high investor attention, as indicated by factors such as increased Google searches, predicts high future returns over 1- to 2-week horizons for BTC. Borri (2019) [12] finds that cryptocurrencies are highly correlated with each other, but poorly correlated with other global assets, including gold. Blau (2017) [11] identifies that speculative trading does not contribute to the unprecedented rise and subsequent crash in BTC prices. Fantazzini et al. (2016) [25], Fry and Cheah (2016) [27], Corbet, Lucey, and Yarovaya (2018) [22], Fry (2018) [26], and references therein used methods developed for stock markets to test bubbles in the cryptocurrency market. The fundamentals used are a function of historical prices, which differs from our monetary theory approach.
Cryptocurrency Valuation: An Explainable AI Approach
789
2018) [1].5 XAI critiqued the application of machine learning to predictions and decision-making and described it as a black box that is not understandable by human experts and not perceived as trustworthy by stakeholders. Our study provides three solutions. First, we check the consistency in the economic intuition and algorithm results. For instance, we verify that the function of our simple indicator to inform a “buy-low and sell-high” strategy is, in theory, aligned with the unsupervised machine learning analysis of the BTC data. Second, using one supervised machine learning method with high explainability, we show that the simple indicator alone is efficient in predicting bull markets. Third, we propose an efficient automated trading strategy with high interpretability based on the simple indicator alone for further applications of automation in the industry. In spirits of Explainable AI in Finance, our study also contributes to the literature in finance in comparing the passive investment strategy of buy-and-hold6 and the active investment strategy of market-timing7 (e.g., Malkiel 2003 [53]; Shilling 1992 [65]; Sharpe 1975 [62]). Existing literature shows that a long-term buy-andhold strategy tends to outperform the market-timing strategy. Since much of the market’s greatest returns or declines are concentrated in a short time frame, the market-timing strategy often makes use of high-frequency trading technology with tens of thousands of trades per second. However, trading directly on the Bitcoin network is very slow. Depending on the network congestion, the finality time usually lasts one hour or longer. Trading on exchanges involves high trading fees (e.g., the largest crypto exchange Binance charges 0.1% trading fee) and third-party risk (e.g., deposit hacking, exchange bankruptcy, and internal manipulation by the insiders). Our automated trading strategy involves as few trades as possible, while achieving higher returns than the conventional buyand-hold strategy. Moreover, other underlying disadvantages of market-timing strategies such as maintenance, labor cost, and the opportunity cost of daily attention are also less of a concern in our trading strategy. To elaborate further, since our approach is an automated trading strategy, it can be delegated to decentralized exchanges (DEXs).8 A DEX is publicly verifiable and transparent computer code that executes orders automatically when conditions are satisfied. Therefore, users could connect their wallets to a DEX and set their trading strategies in advance. 5 6
7 8
Cong et al. (2020,2021) [18, 19] propose an explainable AI approach for portfolio management in finance. Buy-and-hold is an investment strategy in which the investor may use an active strategy to select securities or funds but then lock them in to hold them for a long term. Market-timing is an investment strategy in which a market participant attempts to beat the market by predicting its movements and buying and selling accordingly. DEX is a smart contract. Smart contracts are computer codes that can be automatically executed when certain conditions are fulfilled. A smart contract does not need a centralized entity to maintain its functions. Once deployed, it can run timely and objectively, as long as the underlying platform, such as Ethereum, is secure and decentralized.
790
Y. Liu and L. Zhang Table 1. Summary Statistics Panel A: Summary statistics by various frequencies Mean
SD
daily return
0.33%
4.71%
t-Stat Sharpe Skewness Kurtosis %¿0 4.21
0.07
–0.21%
13.33
49.23
weeklyreturn
2.36%
13.20%
7.72
0.18
1.12%
7.29
51.46
30-days return
11.83%
43.06%
11.50
0.27
4.16%
28.81
52.35
90-daysreturn
51.75%
139.30%
15.55
0.37
4.19%
23.87
56.54
180-days return 139.43% 326.14%
17.88
0.43
5.83%
50.87
60.61
360 days return 808.55% 2485.32% 13.57
0.33
5.99%
41.54
71.49
Panel B. Extreme events of daily returns Disasters Counts 30%
309 96 16 4
% 8.93% 2.77% 0.46% 0.12%
Notes: This table documents the summary statistics of the BTC returns. Panel A reports the daily, weekly, and 30-day, 90-day, 180-day, and 360-day summary statistics of BTC returns. The mean, standard deviation, t-statistics (Newey-West adjusted with n − 1 lags), Sharpe ratio, skewness, kurtosis, and the percentage of observations that are positive are reported. Panel B reports the percentage of extreme events based on the daily BTC returns. The Bitcoin returns are from July 13, 2011, to December 31, 2020 (3, 460 days in total).
1.2
The Data and Basic Characteristics
Bitcoin is the first cryptocurrency9 and has the largest market share.10 Thus, in this article, we construct and evaluate valuation proxies using bitcoin historical data. The data are sampled daily and downloaded from three open source platforms: CoinMetrics (CM),11 the Coin MarketCap (CMC),12 and Google Bigquery.13 We include the historical data on CM from January 3, 2009, to December 31, 2020, and on CMC from April 29, 2013, to December 31, 2020, since April 29, 2013, is the first date that BTC has the off-chain transactions recorded by CMC. The data from Bigquery are from January 3, 2009, to August 31, 2020. The total volume we used in this research is the sum of the on-chain exchange volume collected from CM and the off-chain exchange volume from CMC. 9 10 11 12 13
Bitcoin was launched on January 3rd, 2009. Since its birth, Bitcoin’s market share is mostly more than half of the total cryptocurrency market cap. See https://coinmarketcap.com. https://coinmetrics.io/data-downloads-2. https://coinmarketcap.com/currencies/bitcoin/historical-data. https://www.kaggle.com/bigquery/bitcoin-blockchain.
Cryptocurrency Valuation: An Explainable AI Approach
791
We now document the main statistical properties of the time series for the BTC returns. Figure 1 shows the return distributions of BTC at daily, weekly, 30-day, 90-day, 180-day, and 360-day frequencies. Table 1 shows the statistics of the BTC returns comparable to Table 1 in Liu and Tsyvinski (2021) [49]. Panel A shows the mean, standard deviation, Sharpe ratio, and kurtosis of BTC return at various frequencies and the percentage of positive returns. The skewness is positive at all frequencies except for daily returns. Panel B shows that the BTC market is highly volatile. The probability of a 5% drop in a day is more than 7% and 2% for a 10% daily loss. However, the probability of a positive movement is larger than a negative movement of the same level.14
2
Cryptocurrency Valuation Ratios
In Sect. 2.1 we present the existing cryptocurrency valuation ratios in both academic literature and industry practice. Section 2.2 introduces the PU ratio and its implications. Section 2.3 assesses the predictive power of various fundamentalto-market ratios on future BTC returns. 2.1
Industry Cryptocurrency Valuation Ratios
An optimal valuation ratio should reflect the position of the market valuation of the cryptocurrency relative to its fundamentals. When the fundamental-tomarket value is high/low, then the asset is undervalued/overvalued. The market valuation of a cryptocurrency can be precisely measured by its market cap. However, the fundamentals are difficult to determine. Liu and Tsyvinski (2021) [49] construct proxies for the fundamental-to-market ratio in the cryptocurrency market. Among these proxies, the (negative) past 100-week cumulative return (Past100) is based on a strong correlation between the fundamentalto-market value and the negative of the cumulative past returns [23,24,56]. The other set of measures, including the user-to-market ratio (UMR), the address-tomarket ratio (AMR), the transaction-to-market ratio (TMR), and the paymentto-market ratio (PMR), are inspired by the dynamic cryptocurrency asset pricing model developed in Cong, Li, and Wang (2021) [17] and proxy fundamentals by user adoptions. Below, we review other cryptocurrency valuation ratios adopted in industry practice including price-to-earnings (PE) ratio, network-value-totransactions (NVT) ratio, and price to Metcalfe’s (PM) ratio, and discuss their pros and cons.15
14
15
Readers can refer to the working paper version on arXiv for an Appendix: https://arxiv.org/abs/2201.12893. The data and code are available on GitHub: https://github.com/SciEcon/CV XAI. Note that those are market-to-fundamental ratios and the inverse of those are fundamental-to-market ratios.
792
Y. Liu and L. Zhang
Price-to-Earning (PE) Ratio. Earnings per share is a widely used proxy for the fundamentals of a company’s stock by financial practitioners. Dividing the stock price by its earnings per share yields the well-known PE ratio (Shiller 2000) [64]. Usually, a high PE ratio indicates overvaluation of the stock, and a low ratio indicates undervaluation. Similarly, one could use miners’ earnings as the proxy of the token fundamentals. Miners’ earnings consist of the block rewards and transaction fees paid by users. However, using transaction fees and block rewards to proxy for fundamentals is misleading because these fees are earned by block makers (i.e., miners) and not by token holders. Furthermore, using transaction fees could lead to ludicrous results. For example, a low P/E ratio, resulting from a hike in transaction fees, implies an undervaluation of the token. However, in contrast, it is intuitive that high transaction fees usually impede a more universal adoption of cryptocurrencies and indicate an overvaluation of tokens instead. Figure 2 shows that the BTC PE ratio does not inform either BTC valuation or price movement. The BTC price moves up and down, while the PE ratio increases quasi-linearly with three abrupt jumps. The PE ratio is not a good indicator for assessing BTC valuation.16 Network-Value-to-Transactions (NVT) Ratio. NVT ratio was introduced by Willy Woo, an industry pioneer of on-chain analysis.17 It is defined as the
Fig. 2. The BTC P/E Ratio and Price in USD 16
17
BTC supply increases steadily while the block rewards are halved every four years, which explains the quasi-linear growth of the BTC P/E ratio and the three jumps on the halving dates. https://academy.glassnode.com/indicators/nvt/nvt-ratio.
Cryptocurrency Valuation: An Explainable AI Approach
793
Fig. 3. The BTC Price and NVT Ratio in USD
ratio of the market value to token transaction volume in USD over the past 24 h. That is, the fundamental value of a token derives from how frequently it is used for transactions (i.e., from its function as a medium of exchange). A high NVT ratio indicates that the market cap of the token outpaces the transacted value on its blockchain, indicating speculation. A low NVT ratio caused by an increasing transaction volume indicates a relatively high usage of the token, which, in turn, serves as a signal to buy the undervalued token. The limitation of the NVT ratio lies in the assumption that the fundamental value of a cryptocurrency derives only from its function as a medium of exchange. The usage of cryptocurrencies as a store of value is completely overlooked in this model. To illustrate the point, an increase of the NVT ratio due to a reduction of the transaction volume does not necessarily imply an overvaluation of the token. It could also be the result of an increasing number of long-term holders hoarding more tokens, thus causing a decline in transaction volume. In other words, the NVT ratio offers a glimpse of what is happening, but it is not completely informative regarding buying and selling decisions. This explains why the NVT ratio only captures shortterm noises and is not indicative of the general pattern of BTC price movement (Fig. 3). Price to Metcalfe’s (PM) Ratio: In the 1980s, Robert Metcalfe, the inventor of Ethernet, proposed that the value of a network is proportional to the square of the number of its users [61]. The underlying logic of this proposal is , which is that the number of connections of a network with n nodes is n(n−1) 2 asymptotically proportional to n2 . Since its inception, Metcalfe’s law has become an influential tool for studying network effects. Figure 4 represents the PM ratio counting the number of active addresses as n. Compared with PE ratio and NVT
794
Y. Liu and L. Zhang
Fig. 4. BTC Price in USD and PM Ratio
ratio, PM ratio has better potential as a valuation indicator for cryptocurrencies. However, it treats users with different amounts of tokens indifferently and overlooks the inactive users who hoard cryptocurrencies as a store of value. 2.2
Price-to-Utility Ratio
A comprehensive assessment of the fundamentals of a cryptocurrency should incorporate all the utilities it provides as a currency, namely, medium of exchange, store of value,18 and unit of account. In this section, we propose the price-to-utility (PU) ratio that consists of proxies for all three utilities. We define Token Utility (TU) and their proxies as in Eq. 1: T oken utility =
token velocity × staking ratio price volatility × dilute rate
(1)
Figure 5 presents the mapping between proxies and BTC utilities. Token velocity serves as the proxy for the medium of exchange. It measures the percentage of tokens transacted over the past 24 h relative to the total token supply. A large token velocity signifies frequent usage of and demand for the token. The Staking ratio is a proxy for the store of value, and it is defined as the percentage of tokens that are older than one year in age. As shown in Fig. 6, a substantial amount of BTC tokens have been inactive for more than a year. These tokens serve as a store of value and can be referred to as “staked tokens.” A high 18
In this article, “store of value” refers to the function of a token to keep or increase its purchasing power over time.
Cryptocurrency Valuation: An Explainable AI Approach
795
Fig. 5. The Three Utilities of Currencies and Their Proxies
Fig. 6. The Percentage of Staked BTC at 1, 2, 3, 4, 5, and 10 Years
staking ratio implies that more users are placing long-term faith in the Bitcoin system. Figure 8 shows the BTC staking ratio has been steadily increasing with periodic booms and busts. In a nutshell, token velocity represents the utility of the token in its role as a medium of exchange that is demanded by daily active users, while the staking ratio represents the utility of the token as a store of value demanded by long-
796
Y. Liu and L. Zhang
term holders. The inverse of the dilution rate is another proxy for the store of value. The dilution rate measures the annual growth rate of the token supply. The new BTC comes from the block rewards. Ceteris paribus, a high dilution rate makes the token less attractive as a store of value, thus leading to a lower token utility.19 The inverse of price volatility20 is the proxy for the unit of account. Note that due to the high volatility of the token price, most cryptocurrencies, except for stablecoins (see Senner and Sornette 2018) or coins with high stabilities [32],21 are limited in their utility as a unit of account. For simplicity, we will keep the price volatility mute in the following illustration of PU ratio; however, we include it in the robustness check in the machine learning session. Dividing BTC price by its utility yields the PU ratio of BTC (Fig. 7). Bitcoin’s first halving event in November 2012 broke the BTC supply-demand equilibrium and led to a price hike from around $10 to $100 in the first half of 2013. The BTC price soared from around $100 in September 2013 to $1,100 within a couple of months. Such a sharp price spike lured the long-term holders into selling their staked BTC (see the drop of BTC staking ratio in Fig. 6). Correspondingly, the BTC token utility also plunged. The price hysteria, together with the significant drop in the token utility, drove the PU ratio to unsustainably high levels at the end of 2013, indicating the overvaluation of the BTC. Figure 7 shows that the BTC price experienced a whole-year drop from above $1,000 to around $200 in 2014. From the beginning of 2015 to mid-2016, the BTC price steadily increased from $200 to around $500 along with a firm growth of the BTC token utility, rendering a relatively stable PU ratio in the undervaluation range. The BTC price hike in this period can be justified by the increase in its fundamentals—a moderate increase in BTC velocity and staking ratio. After the second halving event in July 2016, the BTC dilution rate dropped to around 4%. The formation of the new BTC supply-demand equilibrium led to the instability of the PU ratio between mid-2016 and mid-2017. The BTC price reached a historically high $3,000 in June 2017. Long-term holders started to sell their BTC, leading to a drop in the BTC staking ratio in the second half of 2017. Meanwhile, the BTC velocity experienced a moderate drop. These two factors led to the drop in token utility. However, the BTC price rapidly grew to $20,000. The drop in BTC token utility and skyrocketing BTC price resulted in a sharp spike in the Bitcoin PU ratio at the end of 2017. It took a whole year for the price to consolidate to around $6,000 in 2018. The price even plunged to $3,100 in early 2019, which was dubbed “crypto winter.” Since March 2019, an increase in token utility resulted from the rise of the token velocity and the 19 20 21
The annualized dilution rate is defined as the 365 times moving average of the newly minted bitcoin in the last 90 days divided by the total bitcoin supply. The 180/90/60/30-day volatility, measured as the standard deviation of the natural log of daily returns over the past 180/90/60/30 days. Gersbach (2019) [32] introduces flexible majority rules for the cryptocurrency issuance and shows that the flexible majority rules could foster the stability of a cryptocurrency.
Cryptocurrency Valuation: An Explainable AI Approach
797
Fig. 7. The BTC Price in USD (Blue Line, Left Axis) and PU ratio (Green Line, Right Axis). Note: Red zone (PU >100) indicates overvaluation, yellow zone (60 < PU < 100) indicates a normal range, and green zone (PU < 60) indicates undervaluation.
staking ratio, along with the decrease in the dilution rate. Increasing demand boosted the BTC price to around $15,000 in the summer of 2019, which signifies the end of the bearish crypto winter. The BTC price hovered around $10,000 over the next year and achieved a new all-time high of $30,000 at the end of 2020. The PU ratio once again poked into the yellow zone, which indicates a slight overvaluation of the BTC price.
3
Cryptocurrency Valuation Ratio: Predictive Power on Future Returns
In this section, we document the predictive regressions of BTC future 1-week, 30-day, 90-day, 180-day, and 360-day returns on proxies for BTC fundamentalto-market ratio. The proxies for the BTC valuation ratio include the inverse of the PE ratio (EPR), the inverse of the NVT ratio (TVN), the inverse of the P/M ratio (MPR), and the inverse of the PU ratio (UPR). To make a comparison with the results reported by Liu and Tsyvinski (2021), we also include the (negative) past 100−week cumulative BTC returns (Past100), the AMR,22 the TMR,23 PMR,24 and the first principal component of the previous eight proxies (FPC). We regress the BTC returns on the lagged cryptocurrency fundamental to market 22 23 24
The number of active addresses to market cap. The number of transaction counts to market cap. The number of payments (transfers) to market cap.
798
Y. Liu and L. Zhang
ratios, and the results are reported in Table 2. Our regression differs from Liu and Tsyvinski (2021) [49] in three aspects. First, we add four more fundamentalto-market ratios: EPR, TVN, MPR, and UPR. Second, we test the predictive power on long-horizon returns up to 360 days while Liu and Tsyvinski (2021) [49] only test for up to 8 weeks. Third, our ratios are at daily frequencies while theirs are at weekly frequencies. We have new findings aside from those consistent with existing literature. First of all, none of these ratios has enough predictive power on future BTC returns in the short run (i.e., 1-week and 30-day return on investment (ROI)25 ), which supports the finding in Liu and Tsyvinski (2021) [49]. Second, AMR, PMR, TMP, and UPR together provide a good prediction of future BTC returns at a longer scale and the predictive power increases with longer horizons. Third, UPR best predicts long-term returns at 90-day, 180day, and 360-day horizons. The latter two new findings lead to a supplementary conclusion from Liu and Tsyvinski (2021) [49]. We can conclude that there is a strong relationship between the future BTC returns and several fundamentalto-value ratios, especially the reverse of the PU ratio, at a longer horizon.
4
Cryptocurrency Valuation: Explainable Machine Learning
An asset is overvalued (undervalued) when the ratio of the market cap to its fundamentals is high (low). The ratio is then useful in informing investors to follow the golden rule-“buy low and sell high.” In this section, we design K-means Clustering [51], an unsupervised machine learning method to assess further the explainability of the PU ratio in asset valuation. We identify a pattern: A high ROI is associated with a strategy of buying when the PU ratio is low and selling when the PU ratio is high, which is consistent with the theoretical implication of the PU ratio. We represent each N-day investment by its PU ratios at buy-in and sell-out date: (2) x(i) = (P U Ratiobuy−in − P U Ratiosell−out ) Then we label the data in four clusters. The indicator’s theoretical implication is deemed consistent with empirical data if the following two criteria are satisfied: 1. There exists a cluster k ∗ such that the PU ratio at the buy-in date is low and the PU ratio at the sell-out date is high; that is, for the centroid µk∗ of cluster k ∗ , the map of µk∗ on the x-axis is larger than that on the y-axis: µk∗ |x < µk∗ |y. 2. The investments in cluster k ∗ have a higher ROI than investments in other clusters. Figure 8 shows that only the PU ratio satisfies the two criteria. In Fig. 8 A, the x-axis represents the PU ratio on the buy-in date and the y-axis represents 25
In Table 2, we can see that the R squared values in the first two columns are all below 6%.
Cryptocurrency Valuation: An Explainable AI Approach
799
Fig. 8. (A) PU Ratio Clustering (90-Day ROI) (B) 90-Day ROI by Clustering Categories (PU Ratio)
800
Y. Liu and L. Zhang Table 2. Comparison of the Cryptocurrency Valuation Indicators Future Future Future Future Future 1-week ROI 30-days ROI 90-days ROI 180-days ROI 360-days ROI
Past100 −0.036 (−1.229) 2 0.002 R
−0.137** (−1.663) 0.002
0.361*** (3.144) 0.002
0.663** (2.822) 0.002
−2.217* (−2.260) 0.001
AMR
0.084*** (4.229) 0.007
0.448*** (7.095) 0.017
2.813*** (5.982) 0.076
7.521*** (9.370) 0.197
44.139*** (10.711) 0.347
0.253** (6.295) 0.038
1.000*** (8.371) 0.052
3.587*** (8.289) 0.075
7.285*** (12.205) 0.112
33.053*** (10.173) 0.118
0.102*** (4.172) 0.007
0.614*** (7.068) 0.022
3.767*** (6.478) 0.093
9.286*** (9.343) 0.206
56.428*** (10.688) 0.388
0.060* (3.523) 0.007
0.327*** (5.162) 0.018
1.174*** (6.632) 0.026
4.755** (9.794) 0.154
15.392*** (12.045) 0.082
0.032 (1.318) 0.002
0.144*** (2.160) 0.003
1.586*** (8.783) 0.038
6.925*** (14.654) 0.260
19.048*** (10.754) 0.101
0.016 (0.845) 0.000
−0.021 (−0.321) 0.000
−0.202 (−1.253) 0.000
0.184 (0.652) 0.000
5.906*** (5.595) 0.007
0.071*** (4.816) 0.007
0.455** (8.245) 0.026
3.906*** (8.624) 0.211
8.159*** (15.006) 0.335
49.486*** (17.897) 0.628
0.045*** (5.111) 0.010
0.238*** (8.166) 0.025
1.334*** (7.158) 0.086
3.873*** (11.990) 0.264
18.349*** (11.974) 0.302
R2 TMR R2 PMR R2 EPR R2 TVN R2 MPR R2 UPR R2 FPC R2
Notes: This table reports the predictive regressions of BTC future 1-day, 1-week, 30-day, 90-day, 180-day, and 360-day returns on proxies for BTC fundamental-to-market ratio. The proxies for BTC valuation ratio include the (negative) past 100-week cumulative BTC returns, the address-to-market ratio, the transaction-to-market ratio, the payment-to-market ratio, the inverse of PE ratio (EPR), the inverse of NVT ratio (TVN), the inverse of PM ratio (MPR), the inverse of PU ratio (UPR), and the first principal component (FPC) of the previous eight proxies. The Newey-West adjusted t-statistics with n − 1 lags are reported in parentheses. *, **, and *** denote significance at the 10%, 5%, and 1% levels. The data frequency is daily.
Cryptocurrency Valuation: An Explainable AI Approach
801
Fig. 9. (A) Buy and Sell Signals for the PU Ratio. (B) Buy and Sell Signals for the MA Crossover Rule. (C) Buy and Sell Signals for the Buy-and-Hold Strategy.
the PU ratio on the sell-out date. Figure 8 B shows the ROI by labeled clusters. Cluster 1 in green manifests a pattern of buy-low (highest PU ratio on buy date: 117.39) and sell-high (lowest PU ratio after 90 days: 257.47) and the mean of its ROI dominates the three other clusters. Our results are robust in varying trading periods. We fail to identify similar patterns for the NVT and PM ratios.
802
5
Y. Liu and L. Zhang
Automated Trading Strategies
In this section, we propose an automated trading strategy based on the PU ratio and compare its performance to a conventional buy-and-hold strategy and a moving average (MA) crossover rule in finance26 . We find that the strategy based on the PU ratio outperforms the other two conventional strategies. Our strategy generates a buy (sell) signal when the PU ratio is equal to or lower (higher) than its 0.1 (0.9) quantiles of the historical records. The MA crossover rule generates a buy (sell) signal when the short-window MA of BTC price crosses up (down) its long-window MA. Finally, the buy-and-hold strategy buys at the start date and holds until the end date of the investment period. Figure 9 demonstrates the trading strategies with buy and sell signals. We evaluate the performance of the three strategies with an initial capital of 100,000 USD, a 0.1% transaction fee, and a transaction limit of 100 BTC at each buy or sell signal.27 We consider the timeframe from the first date of the BTC off-chain transaction on December 27, 2013, to December 31, 2020. At the end of the trading day, the automated trading strategy based on the PU ratio generates a gross ROI of 6,245.83% and an annualized Sharpe ratio of 3.65, which is higher than both the market timing strategy based on the MA crossover rule (gross ROI: 2,670.37%; annualized Sharpe ratio: 3.29) and the buy-and-hold strategy (gross ROI: 3,920.09%; annualized Sharpe ratio: 2.87). The result is robust in a variety of parameter settings and even better when we do not set a transaction limit.
6
Conclusion and Discussion
Our results imply that considering features unique to the crypto market could contribute significantly to cryptocurrency valuation. For instance, in this study, we take advantage of Blockchain UTXO accounting methods to construct a proxy for store-of-value. This approach could be seamlessly applied to other UTXO blockchains such as Bitcoin Cash,28 Dash,29 Dogecoin,30 Litecoin,31 and 26
27 28 29 30 31
The moving average rules (Gartley 1935 [29]) give a buy (sell) signal when the short-window moving average of the current price indicator moves above (below) its long-window moving average. Literature in finance (e.g., Brown and Jennings 1989 [14]; Brock, Lakonishok, and Lebaron 1992 [13], Neely et al. 2014 [58]) show evidence that investors benefit more by following the moving average rules than the buy-and-hold strategies. The intuition is that the moving average rules predict a short-term positive fluctuation in price when its short-term indicator moves above (below) its long-term indicator. The 0.1% transaction fee is aligned with the standard fee structure at Binance, the largest crypto exchange platform. https://www.bitcoincash.org/whitepaper. https://cryptoverze.com/dogecoin-whitepaper. https://docs.dash.org/en/stable/introduction/about.html. https://www.allcryptowhitepapers.com/litecoin-whitepaper.
Cryptocurrency Valuation: An Explainable AI Approach
803
Zcash.32 Moreover, our approach can adapt to the updates of blockchain mechanisms and generate further research findings. In recent years, cryptocurrencies based on proof-of-work have been criticized as a waste of energy. Most of the electricity consumed by miners is converted into hashing power to compute cryptographic puzzles instead of routing and executing transactions. To address this issue, proof-of-stake (PoS) blockchain projects, such as EOS,33 Tezos,34 and Cosmos,35 have gained momentum in recent years. In the PoS protocol, a certain fraction of native tokens are staked in the system by validators (i.e., miners, see Saleh 2020 [60]). Validators do not need to solve cryptographic puzzles to win the block rewards. The more tokens staked by a validator, the higher the chance that the validator becomes a block maker and earns the block rewards. Moreover, PoS projects have a flexible token dilution rate. For example, the dilution rate is dynamically adjusted by the Cosmos system to achieve a 66.7% staking ratio. When the staking ratio drops below the target, the block rewards automatically increase to attract more tokens staked in the system and vice versa. In such a case, the staking ratio and the dilution rate as proxies for store-ofvalue would be interdependent. Besides store-of-value, other features unique to the crypto market such as the decentralization level [68], the network features of the peer-to-peer transactions [4], sentiments on blockchain ecosystem [28,69] and the upgrade of blockchain mechanisms [48,71] might also affect the values of cryptocurrency. It would be interesting to explore further in this direction. Like all other asset prices, cryptocurrency prices are also affected by macroeconomic development, which is beyond the scope of this paper. For example, on March 12, 2020, the cryptocurrency market crashed with the stock market and gold market due to the panic sale caused by the COVID-19 pandemic. Bitcoin prices plunged by more than 60%, almost twice as much as the stock market (e.g., Dow Jones Industrial Average and S&P 500 Index). However, one and half months after the market collapse, the BTC price quickly recovered to its precrisis level while the stock market still stagnated at around 80% of its pre-crisis level. It would be interesting to study what unique attributes of the cryptocurrency market led to the different market trajectories and what the implications are for investment strategies. In the current paper, the indicators that we develop for cryptocurrency valuation are based on macro variables and functions of currency in monetary theory. However, every macro phenomenon is a manifestation of micro behavior in an aggregate form, which in general is derived from two micro facets: the choices of each individual based respectively on rationality or heuristic bias contingent on the state of the world. For the first, to model rational choices, each individual must envision at least all hypothetical scenarios in Decision under Ambiguity [45] and also form subjective probabilities for all possible scenarios in Decision under Uncertainty [34]. However, since crypto-economics is rapidly 32 33 34 35
https://whitepaper.io/coin/zcash. https://www.allcryptowhitepapers.com/eos-whitepaper. https://wiki.tezosagora.org/whitepaper. https://v1.cosmos.network/resources/whitepaper.
804
Y. Liu and L. Zhang
evolving, both requirements are difficult if not far-fetched for the majority, which makes the problems more similar to decisions under ignorance [33,44,46,54], a vastly uncultivated field even at the foundations of microeconomic theory. Second, little literature seeks to understand behavioral patterns of bounded rationality [21,47]. For instance, Gemayel and Preda (2021) [30] find evidence of heuristic bias including disposition effect [63], self-attribution bias [43], and the gambler’s fallacy [15] among cryptocurrency traders. We envision the two directions as representing promising directions and the next frontiers for future research. Hey (2009) [42] envisions the fourth paradigm shift in scientific discoveries to be data-driven. Our study shows that interdisciplinary research in machine learning could verify the explainability of valuation ratios by establishing consistency for its theoretical implications in empirical results. To facilitate this future research, we distribute the trading algorithms as open-source software via Python Package Index [1] (PyPI).36 Acknowledgments. We thank Prof. Campbell Harvey, Prof. Lin William Cong, and Prof. Ye Li for their insightful comments. Luyao thank Corporate Finance Ins titute (CFI) Workshop on Mathematical Finance and Cryptocurrencies at the Univer sity of Toronto’s Fields Institute, the 29th Annual Global Finance Conference, and Workshop on Explainable AI in Finance and Workshop on Women in AI and Finance co-located with the 3rd ACM International Conference on AI in Finance (ICAIF) for hosting her presentations and inspiring discussions among the participants. We thank the anonymous referees at Computing Conference for their professional and thoughtful comments. The corresponding author Luyao Zhang is supported by National Science Foundation China on the project entitled “Trust Mechanism Design on Blockchain: An Interdisciplinary Approach of Game Theory, Reinforcement Learning, and Human-AI Interactions.” (Grant No. 12201266). Luyao Zhang is also with SciEcon CIC, a notfor-profit organization aiming at cultivating interdisciplinary research of both profound insights and practical impacts in the United Kingdom. Yulin Liu is also with Shiku Foundation and Bochsler Finance, Switzerland.
References 1. Adadi, A., Berrada, M.: Peeking inside the black-box: a survey on explainable artificial intelligence (XAI). IEEE Access 6, 52138–52160 (2018) 2. Alabi, K.: Digital blockchain networks appear to be following metcalfe’s law. Electron. Commer. Res. Appl. 24, 23–29 (2017) 3. Alabi, K.: A 2020 perspective on “digital blockchain networks appear to be following metcalfe’s law”. Electron. Commerce Res. Appl. 40, 100, 939 (2020) 4. Ao, Z., Horvath, G., Zhang, L.: Are decentralized finance really decentralized? A social network analysis of the Aave protocol on the Ethereum blockchain. arXiv preprint arXiv:2206.08401 (2022). https://doi.org/10.48550/arXiv.2206. 08401. https://arxiv.org/abs/2206.08401 5. Athey, S.: Beyond prediction: using big data for policy problems. Science 355(6324), 483–485 (2017) 36
Refer to the latest version: https://test.pypi.org/project/AlgorithmicTradingCV, and a related data science pipeline paper [70].
Cryptocurrency Valuation: An Explainable AI Approach
805
6. Athey, S.: 21. the impact of machine learning on economics. In: The Economics of Artificial Intelligence, pp. 507–552. University of Chicago Press (2019) 7. Athey, S., Imbens, G.W.: The state of applied econometrics: causality and policy evaluation. J. Econ. Perspectives 31(2), 3–32 (2017) 8. Athey, S., Imbens, G.W.: Machine learning methods that economists should know about. Ann. Rev. Econ. 11, 685–725 (2019) 9. Athey, S., Parashkevov, I., Sarukkai, V., Xia, J.: Bitcoin pricing, adoption, and usage: Theory and evidence (2016) 10. Biais, B., Bisiere, C., Bouvard, M., Casamatta, C., Menkveld, A.J.: Equilibrium bitcoin pricing. Available at SSRN 3261063 (2020) 11. Blau, B.M.: Price dynamics and speculative trading in bitcoin. Res. Int. Bus. Financ. 41, 493–499 (2017) 12. Borri, N.: Conditional tail-risk in cryptocurrency markets. J. Empir. Financ. 50, 1–19 (2019) 13. Brock, W., Lakonishok, J., LeBaron, B.: Simple technical trading rules and the stochastic properties of stock returns. J. Financ. 47(5), 1731–1764 (1992) 14. Brown, D.P., Jennings, R.H.: On technical analysis. Rev. Financ. Stud. 2(4), 527– 551 (1989) 15. Chen, D.L., Moskowitz, T.J., Shue, K.: Decision making under the gambler’s fallacy: evidence from asylum judges, loan officers, and baseball umpires. Q. J. Econ. 131(3), 1181–1242 (2016) 16. Ciaian, P., Rajcaniova, M., Kancs, d.: The economics of bitcoin price formation. Appl. Econ. 48(19), 1799–1815 (2016) 17. Cong, L.W., Li, Y., Wang, N.: Tokenomics: dynamic adoption and valuation. Rev. Financ. Stud. 34(3), 1105–1155 (2021) 18. Cong, L.W., Tang, K., Wang, J., Zhang, Y.: Alphaportfolio for investment and economically interpretable ai. SSRN (2020) 19. Cong, L.W., Tang, K., Wang, J., Zhang, Y.: Deep sequence modeling: development and applications in asset pricing. J. Financ. Data Sci. 3(1), 28–42 (2021) 20. Cong, L.W., Xiao, Y.: Categories and functions of crypto-tokens. In: Pompella, M., Matousek, R. (eds.) The Palgrave Handbook of FinTech and Blockchain, pp. 267–284. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-66433-6 12 21. Conlisk, J.: Why bounded rationality? J. Econ. Literat. 34(2), 669–700 (1996) 22. Corbet, S., Lucey, B., Yarovaya, L.: Datestamping the bitcoin and ethereum bubbles. Financ. Res. Lett. 26, 81–88 (2018) 23. De Bondt, W.F., Thaler, R.H.: Further evidence on investor overreaction and stock market seasonality. J. Financ. 42(3), 557–581 (1987) 24. Fama, E.F., French, K.R.: Common risk factors in the returns on stocks and bonds. J. Financ. Econ. 33(1), 3–56 (1993) 25. Fantazzini, D., Nigmatullin, E., Sukhanovskaya, V., Ivliev, S.: Everything you always wanted to know about bitcoin modelling but were afraid to ask. https:// mpra.ub.uni-muenchen.de/71946/ (2016) 26. Fry, J.: Booms, busts and heavy-tails: the story of bitcoin and cryptocurrency markets? Econ. Lett. 171, 225–229 (2018) 27. Fry, J., Cheah, E.T.: Negative bubbles and shocks in cryptocurrency markets. Int. Rev. Financ. Anal. 47, 343–352 (2016) 28. Fu, Y., Zhuang, Z., Zhang, L.: Ai ethics on blockchain: Topic analysis on twitter data for blockchain security. arXiv preprint arXiv:2212.06951 (2022). https://arxiv. org/abs/2212.06951 29. Gartley, H.M.: Profits in the stock market. Health Research Books (1935)
806
Y. Liu and L. Zhang
30. Gemayel, R., Preda, A.: Performance and learning in an ambiguous environment: a study of cryptocurrency traders. Int. Rev. Financ. Anal. 77, 101, 847 (2021) 31. Georgoula, I., Pournarakis, D., Bilanakos, C., Sotiropoulos, D., Giaglis, G.M.: Using time-series and sentiment analysis to detect the determinants of bitcoin prices. Available at SSRN 2607167 (2015) 32. Gersbach, H.: Flexible majority rules for cryptocurrency issuance (2019) 33. Giang, P.H.: Decision making under ignorance. In: Rogova, G., Scott, P. (eds.) Fusion Methodologies in Crisis Management, pp. 435–454. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-22527-2 20 34. Gilboa, I.: Theory of decision under uncertainty, vol. 45. Cambridge University Press (2009) 35. Gu, S., Kelly, B., Xiu, D.: Empirical asset pricing via machine learning. Rev. Financ. Stud. 33(5), 2223–2273 (2020) 36. Gu, S., Kelly, B., Xiu, D.: Autoencoder asset pricing models. J. Econometrics 222(1), 429–450 (2021) 37. Gunning, D., Stefik, M., Choi, J., Miller, T., Stumpf, S., Yang, G.Z.: Xaiexplainable artificial intelligence. Sci. Robot. 4(37) (2019) 38. Haeringer, G., Halaburda, H.: Bitcoin: a revolution? Economic analysis of the digital revolution,” J. Ganuza and G. Llobert, (eds)., FUNCAS (2018) 39. Halaburda, H., Haeringer, G., Gans, J.S., Gandal, N.: The microeconomics of cryptocurrencies. J. Econ. Literat. forthcoming (2022) 40. H¨ ardle, W.K., Harvey, C.R., Reule, R.C.: Understanding cryptocurrencies (2020) 41. Harvey, C.R., Ramachandran, A., Santoro, J.: DeFi and the Future of Finance. Wiley (2021) 42. Hey, A.J., Tansley, S., Tolle, K.M., et al.: The fourth paradigm: data-intensive scientific discovery, vol. 1. Microsoft research Redmond, WA (2009) 43. Hoffmann, A.O., Post, T.: Self-attribution bias in consumer financial decisionmaking: how investment returns affect individuals’ belief in skill. J. Behav. Exp. Econ. 52, 23–28 (2014) 44. Hogarth, R.M., Kunreuther, H.: Decision making under ignorance: arguing with yourself. J. Risk Uncertain. 10(1), 15–36 (1995) 45. Karni, E., Maccheroni, F., Marinacci, M.: Ambiguity and nonexpected utility. Handbook of Game Theory with Economic Applications 4, 901–947 (2015) 46. Karni, E., Vierø, M.L.: “Reverse bayesianism”: a choice-based theory of growing awareness. Am. Econ. Rev. 103(7), 2790–2810 (2013) 47. Levin, D., Zhang, L.: Bridging level-k to nash equilibrium. Rev. Econ. Stat. 104(6), 1329–1340 (2022) 48. Liu, Y., Lu, Y., Nayak, K., Zhang, F., Zhang, L., Zhao, Y.: Empirical analysis of eip-1559: Transaction fees, waiting times, and consensus security. In: Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security, CCS 2022, pp. 2099–2113. Association for Computing Machinery, New York,(2022). https://doi.org/10.1145/3548606.3559341. https://arxiv.org/abs/2305.02552 49. Liu, Y., Tsyvinski, A.: Risks and returns of cryptocurrency. Rev. Financ. Stud. 34(6), 2689–2727 (2021) 50. Liu, Y., Zhang, L., Zhao, Y.: Deciphering bitcoin blockchain data by cohort analysis. Sci. Data 9, 136 (2022). https://doi.org/10.1038/s41597-022-01254-0 51. MacQueen, J., et al.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, pp. 281–297, Oakland, CA, USA (1967)
Cryptocurrency Valuation: An Explainable AI Approach
807
52. Mai, F., Shan, Z., Bai, Q., Wang, X., Chiang, R.H.: How does social media impact bitcoin value? a test of the silent majority hypothesis. J. Manag. Inf. Syst. 35(1), 19–52 (2018) 53. Malkiel, B.G.: Passive investment strategies and efficient markets. Eur. Financ. Manag. 9(1), 1–10 (2003) 54. Maskin, E.: Decision-making under ignorance with implications for social choice. In: Game theory, social choice and ethics, pp. 319–337. Springer, Cham (1979). https://doi.org/10.1007/978-94-009-9532-1 9 55. Metcalfe, B.: Metcalfe’s law after 40 years of ethernet. Computer 46(12), 26–31 (2013) 56. Moskowitz, T.J.: Asset pricing and sports betting. J. Financ. 76(6), 3153–3209 (2021) 57. Mullainathan, S., Spiess, J.: Machine learning: an applied econometric approach. J. Econ. Perspectives 31(2), 87–106 (2017) 58. Neely, C.J., Rapach, D.E., Tu, J., Zhou, G.: Forecasting the equity risk premium: the role of technical indicators. Manage. Sci. 60(7), 1772–1791 (2014) 59. Polasik, M., Piotrowska, A.I., Wisniewski, T.P., Kotkowski, R., Lightfoot, G.: Price fluctuations and the use of bitcoin: an empirical inquiry. Int. J. Electron. Commer. 20(1), 9–49 (2015) 60. Saleh, F.: Blockchain without waste: proof-of-stake. Rev. Financ. Stud. 34(3), 1156–1190 (2021) 61. Shapiro, C., Varian, H.R.: Information rules: A strategic guide to the network economy. Harvard Business Review Press (1998) 62. Sharpe, W.F.: Likely gains from market timing. Financ. Anal. J. 31(2), 60–69 (1975) 63. Shefrin, H., Statman, M.: The disposition to sell winners too early and ride losers too long: theory and evidence. J. Financ. 40(3), 777–790 (1985) 64. Shiller, R.J.: Irrational Exuberance. Princeton Univ (2000) 65. Shilling, A.G.: Market timing: better than a buy-and-hold strategy. Financ. Anal. J. 48(2), 46–50 (1992) 66. Varian, H.R.: Big data: new tricks for econometrics. J. Econ. Perspectives 28(2), 3–28 (2014) 67. Wheatley, S., Sornette, D., Huber, T., Reppen, M., Gantner, R.N.: Are bitcoin bubbles predictable? combining a generalized metcalfe’s law and the log-periodic power law singularity model. Royal Soc. Open Sci. 6(6), 180,538 (2019) 68. Zhang, L., Ma, X., Liu, Y.: Sok: blockchain decentralization. arXiv preprint arXiv:2205.04256 (2022). https://doi.org/10.48550/arXiv.2205.04256. https:// arxiv.org/abs/2205.04256 69. Zhang, L., Sun, Y., Quan, Y., Cao, J., Tong, X.: On the mechanics of NFT valuation: AI ethics and social media (2023). https://doi.org/10.31219/osf.io/qwpdx. https://doi.org/10.31219 70. Zhang, L., Wu, T., Lahrichi, S., Salas-Flores, C.G., Li, J.: A data science pipeline for algorithmic trading: a comparative study of applications for finance and cryptoeconomics. In: 2022 IEEE International Conference on Blockchain (Blockchain), pp. 298–303 (2022). https://doi.org/10.1109/Blockchain55522.2022.00048 71. Zhang, L., Zhang, F.: Understand waiting time in transaction fee mechanism: An interdisciplinary perspective. arXiv preprint arXiv:2305.02552 (2023). https://doi. org/10.48550/arXiv.2305.02552. https://arxiv.org/abs/2305.02552
The Path to Autonomous Learners Hanna Abi Akl(B) Data ScienceTech Institute, Paris, France [email protected]
Abstract. In this paper, we present a new theoretical approach for enabling domain knowledge acquisition by intelligent systems. We introduce a hybrid model that starts with minimal input knowledge in the form of an upper ontology of concepts, stores and reasons over this knowledge through a knowledge graph database and learns new information through a Logic Neural Network. We study the behavior of this architecture when handling new data and show that the final system is capable of enriching its current knowledge as well as extending it to new domains.
Keywords: Neuro-Symbolic AI First-Order Logic
1
· Knowledge Graphs · Ontology ·
Introduction
Artificial intelligence has taken strides in enabling machines to perform tasks at near human-level. This poses a question on the nature of the relation between machines and knowledge, specifically how far off machines are from acquiring knowledge in an intelligent way. Today’s intelligent systems do not show the same signs of learning as humans, and while our methods of learning are by no means optimal, they at least enable us to adapt to new domains. Machines on the other hand still struggle when introduced to a new environment or domain and remain limited in performing human tasks. The field of natural language, for example, has considerably benefited from advances in neural models such as Large Language Models (LLMs). While these models have been able to rival human performance on many NLP tasks [1,13,16], they still present shortcomings, most notably in their inability to reason and their over-reliance on huge amounts of training data to learn. These shortcomings in LLMs have led to questions over their long-term capabilities, most notably how much data they should be provided to generalize their learning to almost any domain, how much scaling is required to accommodate this increase in data and whether or not these factors are enough to exhibit some sort of domain-adaptive learning like humans [18]. In the face of these apparent problems neural models tend to suffer from, a tide of symbolic applications has resurfaced, and research combining both neural and symbolic (dubbed neuro-symbolic) models has emerged. These models aim to leverage the power of neural networks while imbuing them with a rule-based framework [17]. The c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): SAI 2023, LNNS 711, pp. 808–830, 2023. https://doi.org/10.1007/978-3-031-37717-4_52
The Path to Autonomous Learners
809
main upsides are of course to enhance the learning capabilities of these models (and be able to trace the way they learn, something that is notoriously difficult to do with black-box language models) and enable them to perform while relying less on data, thereby reducing the massive data set requirement. In the vein of neuro-symbolic research, Logic Neural Networks (LNNs) have emerged as the perfect candidates to reconcile both neural and symbolic learning approaches [15]. While harnessing the capabilities of neural networks, LNNs are also capable of reasoning based on first-order logical rules, which makes it inherently easy for them to understand and derive concepts such as equivalence, negation and implication. In order to make sense of acquired knowledge, knowledge graphs have been shown to map information faithfully by modeling concepts as nodes and connections between them as edges [7,8]. This makes them attractive tools to turn to for storing and linking data from different sources. In this paper, we propose a system that will reshape the manner in which machines acquire and extrapolate knowledge. We model our approach by placing human learning at its center, i.e., we draw inspiration from human intelligence by breaking down human learning into three main components: minimal knowledge, reasoning and learning. Each component is treated in a section as part of our final solution. We demonstrate how machines can benefit from this approach by starting with minimum knowledge and presenting two use cases: reasoning to enrich an existing knowledge set and extending learning to new domains. The rest of this paper is organized as follows. We discuss related work, introduce the technical details of our proposed approach, present practical use cases and finally provide the conclusions of our work.
2
Related Work
Graph Neural Networks (GNNs) have recently gained traction in artificial intelligence research. They have been shown to boost graph representation learning and achieve state-of-the-art performance on many human tasks like computer vision and language [4,20]. Particularly, GNNs have been useful in explaining prediction performance for Deep Learning models and shifting away from the black-box architecture of learning systems [10]. The field of Neuro-symbolic Computing has also surged and paved the way for Neuro-Symbolic Artificial Intelligence [19]. This branch of AI emphasizes a knowledge-driven paradigm that promotes a strong generalization ability and interpretability [21]. Neuro-symbolic systems have also shown promising performances on many human tasks which may make them key to unlocking nextgeneration AI [3]. However, these models present their own pitfalls, most notably in their inability to handle unstructured data, their weak robustness which can render them difficult to scale and their slowness in reasoning which might raise performance issues depending on the applications. Both GNNs and Neuro-symbolic systems seem to possess advantageous qualities that merit further exploration in the endeavor to make artificial models more
810
H. Abi Akl
intelligent. Recent studies point to a combined effort to reconcile both technologies to inherit the semantic representation of graph networks as well as the logical framework that powers Neuro-symbolic models [9]. This combination effectively makes for more interpretable models that operate on clear logical rules and are capable of symbolically representing information in an inter-operable network. This paper follows the logical progression of these hybrid models and builds on them.
3
Proposed Approach
Humans are intuitive and logical beings. They are capable of many intelligent functions such as retaining and storing information for later retrieval, finding relationships between different objects and explaining new ideas using mechanics like deduction or comparison. Our approach aims to align machine learning with human learning and hopes to emulate the features of the latter. For that, we model a pipeline that approximates the human stages of learning. We identify three principle stages: – Minimal Input Knowledge. The basic amount of information humans start with. In our pipeline, this is the starting data containing the basic knowledge a machine should have. Here we base our approach on the hypothesis that every being starts with a set of information that constitutes the foundation for future learning, and we implement our system accordingly. The intuition behind this approach is that humans always seek to expand their knowledge, but no matter how much new information they retain this represents an infinitesimal amount of the existing knowledge in the world. By identifying and attributing a minimal knowledge set to our system, we aim to emulate the scenario of a human being that is continually learning. – Reasoning. This phase assumes the system already possesses some data or information that represents knowledge. At this stage, the system should be capable of reasoning, i.e., making sense of the existing data. This implies defining and differentiating data points (which we loosely call concepts), as well as correctly finding and tracing relations that bind them, i.e., reasoning over concepts that are linked together. – Learning. The system should be prepared to deal with changing reference frames. For that, it must be able to adapt to new domains. Adaptability is a hallmark of intelligence and a necessary ingredient for learning. A system that is capable of adapting to new types and sources of information is a system that showcases thinking and can therefore extend the scope of its knowledge. There should also be a feedback mechanism to incorporate every new piece of knowledge and update the existing information of the system. Each principle stage is modeled by a component in our overall framework. The general architecture is shown in Fig. 1. We elaborate each component in the following subsections.
The Path to Autonomous Learners
811
Fig. 1. Framework of our Proposed Approach.
3.1
Minimal Input Knowledge
In order to imbue our system with initial knowledge, we need to identify a basic set of representational knowledge that will serve as foundation for future learning. The criteria to consider in choosing this knowledge set are that it should be small (i.e., similarly to a human baby, the system need not start with huge amounts of data) and the data should be general enough to prevent specialization in a particular domain. The problem of selecting or constructing the right knowledge set falls under the umbrella of Knowledge Engineering. This problem has already been addressed at task-level, i.e., constructing the right knowledge set to perform well on a specific task [12]. Specifically, there is a branch of research in the field of knowledge engineering that tries to answer the question of deriving a general, consolidated knowledge set that represents the foundational knowledge of the universe. So far no such set has been derived, but several representative sets called Upper Ontologies have been compiled for that purpose [6]. An ontology is a formal representation of a set of information falling under the same theme or domain. An upper ontology is a high-level representation that transcends a specific domain and is general enough to cover multiple domains [11]. From the available upper ontologies, none constitutes the point of reference on its own, yet each of them possesses interesting features that makes it a reference depending on the case of study. For our system, we do not treat the problem of consolidating these ontologies, but rather focus our efforts on selecting the best one as a sufficient starting point for our Minimal Input Knowledge. Our selection criteria are two-fold: size and availability. It is important that our system starts with a small information set like an early-stage human. More-
812
H. Abi Akl
over, this knowledge set should be readily accessible to us. The choice falls on the Proton ontology specifically since it ticks both boxes: it is freely-available and small compared to the other upper ontologies. Proton consists of 25 classes and 77 properties and covers most of the high-level world concepts. 3.2
Reasoning
The purpose of the Reasoning module is to make sense of existing data. By representing the Proton ontology as a knowledge graph, we can map classes and properties to nodes and the relations linking them to edges. The advantage of using a knowledge graph is that it enforces a dynamic structure on the available information and allows easy navigation between concepts. In our architecture, nodes represent classes or high-level concepts, e.g., Person. Edges represent relationships between classes, e.g., Person is a subclass of Agent. Here the subclass of relationship enables the system to extend the specific concept of Person to the more general Agent concept. This allows the system to classify new information by identifying the most likely class (or node) it might belong to. Properties add a layer of precision to the representation of the knowledge set. They provide additional criteria to distinguish classes and can help the system to better define them. In the knowledge graph representation of the Proton ontology, properties are also represented as nodes and are linked to their relevant classes by edges. The Reasoning module serves a dual purpose: retaining existing information and dynamically updating its network to accommodate new information. Whenever the system is exposed to a new concept, it will first treat this concept as an instance. We define an instance as a sub-node, i.e., a node derived from a class node. The system will first try to connect the instance somewhere along the existing graph network using the information it possesses on class nodes and their properties. If it fails to find a suitable class to link the instance to or has insufficient information about the instance, then it will treat it as a class by itself and add it as a concept node to the existing graph. The reasoning required to make these informed decisions relies on a logical framework, i.e., a set of rules, that is provided by the Learning module and is communicated to the Reasoning component through the Feedback module represented in Fig. 1. 3.3
Learning
The Learning module is represented by a Logic Neural Network. The network instantiates a world model in which it learns the existing concepts and the connections between them through a set of first-order logic rules. These rules can be seen as the instruction set of the module to handle existing as well as incoming data and enables it to update the graph knowledge base by creating associations with new concepts. For that, the module runs an inference on the set of rules (i.e., axioms) and the existing knowledge graph to derive meaning for the nodes and edges. Then, information can be queried in the form of predicates that either evaluate to True
The Path to Autonomous Learners
813
or False based on the rule set of the LLN. Using these mechanics, we can plan ahead for the system and anticipate its exposure to a new domain or environment by writing first-order logic rules that build on its Minimal Input Knowledge. 3.4
Feedback
The Feedback module is a wrapper code that establishes a connection with the graph knowledge base and provides an API to manipulate graph objects. The purpose of this component is to provide swift communication between the Reasoning and Learning modules by transforming logic rule inferences to graph database operations like query, insert and update on one end, and symbolic representations of new and existing data (i.e., variables in first-order logic) to graph objects, i.e., nodes and edges.
4
Experiments
To test the behavior of our framework, we devise 2 experiments designed to tackle different aspects of the learning process: enriching the current knowledge set with new information and integrating a new body of knowledge. We develop each experiment in the following subsections. 4.1
Enriching Knowledge Set
We define the Proton upper ontology as our starting knowledge set. We store the data in a knowledge graph such that classes and properties are represented by nodes. Class-class, class-property and property-property relations are represented by edges. In the Proton ontology, the only type of class-class relation is the subclass of relationship. Class-property relations are represented differently: each property has a domain and a range, i.e., a mapping from one class to another. The domain references the source class and the range represents the target class of the property. Finally, for property-property relations, we choose to focus on the two most represented types: subproperty of and inverse of. Table 1 shows how class-subclass relationships are stored. In Table 2, we list all classproperty relations. Tables 3 and 4 showcase the subPropertyOf and inverseOf relations between properties in our graph network. We also add the following axioms (i.e., first-order logic rules) to our system as the basic rule set to enable it to make sense of its existing data and learn new information in the context of its graph network: – propagate-class-instance-to-superclass (Axiom 1): ∀x∀y∀z(isinstanceOf (x, y) ∧subClassOf (y, z) =⇒ (isinstanceOf (x, z))) – propagate-class-property-to-instance (Axiom 2): ∀x∀y∀z(isinstanceOf (x, y)∧ propertyOf (z, y) =⇒ (propertyOf (z, x))) – propagate-subproperty-to-class (Axiom 3): ∀x∀y∀z(subP ropertyOf (x, y) ∧ propertyOf (y, z) =⇒ (propertyOf (x, z)))
814
H. Abi Akl Table 1. Class subClassOf Relations Child
Parent
Child
Parent
Language Language Happening Organization Organization Organization Organization ProductModel ProductModel Location Location ContactInformation ContactInformation Inf.Res Inf.Res Inf.Res Service Service Number Number TimeInterval TimeInterval Situation Situation SocialPosition SocialPosition SocialPosition Event
Abstract Entity Entity Group Agent Object Entity Object Entity Object Entity Abstract Entity Statement Object Entity Object Entity Abstract Entity Happening Entity Happening Entity Situation Happening Entity Happening
Event JobPosition JobPosition JobPosition JobPosition Group Group Group Document Document Document Document Abstract Role Role Role Agent Agent Topic Topic Object Statement Statement GeneralTerm GeneralTerm Person Person Person
Entity SocialPosition Situation Happening Entity Agent Object Entity Inf.Res Statement Object Entity Entity Situation Happening Entity Object Entity Abstract Entity Entity Object Entity Abstract Entity Agent Object Entity
– propagate-inverse-to-class (Axiom 4): ∀x∀y∀z(inverseOf (x, y) ∧ propertyOf (y, z) =⇒ (propertyOf (x, z))) Using these axioms, the model should be able to make clever deductions like follow a chain of subClassOf relations from parent to child to grandchild node and deduce that the grandchild is a subclass of its parent and grandparent, propagate class properties to all instances of that class, understand that a subproperty and an inverse property are tied to a property and trace it back to its relevant class. We test our model’s understanding by introducing 2 examples:
The Path to Autonomous Learners
815
– english: We define this information as an instance of the Language class. Since Language is a subclass of Abstract and Abstract is a subclass of Entity, the model correctly deduces that english is a subclass of both Abstract and Entity (Axiom 1) and adds these relations to its network. – paris: We define this information as an instance of the Location class. The model applies the same reasoning to deduce that paris is also a subclass of Object and Entity since Location is a subclass of Object and Object is a subclass of Entity (Axiom 1). Additionally, Location has the following properties: “nima gns unique feature identifier”, “longitude”, “population count”, “subregion of”, “nima gns designator” and “latitude”. The instance paris also inherits these properties (Axiom 2). 4.2
Extending to a New Domain
To test the adaptability of our system, we introduce it to a new set of knowledge. We choose another top-level ontology, the Basic Formal Ontology (BFO), to imbue our framework with as much general knowledge as possible. Another possibility would have been to provide a more domain-specific ontology to try to specialize our system. The BFO ontology is another freely-available resource that contains 34 categories and eight relations. By integrating this ontology into the framework’s knowledge graph, we aim to measure where and how BFO concepts intersect with Proton concepts. Since the BFO is designed to promote interoperability between domains, its categories consist mainly of general concepts much like the Proton ontology. Figure 2 displays the hierarchical structure of the BFO.
Fig. 2. Structure of the Basic Formal Ontology (BFO).
816
H. Abi Akl
We see that the BFO defines Entity and Object concepts, much like the Proton ontology. Our model should be able to identify these similarities and dynamically extend its network accordingly. By leaving the axioms in our system unchanged, we introduce a new square term and define it as an instance of Object. But which Object are we referring to? Is it the concept belonging to the Proton ontology or that of BFO? The answer is both. With the rule set at its disposal, our model should be capable of handling this ambiguity and deriving the connections and properties learned from both ontologies and attributing them to the square instance. The system produces the correct information and deduces that square is a subclass of Entity from the Proton ontology and Material Entity from the BFO (Axiom 1). It also attributes the “is owned by” and “has contact info” Object properties to square (Axiom 2). 4.3
Results
From our experiments, we see that our system is capable of both reasoning and learning. The suggested framework shows that the model can reason since it can handle new incoming information and tie it to its data by using its graph network of classes and properties to augment its existing knowledge set. Our architecture also proves the overall system can learn since it can incorporate additional domain information like new ontologies and integrate it to its knowledge base through a set of logical rules. Since our experiments are performed exclusively with upper ontologies, we find that our proposed system also enables swift integration between them. This may lead to future work on the topic of creating a master top-level ontology that serves as a unique reference for all general knowledge. Finally, to verify the sanity of our model’s reasoning, we make the system output its learned network including the Proton ontology, the BFO and the examples we used for our experiments. The full model log results can be found in the Appendix section. We also publicly share the code containing the framework and experiments in this repository.1 4.4
Limitations
Most data in the real world is unstructured, as opposed to formally-defined ontologies. One such form of unstructured data is natural language text. We can then ask the question of how well our system is equipped to handle natural language to extract relevant information from it or answer queries for example. If the ultimate objective of our proposed framework is to get closer to human intelligence, then these are some of the multitude of tasks it should be able to deal with. However, this remains an open question that we haven’t explored yet. Handling natural language requires at least an intermediate process to tokenize the words in the text or transform it into a logical set of information by means of 1
https://github.com/HannaAbiAkl/AutonomousLearner.
The Path to Autonomous Learners
817
relation extraction methods for example. There is also the question of the accuracy of such methods in retaining all the information from the original text. This is why this process is outside the scope of our research. Our question assumes such a functioning pipeline exists and asks how this information can and should be handled by our system. Unfortunately, the strength of our framework might also be its greatest weakness. Since the model reasons in first-order logic, it expects predicative statements to be able to draw inferences from them. Transforming natural language text to first-order logic is an ongoing research [5,14], but for now, this may well prove to be a limitation of our system. In case this transformation cannot happen, this weakness can be seen as a constraint rather than a liability in the sense that we will be required to formalize unstructured text before ingesting it in our framework. Another approach would be to derive information from natural language using a grammar template. A problem with this method is that these templates should exist for all grammars and all languages. An example template can be found in the SUMO ontology, another upper-level ontology that is made available for us to use [2,22]. The SUMO ontology is designed especially for research and applications in search, linguistics and reasoning. It is also mapped to the WordNet lexicon and is the largest public ontology in existence today with 13457 terms, 193812 axioms and 6055 rules. The English grammar template is a subgraph of the ontology and consists of nodes and edges that can derive meaning from elements in text by linking them to semantic concepts. This implementation merits further exploration but our initial observations when trying to integrate the SUMO ontology in our framework is that it makes the model inference very slow due to its large size. Keeping our set of logical rules unchanged, ingestion of the Proton and BFO ontologies takes a few seconds compared to ingesting the SUMO ontology which takes several hours. This performance degradation presents itself as a limitation of our framework that raises optimization questions.
5
Conclusion and Future Work
In this paper, we propose a new theoretical approach for better machine learning. We draw inspiration from human intelligence and leverage the power of knowledge graphs and logic neural networks to create a hybrid framework capable of reasoning and learning with minimal input knowledge. We show that our system is capable of enriching its knowledge set by associating concept properties with new instances of its known classes. We also prove that the model is capable of extending its knowledge by integrating new domain information in its knowledge base and forming connections between related concepts via logical rule inference. These results, while early, deliver on the promise of adopting a neuro-symbolic approach in artificial intelligence and pave the way for future experiments to address interesting applications such as compiling a reference general ontology or understanding natural language more seamlessly. We hope this paper is a step toward creating autonomous learners truly capable of leveraging humanlike intelligence.
818
H. Abi Akl
Appendix This section presents the full log results of the model’s knowledge. The logs summarize the information retained by the system as well as the inferences drawn from the logic rules. They demonstrate how the model defines objects and their connections. Figure 3 shows all class-subclass relations by propagating the subClassOf relationship through the concept class hierarchy. Figure 4 shows all class-instance connections and class-property relationships. Figure 5 displays the model’s inference on Axiom 4. Figure 6 displays the model’s inference on Axioms 1 and 3. Figure 7 displays the model’s inference on Axiom 2. Figure 8 showcases the subPropertyOf relation propagated throughout the model. Figure 9 and Fig. 10 shows the propertyOf relation between 2 nodes. Figure 11 and Fig. 12 log all Property nodes. Figure 13 and Fig. 14 show the subClassOf relation between 2 nodes. Figure 15 displays the instance of relationship between 2 nodes. Figure 16 and Fig. 17 show all existing Class and Instance nodes in the network.
Fig. 3. Propagation of the subClassOf Relation.
Fig. 4. Propagation of Class Instances and Properties.
The Path to Autonomous Learners
Fig. 5. Propagation of Axiom 4. Table 2. Class-Property hasProperty Relations Class
Property
Class
Property
Happening
End Time
Happening
Pcpt in
Happening
Ent. Pcptng
Happening
Start Time
Entity
Located in
Entity
Name
Entity
Involved in
Entity
Main Label
Entity
Part of
Entity
Ent. Invld in
Entity
Description
Org.
Business as
Org.
Etbld in
Org.
Nb of Empl
Org.
Parent Org. of
Org.
Etbld Date
Org.
Subs. Org. of
Org.
Rgstrd in
Pdct. Mdl.
Produced by
Location
Latitude
Location
NIMA GNS Des
Location
Pop. Count
Location
NIMA GNS UFI
Location
Longitude
Location
Subregion of
Inf.Res.
has Subject
Inf.Res.
in Language
Inf.Res.
Res. Format
Inf.Res.
Dvd from Src
Inf.Res.
Inf.Res. Cov
Inf.Res.
has Contributor
Inf.Res.
Inf.Res. Rghts
Inf.Res.
Inf.Res. Id
Inf.Res.
has Date
Inf.Res.
Title
Inf.Res.
Res. Type
Service
Operated by
Social Pos.
Soc. Pos. Hldr
Job Pos.
Holder
Job Pos.
Held from
Job Pos.
within Org
Job Pos.
Held to
Group
has Member
Document
Doc. Abstract
Document
Dcmt Subttle
Role
Role Holder
Role
Role in
Agent
Involved in
Agent
is Legal Entity
Agent
Part. Controls
Topic
Subtopic of
Object
is Owned by
Object
Cnt. Info
Statement
Valid from
Statement
Valid until
Statement
Stated by
Person
is Boss of
Person
has Relative
Person
Soc. Pos
Person
Last Name
Person
Given Name
Person
has Pos
Person
First Name
819
820
H. Abi Akl
Table 3. Property subPropertyOf Relations Source
Target
Source
Etbld in Rgstrd in has Parent has Old Name Invld in First Name has Employee Held from Given Name has Spouse has Child Subs. Org. of Subregion of has Leader Doing Bsns as
Located in Located in has Relative Name Ent. Invld in Name has Member Start Time Name has Relative has Relative Part of Located in has Member Name
Doc. Abstct has Creator Held to has Siblg Doc. Subttle Title Subregion of Doc. Author Last Name Part. Owns Owns Laconic Desc. Pcpt in Happng Parent Org. of Controls
Target Desc has Contr End Time has Reltve Laconic Desc Name Part of has Creator Name Part. Controls Part. Owns Desc Ent. Pcptng Part. Controls Part. Controls
Table 4. Property inverseOf Relations Source Soc. Pos. Holder has Parent has Soc. Pos. has Position Parent Org. of Pcpt in Happng Ent. Pcptng
Target has Soc. Pos has Child Soc. Pos. Holder Holder Subs. Org. of Involved in Entity Involved in
Fig. 6. Propagation of Axioms 1 and 3.
The Path to Autonomous Learners
Fig. 7. Propagation of Axiom 2.
Fig. 8. Propagation of the subPropertyOf Relation.
821
822
H. Abi Akl
Fig. 9. Propagation of the propertyOf Relation.
The Path to Autonomous Learners
Fig. 10. Propagation of the propertyOf Relation - Continued.
823
824
H. Abi Akl
Fig. 11. Log of All Property Nodes.
The Path to Autonomous Learners
Fig. 12. Log of All Property Nodes - Continued.
825
826
H. Abi Akl
Fig. 13. Node-Node subClassOf Relations.
The Path to Autonomous Learners
Fig. 14. Node-Node subClassOf Relations - Continued.
Fig. 15. Node-Node instanceOf Relations.
827
828
H. Abi Akl
Fig. 16. Log of All Class and Instance Nodes.
The Path to Autonomous Learners
Fig. 17. Log of All Class and Instance nodes - Continued.
829
830
H. Abi Akl
References 1. Aher, G., Arriaga, R.I., Kalai, A.T.: Using large language models to simulate multiple humans (2022) 2. Allen, R.B.: Semantic modeling with sumo (2020) 3. Bouneffouf, D., Aggarwal, C.C.: Survey on applications of neurosymbolic artificial intelligence (2022) 4. Chen, C., et al.: A survey on graph neural networks and graph transformers in computer vision: a task-oriented perspective (2022) 5. Chen, Z., Gao, Q., Moss, L.S.: Neurallog: natural language inference with joint neural and logical reasoning (2021) 6. Elmhadhbi, L., Karray, M.H., Archim`ede, B.: Toward the use of upper level ontologies for semantically interoperable systems: an emergency management use case. In: 9th Conference on Interoperability for Enterprise Systems and Applications I-ESA2018 Conference, pp. 1–10, Berlin, Germany, March 2018 7. Hur, A., Janjua, N., Ahmed, M.: A survey on state-of-the-art techniques for knowledge graphs construction and challenges ahead (2021) 8. Ji, S., Pan, S., Cambria, E., Marttinen, P., Philip, S.Y.: A survey on knowledge graphs: representation, acquisition, and applications. IEEE Trans. Neural Networks Learn. Syst. 33(2), 494–514 (2022) 9. Luis, C., Lamb, A.G., Gori, M., Prates, M., Avelar, P., Vardi, M.: A survey and perspective, Graph neural networks meet neural-symbolic computing (2020) 10. Li, Y., Zhou, J., Verma, S., Chen, F.: A survey of explainable graph neural networks, Taxonomy and evaluation metrics (2022) 11. Mascardi, V., Cord`ı, V., Rosso, P.: A comparison of upper ontologies. In: WOA (2007) 12. McShane, M., English, J., Nirenburg, S.: Knowledge engineering in the long game of artificial intelligence: The case of speech acts (2022) 13. Min, B., et al.: Recent advances in natural language processing via large pre-trained language models: a survey (2021) 14. Muresan, S.: Ontology-based semantic interpretation as grammar rule constraints. In: Gelbukh, A. (ed.) CICLing 2010. LNCS, vol. 6008, pp. 137–149. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-12116-6 12 15. Riegel, R., et al.: Logical neural networks (2020) 16. Sejnowski, T.: Large language models and the reverse turing test (2022) 17. Susskind, Z., Arden, B., John, L.K., Stockton, P., John, E.B.: Neuro-symbolic AI: an emerging class of AI workloads and their characterization (2021) 18. Valmeekam, K., Olmo, A., Sreedharan, S., Kambhampati, S.: Large language models still can’t plan (a benchmark for LLMS on planning and reasoning about change) (2022) 19. Wang, W., Yang, Y.: Towards data-and knowledge-driven artificial intelligence: a survey on neuro-symbolic computing (2022) 20. Zonghan, W., Pan, S., Chen, F., Long, G., Zhang, C., Philip, S.Y.: A comprehensive survey on graph neural networks. IEEE Trans. Neural Networks Learn. Syst. 32(1), 4–24 (2021) 21. Yu, D., Yang, B., Liu, D., Wang, H.: A survey on neural-symbolic systems (2021) ´ 22. Alvez, J., Gonzalez-Dios, I., Rigau, G.: Applying the closed world assumption to sumo-based FOL ontologies for effective commonsense reasoning (2018)
Can Mobile Device Use in the Classroom Facilitate Student Engagement in Higher Education Michael Bass1(B) and Perry Hessenauer2 1 Sheffield Hallam University, Sheffield, UK
[email protected] 2 Nazarbayev University, Astana, Kazakhstan
Abstract. Bring Your Own Device (BYOD) has been heralded by many in academia and industry as a means of integrating new technology into existing processes, thus enhancing the learning experience but without the need for significant capital outlay or resources. One potential benefit of BYOD is an improvement in student engagement in the classroom. This research paper shall seek to identify if this is the case and if so by which means should this be applied. As education is an increasingly global enterprise the research shall be conducted from an international perspective to assess the impact of mobile technology across different countries. What can be alluded to from these findings is that from a student’s perspective there are surprisingly few differences between the reasons why and how they use their mobile devices in the classroom, just that they want to. What is notable however is the variation from an institutions outlook; those countries with more established HE provision are less receptive to the ideas of using mobile technology in their classrooms than their less established counterparts. Keywords: Learning Environment · BYOD · Higher Education · Personalised Learning · e-Learning
1 Introduction Mobile devices have experienced exponential growth across the globe with many countries now having reached a plateau in terms of device ownership; in many developed economies it is common for households to have more mobile devices than people and in fact, globally more people own a mobile phone than a flushing toilet [1]. As Kraut, Brynin, & Kiesler concluded the pervasive use of technology has become the norm and society now expects an element of technology to be involved in all aspects of their life [2]. This growth, however has not been replicated in higher education. Yes, there has been the introduction of mobile devices to enhance the learning experience but they have not truly been embedded into the learning processes officially, rather it is seen as a latter addition. Institutions are too cautious, they are fearful of students using their devices in class for inappropriate uses and combined with limited empirical evidence they are merely using IT to enhance existing teaching rather than using it to transform how teaching is designed and delivered [3]. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): SAI 2023, LNNS 711, pp. 831–840, 2023. https://doi.org/10.1007/978-3-031-37717-4_53
832
M. Bass and P. Hessenauer
This research paper investigates the infrastructure and learning environments of higher education institutions across five different countries: Australia, China, Kazakhstan, UK, and the USA. By analysing the varying degrees of mobile device use, the researchers intend to ascertain the extent to which mobile technology can facilitate student engagement. Specific reference to how institutions in each of these countries use mobile devices to support their students will be compared from both an academic and student perspective. The study shall look beyond the typical research of technology use in terms of academic achievement and instead assess the impact on the individuals, specifically in terms of how they are able to access education, combine it with family commitments and overcome barriers to learning. Given the increasing pressures on graduates across the world due to increased competition for graduate jobs, financial strains of tuition fees and changing cultural patterns more people now take alternative paths into HE. The researchers shall investigate how mobile devices can help support students through their learning journey and overcome these barriers.
2 Comparative Studies There are countless published papers that have validated the benefits of using technology within the HE sector from an academics perspective, although only a select few have taken a student’s perspective. Eun oh and Gwizdka who researched HE students use of tablets in a classroom are one such exception [4]. Their study of American students found unexpected technology uses that can be explained by the characteristics of the student group, the Net generation, namely, their impatient multi-tasking and opportunistic behaviour. Their research found that students liked using laptops to take notes during class discussions but tablets were seen as inconvenient for note taking due to the difficulties typing on the screen and limited functionality. Taking advantage of learning opportunities was also a theme within Dorit and Currie’s work; their comparative study showed that there is no one size fits all solution, some student’s preferred email and Dropbox whereas others preferred messaging apps and social media [5]. True collaborative communities used a range of tools to suit the circumstances and different tools worked for different students. Ultimately though the result was the same; greater connectedness, collaboration and more intense relationships that resulted in higher quality outcomes for the Australian students they had researched. Using mobile devices to facilitate student engagement and create a collaborative, technology driven learning environment has been central to the findings of Romero. According to their research, technology promotes collaboration and facilities round the clock access to materials on Virtual Learning Environments (VLE’s) to cater for those who work unusual shift patterns [6]. When working in dispersed locations as many students now do, particularly so given the increase in HE students attending university in conjunction with their employer through newly developed degree apprenticeships, the need for electronic communication is immeasurable. Whilst still quite niche Virtual Reality (VR) is becoming more mainstream and opens up opportunities for truly virtual learning environments with students creating their own virtual personas to interact with their peers [3]. The fundamental constraint behind so many of these good intentions is either the lack of imagination at course planning level or any true understanding of the costing
Can Mobile Device Use in the Classroom
833
and resource implications [3]. A wide range of tools and techniques are required to truly integrate mobile technology into the classroom and facilitate student engagement; however the systems and infrastructure often cannot fully support this due to insufficient wireless access points, power outlets or network configurations that do not permit easy access to remote drives [7]. Despite these barriers enforced by an institution’s culture where research is often prioritised over teaching [3] many external organisations are continuing to invest in these areas. Microsoft has been keen to develop the research capabilities at Chinese universities, supporting students with new technology to support their studies and research efforts [8]. The continued growth of China’s economy by 10% per annum since the 1970’s through to the early 2000’s has been driven in part by the significant growth in technology led positions and highly-skilled workers who have integrated new technological advancements into increasing elements of society and the workplace [9]. None more so than within universities where they have strived to develop integrated curriculums that utilise the most advanced technologies so that their students have the necessary skills to contribute to this economic growth. Previous research into Chinese HE institutions found a correlation between the teachers’ positive attitude towards ICT and the success of students in their class. These findings are similar to those made by Vidacek-Hains who found that university students have demonstrated a direct link between the use of ICT for learning activities and improvements in their learning outcomes [10]. Even after investigating different countries, Australia and Eurasia, both Bower and Ali concluded that cultural and procedural difficulties of developing quality learning materials using new technology is what holds back many new initiatives [11] and [3]. Having a clear and coherent governance structure would, according to Ali, ensure that technology was properly integrated into the curriculum for the benefit of academics and students alike. Currently HE institutions are costly and ineffective; they have not sufficiently adapted to cater for the massification of HE, simply adding extra tutorial groups or running lectures in ever larger lecture theatres does not provide the same student experience as a cohort half the size [3] and [13]. Technology can drive cost benefits and improve student satisfaction scores if applied correctly [14] and [15]. Where technology has been proactively applied in higher education there are significant cost benefits to be realised. Previous research by the Arizona State University showed how by using digital technology to reduce costs they were able to offer annual fees of $10000 against the average annual fees of US universities of $31000 [12].
3 Methodology To truly investigate the cultural, political and geographical differences of BYOD on student engagement in HE it was necessary to conduct research across as many continents as possible. As such students from Sheffield Hallam University, UK; Nazarbayev University, Kazakhstan; University of Illinois at Chicago, USA; The University of Western Australia, Perth and Wuhan University, China have been involved. Prior to the main research activity, a small-scale study of current undergraduate students from Sheffield Hallam University was undertaken. This enabled the questions to be finalised and any problems or uncertainties resolved before the main research activity was conducted.
834
M. Bass and P. Hessenauer
Participants at their respective institutions were invited to participate in the study by completing an online questionnaire during one of their taught sessions. The selection criteria was that they must be over the age of 18 and a current student of a HE institution, either in an undergraduate or post-graduate capacity. Participants were all informed before any data was gathered that the information would be held securely and no reference would be made to them individually as part of the study, at all times participants were free to leave the study. Gender and other personally identifiable information were not requested as analysis on these factors would have been outside of the scope of this research. A total of 174 students chose to be involved and completed the questionnaire.
4 Results and Discussion Unsurprisingly the proportion of students who use a mobile device as part of their studies (Table 1) is similar to the overall penetration of mobile device usage in their respective countries. What was unexpected was the comparatively low use of mobile devices by Australian students and the higher than anticipated use by students from Kazakhstan. Table 1. Mobile Device use by Institution Institution
Proportion of students who use a mobile device as part of their studies
Sheffield Hallam University, UK
95%
Nazarbayev University, Kazakhstan
82%
University of Illinois at Chicago, USA
97%
The University of Western Australia, Perth 84% Wuhan University, China
100%
The most notable differences can be seen in Fig. 1 which shows a clear variation in the participants preferred mobile device dependent on their country. When plotting this data against mobile device use there is a clear correlation, see Fig. 2, particularly so when the preference for that device is a Smartphone and conversely so when that device is a laptop. The higher the proportion of students using a mobile device as part of their studies then there is a greater likelihood of that device being a Smartphone or tablet. Undertaking regression analysis (Table 2) on these data values then the coefficient of determination can be calculated as follows. An R2 of 0.72 for any human behaviour is clearly of viable intent and when analysing the data in more detail it is apparent that those students where mobile device ownership is ubiquitous, such as China, the UK or USA, then the student’s preference for their device of choice changes to what could be viewed as more up to date technology in the form of a smartphone or tablet. Whereas those economies where mobile technology ownership has yet to reach critical mass, the preference for mobile device is much more
Can Mobile Device Use in the Classroom
Fig. 1. Mobile Device Preference
Fig. 2. Mobile Device Correlation
Table 2. Regression Analysis Preferred Device
R2 Value
Laptop
0.25595
Smartphone
0.72096
Tablet
0.68778
835
836
M. Bass and P. Hessenauer
conservative and traditional in the form of laptops. By using this information academics and course planners can plan teaching activities accordingly based on their student’s preference for mobile device. These findings concur with those of Eun Oh and Gwizdka who concluded that 94% of students agreed that using technology during classroom sessions was beneficial and increased interaction between academic staff and students [4]. They did find an interesting positive effect of the use of tablet computers in the classroom, whilst some students using laptops were engaged in non-course related activities; those using a tablet were just using them for course related activities. They concluded that the difficulty of hiding what you are doing on a tablet screen was the reason why those using a tablet remained focused on the task. Whilst a significant proportion of students do use their mobile devices for their studies, there is still a selection who don’t and only by investigating these in more detail could we find potential solutions that would benefit all. Figure 3 shows that the ownership status of a mobile device is not of concern, rather the individual preference for using a desktop computer instead of a mobile device was the primary reason. Such a positive outcome is testament to the desire for students to use some form of technology to enhance their studies and it is merely the limiting factors of these devices that influence their choice. A facility to enable students to access specialist software remotely on their own device or the option to hire a range of mobile technologies could encourage students to engage with classroom sessions by making an active contribution.
Fig. 3. Reasons for not using a Mobile Device
Given that there is so little variation in the reasons behind why students don’t use a mobile device as part of their studies this is in stark contrast to the reasons why student do use a mobile device for their studies (Fig. 4). For American students the social element of learning is a key priority, having social connectivity and being able to share ideas was vital to their learning experience. This is the opposite for Australian, British and Chinese students who see completing their work as the primary reason for using mobile
Can Mobile Device Use in the Classroom
837
devices in the classroom, a fact some educators may find hard to comprehend given the impediments some associate with using mobile devices in their classrooms [16]. Chinese students noted that aesthetics were an essential aspect, being seen with the latest technology is vital and the majority of students actually used multiple devices for their learning. A laptop to complete their main tasks, tablet to do secondary research and a smartphone to connect with peers. Despite the cultural differences and variations in prior education, students from those HE institutions researched have a common theme and approach to ICT use as part of their studies, that its role is as important as any other teaching method and a key component for their successful learning.
Fig. 4. Mobile Device Use
The desire is to encourage students from all countries, irrespective of their ability or background to be able to actively engage in classroom activities by using their own mobile devices, the key starting point to this lies within Fig. 5. By knowing what actual students would like to use their mobile devices for in the classroom then academics and course planners can consider these when designing and delivering their teaching. Overall students want more interactive teaching activities that use the full capability of their devices, rather than just using them to type Word documents or Google a question. However, when analysing the data in more detail there are geographical disparities. Australian students felt that the network and infrastructure was a limiting factor, it didn’t really support their devices and American students wanted more access to power outlets so they could charge their devices throughout the day.
838
M. Bass and P. Hessenauer
Fig. 5. Increased Mobile Usage
5 Conclusion This research paper had sought to ascertain if mobile device use in the classroom can facilitate student engagement within a HE setting across different international institutions. From the data gathered there is clearly a desire on the part of students to use a range of devices in support of their learning. There are however challenges when using mobile devices in the classroom, one such drawback often cited is the distraction these devices can help to facilitate by encouraging users to multitask [16]. The problem with multitasking is an overall decrease in quality of the individual activities [17]. However, from the students’ own comments, they appreciated being able to multi-task, completing several different activities using multiple technologies as being the most convenient. The contradiction of the empirical evidence and the students perception is how many problems arise, as educators we seek to base our practice on evidence without always considering how this may be perceived by the students we are attempting to teach. Teacher training could be used to help overcome some of these barriers to technology adoption and increase an awareness of student perceptions but this forced training is seen by some members of academia as merely a tick box exercise and is never taken seriously. Despite this it should be noted that the researchers strongly urge academic policy makers and work planners to encourage their academic delivery staff to make adequate provision that would facilitate students to bring their own devices into the classroom. This research has clearly demonstrated that there is a desire by the students for this to happen and any student-led activities are likely to have greater success. Students from all five countries commented that using mobile devices as part of the classroom enhanced the learning experience rather than diminished or interfered with the learning process. A number were frustrated at the fact that whilst mobile technology had come on in leaps and bounds over the past 10 years, HE had failed to keep pace and there were too many instances of technology being seen as an intrusion, particularly so
Can Mobile Device Use in the Classroom
839
in those countries where HE is well established. Those countries whose HE institutions are relatively new appear to have a far more open perspective towards the introduction of mobile technology in the classroom and are embracing the possibilities these can bestow. The limitations to the data set used to form these correlations primarily lie with focusing on a single institution within each of the participating countries. The type and background of students attending HE vary by institution which could have an impact on the reliability of the findings. However, given the institutions involved incorporated relatively new universities Nazarbayev (founded in 2010) and Sheffield Hallam (university status in 1992) alongside more established institutions Western Australia (founded in 1911) and Illinois (founded in 1859) the similarity of student responses to key questions mitigates any such concerns. Notwithstanding these limitations it still provides a basis from which Higher education institutions and academics can begin to adapt their teaching resources. They can use this information to ensure they tailor their teaching resources to factor this device preference into their teaching activities. For example, an academic delivering a course in those countries that have a preference for using smartphones should design interactive teaching activities that utilise the gyroscopes and touch screen controls of a modern smartphone. Such examples can be as simple as creating a scavenger hunt using Geocaching to more complex augmented realities. Even so, these need not be expensive, Google cardboard can create a quick and effective Virtual Reality headset with a plethora of content available on YouTube’s VR channels. This research has analysed mobile device use and student engagement within a HE setting but it does so from a social lens which has demonstrated a need for further research into the precise methods and means by which technology can be truly integrated with the curriculum without being viewed as an intrusion. Consequently future research shall seek to design a specific technological intervention that will be designed and implemented by the students themselves. By having the participants for whom the intervention is intended to help at the core of the design stage will create an opportunity to compare a different method of delivery and hopefully overcome some of the pre-conceived limitations.
References 1. Hartford, T.: How the humble S-bend made modern toilets possible. From BBC News: http:// www.bbc.co.uk/news/business-41188465 (2017). Retrieved 16 Oct 2017 2. Stirling, E.: Technology, time and transition in higher education – two different realities of everyday Facebook use in the first year of university in the UK. Learn. Media Technol. 41(1), 100–118 (2016). https://doi.org/10.1080/17439884.2015.1102744 3. Ali, N.: The influence of technology on the academic and social lives of students and lecturers in Kuwaiti higher education (HE). http://hdl.handle.net/10871/31851 (2017). 4 Feb 2019 4. Eun Oh, K., Gwizdka, J.: Impatient opportunists: a study of technology use in a higher education classroom. J. Appl. Res. High. Educ. 3, 81–96 (2011) 5. Maor, D., Currie, J.K.: The use of technology in postgraduate supervision pedagogy in two Australian universities. Int. J. Educ. Technol. High. Educ. 14(1), 1–15 (2017). https://doi.org/ 10.1186/s41239-017-0046-1
840
M. Bass and P. Hessenauer
6. Romero Alonso, R., Riquelme Plaza, I., Halal Orfali, C.: Barriers in teacher perception about the use of technology for evaluation in higher education. Dig. Educ. Rev. 35(1), 170–185 (2019). https://doi.org/10.1344/der.2019.35.170-185 7. Iqra University: Realising technology in university education. Asian Journal of engineering, sciences & technology (2015) 8. Olsen, F.: Microsoft to Spend Millions at Chinese Universities, 34. The Chronicle of Higher Education (2002) 9. Bucher, T., Helmond, A.: The affordances of social media platforms. The SAGE Handb. Soc. Media 1(1), 233–253 (2018) 10. Vidacek-Hains, V., Appatova, V., Prats, H., Takemura, K., An, L., Bushaty, J., et al.: Implementation of information and communication technology in higher education: Comparative research in Asian, American and European universities. In: Central European Conference on Information and Intelligent Systems, pp. 149–155 (2010) 11. Bower, M., Torrington, J.:. Typology of Free Web-Based Learning Technologies. Macquarie University, Australia (2020). https://search.datacite.org/works/10.13140/rg.2.2.11064.16647 12. Anon: The log-on degree; Technology and universities. The Economist, pp. 29–30 (2015) 13. Kleinke, S., Lin, Y.: Application of adult learning theory to STEM education in online learning environment (2019) 14. Sang, G., Valcke, M., van Braak, J., Tondeur, J.: Student teachers’ thinking processes and ICT integration: predictors of prospective teaching behaviors with educational technology. Comput. Educ. 54(1), 103–112 (2010). https://doi.org/10.1016/j.compedu.2009.07.010 15. Chamorro-Premuzic, T., Frankiewicz, B.: 6 reasons why higher education needs to be disrupted. Harvard Business Review. https://hbr.org/2019/11/6-reasons-why-higher-educationneeds-to-be-disrupted. 19 Nov 2019 16. The President and Fellows of Harvard College: Devices in the classroom. Harvard University. https://bokcenter.harvard.edu/technology-and-student-distraction (2019). Retrieved 20 Nov 2022 17. Top Hat Staff: 20 pros & cons of technology in the classroom in 2021. Top Hat. https://top hat.com/blog/technology-in-the-classroom-pros-and-cons/ (2021). Retrieved 20 Nov 2022
An e-Learning Course on Artificial Intelligence in Production – Development of a Target Group-Oriented Continuing Education Format for Technical Innovations Erik Voigt(B) , Marietta Menner, and Julia Thurner-Irmler Universität Augsburg, Universitätsstraße 1a, 86159 Augsburg, Germany {erik.voigt,marietta.menner,julia.thurner}@uni-a.de
Abstract. Numerous training courses impart knowledge about artificial intelligence (AI) – yet companies complain about a lack of skilled workers and further training opportunities. Moreover, many formats convey content but have not been developed specifically for the target group. This paper will show an approach for a target group-oriented design of further education formats for technical innovations, using the example topic of artificial intelligence. This is done utilizing a concrete example in the form of an e-learning course developed according to the design-based research (DBR) approach, which contains both the basics of AI and a focus on production. Initial testing and feedback from the target group show that the course still needs to be improved, especially in terms of intuitive handling and design. However, potential is seen in the increased incorporation of gamification elements. Keywords: e-Learning · Artificial Intelligence · Gamification
1 The Need for Target Group-Oriented AI Competence Training Due to the Corona pandemic, among other things, digitalization has developed rapidly in many areas of life in the last two years. As a result, technical innovations are finding their way into working life much more quickly, and society must adapt to this “mechanization” of its living environment. The need for people with technical skills is growing in Germany. By 2026, more than 780,000 people with expertise in areas ranging from data analytics and AI to hardware/robotics development will be needed. Universities, in particular, face the task of promoting the education and training of technological competencies even more strongly than before. In this context, attention must not only be focused on graduates, but the further training of existing workers and the role of universities as providers of further training is also of great importance [13]. Not least for these reasons, the “AI Production Network” was launched in 2021, which deals with the optimization of production processes through artificial intelligence. The cross-cutting topic of “education and training” was also anchored in this network. In this field, a concept is being developed to prepare skilled workers for using artificial © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): SAI 2023, LNNS 711, pp. 841–852, 2023. https://doi.org/10.1007/978-3-031-37717-4_54
842
E. Voigt et al.
intelligence in production. There are already various training offers in German-speaking countries on artificial intelligence, for example, “AI Campus”, which bundles online courses, videos, and podcasts in this area [8]. However, there is usually no target grouporiented approach that enables employees without prior knowledge to get started and professionals with prior knowledge to continue their education. Therefore, it would make sense to develop the necessary offers together with the target group to specifically address the need for further training on artificial intelligence in production. Within the network framework, a continuous pilot model was launched in which professionals are trained in the basics of working with artificial intelligence and sensitized with a particular focus on production. The question is: How must further training courses be designed to convey content on the topic of artificial intelligence with a specific focus on production in a target group-oriented way? As a basis for the pilot model, an e-learning course is being developed within the framework of a design-based research approach (DBR), the concept and contents of which are presented below as a matter of priority. Finally, based on the initial evaluation results of the prototype, the resulting and planned steps for both the course and the continuous education program will be discussed. In this paper, we examine the first intervention, the first iteration of the e-learning module. To this end, the module and its contents are briefly explained. Afterward, we give an overview of the DBR approach and its uses in continuing education especially. Here, we examine why target group-oriented measures must be developed in close contact with the content receivers. Additionally, we explore the benefits of gamification elements in teaching technical innovations. Lastly, the results of the first iteration are discussed and reflected upon, and an outlook for further interventions is given.
2 The e-Learning Course In order to prevent problematic differences in the participants’ prior knowledge of artificial intelligence (in production), an e-learning course is created. The course is called “Basics of AI” and is intended to develop a common knowledge base. Since the AI education program intents to take a holistic view of artificial intelligence in production, a common baseline must exist for the more advanced offerings on which to build. To convey the essential basics in a compact and motivating way, we chose the e-learning course format. Here, classic methods are combined with gamification approaches, emphasized more strongly in the advanced modules of the program. In addition, the e-learning course design allows one to work through the basics at one’s own pace and in a differentiated manner according to prior knowledge. 2.1 Idea and Goals The objective of the integrated AI education program is to enable employees currently or in the future working in the manufacturing industry to actively shape the change towards an AI-supported production of the future. In doing so, they should be able to initiate change processes themselves in their respective companies and act as self-determined actors in relation to the use of new technologies, which, sooner or later, will have a
An e-Learning Course on Artificial Intelligence
843
lasting impact on their everyday working lives. Therefore, the course aims to provide the contents and skills the target group needs. To achieve this objective, the learning modules are developed and constantly re-evaluated based on feedback from the target group (see chapter 3.1). The e-learning course is the cornerstone of the continuous education program and is intended to provide a low-threshold introduction and initial encounter with the topic of artificial intelligence. The course aims to build a foundation of knowledge about central concepts, functionalities, and terminology of artificial intelligence – especially in the context of production. This will ensure that the participants can act as active cocreators rather than merely passive recipients of knowledge content in further advanced and in-depth courses in the future. Furthermore, they should already be sensitized to the potential opportunities – but also limitations – of artificial intelligence so that they can derive relevant content and, thus, in-depth courses for themselves and their companies. 2.2 Target Group The e-learning course “Basics of AI” is primarily aimed at trainees and workers from the manufacturing industry. Still, its content is not limited to industry to such an extent that the secondary target group of people interested in AI who do not come from the manufacturing industry cannot benefit from participating in the course. The course is intentionally designed so that participation in the course does not require prior knowledge of programming, artificial intelligence, or manufacturing. Depending on the level of prior knowledge, however, the respective target groups can delve more deeply into the content and make the imparted knowledge content usable for themselves. Furthermore, the production-related content of the introductory course is so general that participation in the course is also beneficial across all industries. Thus, the course can be used both on a horizontal level between different sectors and on a vertical axis within an industry or a company, such that workers in production and the management level can benefit equally from the knowledge content. 2.3 Contents The selection of the contents of the first intervention was based on preliminary talks with workers and decision-makers from small and medium-sized enterprises (SMEs) and scientists working in our research network. A brief definition of the term, in which the topic of artificial intelligence is placed in the current scientific discourse as well as in the context of the course and the manufacturing industry, leads to the introductory content modules and, at the same time provides an overview of the content covered in the course. Here, the participants should primarily deal with the question of why there is added value in dealing with artificial intelligence at an individual as well as the company level. The course is designed as a self-study course that does not prescribe a specific learning speed or a specific depth or sequence of processing and is, therefore, also barrierfree. Consequently, it is up to the participants to decide at what rate and in what order the modules will be worked through. At the beginning of the course, a recommendation is given to deal with them in the order presented since content dealt with later may well
844
E. Voigt et al.
build on the knowledge content of the primary modules. The flexible design leaves it up to the participants to skip parts of the course based on their prior knowledge or to use them as a mere refresher for already existing knowledge. At the same time, it ensures that a lack of prior knowledge is compensated for by a slow and in-depth approach to the content. This leads, among other things, to the fact that no fixed processing time of the course can be determined. In the conception, however, it was assumed that the learner would complete the course without any prior knowledge. For such a person, one hour of processing time should be estimated per module, in addition to the voluntary deepening of the content through further research or exercises. Self-study of the course is mainly supported by interactive and gamified learning elements (see Fig. 1 and 2), which are available in the form of h5p elements or Jupyter notebooks. Also excluded from the time frame are collaboration and exchange with other participants via the forums.
Fig. 1. Interactive Videos are Used to Explain Complex Concepts Clearly and with Visual Support. Here: How does Supervised Learning Work?
Accordingly, the second module deals with the topic of data and AI. Many companies, especially small and medium-sized enterprises, face a significant problem: the generation and handling of data [4, 15]. Good data is essential for the use of many algorithms. For
An e-Learning Course on Artificial Intelligence
845
Fig. 2. Use of Interactive and Gamified Elements to Reinforce the Engagement and Understanding of more Theoretical Aspects. Here, the Participants must Sort Different Situations from the Manufacturing Context into the Categories of Hammonds Periodic Table of AI (Adapted from Periodensystem-ki.de)
this reason, this module closely examines what good data is and how data preparation for learning algorithms works. Modules 3 and 4 then address the central topic of the course: machine learning. Here, the participants get an overview of the different areas of machine learning to enable them to classify different use cases better. Here, easy-to-use and understand code examples with intuitive graphical user interfaces (GUIs) are used to minimize an extraneous load of understanding complex code examples or workflows. The participants can directly run the code in the browser. Additionally, suggestions of changes that could be made to the code are given to advanced participants. The provided examples in the form of Jupyter notebooks and p5.js-editor sketches contribute to sharpening a realistic picture of the technology. The confrontation, not only with the abstract idea of artificial intelligence but also with the concrete algorithm, which becomes changeable and thus comprehensible, drives the demystification of this and thus pursues the goal of further lowering initial inhibitions on the part of the participants. In order to take a closer look at non-technical aspects, a separate module is dedicated to AI ethics. Here, the module compiles and provides content on ethical aspects of artificial intelligence in multimodal form. This includes classic text-based methods and interactive video lessons by renowned professors. Considering the ethical and social aspects is particularly important because many fears and obstacles are of social origin. For example, 50% of German companies expect using artificial intelligence to be associated with a reduction in many previous jobs [14].
846
E. Voigt et al.
After the more general part of the introductory course has been completed, the production focus of the self-study course is supported by a series of (interactive) video lessons in interview form with scientists active in current research. These provide a bridge between the knowledge content already learned and the production reality. In addition, based on the topics covered in this module, participants can derive relevant questions and content for themselves or their companies.
3 Methodology In order to set up the individual components of the program in a target group-oriented and consistent manner, the development of the training modules and, thus, also of the elearning course is accompanied by a design-based research approach (DBR). In addition, the course is enriched by gamification elements to achieve certain effects. Both points are briefly presented and explained in this chapter. 3.1 Design-Based Research Approach So far, two main paradigms have been followed in educational research when designing educational media: The hermeneutic approach, which can be described as theoretical development without direct reference to practice in the procedure, and the empirical approach, which measures quantitative and qualitative measurements, for example, on effects of a finished medium [10]. The latter procedure is also called practice-oriented evaluation and includes a standard evaluation procedure of educational media [6]. However, in this form of evaluation, the target group is not involved in the creation of a new format, so there can be a risk of developing a medium that misses the target group’s needs and subsequently must be redesigned at an enormous additional expense, without knowing whether the redesigned medium then meets the needs of the users. For this reason, a small-step evaluation model was chosen, namely the DBR approach. The DBR process can help solve two major problems we identified in developing further education measures for SMEs: the limited time spent on further education (1) and the lack of a precise definition of concrete needs (2). 1. Although SMEs are generally interested in offering their employees further qualification measures, they often cannot provide them due to time- or economic constraints. 2. Many SMEs lack an in-depth understanding of what AI is and where its potential lies. This often results in companies not correctly defining their needs and expectations of AI. Thus, further education measures ought to be designed in a way that is tailored to the industries and workers’ needs. To avoid the unnecessary cost of development for a large product that is only marginally suited to the target group’s needs, the DBR approach is implemented to incrementally adjust to the needs of the target group while, at the same time, providing high-quality educational offers. The use of the DBR approach is well suited when a practically relevant educational problem prevails for which, for example, a teaching method, an educational concept,
An e-Learning Course on Artificial Intelligence
847
or a technical tool must be developed [11]. Often, sustainable innovations in everyday education are produced as a result. Thereby, a systematic design, implementation, review, and re-design are in interplay to produce theoretical knowledge about the design process as well as improvements for practice [12]. Euler (2014) proposes a cyclical model that leads from problem specification to experience evaluation to design development, testing, and formative evaluation to possible generalization. Then, after several possible cycles or re-designs, the intervention can be summatively evaluated before a new problem can possibly be specified [5]. The continuing education program will be designed according to the approach and is currently in the formative evaluation phase (see Fig. 3). In the process, the e-learning course described here, and the other (advanced) offerings will be coordinated with one another or will interact both in terms of content and concerning the results and findings of the respective tests and evaluations.
Fig. 3. The Current Status of the Formative Process of the AI Pilot Model in the DBR Cycle (Adapted from [5])
3.2 Platform and Didactic Mediation – Implementing Gamification The e-learning course is offered as a MOOC (Massive Open Online Course) via the content management system Moodle. On the one hand, this offers the advantage that the course content can be made available to a wide range of people efficiently and effectively, and on the other hand, Moodle allows for a strongly multimodal and multimedia mediation strategy. To convey the knowledge content appealingly, the course makes use of various formats of media mediation. This ranges from classic text contributions and audio and video formats to collaborative tasks, feedback, and knowledge assurance tools. In particular, using h5P formats, which Moodle allows to be embedded, promotes an
848
E. Voigt et al.
interactive learning environment in which participants are encouraged to engage with the knowledge content actively. In order to sustainably anchor the subject content, the participants are also encouraged via external tools such as Jupyter notebooks via Google Colaboratory, Teachable Machine, and the p5.js-editor to deal with the topic of machine learning and AI in a practical way to try out what they have just learned hands-on.
Fig. 4. Gamified Mediative Elements are Used to Enrich Learning Contents that could be Perceived as “dry” or Boring otherwise, such as Lists and Definitions
In addition to the pure transfer of knowledge, gamification principles and methods were used – where appropriate – to promote intrinsic motivation regarding dealing with the topic and maintain the motivation to learn while working through the course (see Fig. 4).
An e-Learning Course on Artificial Intelligence
849
Gamification describes “the use of game design elements in non-game contexts” [3]. Game elements are often used to evoke the positive experiences and motivational possibilities that games generate and use, thus influencing players’ long-term behavior [7]. Often used game elements are points, levels, badges, time (pressure), progress bars, etc. [2] This can also be illustrated by concrete examples from everyday life, such as fitness watches or apps that want to motivate people to exercise more using badges or to collect points when shopping to increase the probability of a repeat visit [9]. However, it is essential to emphasize that the automatic activation of intrinsic motivation is not self-evident [1]. Especially in work, gaming is not usually associated with productivity and efficiency. However, the fundamental purpose is not playing while not working – it “is to engage and encourage participation” [2]. Therefore, many educational offers already profitably use gamification to motivate learners to deal with learning objects in a self-active and self-directed way, otherwise perceived as boring [1]. In addition, game elements can be used to understand and use specific triggers for certain behaviors – even in the workplace [2]. In order to facilitate the introduction to the complex topic of artificial intelligence in production and to increase the participants’ level of action as well as to motivate them to keep up dealing with the content, low-threshold and easy-to-complete game elements were embedded in the e-learning course. They vary from quizzes and matching tasks in which points can be earned, both as interactive elements in videos and stand-alone elements in the course.
4 Evaluation of the First Intervention For the first trial of the e-learning course, 40 participants took part. They received the access data and were able to complete the course independently. This was done in preparation for a face-to-face course in which different AI research directions and practical offers were presented and tested. The e-learning course generated a consistent base of knowledge and an initial basic understanding of the topic. All 40 invited participants (male adults between 25 and 45 years old) accepted the invitation to the course and completed it. 4.1 Results of the Questionnaire Following the face-to-face event, an evaluation was conducted. This alternated between open assessment questions/comment fields and items that could be rated as “fully agree”, “disagree”, “partially agree”, “agree”, and “fully agree”. The questionnaire was answered entirely by a total of n = 24 respondents—five items related to the e-learning course – mainly to its use and handling. The question “I get the explanations I need on how to use the online course” was answered in the affirmative by eleven participants, eight agreed in part, and five answered in the negative. The answers are similar to the item “I found the handling of the online course intuitive”, to which only ten persons agreed or fully agreed. A similar picture emerged for the question, “I found the online course design intuitive”. Four people disagreed or strongly disagreed, ten partially agreed, eight agreed, and two fully agreed.
850
E. Voigt et al.
Finally, the item “The online course used in the run-up to the event supports my learning process in a meaningful way” was agreed to by a total of six people. In contrast, twelve only partially agreed, and six did not agree at all with the statement. For the last question, “The preparatory course increases my motivation to deal with the course content on AI and digitalization”, a different picture was obtained: two fully agreed, ten agreed, nine partially agreed, three disagreed, and none disagreed at all. Three comments were received from respondents in the field for further comments. There, the content of the online course was rated very well but too extensive as an introduction to the topic. In addition, one respondent felt there was no direct link between the online content and what was offered in the face-to-face course. 4.2 Discussion The initial testing and survey of the first offering from the AI pilot model show that the e-learning course was rated moderately to well. Above all, the effect of increasing motivation is to be emphasized positively. However, the course needs to be revised regarding handling, design, and scope or integration into other continuing education offerings. Based on the answers given about the e-learning course, it became clear that the handling was not clearly explained for more than half of the participants in the evaluation. This shows that additional design, handling cues, and assistance must be incorporated into the course for the target group to make it more intuitive. The missing intuition may also influence the answers to whether the online course has supported the learning process. On the one hand, this could indicate that the course’s implementation and design are not yet intuitive enough. On the other hand, however, it may also indicate that the participants are still missing knowledge content. However, it seems that the course could increase the motivation to deal with the topic in more detail. This is also shown by the fact that all participants completed the course before the face-to-face event. During the attendance day, the participants could go through a practical module designed in a game-based manner. Despite initial skepticism, this was very well received and got much positive feedback. The challenge, however, is to achieve the right degree of gamification because the enrichment of a learning subject with game-like elements and the actual activation of the learners to engage with the subject matter to be taught do not have to go hand in hand [1]. However, this adaptation may also influence the perceived support of the learning process. This effect should be promoted with pre- and post-knowledge tests playfully integrated into the course. Nevertheless, the evaluation questions need to be adapted or specified to determine the consequences accurately. It was agreed that in the future, the evaluation of the course should not only be tied to the face-to-face event. On the one hand, the number of people who filled out the evaluation (24) compared to the number of participants (40) shows that the interest in filling out a questionnaire after a one-day offer is not very high. On the other hand, not every course run is linked to a face-to-face event. Therefore, a significantly extended survey should be inserted directly afterward and integrated into the course or course sections. In this course, the answer options also must be revised. It became clear that between one-third and one-half of the respondents always chose the middle answer option “partly”. It remains to be tested whether this is also the case
An e-Learning Course on Artificial Intelligence
851
when the evaluation is not filled out after an attendance offer. Different scales should be tested with the target group, but this again entails comparability of the test results in the different cycles. Also, the sample size and the non-heterogeneity of the respondents are points of criticism of the first testing – thus, the results can only provide a first orientation. However, with the DBR approach, it is essential to remember that it is not primarily about large samples but that the goal is to create an intervention that provides a practical educational benefit and a gain in theoretical knowledge [11]. Nevertheless, the goal is to attain a more heterogeneous group for the following survey. Since companies expect high-quality offerings and rarely release their employees to develop continuing education offerings, using a control group that can receive a traditional continuing education approach proved impractical. To circumvent this problem, we draw on regional chambers of commerce and industry and their instructors and classes to independently evaluate the offering one more time before making it available to the industry as a finished product. In this sense, the chambers’ expertise substitutes for a control group. This way, an attempt can be made to minimize bias in evaluating the results.
5 Outlook on the Next Steps in the AI Pilot Model The criticized points will be taken up and incorporated in the subsequent re-design phase before the next test or evaluation. The first idea would be to divide the course into basics and production focus and even to create more subchapters in the basics. Depending on the focus of the target group, the level of knowledge, or a subsequent face-to-face event, certain specific chapters could thus be unlocked or referred to more easily. In terms of handling and design, it is planned to incorporate more gamification elements. In the next cycle, the mentioned points will now be implemented. In addition, (advanced) offerings will be developed in parallel, tested with the target group, and adapted. Thus, the network’s continuing education offerings will be steadily expanded – with the goal of designing a variety of target group-oriented offerings for the AI experts of tomorrow.
References 1. Beißwenger, M., Meyer, L.: Gamification als Schlüssel zu „trockenen“ Themen? Beobachtungen und Analysen zu einem webbasierten Planspiel zur Förderung orthographischer Kompetenz. In: Beckers, K., Wassermann, M. (eds.) Wissenskommunikation im Web. Sprachwissenschaftliche Perspektiven und Analysen., pp. 203–240. Internationaler Verlag der Wissenschaften, Berlin (2020) 2. Dale, S.: Gamification: making work fun, or making fun of work? Bus. Inf. Rev. 31(2), 82–90 (2014). https://doi.org/10.1177/0266382114538350 3. Deterding, S., Dixon, D., Khaled, R., Nacke, L.: From game design elements to gamefulness: defining “gamification”. In: MindTrek ‘11: Proceedings of the 15th International Academic MindTrek Conference: Envisioning Future Media Environments, pp. 9–15 (2011). https:// doi.org/10.1145/2181037.2181040
852
E. Voigt et al.
4. Dukino, C., Friedrich, M., Ganz, W., Hämmerle, M., Kötter, F., et al.: Künstliche Intelligenz in der Unternehmenspraxis: Studie zu Auswirkungen auf Dienstleistung und Produktion. Fraunhofer Verlag, Stuttgart (2020) 5. Euler, D.: Design Principles als Kristallisationspunkt für Praxisgestaltung und wissenschaftliche Erkenntnisgewinnung. In: Euler, D., Sloane, P.F.E. (eds.) Design-based Research. Zeitschrift für Berufs- und Wirtschaftspädagogik, pp. 97–112. Stuttgart (2014) 6. Gollwitzer, M., Jäger, R.S.: Evaluation kompakt, 1st edn. Beltz Verlag, Weinheim (2009) 7. Högberg, J., Hamari, J., Wästlund, E.: Gameful Experience Questionnaire (GAMEFULQUEST): an instrument for measuring the perceived gamefulness of system use. User Model. User-Adap. Inter. 29(3), 619–660 (2019). https://doi.org/10.1007/s11257-019-092 23-w 8. KI Campus Website. https://ki-campus.org/. Last accessed 10 Oct 2022 9. Korn, O., Schulz, A.S., Hagley, B.J.: Gamification: grundlagen, methoden und anwendungsbeispiele. In: Becker, W., Metz, M. (eds.) Digitale Lernwelten – Serious Games und Gamification: Didaktik, Anwendungen und Erfahrungen in der Beruflichen Bildung, pp. 43–63. Springer Fachmedien Wiesbaden, Wiesbaden (2022). https://doi.org/10.1007/978-3-658-350 59-8_4 10. Malmberg, I.: Die Blackbox ausleuchten. Potenziale von Design-Based Research für Phasen der Lehrerinnen- und Lehrerprofessionalisierung. Beiträge zur Lehrerinnen- und Lehrerbildung 38(1), 79–93 (2020). https://doi.org/10.25656/01:21776 11. Reinmann, G.: Design-based research: In: Schemme, D., Novak, H. (eds.) Gestaltungsorientierte Forschung – Basis für soziale Innovationen, pp. 49–62. W. Bertelsmann Verlag GmbH & Co. KG, Bielefeld (2017) 12. Reinmann, G.: Innovation ohne Forschung? Ein Plädoyer für den Design-Based Research-Ansatz in der Lehr-Lernforschung. Unterrichtswissenschaft 33(1), 52–69 (2005). https://www.pedocs.de/volltexte/2013/5787/pdf/UntWiss_2005_1_Reinmann_Inno vation_ohne_Forschung.pdf. Last accessed 4 Oct 2022 13. Stifterverband für die Deutsche Wissenschaft e.V.: Tech-Spezialisten gesucht! Bedarf an Personal mit technologischen Kompetenzen wächst (Diskussionspapier 4). https://www.sti fterverband.org/download/file/fid/10558 (2021). Last accessed 10 Oct 2022 14. TÜV-Verband: Künstliche Intelligenz in Unternehmen – Chancen nutzen – Risiken begegnen. https://www.tuev-verband.de/?tx_epxelo_file[id]=824697&cHash=897f8c02b9e7781 3ccb907cec7751333 (2020). Last accessed 13 Oct 2022 15. Wangermann, T.: KI in KMU – Rahmenbedingungen für den Transfer von KI-Anwendungen in kleine und mittlere Unternehmen. Analysen & Argumente (383). Konrad-AdenauerStiftung, Berlin. https://www.kas.de/documents/252038/7995358/K%C3%BCnstliche+Int elligenz+in+kleinen+und+mittleren+Unternehmen.pdf/1894a732-8ead-46f7-90b4-72c0e1 a6fe2b?version=1.1&t=1580810247109 (2020). Last accessed 14 Oct 2022
Design and Implementation of a Postgraduate Micro-credential in Software Development David Parsons(B) The Mind Lab, academyEX, Auckland, New Zealand [email protected] Abstract. This article focuses on curriculum issues by describing the design of a micro-credential in software development, intended to provide a postgraduate, stackable, industry-relevant unit of study that provides a unique offering in the education marketplace. The key philosophy of this micro-credential was that it would bridge the often-siloed disciplines of academic study in areas of software development, drawing instead from the more integrated approaches of contemporary school curricula, and enable professionals from across technology-enabled organizations to participate in software development processes more effectively. The article describes the rationale and context for the micro-credential, reports on the outcomes of the consultation process with stakeholders, discusses the evolution of the design through this process, and then presents the final structure and indicative content of the micro-credential. It is hoped that this case study will contribute to the necessary ongoing debate around what role micro-credentials serve in the education ecosystem, and what design considerations are important when creating micro-credentials that can address the needs of both students and other stakeholders. Keywords: Micro-Credential · Software Development · Curriculum · Education
1 Introduction – Micro-Credentials In a world in which education providers and national administrations are constantly looking for new ways to meet the needs of both students and wider society, the concept of micro-credentials has been widely embraced in recent years, alongside other potentially disruptive forms of credentialing such as nano degrees, digital and open badges [1, 2]. Micro-credentials provide qualifications that are smaller in quantity than traditional vocational or higher education qualifications but of equal quality, focusing on specific in-demand subject areas. Being small/short courses they can provide rapid upskilling, particularly for mature learners already in the workforce [3]. What constitutes a microcredential is something that has been evolving and becoming more diverse over time. However, the key elements are identified by the name, in the sense that ‘micro’ implies a small qualification and ‘credential’ implies that unlike, for example, a commercial training course, the qualification is recognized by an accredited qualifications body. Beyond this there is still limited academic research in the field of implementing and sustaining micro-credentials in higher education [2, 4], so work such as the case study reported in this article may provide some relevant contribution to this ongoing debate. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): SAI 2023, LNNS 711, pp. 853–864, 2023. https://doi.org/10.1007/978-3-031-37717-4_55
854
D. Parsons
1.1 Questions Relating to Micro-Credentials There are many questions raised by micro-credentials in terms of their character in practice and their overall philosophy. Areas of debate include: what levels of education they should address, how many hours of learning they should be, how much they should cost, whether or not students may be able to seek scholarships or loans to support their study, and whether these qualifications could be in some way ‘stackable’ to become components of a larger, and perhaps more traditional, type of qualification [2, 5–7]. There are many reasons why potential students might want to study micro-credentials as opposed to other types of qualification. These include: wanting to increase income, apply skills to practice, and develop professionally using a rigorous framework for learning [5]. However, there is also an expectation that the topic of a micro-credential is one that serves an identified need in the marketplace from a stakeholder (e.g. employer) perspective, rather than just from a student perspective [3]. Reflecting on their role specifically in higher education in information systems subject areas, micro-credentials meet the need for units of learning that are very thematically focused, updated frequently, and provide an easily shareable, informationally transparent digital object [8]. There are also design considerations such that different micro-credentials might emphasize different design elements based on their content. They might be self-directed using online materials (the instructional design and online platform both need consideration), job-embedded, competency-based, and/or research-based, and they should not have a one-size-fits-all approach [5]. 1.2 The Context of this Article This article is a case study based on the design and development of a postgraduate microcredential in broad aspects of software development. It addresses several questions, such as how this qualification was developed from a curriculum perspective. That is, how it takes its cue from traditional higher education curricula in subjects related to software development, as well as extensive moves in school education internationally to bring in topics such as computational thinking across the school curriculum to encourage a greater engagement of students with STEM (Science. Technology. Mathematics, and Engineering) subjects. It addresses some questions around domains of knowledge and how these might be reinterpreted through the lens of a micro-credential to address an unmet need in the marketplace.
2 The Curriculum as it Relates to Software Development This section addresses the curriculum from two perspectives. First, from traditional higher education and domains of knowledge that have become well-established in that area, then from international school curricula that have in recent years been focusing on bringing in more extensive and pervasive coverage of digital technologies and skills.
Design and Implementation
855
2.1 Software Development in the Higher Education Curriculum For many years now there have been specific disciplines in higher education that divide the concerns of software development into three broad areas: Computer Science, Information Systems, and Software Engineering, which can be seen embodied in the ACM and IEEE curricula [9], and the software engineering body of knowledge [10] (associated with these are curricula for Computer Engineering, Information Technology, and Cybersecurity that are less relevant to software development). Although these were developed originally from a United States perspective, they have nevertheless been adopted widely around the world. The problem with these curricula is that they take very different views of software development. They encourage different areas of research, and certainly from an academic perspective they have little to do with one another, sometimes being taught in entirely different schools in the same university. However, in the world of practice, this division is meaningless, because for software development projects to succeed they need components of all three disciplines to be well-integrated. The problem, then, is that these administrative structures tend to silo students into specific domains of knowledge where they remain largely ignorant of the concerns of other domains, even though they are all contributing to the same overall area of human creativity and productivity. Students taking classes from across these different domains tends to be the exception rather than the rule. The practice of software development is evolving all the time, and professional practice and project management are becoming increasingly important, with an emphasis on the essential soft skills [11]. 2.2 Software Development in the School Curriculum At a school level there is a very different view that can be identified from international work in recent years on curricula, intended to ensure that students in schools are given suitable exposure to technology subjects, including aspects of software. There is an intent that this knowledge is not confined to siloed subject areas but, at least theoretically, is applicable across the curriculum for all learners – what has, in some cases, been called the ‘entitlement curriculum’, whereby every student is entitled to be exposed to areas of learning focused on digital technologies [12]. Of course, these curricula vary from country to country. For example, in the New Zealand, English, and Kenyan curricula there is a clear focus on developing digital fluency from an early age, while within the Australian and U.S. curricula there is a strong focus on designing creative solutions. There are also specific local conditions impacting on particular counties such as China, which has been adapting to a rapid spread of technology in society, and Kenya, which faces problems in catching up with the infrastructures and digital tools of the developed world [13]. Despite these variations in school curricula, they provide a much more unified concept of what software might be like than the rigid domains of the universities, even though, as soon as a curriculum needs to be broken down into topic areas, some of the same divisions seem to appear. In New Zealand, for example, the digital technologies curriculum divides into two areas: Computational Thinking for Digital Technologies and Designing and Developing Digital Outcomes. Nevertheless, students will be at least theoretically covering both areas in parallel so they can be seen to be complementary rather than conflicting.
856
D. Parsons
3 Understanding Software Development Processes and Tools – A Micro-Credential Case Study This section introduces a case study, based on the design of a micro-credential, that is at the heart of this article. The context in which the micro-credential was developed was a small private graduate school in New Zealand that focuses on upskilling adult professionals in education, innovation, and leadership. With an existing portfolio of postgraduate courses that included three master’s degrees, the institution began delivering micro-credentials in early 2020, initially at undergraduate degree level but more recently changing strategy so that all new micro-credentials would be at postgraduate level. Each of these micro-credentials is relatively large at 15 credits, which equates to 150 h of directed and self-directed learning. One of the motivations for this approach was to enable the stacking of these micro-credentials within larger qualifications so they could perform the dual role of being both standalone credentials and components of longerterm learning journeys. It was evident from feedback from students on other programs that there was a demand for a micro-credential that could increase their understanding of how software is typically developed, often in support of their own innovation projects in master’s programs. Against this background, the Understanding Software Development Processes and Tools micro-credential was developed for first delivery in 2022. The motivation for this micro-credential was to provide something that was not readily available in the marketplace, that is, a qualification for people who felt they needed to better understand how software was developed, even if they were themselves not directly involved in software development. This understanding needed to be broad and integrated rather than detailed and over-technical, in a sense to combine the key thinking from information systems, computer science, and software engineering into a realistic, user-focused understanding of the key concepts, processes, tools, and techniques that are used in contemporary software development. From the models used in school curricula it was intended to provide integrated skills across a broad range of technology areas to support professional roles across an organization. While being aware of the criticism that micro-credentials may be no more than ‘gig credentials for the gig economy’ [14], we believe that a micro-credential of this scale and level provides opportunities for meaningful career enrichment and social benefit through improved software development interactions. 3.1 The Rationale for the Micro-Credential To go out for consultation with potential stakeholders it was necessary to draft an initial proposal and justification for the micro-credential. The rationale for the credential was to create a learning experience that could support professionals, organizations, and communities from a range of practices to understand the fundamental principles and technology areas of software development practice and develop the skills necessary to apply this understanding within their own context. Specifically, the micro-credential aimed to provide learners with the knowledge, skills and appropriate frameworks to engage effectively with software professionals and participate in software development
Design and Implementation
857
processes or other aspects of Information and Communication Technology (ICT) integration, while taking into account individual and socio-cultural factors that can impact on accessibility and equity. The context of developing the credential was the belief that in the contemporary environment there is not a single discipline, profession, or field of study where an understanding of how software works and is developed is not relevant. The content was therefore designed for anybody who wanted to understand how software development works and how the process of development is managed. It was aimed at a broad community of professionals (or budding professionals) who interact with software development in some way, whether, for example, as a system end-user, as a manager, in a procurement role, or engaged as clients of software developers. In terms of its potential audience, it was not primarily intended for those who already know how to develop software, though some of those people might find it relevant for understanding broader areas outside their professional focus. Neither was it intended to train people to become a particular type of software developer, though it might nevertheless inspire learners to subsequently pursue other opportunities to develop deeper skills in one of the areas introduced. Instead, we wanted those who completed the course to leave with a clearer understanding of the what?, why? and how? of the terminologies, technologies, tools, techniques, roles, social contexts and professional interactions of contemporary software development. As indicated earlier in this article, there is an expectation thar micro-credentials can address skills development in areas for which there is evident demand from industry. The New Zealand Digital Skills Forum, which aims to ensure New Zealand has the digital skills it needs to grow, produced a report in 2018 – Digital Skills for a Digital Nation – which emphasized the increasingly broad requirement for advanced skills at the professional level. “Traditionally, these were the skills required by ICT professionals, however as digital technologies become more pervasive throughout businesses, it is no longer just IT professionals that need these advanced digital skills” [15]. This is exactly the target market for this micro-credential – professionals who are not directly ICT professionals but need advanced digital skills in other roles. This is an area not well catered for by current provision in tertiary education. While skills development was an essential part of this micro-credential, given that it was applied at postgraduate audience, its main focus was on the kind of critical thinking and creative engagement required of professionals in the software industry. The microcredential therefore focuses on human aspects as much as technology, enabling learners to explore not only how to use technology but, more importantly, how to work with others in complex problem-solving. The micro-credential was designed to provide an opportunity, through challenge-based learning, to develop a deeper understanding of how software is developed, and to enable more valuable and constructive interactions between software developers and their customers/users, taking account of contextual factors such as culture and equity. Examples of individuals who we thought might choose to engage with the program included: • People who are end-users of software systems and want to give informed feedback to software developers or vendors
858
D. Parsons
• People who work directly with software developers such as analysts, product owners, and project managers • People who want to integrate software more effectively into their businesses or organizations • People who are interested in developing a greater understanding of software development to develop new career paths or expand their current roles • People who already work in some aspect of software development but are seeking a broader perspective. • People who are interested in developing their own applications. Building on the rationale, the initial proposal sent out for consultation was for a twelve-week micro-credential entitled Software Development Tools and Processes. As the title implies, bringing tools to the center meant that there was an expectation that students undertaking the micro-credential would be engaging hands-on with software development tools. Another expectation was that all assessment would be done in teams, since most meaningful software development is also developed in teams, and that teamrelated skills, particularly those contextualized in the same way as a software development team, would be very valuable for the students. Another expectation was that there would be multiple example systems that the students would engage with and modify as part of the assessment process. 3.2 The Consultation Process The consultation process revealed the difficulties of addressing such a broad program of study, and our interaction with a range of different stakeholders led us to understand that it would be difficult to address the concerns of all of these within the scope of the proposed micro-credential. Some of the key issues raised by stakeholders included: 1) that software development is evolving so quickly that it would not be possible to provide a micro-credential that was suitably up-to-date at any point in time, 2) that software development is moving towards a ‘no code’ and ‘low code’ environment, which would render many of the more traditional ideas in the proposal redundant, as end-users would be creating their own applications, and 3) that it would be difficult in a micro-credential, given its limited scale, to address all of the necessary areas of software development to provide a suitably comprehensive understanding of the topic area. Notwithstanding these challenges, the stakeholder consultation also suggested that there was great value in following through with designing this micro-credential. Consultation with the local university suggested that this approach would indeed help to address the siloed faculty problem of university subject areas, and that students from these areas of study, regardless of whether they were computer scientists, information systems students, or software engineers, could benefit from this broader perspective to give a better understanding of these other domains and how all of them related to the same overall mission. Those stakeholders who worked in some way with software development as, for example, end users or customers, felt that this credential would provide great value for them in helping their communication with the software professionals with whom they worked, helping them understand the jargon, the concepts, the processes, and the overall lifecycle.
Design and Implementation
859
As a counter to the concern that the micro-credential could not keep up with cutting edge software technology, another stakeholder asserted that it should not attempt to address this aspect of software development, rather it should focus on how current organizations work, making it relevant for the broadest part of industry and that other micro-credentials could deal with areas of study such as disruptive technologies. This course should focus on human aspects as well as technology – how to use technology and work with others in this context. Stakeholders agreed that being able to communicate the right requirements and user stories can enable those in other roles to get the most out of their relationships with development teams, and that many people in big organizations who are not already technically focused can frequently find themselves put into leadership roles within technology domains, for example being required to manage a project where they need to work with others such as solution architects and agile coaches without any special training being provided. It was felt that this was an area that where the course could be of benefit. A consistent element of feedback across the stakeholder body was that twelve weeks was simply not long enough to address the range of topics that were being proposed, and that it was not reasonable to expect that everyone who was interested in how software was developed would necessarily want or need to engage in hands-on use of digital tools such as server side and client-side programming languages and database management code. Many stakeholders felt that the emphasis on tools was inappropriate and that it was the processes that should be brought to the fore. However, there was some focus on technology that did need to be included in the course, and it was also suggested that students should get some experience of version control systems. 3.3 Adapting the Proposal As a result of the consultation process, many changes were made to the overall approach of the micro-credential. The title was refocused somewhat to Software Development Processes and Tools, a small change but an important acknowledgment that processes should be front and center, supported by tools, rather than vice versa. The assessments were rethought to provide much more flexibility, so that students could focus on areas of the software development life cycle that they felt were most useful for them and they would not necessarily have to engage with coding tools. Nevertheless, that option would also be made available to them. The proposed calendar was extended by three weeks (to make it 15 weeks long), specifically to enable the students to complete a team project over those last weeks for the final assessment, to ensure that there was suitable time for teams to work together and produce something meaningful. It was also decided to scale down the idea about having many different examples, but rather to provide just one example that could be clearly explained and understood. All students would therefore be able to communicate about a common platform that they were all able to engage with and use as the basis for their own work, regardless of whether they focused on analysis, design, implementation, testing, and/or deployment. When the revised proposal was sent to the national qualifications authority for approval, among a few other suggestions they indicated that the micro-credential should be renamed Understanding Software Development Processes and Tools because indeed
860
D. Parsons
that was the whole point of the micro-credential – to help people to understand those processes and tools. Another requirement was that, while the team assessment was acceptable, there also had to be an element of individual reflection on what had been learnt, and what those learning outcomes were for those individuals. After incorporating feedback, the final learning outcomes for the micro-credential were defined as: Upon completion of the micro-credential, learners will be able to: 1. Analyze the business, technology, data and socio-cultural dimensions of software systems and processes. 2. Critically evaluate the role of software development processes and tools in addressing the changing requirements of organizations. 3. Develop and apply knowledge of software development processes and tools to relevant areas of professional or individual practice.
4 The Structure of the Micro-Credential Table 1 lays out the overall structure of the micro-credential as it was for the first delivery in 2022. As can be seen from this table, there are four key themes in the design that provide complementary areas of learning: software development fundamentals, software development processes, working with software development processes and tools, and team projects. Table 1. Understanding Software Development Processes and Tools Content Plan (Themes and Topics) Week
Theme
Topics
1
Software development fundamentals
Introduction to Software Development
2
Software System Fundamentals
3
Coding and Data
4
Software development processes
Approaches to Software Development
5
Analysis and Design
6
People in Software Development
7
Requirements, Testing and Security
8
Assessment 1
Team Software Process Workshops
9
Working with software development processes and tools
System Setup and Deployment
10
Working with Web Resources
11
Client-Side Coding
12
Server-Side Coding
13
Team projects
Project week 1
Assessment 2
Team Project Presentations
14 15
Project week 2
Design and Implementation
861
In software development fundamentals, learners gain a broad understanding of what programming languages are for, what kinds of problems they are designed to address, and how coding is structured. With these fundamentals in place, the next area of learning focuses on software development processes, enabling learners to engage with activities that give them insights into different aspects of the software design and development process and to be able to work with technical staff to express requirements. This level of work is sometimes known as the software development macro process, addressing the overall software development lifecycle. At the end of these two themes, the first assessment allows learners to demonstrate their learning by running workshops for other students that each address specific aspects of the development process. The next phase of the micro-credential involves elements of the software development micro-process, taking account of activities such as coding, refactoring, testing, running, debugging, packaging, and deploying software. This part of the micro-credential provides students with opportunities explore different aspects of the software development process. They have access to a small, structurally complete but functionally incomplete web-based application that they can modify and extend within a version-controlled environment to provide insights into how software is implemented and integrated. Considering stakeholder feedback, the modifications and extensions that students undertake can be confined to, for example, requirements or design aspects to cater to students who do not have an interest in engaging directly with coding. Similarly, students may elect to focus entirely on technical changes to the system that might have no impact on its functionality (e.g., refactoring the design). Other teams might choose to take a whole lifecycle view to analyse, design, implement, test, and deploy a new vertical slice of functionality for the system. 4.1 Assessment The first assessment is a software process workshop, based on each team running a workshop for other students that demonstrates an aspect of contemporary software development processes, covering both technical and human issues. This assessment is based on the “process miniature” concept, where even large software processes can be simulated in very small workshop activities. This assessment is preceded by a negotiation stage to ensure that each team focuses on a different aspect of software development for their workshop to ensure maximum learning for all teams. The second assessment is a team software development project (and associated presentation). The final three weeks of the micro-credential allow students to develop and present their group projects. This assessment, where students work as a team with an existing codebase (to serve as a starting point), is contextualized from a customer perspective, with teams defining new requirements through a suitable process and then providing a high-level design to address them. Groups will have the option of focusing on any specific aspect of the process; analysis, design and/or implementation, depending on their own priorities and skills.
862
D. Parsons
5 Initial Review The first delivery of the micro-credential was completed in December 2022. Overall, the aims and objectives of the course were met, but there were several insights gained that will be taken on board for future delivery. First, it became clear that more structured scaffolding was required in the initial stages to ensure that fundamental questions were addressed such as what is software development? what is the software development lifecycle? and what is a database table? The course design has made too many assumptions about the likely pre-existing knowledge of those enrolled. The introduction of computational thinking concepts was also too early. It became clear that a thorough grounding in the process aspects of the course had to be completed before moving onto the tools, even at a conceptual level. The delivery mode, which had been conceptualized as self-paced with synchronous discussions also had to be changed as the students requested a more didactic, content driven approach, and preferred knowledge check quizzes to some of the more exploratory activities that had originally been designed. Nevertheless, the outcomes from the student work were highly satisfactory. The group workshop sessions, delivered by student groups to each other, provided excellent learning experiences of topics including version control, cultural differences in software design, algorithmic impact assessment, design thinking, privacy and security by design, and requirements specification and prioritization. Although few groups chose to reengineer parts of the example system for their assessments, those that did (for example one group rewrote the presentation layer using React) showed the value of offering this option within the micro-credential. The overall review of the first delivery suggested that most of the content and design was effective, but that additional content was needed to cater for a broader range of student experience, and a more scaffolded sequencing of topics was needed that made fewer assumptions about existing knowledge.
6 Summary and Conclusions This article has outlined the rationale for, and the design of, a postgraduate microcredential called Understanding Software Development Processes and Tools. It provided the background motivation for the development of this micro-credential and positioned it within the broader international micro-credential movement, the nature of software development education in universities, and the increasingly technology-aware developments in school curricula. It reported on the initial proposals for this micro-credential, the feedback and consultation process, and the changes that were made because of this consultation. It then gave an overview of the final design of the micro-credential, including its themes, topics, and assessment strategy. Within the context of micro-credential design there are several lessons that might be learnt from the experience of this particular case study. It may be helpful to return to some of the questions asked at the beginning of this article around what a micro-credential might be like in terms of, for example, size, level, stackability, and justification from the stakeholder perspective. Also, to consider how it might be categorized, given that not all these types of qualifications are attempting to be of the same type. This micro-credential incorporates several important design principles. Firstly, it is research based but also, in terms of its assessment strategy, it is competency-based. It is
Design and Implementation
863
also, in its final form, specifically not a one-size-fits-all design. However, it is not intended to be job embedded. Perhaps its most important characteristic is that it is stackable, and this is the main reason why it is larger and at a higher level than many micro-credentials, because it needs to equate to a postgraduate paper in a more traditional qualification to enable its stackability to be seamless. In terms of its relationship to curricula, its intention is to avoid the traditional silos of software development disciplines within higher education and to bridge concepts from information systems, software engineering, and computer science within an end-user professional context. It therefore draws philosophically more from recent attempts to make the school curriculum focus on digital fluencies, and the curricular integration of digital technologies into applied practice. At the time of writing the first intake of the micro-credential has recently been completed, and lessons from this experience are being applied to future deliveries. It is hoped that in future work we can provide insights from student feedback over a series of iterations of this micro-credential that will lead to further improvements made in light of experience. From this we hope to create a more refined model of how such a microcredential can meet the needs of its stakeholders. Based on the experience of delivering the micro-credential a two-week sampler course and a six-week non-accredited short course have already been developed from this micro-credential to provide for different audiences and their requirements.
References 1. Selvaratnam, R.M., Sankey, M.: An integrative literature review of the implementation of micro-credentials in higher education: implications for practice in Australasia. J. Teach. Learn. Graduate Employability 12(1), 1–17 (2021). https://doi.org/10.21153/jtlge2021vol12n o1art942 2. Lemoine, P.A., Richardson, M.D.: Micro-credentials, nano degrees, and digital badges: new credentials for global higher education. Int. J. Technol. Educ. Market. 5(1), 36–49 (2015). https://doi.org/10.4018/ijtem.2015010104 3. Oliver, B.: Making micro-credentials work for learners, employers and providers. https:// www.voced.edu.au/content/ngv:83922. Accessed 15 Oct 2022 4. Hunt, T., Carter, R., Zhang, L., Yang, S.: Micro-credentials: the potential of personalized professional development. Dev. Learn. Organ. 34(2), 33–35 (2019). https://doi.org/10.1108/ DLO-09-2019-0215 5. Acree, L.: Seven Lessons Learned From Implementing Micro-credentials. https://www-data. fi.ncsu.edu/wp-content/uploads/2016/02/28152144/microcredentials.pdf. Accessed 15 Oct 2022 6. Hall-Ellis, S.D.: Stackable micro-credentials – a framework for the future. The Bottom Line 29(4), 233–236 (2016). https://doi.org/10.1108/BL-02-2016-0006 7. Commonwealth of Learning: Designing & Implementing Micro-Credentials: A Guide for Practitioners (2019). https://oasis.col.org/colserver/api/core/bitstreams/770ff842-9a5e-424ba253-0757fa539086/content. Accessed: 15 Oct 2022 8. Rubleske, J., Cata, T.: University micro-credentials and the need for agile IS skill development programs. Inform. Syst. 3, 9 (2017) 9. Impagliazzo, J., Pears, A.N.: The CC2020 project — computing curricula guidelines for the 2020s. In: 2018 IEEE Global Engineering Education Conference (EDUCON), pp. 2021–2024 (2018). https://doi.org/10.1109/EDUCON.2018.8363484
864
D. Parsons
10. Bourque, P., Fairley, R.E.: IEEE Computer Society, SWEBOK v.3.0: guide to the software engineering body of knowledge. IEEE (2014) 11. Garousi, V., Giray, G., Tuzun, E.: Understanding the knowledge gaps of software engineers: an empirical analysis based on SWEBOK. ACM Trans. Comput. Educ. 20(1), 1–33 (2019). https://doi.org/10.1145/3360497 12. Unwin, A., Yandell, J.: Rethinking Education: Whose knowledge is it anyway? New Internationalist (2016) 13. Parsons, D., MacCallum, K., Schofield, L., Johnstone, A., Coulter, S.-K.: Next-generation digital curricula for future teaching and learning. In: Yu, S., Ally, M., Tsinakos, A. (eds.) Emerging Technologies and Pedagogies in the Curriculum. BHMFEI, pp. 3–19. Springer, Singapore (2020). https://doi.org/10.1007/978-981-15-0618-5_1 14. Wheelahan, L., Moodie, G.: Gig qualifications for the gig economy: micro-credentials and the ‘hungry mile.’ High. Educ. 83(6), 1279–1295 (2021). https://doi.org/10.1007/s10734021-00742-3 15. Digital Skills Forum: Digital Skills for a Digital Nation. NZTech. https://nztech.org.nz/rep orts/digital-skills-for-a-digital-nation/. Accessed 15 Oct 2022
Social Media in Support of Higher Education Teaching and Learning: A Systematic Literature Review Lily Schoeman
and Sunet Eybers(B)
University of Pretoria, Private Bag X20, Hatfield, Pretoria 0028, South Africa
Abstract. Social media, one of many web 2.0 technologies, is known by most people as a leisure platform that is used by people to interact on a social level. However, social media can also be used for educational purposes, for example as a tool to support Higher Education teaching and learning. The Fourth Industrial Revolution has created an ever-changing environment forcing society to adopt or die, which means that a lack of adoption of, for example social media, might lead to exclusion. Therefore, Higher Education Institutions should consider the adoption of social media to, not only stay relevant in their educational teaching and learning delivery, but also to adapt their approaches to the level of students. The objective of the systematic review is to address the following question: How can social media support Higher Education teaching and learning? A high level scan of academically published literature and subsequent thematic analysis found that social media can support Higher Education in ten different areas and the areas can be categorized into two main categories namely by enhancing the Student’s Experience and through the facilitation of university functions. The research is the first step towards an in-depth investigation into understanding the potential benefits of social media as teaching and learning platform. Keywords: Social Media · Web 2.0 · Higher Education · Teaching and Learning
1 Introduction The Fourth Industrial Revolution has brought about new technologies, such as Web 2.0 [1], which is characterized by its ability to support and encourage collaboration between users [2]. Social media is one of the components of Web 2.0 as it allows users to communicate and collaborate on the same platform. Before Web 2.0, the internet was a one way communication mechanism [3] with no opportunity for online collaboration. But with Web 2.0 technologies, any user, irrespective of the geographical location, can gain access to the internet given a smart device and internet connectivity [4]. As a result, society has been revolutionized by how we work, live, socialize and subsequently engage in educational environments as part of Higher Education Institutes. Web 2.0, specifically social media, provides Higher Education Institutes with new ways to interact with students as well as new ways for students to interact with one another [5]. This interactivity has disrupted traditional teaching methods used by Higher © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): SAI 2023, LNNS 711, pp. 865–872, 2023. https://doi.org/10.1007/978-3-031-37717-4_56
866
L. Schoeman and S. Eybers
Education Institutions [4]. In ancient times, knowledge was shared by word of mouth followed by group gatherings. Eventually, the knowledge was written down for future use [4]. More recently classrooms, which referred to building structures located at educational institutions, evolved from presenting class content on blackboards followed by overhead electronic projectors and finally smartboards. Web 2.0 technologies allow for the presentation of online classes and as a result blurred the boundaries of physical content delivery [6]. There is therefore a need for Higher Education Institutions to adapt their approaches to cater for these new virtual educational environments or cyber-physical environments. Previous academically published literature focusing on social media and Higher Educational Institutions voice different opinions on how social media can support Higher Education learning and teaching. [7] believes that if an institution does not adopt social media as a teaching and learning tool the institution will slowly become obsolete because it’s offering will not be on the same academic standard as the institutions that do adopt social media. On the other hand, [8] warns of the possible risks introduced as a result of social media adoption, of which security breaches is highlighted. As a result, a systemic literature review will help with summarising these different opinions and providing a comprehensive overview of the common thoughts and/or reasons of how social media can be used as a supportive tool in teaching and learning. The main research question for this review is: How can social media support teaching and learning in Higher Education? The remainder of the paper starts with a brief overview of social media in the context of Higher Educational Institutions, followed by the research approach. The next section discuss the results and findings of the thematic analysis hereafter the study is concluded.
2 Social Media and Higher Educational Institutions Social media is defined as a platform that enables users to create and exchange usergenerated content [9], allowing users to network and communicate irrespective of the geographic location. Social media can be grouped into different types, depending on the main objective of the platform, for example collaborative platforms such as wikis and social bookmarking applications; blogs where authors capture and share their thoughts; content communities where media content is shared, for example video sharing using YouTube; social networking sites where private information is shared amongst personal connections; and virtual game and social worlds where an online persona are created to live in a virtual environment [10]. In a Higher Educational Institution, social media platforms allow students to actively engage with peers and lecturers with the objective of exchanging knowledge or to complete team tasks. On the other hand, instructors can make online content available to students. The ease of use and relatively low requirements of a smart phone device and data connectivity, make it an accessible tool to financially constrained students. Despite the obvious benefits of social media platforms in Higher Educational Institutions, both students and instructors are faced with challenges. The role of the student has been changed from passive listener to active engagement [11]. On the other hand, the social media environment and ability to interactively deliver online content is a relative new concept to instructors in which traditional pedagogical approaches might
Social Media in Support of Higher Education Teaching and Learning
867
not be suitable. As a result, instructors need to re-think and re-develop classroom presented material for it to be suitable for online presentation. Also, traditional classroom activities might not be feasible in an online environment. Finally, customization and personalization of learning content is often expected in an online environment [11].
3 Research Approach A Systematic Literature Review (SLR) approach was followed to investigate how social media can be used to support teaching and learning in Higher Education Institutions. The SLR followed guidelines as prescribed by [12]. A Systematic Literature Review is a secondary study where academic evidence, in the form of primarily studies published, are identified, assessed, analyzed and synthesized with the objective of answering a research question. This method seemed suitable as it prescribed structured, logical steps whilst still allowing for control over the research objectives and the development of critical appraisal skills [12]. A ‘Preferred Reporting Items for Systematic Review and Meta-Analyses’ (PRISMA) flow chart was used to graphically present the literature review search process [12]. The review process followed four main steps, namely: Step 1: Identification of articles after using a combination of search terms, including social media, Web 2.0, Higher Education and tertiary education. The combination of search terms included: (“social media” OR “web 2.0”) AND (“enhance” OR “support”) AND (“Higher Education” OR “university” OR “tertiary education”) (“social media” OR “technologies”) AND (“Higher Education” OR “university” OR “tertiary education”) AND (“advantages”) AND (“disadvantages”) AND (“peerreviewed”). These search terms were used on platforms such as Google Scholar, SpringerLink, Research Gate, Sage Journals and Elsevier and resulted in 1290 records. Step 2: Screening of articles for suitability. In this step, duplicates were removed, resulting in 514 records. Furthermore, only articles published in English was considered, which resulted in a final article list of 133. Step 3: The pool of 133 articles were evaluated for eligibility, based on the following criteria: peer-reviewed; and published between 2004 and 2022 (Web 2.0 was only introduced in 2004); whilst grey-literature was not considered. Step 4: A total of 20 articles were identified after the eligibility evaluation. The PRISMA flowchart is graphically displayed in Fig. 1.
3.1 Data Analysis The final research article pool of 20 articles formed the foundation of the thematic analysis approach adopted with the objective of identifying common themes in literature. Thematic analysis is defined as a qualitative research method that involves identifying, analyzing, and interpreting patterns within the search results [13]. A six phase approach, as prescribed by [14] was followed:
868
L. Schoeman and S. Eybers
Fig. 1. PRISMA Flow Chart (based on [12])
Phase 1: Familiarize yourself with the data. The article pool was small enough to read all publications. During this process, the researcher highlighted important quotes and wrote down statements that captured the essence or main points of the publication. Phase 2: Generate initial codes. During this phase shorthand keywords were generated to summarize the article [14]. Based on the keywords, codes were created whilst keeping the main research question in mind [13]. Ten different codes were generated in this phase namely: Adaptability, Student Learning, Transformation requires adaptability, Enhance Student Learning, Communication, Access to Content, Impact on Student Learning, Enhance Student Engagement, Exposure to technologies, and Interaction between staff and students. Phase 3: Search for themes. Themes were created by identifying patterns based on the codes that were generated [14]. The potential themes for this review were generated by writing down all the codes generated in phase 2 and then analyzing them to find common thoughts or opinions with the objective of identifying patterns. Common codes were then grouped together and named using a collective term, namely the facilitation of the learning process by educational institutions (using Facilitation as label which constitutes 30% of the codes) and themes related to student experience (labelled as Student Experience which constitutes 70% of the codes). Phase 4: Review potential themes. This phase involves quality-checking the potential themes that started to take form in phase 2 against the codes and the whole dataset to ensure that they were accurate [14]. With a large dataset, there is a possibility that some data could be incorrectly coded. Therefore, this process checks that the coded data is correct in order to produce credible themes. During this phase, two main questions were considered namely: “Does this theme help answer the research question?”; and “Do the codes, that are grouped in this theme, all provide evidence as to why this theme answers
Social Media in Support of Higher Education Teaching and Learning
869
the research question?”. After reviewing the potential theme, the researcher confirmed the themes were indeed extensive and correct. Phase 5: Define and name themes. When defining and naming the theme it is important that the theme clearly states its uniqueness and captures the essence of all the codes categorized as part of the theme [14]. The theme should focus on one entity whilst remaining unique and related to the overall research question [13]. As a result, the names for the themes identified remained Facilitation and Student Experience. Phase 6: Produce the report. The discussion of results and findings section contains the outcome of the thematic analysis.
4 Discussion of Results and Findings During the thematic analysis, ten main different themes were identified, which could be categorized into two main categories, namely Facilitation and enhancing Student Experience. 4.1 Facilitation Facilitation, for the purpose of this discussion, refers to the support provided to higher education teaching and learning processes, for example admin and communication. Adaptability: Technological advancements have become entangled in our society forcing organizations to either “adapt or die”. The Fourth Industrial Revolution has furthermore accelerated the need to stay connected. During the recent COVID-19 pandemic Higher Education Institutions that had a good social media system in place could continue with educational activities using this platform, whilst institutions without these platforms had to halt activities [7]. The study [5] suggests that having a social media support system in place will support the Higher Education Institution to be more adaptable, no matter what the change or disruption. Transformation Requires Adaptability: Higher Education Institutions need to provide qualifications that follows good quality teaching and learning practices so that students want to attend their institution. The Institution takes pride in the quality of its qualifications and accreditations. Social media can help the institution utilize transformational teaching to keep up with the current teaching methods to stay relevant [15]. Transformational teaching is a teaching method that focus on making an impact on the life of students [15]. Social media can furthermore help educators stay up to date with current teaching methods and new research or discoveries in their field. As a result, the institution will be more attractive to prospective students because of the perceived increase in quality education. Communication: Just as social media can facilitate communication between students it can equally facilitate communication between the institution and its students and staff [16]. An Institution must have the means to quickly communicate information to its students and staff members to ensure the smooth operation of the institution. Social media can facilitate this through broadcast messages or emails.
870
L. Schoeman and S. Eybers
4.2 Student Experience Student Experience, for the purpose of the paper, refers to student’s perceived interaction with the institution. Interactions occur on various levels, namely academic, social, emotional, cultural, growth opportunities, sporting, and artistic [6]. Enhance Student Learning: Students that are currently studying generally have social media integrated lives, meaning they are on one or more social media sites for example WhatsApp or Facebook [17]. In a study conducted by [17], the findings suggested that the majority of Higher Education students are comfortable using social media and perceive it as a useful teaching tool. As a result, a tool like YouTube can be used to present course content and subsequently expanding classroom walls. This allows for interaction with knowledgeable experts in the field [18]. Through social media, students will also be able to communicate, interact and collaborate with other students from other institutions enriching their knowledge [19]. Access to Content: Social media can be used to publish course content, irrespective of student’s circumstances [3]. Internet connectivity and smart devices allow students to access course content anywhere, anytime and subsequently enhance the student’s experience. Even during pandemic times, students could continue with their studies. Social media also allows people in different stages of their lives to study. Traditionally tertiary education start soon after completing school [20]. However, social media enables distance learning as no physical presence is required and allowing students with fulltime responsibilities to study [2]. This enhances the Student’s Experience because a qualification can be obtained, leading to better career opportunities, without sacrificing full-time responsiblities. Enhance Student Engagement: [21] conducted an experiment creating a course WhatsApp group in one semester, whilst the second semester had no WhatsApp group. The results showed an increase in Student Engagement in the second semester. He suggested that behavioral characteristics like shyness or peer pressure can cause a student not to engage in class, whilst this played no role in WhatsApp groups. Student engagement is an integral part of learning as higher levels of engagement leads to better comprehension of content. Exposure to Technologies: Social media can support and prepare a student for the world of work through exposure to workplace related technologies [22]. For a student to be successful in their workplace they have to be able to work with whatever technology the company is working with [23]. The more exposure to different technologies during their studies, the more prepared they will be for a working environment [24]. Although it is nearly impossible for students to be exposed to every single technology, a higher exposure rate to various technologies will increase confidence. Impact on Student Learning: Unfortunately, the utilization of social media in Higher Educational Institutions are not without disadvantages [8], social media can potentially cause security threats to Higher Education Institutions and can affect the physical and mental well-being of students [8]. Unintentional student behavior, such as software downloads or malicious links can open opportunities for security attacks through social media platforms. Student wellbeing should also be considered when using social media
Social Media in Support of Higher Education Teaching and Learning
871
due to the addictive nature and harmful disadvantages (such as decreased eyesight) of prolonged electronic device use [8]. There is also concerns raised about the distraction other social media platforms can cause when engaging with course content on social media [25].
5 Conclusion The objective of the study was to investigate how social media can be used to support Higher Education teaching and learning. A structured literature review, followed by a thematic analysis on the final literature article pool, revealed two main themes namely Facilitation and Student Experience. On a high level, social media can support teaching and learning in Higher Education Institutions by focusing on enhancing the Student’s Experience and by providing support processes. The Student’s experience can specifically be enhanced by exposing and engaging with course content both in and beyond classroom walls, and to technologies that they could possibly use in industry. The findings of the study is similar to the findings of previously academically published papers where the enhancement of Student Experience when using social media in Higher Educational Institutions [2, 18, 19] and its ability to facilitate teaching and learning [1, 7, 16] was identified. The results of the study provide a good starting point for further in-depth investigation into other possible factors that can enhance teaching and learning in Higher Educational Institutions using social media and other possible platforms.
References 1. Morrar, R., Arman, H.: The fourth industrial revolution (industry 4.0): a social innovation perspective. Technol. Innov. Manag. Rev. 7(11), 12–20 (2017). https://doi.org/10.22215/tim review/1117 2. Williams, M.L.: The adoption of Web 2.0 technologies in academic libraries: a comparative exploration. J. Librariansh. Inf. Sci. 52, 137–149 (2020). https://doi.org/10.1177/096100061 8788725 3. Miranda, P., Isaias, P., Costa, C., Pifano, S.: WEB 2.0 technologies supporting students and scholars in higher education. In: Ozok, A.A., Zaphiris, P. (eds.) OCSC 2013. LNCS, vol. 8029, pp. 191–200. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-393716_22 4. Livingstone, K.A.: The impact of Web 2.0 in Education and its potential for language learning and teaching (2015) 5. Vandeyar, T.: The academic turn: social media in higher education. Educ. Inf. Technol. 25(6), 5617–5635 (2020). https://doi.org/10.1007/s10639-020-10240-1 6. Gilbert, J., Morton, S., Rowley, J.: e-Learning: the student experience. Br. J. Educ. Technol. 38(4), 560–573 (2007) 7. Chaka, C.: Higher education institutions and the use of online instruction and online tools and resources during the COVID-19 outbreak-An online review of selected US and SA’s universities (2020) 8. Raut, V., Patil, P.: Use of social media in education: positive and negative impact on the students. Int. J. Recent Innovation Trends Comput. Commun. 4, 281–285 (2016)
872
L. Schoeman and S. Eybers
9. Kaplan Andreas, M.: Social media, Definition, and History (2018) 10. Kaplan, A., Haenlein, M.: Users of the world, unite! the challenges and opportunities of social media. Bus. Horiz. 53, 59–68 (2010). https://doi.org/10.1016/j.bushor.2009.09.003 11. Holotescu, C., Grosseck, G.: An empirical analysis of the educational effects of social media in universities and colleges. In: Conference proceedings eLearning and Software for Education (2012). https://doi.org/10.5682/2066-026X-12-027 12. Boland, A., Cherry, M.G., Dickson, R.: Doing a Systematic Review: A Student’s Guide. SAGE publications (2017) 13. Joffe, H.: Thematic analysis (2012) 14. Braun, V., Clarke, V.: Thematic analysis. In: Cooper, H., Camic, P.M., Long, D.L., Panter, A.T., Rindskopf, D., Sher, K.J. (eds.) APA handbook of research methods in psychology, vol 2: Research designs: Quantitative, qualitative, neuropsychological, and biological., pp. 57–71. American Psychological Association, Washington (2012). https://doi.org/10.1037/13620-004 15. Rasiah, R.R.: Transformative higher education teaching and learning: using social media in a team-based learning environment. Proceida – Soc. Behav. Sci. 123, 369–379 (2014). https:// doi.org/10.1016/j.sbspro.2014.01.1435 16. Ansari, J.A.N., Khan, N.A.: Exploring the role of social media in collaborative learning the new domain of learning. Smart Learn. Env. 7, 9 (2020). https://doi.org/10.1186/s40561-02000118-7 17. Dunn, L.: Teaching in higher education: can social media enhance the learning experience? (2013) 18. Mourlam, D.: Social media and education: perceptions and need for support. i-manager’s J. Sch. Educ. Technol. 9(3), 23–28 (2014) 19. Gülbahar, Y., Rapp, C., Kilis, S., Sitnikova, A.: Enriching higher education with social media: development and evaluation of a social media toolkit. Int. Rev. Res. Open Distributed Learn. 18 (2017) 20. Kumar, V., Nanda, P.: Social media as a tool in higher education: a pedagogical perspective. In: Tomei, L.A., Carbonara, D.D. (eds.) Handbook of Research on Diverse Teaching Strategies for the Technology-Rich Classroom, pp. 239–253. IGI Global (2020). https://doi.org/10.4018/ 978-1-7998-0238-9.ch016 21. Lottering, R.A.: Using social media to enhance student engagement and quality. South Afr. J. High. Educ. 35, 109–121 (2020). https://doi.org/10.20853/34-5-4271 22. Alalwan, N.: Actual use of social media for engagement to enhance students’ learning. Educ. Inf. Technol. 27(7), 9767–9789 (2022). https://doi.org/10.1007/s10639-022-11014-7 23. Gammon, M.A., White, J.: (Social) media literacy: challenges and opportunities for higher education. In: Wankel, C. (ed.) Educating Educators with Social Media, pp. 329–345. Emerald Group Publishing Limited (2011). https://doi.org/10.1108/S2044-9968(2011)0000001019 24. Anderson, T.: Challenges and opportunities for use of social media in Higher education. J. Learn. Dev. 6, 6–19 (2019). https://doi.org/10.56059/jl4d.v6i1.327 25. Van Den Beemt, A., Thurlings, M., Willems, M.: Towards an understanding of social media use in the classroom: a literature review. Technol. Pedagogy Educ. 29(1), 35–55 (2020)
Designing an Interactive Learning Suite for Children: Results from a Usability Study with a Multidisciplinary Research Team Arash Soleimani(B) Department of Design Computation, School of Architecture, Woodbury University, Burbank, CA 91504, USA [email protected]
Abstract. This research project is the culmination of multiple years of research aimed at creating modular, interactive learning environments that establish the basis of spatial and computational thinking in young learners. Throughout each stage, multidisciplinary teams of college faculty, undergraduate, and graduate students collaborated to create each design iteration and improve upon the existing prototypes. This research paper describes year-long research and testing of the final prototype aptly named U-Design. During this timeframe, the research team created an in-depth literature review that explored education and teaching frameworks that relate to cyber-physical learning and knowledge application within STEAM fields. Team members used these principles to support the design and manufacturing process of an effective tool that can explain complex topics to children while teaching methods of creating systematic, efficient solutions to realworld problems. The paper reports on usability studies and heuristic evaluations of the prototype at key intervals throughout the process to ensure design goals were met. Keywords: Design · Usability · Prototyping · Childhood Education · Creative Learning
1 Introduction 1.1
Motivations Behind the Project
In an increasingly digital society, children are constantly engaged with computers, usually more often than adults, and can easily figure out digital interfaces on devices like laptops and tablets quicker than their parents. Correspondingly, as industries incorporate new technologies into the workforce, effective problem-solving and system design knowledge have become highly sought-after skills. U-Design seeks to answer the call for innovative educational approaches that give young children a better, foundational understanding of these skills in realistic scenarios. Developing a tool that is more complex than a smartphone or tablet yet simple enough for a child to play with shows promise in meeting new educational necessities. U-Design is the realization of an approach that creates an optimized learning tool which enhances children’s utilization of computational thinking and spatial reasoning. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): SAI 2023, LNNS 711, pp. 873–885, 2023. https://doi.org/10.1007/978-3-031-37717-4_57
874
1.2
A. Soleimani
U-Design Inspired by CyberPLAYce
U-Design builds upon a continuous research project entitled CyberPLAYce (Fig. 1). CyberPLAYce consists of a modular, interactive tool for engaging children in playful learning that develops computational thinking and spatial reasoning skills which are foundational competencies within STEAM learning. The CyberPLAYce suite combines form and function, allowing children to build physical configurations from modular panels that have interchangeable, programmable inserts [16].
Fig. 1. Previous Prototype (CyberPLAYce) which was Tested with Children in a School Setting. Photograph by Author
The construction suite is designed to support teaching important problem-solving and system design skills by scaffolding comprehension through relevant educational techniques. Through playful learning, children can construct and visualize real-world problems in an engaging, interactive environment. Children will be able to follow instructorled teaching scenarios and create their own stories within various STEAM subjects. One classic precedent for this type of tool can be found in the House of Cards construction set, created in 1953 by famed American designers Charles and Ray Eames. House of Cards is an oversized deck of playing cards featuring imaginative patterns and pictures on their surfaces that children join to give form to their thoughts through spatial constructions. Adding computation to this recipe, CyberPLAYce puts the emphasis on the process of design rather than only considering the final product. Applied to the problem-solving and planning skills of early learners, this means that the design context can situate children in the mode of thinking like designers, problem solvers, and planners. By focusing on the design context, children employ engineering thought patterns in physical, spatial “diagrams” that they can understand and manipulate. The earliest attempts at design were heavy and less configurable, but the course of progressive iterations consistently improved design implementation. Multiple prototypes have been developed to address the needs of multidisciplinary educational fields while also engaging children with a fun interactive tool. The current prototype, named U-Design, is the fourth iteration in the sequence designed and evaluated by college faculty and undergraduate students. The research team consisted of four faculty members with expertise in Architecture, Computer Science, and Education. There were also six undergraduate students with three from the Architecture Department and three from the Mechatronics and Robotics Engineering Department. The undergraduate students held the position of core investigator and implemented project goals with support from lead professors. Below is an example scenario that envisions how U-Design operates in the classroom setting:
Designing an Interactive Learning Suite for Children
875
Children are given a story to break down into segments and match them to the U-Design panels and modules. When children break down a story or problem by the teacher into smaller, more manageable segments, they better understand, interpret, and construct knowledge (i.e., they think computationally). In this example, Jane’s story is broken down with each story segment defined by concepts and actions. Children match the story segments to electronic modules (Fig. 1), and they collectively construct a tangible story on a table through the physical panels and modules. The tangible story allows for real-time manipulations, spatial rearrangements, and instant feedback. Previous iterations of CyberPLAYce included icon and action cards (color images within Fig. 2) to help children construct pattern sequences and map out story ideas. The updated U-Design models utilize acrylic faces on the prototypes with expo markers for users to write or draw their own “cards” with pictures and descriptions.
Fig. 2. Break-Down of Jane’s Story and Matching with Sensor Modules Representing Different Story Segments
876
A. Soleimani
2 Background 2.1 The Role of Computational Thinking in STEAM Education Recently there has been increased attention to nurturing computational thinking (CT) capacities across curricula by organizations such as the Computer Science Teachers Association (CSTA) and the International Society for Technology in Education (ISTE) [3, 5, 11]. “Computational thinking refers to a set of thinking skills, processes, and approaches to solving complex problems by drawing on concepts from computer science” [18]. Even so, CT is not just relevant to Computer Science, but a range of subject matters such as process management when considering the evaluation process for optimizing a company’s workflow. The process is an evaluation of how well a company is running, and it often includes monitoring efficiency and productivity, seeking out areas where improvement can be made, and creating a plan for implementing these changes. These tasks involve carrying out a series of sub-tasks and then making alterations to ensure maximum efficiency. This could be represented as a flowchart or an algorithmgenerating program that solves for the minimum cycle time while providing adequate workers. Though this is a specific example, the importance of CT in school and job careers can be easily observed in more general cases. CT is viewed as a methodology that can “magnify problem-solving skills needed to address authentic, real-world issues” in order to assist today’s students in meeting “workforce demands of the future” [3]. Although these methods have historically been used primarily in higher education to solve difficult problems, there is a growing consensus that it is best to develop these skills as soon as possible, especially in early childhood. The continued relevance of CT is further demonstrated by the 2015 announcement of the White House’s Summit on Computer Science, Cybersecurity, and Innovation for our future. This summit recognized one of the major challenges to future STEAM success is a lack of qualified teachers in these subjects [13]. In this, former President Obama stated that “there are only about 50,000 Computer Science degrees handed out each year” compared to approximately 10 million students graduating from high school per year [3], indicating one of several disconnects between teacher demand and student opportunity. In order to make CT more accessible across curricula, the International Society for Technology in Education (ISTE) has stated: “We must provide students with explicit instruction on computational thinking skills beginning in elementary school and continuing throughout middle and high school” [5]. Studies indicate that current teaching and assessment methods for CT are not highly effective. The current teaching evaluation of CT within Computer Science would benefit from a formative evaluation that tests the ability to transfer knowledge by developing a process in an engaging way [15]. Implementing CT practices can be complicated and difficult for even college students to understand, so how can young children be expected to comprehend CT without supporting new teaching methods that solve current learning gaps? Innovative approaches must be developed and optimized to teach these complex skills to young children and scaffold their understanding and utilization of these principles in future courses. U-Design bridges the gap between the growing interest in teaching computational thinking to young learners and the shortcomings of current traditional approaches to childhood education by implementing spatial construction in a tangible form while children tell/construct a story or solve a problem.
Designing an Interactive Learning Suite for Children
877
2.2 The Role of Spatial Thinking in STEAM Education Spatial Reasoning, also called Spatial thinking (ST), skills are pivotal for people to navigate the world around them and construct a meaningful understanding of new skills. “Spatial skills, the ability to encode, remember, and mentally manipulate the spatial features and relations of objects or space, are central to our daily functioning” [2]. Along with its prevalence in daily life, ST is closely correlated to the educational realm and STEAM activities. “A connection between spatial skills and STEAM achievement has been demonstrated even in very young children” [17]. As children play with toys and in games within small groups, they associate learned knowledge with physical models that serve to strengthen their comprehension. The educational theory of Constructionism [14] provides a more thorough description of this as an approach to learning by creating a mental model compromising a complete understanding of the facets within the body of knowledge [4]. Lave explained it as, “what you know, you should be able to take apart, tinker with, explain, and put together again in the same way or another way” [10]. Lave went so far as to propose that learning should be differentiated from knowledge acquisition. He defined learning as the acquisition of skills leading to the application of gained knowledge. Constructionism is both a theory and a practical approach to education in which spatial thinking is founded at its heart. It states that “knowledge is not simply transferred from teacher to student, but actively built in the mind of children, and children don’t get the ideas, they make ideas” [8]. Moreover, children are likely to create new understanding and perform creative problem-solving when they are actively learning and interacting with spatial artifacts. Just as those in higher education and business use diagrams and models to represent complex ideas, physical models help children integrate related knowledge in an easily understandable manner. Many education specialists and scientists noted that to know is to relate and to know better, or rather to obtain a more profound understanding is equivalent to ‘learning-in-relation’ [7]. ST in the form of cyber-physical objects helps create a relational understanding between knowledge fields, especially in the realms of CT and STEAM. U-Design provides hand-sized magnetic modules and triangular panels for children to construct knowledge, solve problems, or tell stories in a physical environment, where they have access to manipulate parts simultaneously and rearrange their tangible algorithms as needed.
3 Research Questions Throughout the research, design, and usability study of U-Design, the following questions directed the study: 1) What materials and technologies can be used to design a cyber-physical learning suite that adapts and reshapes to offer different configurations? 2) To what extent is this modular, multi-sensor design suite usable?
4 Design Process The design process was built upon the effective attributes of past iterations while correcting problem areas and optimizing for potential growth. Project development consisted of a four-phase approach:
878
A. Soleimani
Phase 1 – Research Activities (July 1 – September 1, 2021). Phase 2 – Design Activities (September 1 – December 31, 2021). Phase 3 – Prototype Deployment (December 31 – April 1, 2022). Phase 4 – Final Delivery, Usability Study, and Publication Writing (April 1 – June 30, 2022). During Phase 1, the design process for U-Design started with constructing a framework containing parameters and observation tools to precisely examine child education and areas it can improve. The development of the project’s methodology began with an extensive review of educational theories, specifically looking at how spatial reasoning relates to computational thinking and if any pedagogical design practices could be applied to the prototype design. The research team conducted studies of existing cyberphysical technologies and how they could be integrated into educational spaces. At the same time, the focus was given to understanding early learners’ needs, and how learners engage with interactive learning technologies and environments. After two months of in-depth literature review, the research team transitioned to the design activities of ideation, sketch storming, digital prototyping, and low-fidelity model fabrication during Phase 2. The research team analyzed past prototype iterations A-1, A-2, and B-1 (Fig. 3) to inform the design of Prototype B-2 (current prototype i.e., U-Design) along with their corresponding empirical studies to formulate a list of design priorities and potential improvements.
Fig. 3. Progressive Design Iterations over the Past Seven Years (Prototypes A-1, A-2 & B-1) and the Current Prototype (Prototype B-2: U-Design). Photograph by Author
Designing an Interactive Learning Suite for Children
879
To take on design priorities more effectively, the team created new prototypes and user implementations during Phase 2 of development. Sketches were also used to develop new ideas before creating digital models. Afterward, students used Computer-Aided Design (CAD) models to visualize and communicate ideas between members in group discussions and evaluations. Overall, three weeks were dedicated to the undergraduate research assistants for them to formulate the shape and connections between the cyberphysical panels and modules. Crucially, giving design control of the project was not only purposeful for their development but also allowed undergraduate students to bring fresh perspectives with their competence as knowledgeable performers and designers [1]. Through in-depth discussion and model simulations, two design alternatives (Models A and B) were chosen as having the most potential for the final prototype iteration. Students were then split into two teams (each team consisting of one Architecture and one Mechatronics student) to collaborate on low-fidelity constructions for each of the two designs. The goal was to lay out the key functional and structural aspects needed for the finalization of high-fidelity prototypes in Phase 3. Both completed designs can be seen below with one consisting of interconnecting slots and the other having interlocking teeth covered in Velcro (Figs. 4 and 5).
Fig. 4. Design A CAD Model (LEFT), Design A Physical Prototype (RIGHT). Photograph by Author
Fig. 5. Design B CAD Model (LEFT), Design B Physical Prototype (RIGHT). Photograph by Author
880
A. Soleimani
Lastly, the research team performed usability studies with a decision matrix detailed below (Tables 1 and 2). Evaluations were approached by examining key usability characteristics that increase child engagement and interest. Following group discussions, it was decided that Design B showed the most promise by maximizing the freedom of the user’s creative expression and allowing stable panel connections for making larger structures by children. This design formed the basis for high-fidelity fabrication in Phase 3. During Phase 3, the entire research team focused on the development of a single refined version of the U-Design prototype with high-grade materials (Fig. 6). Module functionality was also addressed extensively during Phase 3. Each module was designed by focusing on a key component that would facilitate children’s learning and playful interaction with the U-Design panels. A range of small, portable microcontrollers (e.g., Arduinos), sensors, and actuators were used in the modules for different input/output activities. One example of this is the reading of a temperature sensor plugged into an output module or changing the color of display lights by inserting an LED module into one of the panels. A total of twelve high-fidelity panels were constructed with eight unique sensor modules. A few of the finished prototypes can be seen below with examples of the modules.
Fig. 6. High-Fidelity U-Design Final Prototype. Photograph by Author
Designing an Interactive Learning Suite for Children
881
5 Results To reiterate, the focus of the U-Design prototyping is summative in the following research questions: What materials and technologies can be used to design a cyber-physical learning suite that adapts and reshapes to offer different configurations? To what extent is this modular, multi-sensor design suite usable? It is often better to evaluate the effectiveness of complex projects throughout the creative process instead of at the end due to a large number of constituent design parameters. Furthermore, it may be easier to address small issues within the sub-components before these are integrated into a final system. This is a foundational principle in a hierarchical design. To that end, the research questions were evaluated in a staggered approach throughout prototype construction, at the end of Phase 2, and during Phase 4. While constructing the panels and modules, the research members held regular informal and formal discussions in which peers reviewed each other’s progress. Multiple design parameters such as user feedback options and module implementations were altered during these discussions to enhance overall usability. At the end of Phase 2, lowfidelity prototypes were completed and evaluated by more rigorous testing standards. Each team member gave individual scoring for the two design alternatives (Designs A and B – Figs. 4 and 5) on a 1 to 10 scale (10 being a perfect score). These scores were then organized into a weighted decision matrix shown below (Tables 1 and 2). Decision matrices are commonly used tools in the Engineering field to make project decisions with team members. Design evaluation standards were given different weights depending on their relative importance. It is important to note that this study only showed the most usable structure and overall design clarity during this stage. The in-depth evaluation of specific sub-components was carried out during Phase 4. Table 1. Usability Result Sheet for Prototype A, 1 (Worst) to 10 (Best) Scale. Team Members
Efficacy (Ease of user understanding)
Usability (Ease of use)
Design Clarity (Did we meet design goals?)
Technicality (How complicated?)
Participant 1
8
7
9
7
Participant 2
6
7
8
7
Participant 3
8
6
9
6
Participant 4
7
6
8
8
Participant 5
8
6
8
6
TOTAL = Importance Factor*SUM
0.3*37 = 11.1
0.5*32 = 16
0.3*42 = 12.6
0.2*34 = 6.8
SUM TOTAL: 46.5
882
A. Soleimani Table 2. Usability Result Sheet for Prototype B, 1 (Worst) to 10 (Best) Scale
Team Members
Efficacy (Ease of user understanding)
Usability (Ease of use)
Design Clarity (Did we meet design goals?)
Technicality (How complicated?)
Participant 1
8
7
9
8
Participant 2
8
7
8
7
Participant 3
9
6
9
9
Participant 4
8
7
9
8
Participant 5
8
7
8
8
TOTAL = Importance Factor*SUM
0.3*41 = 12.3
0.5*34 = 17
0.3*43 = 12.9
0.2*40 = 8
SUM TOTAL: 50.2
A comparison of Tables 1 and 2 indicates why Prototype B was selected as it was rated higher than Prototype A in terms of Efficacy, Usability, Design Clarity, and Technicality. The research team used the observations from Phase 2 usability studies to develop a Research-through-Design (RtD) methodology [19] represented in Phase 4 heuristic evaluations. These major design factors were selected from Jakob Nielsen’s heuristics [12] based on their comparative relevance. Seven heuristics were assessed during the last phase of the study including visibility of system, match between system and real world, user control and freedom, consistency and standards, error prevention, aesthetic and minimalist design, and help and documentation. An example evaluation sheet completed by each research team member can be seen below with the seven heuristics (see Fig. 7). Severity (mean) ratings go from zero to four with zero representing no usability problem and four representing a major issue that needs to be addressed. Most team members expressed similar responses to the effective design attributes and the problem areas. The most successful design factors mentioned were effective error prevention, extensive user feedback, and a large amount of creative freedom for instructors and children. The high-fidelity prototype allows for superb reconfigurability with diverse microcontroller modules and with applications to many STEAM subjects. Powering on the panels and inserting modules both cause internal illumination showing users the model is working properly. The major issues described were the unreliability of electrical solder joints, the lack of user instructions, the high weight of the MDF panel core, and tight panel connections. Late into Phase 3, several solder joints began to fail causing unstable power delivery to modules. After troubleshooting, the research team found the low gauge, solid core wires were transmitting too much stress to the joints. Perhaps the most notable issue was the high weight of the MDF core within the panels. In earlier prototypes and low-fidelity U-Design models, a lightweight foam core was used for the structure. The foam board proved not durable enough for rough play with children, and the team attempted the use of an MDF skeleton core with the unnecessary sections carved out. While carving out areas into triangular truss formations made the
Designing an Interactive Learning Suite for Children
883
Fig. 7. An Example of a Heuristic Evaluation Result Sheet for the Final Prototype
frame much lighter, the weight still caused issues when panels were stacked too high. An additional usability problem was discovered with the tight panel interconnections and slots for modules. In order to build structures with panels, interconnections had to have tight tolerances to prevent loss of reconfigurability. Despite extensive ideation and modeling of different connections, this has proven problematic during the final stages of Phase 3 and Phase 4. As part of the evaluative process, team members also discussed possible solutions during future iterations. The most probable steps forward would be testing new core materials like high-density foam or lighter-weight wood frames such as balsa wood in addition to braided core electrical wiring and smoother, pressure-fit connections with plastic panels. Overall, the final prototype solved many issues from the past designs and met the research goals through a user-friendly and highly functional construction suite.
6 Future Work Looking forward to future iterations, the usability issues will be improved, and U-Design will be tested with 10–12-year-old children in a classroom setting. The results of the study will inform the design of a room-size prototype where children will live inside the structure to solve a problem or tell a story. Additionally, elementary students will be asked
884
A. Soleimani
to help the research team as co-designers to assist in designing and evaluating future prototypes. It is expected that the children who have access to cyber-physical-spatial learning environments will express themselves in various modes of communication and will experience enhanced learning [6] and cognitive engagement with their surrounding world [9].
7 Significance for Creativity and Cognition Research For the larger SAI community, U-Design is a design exemplar that is characterized as a case of Research-through-Design, focused on a tangible, interactive learning suite that, in a novel way, extends digital learning to the dimension of space. As computing becomes ever more ubiquitous in our everyday lives, notably in education, it will inevitably occupy the physical spaces we live and learn in, and increasingly converge with it to construct a spatial-interactive learning environment – a next frontier for SAI researchers.
References 1. Bereiter, C., Scardamalia, M.: The Mind’s Pen: Explorations in the Culture of School Learning. Prentice-Hall, Englewood Cliffs, NJ (1993) 2. Casasola, M., Wei, W.S., Suh, D.D., Donskoy, P., Ransom, A.: Children’s exposure to spatial language promotes their spatial thinking. J. Exp. Psychol. Gen. 149, 1116–1136 (2020). https://doi.org/10.1037/xge0000699 3. Computer Science Teachers Association (CSTA): K-12 Computer Science Standards. https:// www.doe.k12.de.us/cms/lib/DE01922744/Centricity/Domain/176/CSTA%20Computer% 20Science%20Standards%20Revised%202017.pdf (2017). Retrieved 22 Jan 2022 4. Holbert, N., Berland, M., Kafai, Y.B. (eds.): Designing Constructionist Futures: The art, Theory, and Practice of Learning Designs. The MIT Press (2020). https://doi.org/10.7551/ mitpress/12091.001.0001 5. ISTE: Computational Thinking Competencies. https://www.iste.org/standards/iste-standardsfor-computational-thinking (2021). Retrieved 22 Jan 2022 6. Jewitt, C.: Cognitive engagement with the world: a developmental-cognitive perspective. Eur. J. Dev. Psychol. 9(1), 3–17 (2012) 7. Jordan, J.V., Kaplan, A.G., Stiver, I.P., Surrey, J.L., Miller, J.B.: Women’s Growth in Connection: Writings from the Stone Center. Guilford Press, New York (1991) 8. Kafai, Y., Resnick, M.: Constructionism in Practice: Designing, Thinking, and Learning in a Digital World. Lawrence Erlbaum Associates, Mahwah, NJ (1996) 9. Kozma, R.: Rethinking the human-computer interface: Cognitive engagement with the world. IEEE Trans. Learn. Technol. 8(4), 271–288 (2015) 10. Lave, J.: Legitimate peripheral participation. Paper presented at the annual meeting of the American Education Research Association, Chicago, IL (1992) 11. Lee, M. G.: Teaching computational thinking in early elementary. https://www.csteachers.org/ Stories/teaching-computational-thinking-in-early-elementary (2019). Retrieved 22 Jan 2022 12. Nielson, J.: Usability inspection methods. In: Proceedings of the Conference on Human Factors in Computing Systems (CHI‘94), pp. 413–414 (1994). https://doi.org/10.1145/259 963.260531 13. Office of Science and Technology Policy, White House summit on computer science, cybersecurity, and innovation for our future. President Barack Obama, The White House, Washington D.C. (2015)
Designing an Interactive Learning Suite for Children
885
14. Sabelli, N.: Constructionism: a new opportunity for elementary science education. National Science Foundation DRL Division of Research on Learning in Formal and Informal Settings, pp. 193–206 (2008). https://nsf.gov/awardsearch/showAward?AWD_ID=8751190. Retrieved 22 Jan 2022 15. de Souza, C.S.: The Semiotic Engineering of Human-Computer Interaction. The MIT Press, Massachusetts (2005). https://doi.org/10.7551/mitpress/6175.001.0001 16. Soleimani, A., Green, K.E., Herro, D., Walker, I.D.: A tangible, story-construction process employing spatial, computational-thinking. In: Proceedings of the ACM Conference on Interaction Design and Children (IDC’16), pp. 157–166 (2016). https://doi.org/10.1145/2930674. 2930703 17. Uttal, D.H., Cohen, C.A.: Spatial thinking and STEM education: when, why, and how? Psychol. Learn. Motiv. 57, 147–181 (2012). https://doi.org/10.1016/B978-0-12-394293-7.000 04-2 18. Wing, J.M.: Computational thinking. Computations ACM 49(3), 33–35 (2006) 19. Zimmerman, J., Forlizzi, J., Evenson, S.: Research through design as a method for interaction design research in HCI. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI‘07), pp. 493–502 (2007). https://dl.acm.org/doi/10.1145/1240624. 1240704
Towards the Establishment of E-Assessment at the University of Mauritius Abdool Qaiyum Mohabuth(B) Auniversity of Mauritius, Reduit, Mauritius [email protected] Abstract. There has been a paradigm shift from traditional learning to online learning in Universities and this has open access to online evaluation. While the adoption of online examination has been rather slow in many Universities for years, the Covid-19 pandemic has triggered most institutions to move urgently towards the implementation of e-assessment. Sanitary restrictions imposed by the authorities have compelled Universities to consider e-assessment as an alternative method to evaluate students’ achievement. The University of Mauritius has been facing similar challenges as other Universities for the implementation of e-assessment on its campus. It has been compelled to make prompt investment and move quickly towards e-assessment. This study made an assessment of the issues faced by students and academics in adopting e-assessment. It investigated the features of the e-assessment platforms, took into consideration the needs of the University before coming up with a framework for the adoption of e-assessment. Both quantitative and qualitative approach guided the study. Questionnaires were used to gather data about the knowledge and practices of students in terms of e-assessments. The research was triangulated by qualitative methods where students and academics were interviewed to add up further issues regarding the usage of e-assessment, besides validating and confirming some of the facts gathered during the quantitative stage. Semi-structured and focus group interviews were conducted. Findings led to the development of an appropriate framework which caters for features prior to, during and after the assessment. The framework makes provision to authenticate assessees and includes automated monitoring which is the necessary feature to deal with the challenging issue of academic dishonesty in terms of cheating. It also presents features about script submission and e-marking facilities to envelop the whole e-assessment process. The framework put forward is affordable and may be easily adopted by Universities. The study also reveals that e-assessment has a prospective future and is not considered as a tool just to be used during pandemics or other urgencies. Keywords: Online Examination · e-Assessment · Academic Dishonesty · Online Learning · e-Plaforms
1 Introduction Covid-19 pandemic has certainly caused major disruption in Universities worldwide. The University of Mauritius is not uncommon to the problem and has faced quite a number of challenges in continuing its mission even when its door was closed. While the decision © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): SAI 2023, LNNS 711, pp. 886–904, 2023. https://doi.org/10.1007/978-3-031-37717-4_58
Towards the Establishment of E-Assessment
887
taken by management in promoting online learning to ensure that students continue and complete their studies has proved to be successful, the issues about assessing students’ achievement was complex. Assigning course work in lieu of written examination was workable for the ad hoc situation. Academics have used various assessment methods to evaluate their students. The University has taken some steps ahead towards e-assessment and tried it with first year undergraduate students. This has been very challenging for the academics and students giving rise to concern over a number of issues which this study will try to unveil. Pandemic or not, the introduction of electronic examination is more than ever a must for the University. There is an urgent need to move towards the computerisation of examination. The use of electronic examination may overcome many of the challenges faced by the University in terms of ensuring that students may be assessed even when they are off campus. With the increasing intakes every year coupled with the complexity of holding examinations in the traditional way with the present pandemic, it is high time for the University to adopt some new assessment methods. Assessment provides observable evidence of learning and sits at the heart of the learning process. It demonstrates understanding of the curriculum and student progress. Achievement of students cannot be validated without assessment. The current pandemic situation has rendered the tasks of setting traditional assessment extremely complex. Confinement followed by restrictions on number of people on gathering coupled with social distancing have halted Universities to continue with their traditional assessment methods. These have caused much disruption and many of them have got no other choice, except to postpone their evaluation process. Since the health hazardous situation has not been improving much, Universities have been compelled to look for alternative evaluation methods. Engaging students fully in online learning and finding ways to conduct assessments online have been the principle occupation of Universities [1]. Seeking out the best practices to conduct online assessment are still among the primary objectives of Universities [2]. The transition from traditional assessment to online evaluation despite its shortcomings has facilitated Faculties at the University of Mauritius to evaluate their students. It has allowed academics to set their assessment online and proceed with the evaluation once students submitted their work. However, this sudden transition has not been without consequences [3]. Some academics claimed that they had to spend more time in learning the features of the e-assessment platform before they were able to set their assignments. Many students have experienced problems in working out their assessments, others have major issues at the time of submission. Academics were also not sure whether all students have managed to use effectively the e-platform they have chosen. Some were unable to confirm whether students have eventually submitted their assessment. There were cases where students claimed that they have duly made their submission, but academics could not find the uploaded files. This has caused a number of inconveniences with postponement of some Board of Examinations. There have even been cases whereby results were approved and have to be reviewed with Board of Examination called again. Besides, from the student side, it is still unknown about how the students have lived the experience of e-assessment. No feedback has been obtained from them. There is need to investigate whether they have appreciated the use of e-assessment and what impact this has been on their performance, whether this have improved their learning. These avenues need to be researched well so that the University of Mauritius
888
A. Q. Mohabuth
can make cautious move from the traditional to e- assessment methods. The aim of this research study is therefore to explore the avenues in making the University shift from the traditional paper-pen examination methods to the establishment of e-assessment for all its courses. To this end, the following objectives are formulated: • • • •
Identify the issues faced by students and academics in adopting e-assessment; Explore the capabilities of e-assessment platform for use at the University; Assess the suitability of the e-assessment methods for the different Faculties; Present an appropriate framework for the use of e-assessment on the campus.
These will help in meeting the common big challenge of Universities of being capable to introduce effective and reliable e-assessment systems in their curriculum [3].
2 Literature Review 2.1 Concept of Assessment Assessment is a major element in the educational system as it provides useful information about the extent to which the learning outcomes have been achieved [4]. Assessment is often considered as a core component in higher education. The study [5] claimed that assessment plays a crucial role in the teaching and learning process in any University. Assessment reflects what students consider important and how much effort they devote to it [6]. It needs to be genuine so as to improve the quality of teaching, learning and academic programmes at educational institutions [7]. Proving what students have grasped is the role of any educational organisation, and this is for what purpose assessments are [8]. Assessment can also be interpreted as a process whereby students’ responses are being recorded with the aim of assigning them a grade [9]. It plays a crucial role in the following up of students’ abilities [10]. Assessment is performed to determine the current status of learning. In order to better understand students’ learning in a course, an academic assesses learning through observation and measurement. There are a number of conventional examination strategies that are used in higher education institutions for assessing academic advancement, for instance, paper-pencil-based examinations, presentations, assignments, e-assessments and many more [11]. For both academics and students, assessment provides evidence of what has been taught. Students in higher education are assessed in order to integrate them into the learning process and provide evidence of their comprehension. Assessment also focuses the attention of both students and academics on what is most important [12]. It has also been argued that students only genuinely engage with their course materials when they are faced with examinations and evaluations. Assessment also helps academics to know how well students grasp the concept and skills they want their students to understand and demonstrate. 2.2 E-Assessment E- assessment is defined as a method where digital technology is used for any assessmentrelated activity [13]. It is the conduct of formal exam through the web or the intranet. Functionally, an e-exam can be provided using a dedicated system. E-assessment into
Towards the Establishment of E-Assessment
889
four groups: (1) Diagnostic assessment which aims at gauging the knowledge of students at the start of a course; (2) Formative assessment which aims at clarifying the learning achieved and to identify needs for additional coaching during the implantation of a course; (3) Summative assessment which provides the final grades at the end of the course; (4) integrative assessment which influences student future learning based on feedback received from academics, other students or through self-evaluation [14]. With the advent of web technologies, students’ abilities and skills can be thoroughly tested through e-assessment. e-Assessment basically operates on an end-to-end electronic basis whereby Information and Communications Technologies are implemented for the introduction of assessment activity and the documenting of responses [15]. Accordingly there are two main types of e-assessment that are commonly used: (1) Web-based delivery where assessment tasks are being carried out online [16, 17]; (2) Download delivery where assessments’ e-files are downloaded onto the students’ computers in the respective assessment centres at the given date and time. They are then divulged when the student attends the assessment in offline e-assessment mode. Using this method, strict security actions must be enforced to prevent assessments leaks [18]. In Universities, refining the grade of learning experience of students remains one of the greatest setbacks. It is thought that e-assessment can provide some helps to this [19]. However, it should be noted that the effectiveness of an e-assessment system depends on a number of factors namely examination conditions, standardisation, mode of examination administration, security, cost amongst others [20]. 2.3 Challenges/Pitfalls of e-Assessment In a cross-sectional study carried out by [21] in the Philippines, the use of e-assessments were positively correlated with their levels of anxiety caused by the pandemic. Some of the challenges recognised to be linked to the perceived anxiety levels include the availability of the necessary technological tools, adeptness and unfamiliarity towards the innovative educational technology and appropriateness of the type of e-assessment to be delivered, be it, diagnostic, formative or summative. Some of the technological limitations that the students experienced were the unavailability of a good Wi-Fi connection and thus having to rely on mobile data. Some of the students did not possess any laptop and had to borrow one for the purpose of assessments. A recent paper from Italy reported that students have experienced a lot of frustration while accessing campus platforms due to the high load traffic and it has also reported that system crashes when students were submitting their works added to the list of assessors [22]. Some past research have demonstrated that a lot of students experience different level of anxiety in online assessments. A study shows that 30% of University students feel very stressed, 56% feel slightly stressed, while only 14% do not feel stressed at all [23]. It has also been deduced that students who are highly anxious during the testing preparation phase are more likely to perform poorly during their assessments [24]. Another key challenge that lies in e-assessment is academic dishonesty as students will continuously devise ways to cheat on the exams [25]. This is discussed further in the next section. Besides, performance data from single tests that included the variables time and score have been analysed in a study carried out by [26]. They found out that the time taken to finish an exam increases significantly and the online test results are inflated
890
A. Q. Mohabuth
in the absence of proctoring as students attempt at finding the answers. Universities have been facing many challenges in the wake of the pandemic and the implementation and adoption of e-assessment is one of them. The vulnerability of online assessment is of salient concern indeed. Research is still being done in that area to ensure that exams are carried out as securely as possible and that attempt of cheating among test-takers is mitigated. In addition, a significant budget is required for e-assessment to be implemented. Personal computers, high-speed internet, a continuous and steady power supply and cameras in assessment rooms are all expensive. Lack of sufficient technical infrastructure, possible failure of equipment and lack of quality are also challenges that should be considered carefully [27]. Figure 1 illustrates how academics are faced with design considerations and challenges when making the migration to online testing. These design considerations include testing continuity within the course, exam access in terms of availability of logistics and exam security as well. Training matters a lot to be able to use efficiently an e-assessment platform coupled with accessibility to appropriate technology.
Fig. 1. Design Considerations when Switching to Online Assessment
2.4 Academic Dishonesty in e-Assessment One of the most pertinent issue with e-assessment is that of cheating. While with traditional assessment, well-established procedures are in place to deal with cheating issues, such is not yet the case for online examination. Academic dishonesty is described as an illegitimate behaviour while attempting an assessment. The research [28] recognised digital academic dishonesty as academic offenses that are committed using digital technologies. Cheaters, according to [29], are people who employ ways that violate the
Towards the Establishment of E-Assessment
891
honour code in order to gain credit without having to learn anything. Approximately 60% of students admitted to have cheated in online assessment. This lead to unfair student evaluations, as determining whether a student’s work is genuine, is difficult in online settings [30]. The reasons behind cheating in online assessments may be: (1) The possibility of not being caught makes students believe that cheating is easier on an online test. 73.6% of the students from a survey conducted among business students showed that they had the perception that it is easier to cheat in an online against a traditional assessment [31]; (2) The ease to access Google and other search engines during online exams where they are not proctored [32]; (3) The facility to use other smart devices which might go unnoticed such as verifying up answers with friends using WhatsApp on mobile phones and answers may even be searched under proctored exam; (4) the pressure on students brought forward by the pandemic has led to an increase in academic misconduct [33]; (5) The limited experience invigilators have to come forward with tactics to deal with academic misconduct during online assessment [34]. Grading the students incorrectly affect Universities’ credibility and even may make employers raised questions about the competency of graduates [35]. It causes students to adopt negative habits of not learning, losing the significant amount of time and resources invested in obtaining a high-quality education [36]. Universities have started to address the issue of academic dishonesty by first considering why do students cheat? This entails examining student attitudes and views that lead to assessment integrity violations. Several scholars have conducted systematic reviews of student and instructor perceptions of online cheating activities [37–39]. Identification of precise causes is crucial since it identifies the problem’s fundamental cause and, as a result, helps in the formulation of an acceptable remedy. Furthermore, contextual considerations play a role, as several studies [40, 41] show that students find it simpler to cheat online and that the majority of them participate in dishonest activity during their examinations. The second question that Universities tend to address is the identification of how do students cheat. With the advancement of technology, new cheating methods are frequently devised by students, making it critical to discover these ways in order to locate security flaws in the assessment process. Plagiarism, Unauthorised materials and Collusion are among the loop holes that have been found in the assessment security process. The author [42] provided a list of recommendations that could be applied with respect to conducting assessments online in higher education. Firstly, it has generally been advised that modules should be assessed continuously using a diversity of assessment methods rather than having the whole weightage of the module rest on only one final assessment. To vary the mode of assessing students virtually, one suggestion is to conduct oral tests using video conferencing tools. These could be recorded and retained as evidence. It has also been highly recommended to refrain from using questions that have generic memorised answers, but to instead incorporate reflective questions that require application of knowledge. Depending on the size of the group of students, the strategy to adopt with respect to organising the online exams would be devised. An intermediate size comprising of roughly 15–40 students: The option suggested would be to divide the group into smaller groups of around 10 students and monitoring them through videoconference using another device such as a mobile or tablet so that the working
892
A. Q. Mohabuth
environment of the students and their identity are assessed. Academic dishonesty is a widespread problem that is gaining momentum, scholars are still researching on it.
3 Research Methodology Quality of a research study is reinforced by the application of both quantitative and qualitative methods [43]. Both methods were used for this research study. Survey in the form of questionnaire was conducted to explore the knowledge that students possessed about e-assessment at the different Faculties/Centers. The use, resources facilities and practical issues involved were also explored. The questionnaire was designed using google forms which provided a number of facilities allowing the integration of different types of questions such as multiple selection, scaling, check boxes, pull-down selection, grid of several options etc. the questionnaire was divided into six sections which include knowledge, usage, resource facilitation, Covid-19 context, practical issues and demographics. These facilitated the core knowledge and current practices of e-assessment to be unveiled. These were used to target meeting the research objectives of the study. Stratified sampling was applied which involved the students at the different Faculties/Centers. The quantitative investigation further helped to decipher the information gathered about differences that may exist about e-assessment practices at the different Faculties/Centers. This was then followed by qualitative methods where academics were interviewed to add up further issues regarding e-assessment, besides validating and confirming some of the facts gathered during the quantitative stage. In addition, focus group interview was used for the students although the questionnaire allowed valuable data required for the research to be captured. Focus groups interviews were used for data collection as group interaction generally enables participants to express views that might not be disclosed in an individual interview and to understand how the participants feel and what they think about an issue. Group interaction supports the students in remembering events to gather more opinions on e-assessment issues.
4 Results and Discussions A total of 331 students responded to the survey. Five questionnaires were rejected and were not taken into consideration for the analysis, as they were either partially completed or almost blank. Cleaning of data was performed. A total of 37.1% (n = 121) male and 62.9% (n = 205) female responded to the survey. 4.1 Knowledge About e-Assessment Students are seen to be quite familiar with e-assessment with 73.3% stating that they have either moderate, some or very good knowledge of e-assessment as illustrated in Fig. 2. Most students are aware about the terms online exams (78.2%) as compared to other terms associated with e-assessment, Computer-mediated assessment (14.4%), Locked up Browser (9.5%) and Proctored Exam is only 5.2%. As regards to the use of e-assessment
Towards the Establishment of E-Assessment
893
Fig. 2. Knowledge of about e-Assessment
platform, 75.6% of the student has experienced an e-assessment platform. The most popular one being Google classroom (86.4%), Turnitin (78.2%), Moodle (49.3%) while Canvas, Blackboard and ColCampus are less popular representing less than 20%. These are quite obvious as even well before the pandemic students were being instructed to make use of the Turnitin and Moodle platform. Besides, the type of online assessment which the students experienced more was Class tests (n-192), followed by Assignment (n = 189) then Mini-projects (n = 85) as illustrated in Fig. 3. In fact, academics have been using the e-assessments platform with their students mostly for class tests and assignments. Most students on the campus have been going through these. Academics have tried these online assessments for their validity as they provide the facility to test really what they want to test and present the results as aligned with the learning goals. They have been able to make fair assessment, distinguish between those students who deserve passing and others who do not. As such academics have been able to mark their class tests and assignments in a consistent way and the judgments or the grades were meaningful. On the other hand, portfolios, case studies, open and oral exam in online version have been less popular among students due to the fact that academics themselves were not enough convinced about the facilities these platform offer. As regards to the type of questions students have experienced with online assessment, Multiple Choice questions topped the list (n = 264), this is followed by Structured (n = 163) then Fill in the blank (n = 129) as shown in Fig. 4. Indeed, students want their test to be fair. By fairness, they mean the questions need to be clear where they may be able to figure out what the question is asking even if they can’t answer it. Multiple choice questions allow academics to bring up these expectations that students have. In addition, it allows students to have meaningful feedback about why the other answers are not correct and contribute much to the learning. In [44] author confirmed that Multiple Choice questions presents an important component of student learning outcomes making
894
A. Q. Mohabuth
250 200 150 100 50 0
Fig. 3. Assessment Instrument for Online Exam
Fig. 4. Type of Questions Preferred for e-Assessment
them enjoy the learning, allowing them to engage in the course and even rate the quality of teaching. 4.2 Usage of e-Assessment The practical use of e-assessment was assessed by using eight Likert items to determine the ease at which students handle e-assessment. A 5-point scale was used from Strongly Agree = 5 to Strongly Disagree = 1. Table 1 illustrates the mean, standard deviation (SD), skewness and kurtosis of the responses. It was observed that students are quite comfortable with e-assessment usage. The number of students who find it hard to focus on the questions when doing an online exam was quite low with a mean of 2.86. Same was noticed for students who find themselves more comfortable in paper-pen mode of
Towards the Establishment of E-Assessment
895
assessment than in e-assessment (mean = 2.20). This means that students do not consider e-assessment as being arduous. The number of students who find that e-assessment is not appropriate for their subject area is also low (mean = 2.57). This further confirmed some of the recent research work conducted where researchers assert that online assessment cut across borders and are widespread covering various disciplines [46, 47 48]. As regards to the adoption of malpractices with e-assessment, students have diverse opinion on the issue (mean = 2.98). Similarly, some students do not feel that online exam have serious impact on health & safety, while others do ( mean = 3.12). This may be associated not only with e-assessment alone but other factors may be contributing such as the covid situation bringing along additional stresses and fears. On the other hand, there is agreement as regards to the use of camera during online exam, most students are agreeable to the use of cameras (mean = 3.98). Students were in fact being constantly under the eye of invigilators under traditional assessment. So being monitored by cameras do not bother them much. Table 1. Distributive Statistics for ease of Use of e-Assessment among Students Mean SD
Skewness Kurtosis
2.86
1.186
0.125
−0.814
I am more comfortable in paper-pen mode of assessment 2.20 than in e-assessment
1.126
0.642
−0.325
I prefer to write on paper and upload my work afterwards 3.79 rather than typing my answers
1.194 −0.696
−0.475
E-assessment is inappropriate for my subject area
2.57
1.385 −0.020
−1.106
E-assessment should only be of short duration
I find it hard to concentrate on the questions when doing an online exam
3.83
0.799
0.062
−0.883
There is less scope to adopt malpractices in e-assessment 2.98 than with paper-based assessments
0.962
0.085
−0.009
I feel there are serious health and safety issues associated 3.12 with online exams
1.174 −0.068
−0.784
It is imperative to make use of camera during online assessment
1.661 −1.238
0.296
3.98
4.3 Resource Facilitation Students expressed high concern as regards to technical issues associated with eassessment (mean = 4.26) as shown in Table 2. Power cut and loss of network connection can cause lot of disruption and unfortunately neither students nor academics have control over these issues. This is a fact and it can happen anytime during an online exam. Universities are fully aware about these problems and they are trying their very best to minimise the impact if ever any student experience these. Students are at no time penalised for these situations which is beyond their control. Most students who took part
896
A. Q. Mohabuth
in the survey did not believe that their University did not have the adequate resources for conducting e-assessment (mean = 2.81). In fact, the University has promptly put in place the necessary infrastructure for online exam to take place during the pandemic. The necessary logistics were provided to the staff and even the students. The University has attended to hardship that some students may encounter and has even put laptops at the disposal for students who cannot afford one. Besides, many students stated that they have no issue to invest in tools such as laptop, smartphone, camera, scanner, Wi-Fi network to take part in e-assessment (mean = 3.77). This shows clearly that the students are in good faith and they have been collaborating to make the University initiate the paradigm shift from traditional assessment to e-assessment. However, most students agreed that it is imperative to have the appropriate training to make use of e-assessment. This is obvious as students have different background except for those students in the IT fields, others surely need to be coached to make efficient use of the features of e-assessment platform. The University has indeed not neglected these issues and has provided the necessary training to all its academic staff where they have been exposed to the different features of e-assessment platforms. Workshop on e-assessment was even organised for academics with international expert. Academics have thereafter transferred the knowledge to their students. Table 2. Distributive Statistics for Facilities in the Environment Mean SD
Skewness Kurtosis
Technical problems such as power cut, loss of network connection make online exams impractical
4.26
0.971 −1.449
My university does not have the adequate resources/infrastructure to carry out e-assessment
2.81
1.044
1.833
0.012
−0.199
I have no issue to invest in tools to take part in 3.17 e-assessment (laptop, smartphone, camera, scanner, Wi-Fi network)
1.180 −0.314
−0.608
It is imperative to have the appropriate training to make use of e-assessment
1.002 −0.608
−0.067
3.77
4.4 The Covid-19 Context e-Assessment systems existed well before the pandemic situation, but it has never been a priority for most Universities. Only those Universities that were offering most of their courses online or by distance learning were considering e-assessment platforms. It is not a secret for anybody that Covid-19 crisis has given all Universities a wakeup call to revamp their assessment strategies. There is a relationship between Covid-19 and e-assessment as it has been the calling factor for Universities, but will e-assessment have a future or is it just being used as a fifth tyre in the current context? Accordingly, students do not consider e-assessment as a fifth tyre as shown in Table 3. Very few students considered that e-assessment should only be used during the pandemic situation (mean = 2.37).
Towards the Establishment of E-Assessment
897
The mean is also very low (mean = 1.96) for those who would have preferred traditional mode of assessment had there not been Covid-19. This clearly demonstrate that the students have willingness to shift towards e-assessment. Students being well aware that technology represent their future and would surely not miss the opportunities to take advantage of what e-assessments have to offer. Students felt the Covid-19 pandemic has compelled them to adapt to e-assessment practices (mean = 3.86). Certainly, the pandemic situation has been the prime factor for making most students discover the facilities of e-assessment platform. Most students would have never experience it even though they may have heard about it before. Students seem to have diverse opinion about the difficulty for them to use e-assessment tools during the COVID-19 pandemic (mean = 3.20). This may be due to other factors, some students might already have accessed to the necessary logistics while others may not. It may also be that some students have been able to discuss the features with their friends, while others have not been able to do so during the pandemic. Table 3. Distributive Statistics for e-Assessment in the Covid-19 Context Items
Mean SD
Skewness Kurtosis
If it were not for COVID-19, I would have preferred paper-based mode of assessment
1.96
1.040
1.034
0.503
E-assessment should only be used during the COVID-19 2.37 -pandemic
1.122
0.581
−0.434
The COVID-19 pandemic compelled me to adapt to e-assessment practices
3.86
0.957 −0.931
0.896
It has not been difficult for me to use e-assessment tools 3.20 during the COVID-19 pandemic
1.117 −0.248
−0.464
4.5 Practical Issues It was found that all students have accessibility to online exams. The mean obtained at each Faculty is quite high nearly four and above which demonstrates that there is no major problem as regards to the platforms used by the University for e-assessment. Online exams are time-saving, provide average rating with mean ranging around three which may be due to the fact that some students may consider that they do not need to travel to the University for assessment. This issue is taken further in the qualitative part of the research to have clearer views. Students at all Faculties do not associate online assessment with e-learning which means that they consider that e-assessment is still feasible even if they are having only face-to-face lectures. e-Learning may facilitate the adoption of e-assessment in Universities, but they are different entities and not interdependent. Students are rather agreeable that there is potential for getting immediate feedback from their online assessment and this could enhance their learning experience. Students express concern about the issue of cheating with online exam. All the Faculties
898
A. Q. Mohabuth
have rated this issue as high with mean ranging from 3.68 to 4.14. Indeed, this is a primary issue which many Universities are still working on. As regards to privacy and recording during online exam, the mean obtained at the different Faculties are low which indicates that students do not have apprehension for these. They are aware that they need to be made visible for the assessment and house privacy and recording of the exam event does not matter much in that these fall into the requirement for ensuring the viability of e-assessment. Most Faculties have given a mean of less than three as regards to the reliability of invigilation in the e-assessment they have taken part. This indicate that there is loophole and more enquiries on it are taken up with the students in the interviews which are discussed in the subsequent section of qualitative analysis. The mean about whether e-assessment reduces stress is quite low indicating that examination remains stressful even if students are at home where they are more comfortable. 4.6 Qualitative Research Interviews were held with academics and administrative officers. This was executed to gain an insight on the hurdles faced and their point of views regarding e-assessment. The findings were combined and the resulting trend is reported in the succeeding sections. While most academics admit that traditional assessment has well established procedures due to the fact that it has been used for years, they are not against the use of e-assessment. They claimed that we are in the experimental stage and that the procedures need to be more structured. Most of them believe that e-assessment may be of help and should not be discontinued when the pandemic situation is over. They are thankful to management for facilitating the implementation of e-assessment on the campus by providing the necessary training and workshops. Most academics found that e-assessment has been very helpful for them to conduct out tests for their course work. They do not have to print up test papers, find appropriate room, especially for large groups, seek help from colleagues for invigilation. The main issue as expected is about academic dishonesty which many found as a deterrent for replacing traditional face to face assessment. In addition, a focus group interview was held with students. Almost all participants showed a satisfactory reaction to a proctored exam system. They believe that it might be a good system to deter treacherous behaviours, the prime pitfall of current e-assessment system. By so doing, a fairer examination might be implemented. 4.7 The Framework From the research findings exposed above, it is crucial to develop a framework that will ensure not only that learning achievement can be assessed fairly, but one that can bring along trust among the academics and students where the process of authenticity is defined and academic dishonesty is mitigated. The following framework as illustrated in Fig. 5 is proposed. The framework takes into account the processes to put in place prior, during and after the e-assessment. Students will first log in using a username and a password onto the examination platform which may be Zoom, Google Meet, Webex, Teams or any other meeting app depending on the number of students to be hosted. The number of participants for the Zoom platform may go up to a thousand participants, but many Universities
Towards the Establishment of E-Assessment
899
Fig. 5. The Proposed Framework
generally subscribe to packages that can handle up to 300 participants per session. The screen gallery to view the participants may go up to 50 per page. While Google Meet has educational subscription that can handle 100 to 500 participants per session. Webex on its side offers hosting up to 1000 participants per session. Once the students are on the respective platform, they need to switch on their front camera whereby they present themselves to the online invigilator for authentication purposes. The identity verification may be done by capturing picture of the students. In fact Universities need to implement a uniform exam policy where it would be required to use a camera capturing student’s photo which will allow the verification of the student’s ID and would make sure that someone else is not taking the test in place of the student. University generally has the profile of each student in its student record system. A process of facial detection will then match the captured image to the saved profile. This ensures the authenticity of the students sitting for the e-assessment. There are off-the-shelf software on the market that may be acquired and easily integrated with the e-assessment platform to match current image with stored image to identify examinees. Google classroom may be used by academics to upload their question paper for their assessment. Once the process of authentication is made, the online invigilator provides the google classroom link to the students by sharing it through the chat window on the examination platform. As soon as all students download and access the link, the examination may start. During the examination process, the online invigilator may retrieve the attendance of the students obtained from the authentication software. Invigilation may take place by constant monitoring of
900
A. Q. Mohabuth
the online invigilator by viewing the grid of cameras presented through the e-assessment platform. If the online invigilator happen to detect a case of cheating, the focus will be on the particular student, thereby neglecting other students. In fact, the research findings from this study revealed a lot about the vulnerabilities of the existing system as regards to academic dishonesty. Students as well as academics pointed out very clearly about the weaknesses of the system to identify cheating cases. This was fully discussed in both the quantitative and qualitative sections above. Browser lockdown software may be used, but still cheating may take place as they do not detect what’s happening around. Students may still use their mobile phones, look up for notes stick on walls or having books or materials around not visible within the camera coverage area. The issue of academic dishonesty may be dissipated by the integration of artificial intelligence (AI) features capable of performing auto monitoring and detecting any cheating attempt. AI software that can monitor students’ behaviour during e-assessment are available and can be integrated with examination platform. They are more powerful that the lockdown browsers, as they constantly monitor students’ behaviour and may detect use of mobile phones, voices flagging off coughing, sneezing etc. They provide timestamped recordings and attempt to cheating may instantly be detected and triggered to the online invigilator for necessary action. The recordings may also be viewed after the examination for confirmation purposes. Honorlock, Proctortrack, Top Hat are among the popular proctor software that offer monitoring through AI features and authenticity facilities As soon as the examination is over, students take pictures or scan their scripts and compile them into a single pdf file. An allocated time is assigned for this exercise, generally 15–20 min suffice. Once students have uploaded their answer scripts, the invigilators verify the turned in papers. Any script sent after the allocated time will not be considered. Once all the scripts have been submitted, the online invigilator releases the students and ends up the online meeting. The Exam Administrative Officer may acknowledge receipt of the scripts and share same with the respective examiners. The latter may access their respective batch for marking. Examiners may download the scripts and marked using annotation on the pdf Platform such as Adobe Acrobat. The use of graphics tablet will facilitate the task. Google drive may be used to share the marked scripts with moderators for the moderation exercise. Final markings are then shared with the Exam Administrative Officer for filing purposes and sampling to External Examiner if so required. The same examination process as for traditional one may then be followed as per approved examination regulations for Exam Boards, Faculty Boards, Exam Committee, Senate etc.
5 Conclusion The study made an assessment of the level of knowledge IT students possessed on current knowledge, usage, resources needed and various practical issues were deeply investigated. It was found that students are quite knowledgeable about e-assessment practices as more than 70% claimed that they have been experiencing a number of platforms associated with e-assessment particularly Google classroom, Google Meet, Zoom, Turnitin, Moodle, etc., although they were not very familiar with some terms such as lock browser and proctored exam. As regards to the type of assessments, students have been experiencing mainly Class tests, Assignments and Mini projects as compared to Portfolios,
Towards the Establishment of E-Assessment
901
Case studies and Open and Oral exam. Multiple choice questions and structured questions were the most preferred types of questions for the students. Academics on their sides have been able to try these online assessments for their validity as they provide the facility to test really what they want to test and present the results as aligned with the learning goals. They have been able to make fair assessment, distinguish between those students who deserve passing and others who do not. Students were found to be quite comfortable with the use of e-assessment. Very few students reported that they find it hard to focus on questions in online exam. Most students confirmed that e-assessment is applicable in their subject area. However, there was diverse opinion as regards to the use of cameras, malpractices associated with online assessment and impact on health & safety. Regarding resource facilitation, majority of the students believe that the University has adequate resources and can put in place further resources for conducting e-assessment. Many students reported also that they did not face issues in investing in tools such as laptop, smartphone, camera, scanner, Wi-Fi network to take part in eassessment. However, most students claimed that it is imperative to have the appropriate training to make use of e-assessment effectively. Both students and academics expressed concern on the issue of academic dishonesty and network connection problems. Students suggested that e-assessment should not be more than two hours to prevent impact on health. Both parties agree that e-assessment has been very helpful during the Covid-19 pandemic situation and it is the latter that have triggered the introduction of e-assessment on the campus. They also have same conviction that that e-assessment should not die down after the pandemic and that there is a bright future for e-assessment, although they believe that e-assessment will not scrap traditional face-to-face assessment.
References 1. Hosseini, M.M., Egodawatte, G., Ruzgar, N.S.: Online assessment in a business department during COVID-19: challenges and practices. Int. J. Manage. Educ. 19, 100556 (2021). https:// doi.org/10.1016/j.ijme.2021.100556 2. Abd Elgalil, H.M., El –Hakam FE-Z, A., Farrag, I.M., Abdelmohsen, S.R., Elkolaly, H.: Undergraduate students’ perceptions of online assessment during Covid -19 pandemic at Faculty of Medicine for Girls, Al-Azhar University, Cairo, Egypt. Innovations in Education and Teaching International, pp. 1–11 (2022). https://doi.org/10.1080/14703297.2022.2037450 3. Mukhtar, K., Javed, K., Arooj, M., Sethi, A.: Advantages, limitations and recommendations for online learning during COVID-19 pandemic era. Pakistan J. Med. Sci. (2020). https://doi. org/10.12669/pjms.36.covid19-s4.2785 4. Al-Qdah, M., Ababneh, I.: Comparing online and paper exams: performances and perceptions of Saudi students. Int. J. Inf. Educ. Technol. 7, 106–109 (2017). https://doi.org/10.18178/ijiet. 2017.7.2.850 5. Appiah, M., van Tonder, F.: Students’ perceptions of E-assessment at a higher education institution. In: 2019 5th International Conference on Computing Engineering and Design (ICCED) (2019). https://doi.org/10.1109/icced46541.2019.9161088 6. Jacsó, P.: The JISC academic database assessment tool – virtues and vices. Online Inf. Rev. 34, 806–814 (2010). https://doi.org/10.1108/14684521011084636 7. Capacho, J.: Assessment of student learning in virtual spaces, using orders of complexity in levels of thinking. Turkish Online J. Distance Educ. 18, 179–201 (2017). https://doi.org/10. 17718/tojde.306568
902
A. Q. Mohabuth
8. Buchanan, E.A.: Online assessment in high education: strategies to systematically evaluate student learning. In: Howard, C., Schenk, K.D., Discenza, R. (eds.) Distance Learning and University Effectiveness: Changing Educational Paradigms for Online Learning, pp. 163–176. IGI Global (2004). https://doi.org/10.4018/978-1-59140-178-0.ch008 9. Brink, R., Lautenbach, G.: Electronic assessment in higher education. Educ. Stud. 37, 503–512 (2011). https://doi.org/10.1080/03055698.2010.539733 10. McLean, H.: This is the way to teach: insights from academics and students about assessment that supports learning. Assess. Eval. High. Educ. 43, 1228–1240 (2018). https://doi.org/10. 1080/02602938.2018.1446508 11. Nwosu, J.: Comparative assessment of e-exam and e-marking integration in selected universities in ogun state: undergraduate student’s perspectives. In: International Conference on Education, Development and Innovation, pp.137–152 (2017) 12. Boud, D., Falchikov, N.: Aligning assessment with long-term learning. Assess. Eval. High. Educ. 31, 399–413 (2006). https://doi.org/10.1080/02602930600679050 13. Rout, G., Patnaik, S.: A case study on E-examination in universities of Odisha. Int. J. Comput. Commun. Technol. 8(1), 12–20 (2017). https://doi.org/10.47893/ijcct.2017.1392 14. Crisp, G.: Assessment. In: Marshall, S. (ed.) A Handbook for Teaching and Learning in Higher Education: Enhancing Academic Practice, pp. 61–71. Routledge (2019). https://doi.org/10. 4324/9780429259500-5 15. Soffer, T., Kahan, T., Livne, E.: E-assessment of online academic courses via students’ activities and perceptions. Stud. Educ. Eval. 54, 83–93 (2017). https://doi.org/10.1016/j.stueduc. 2016.10.001 16. Algahtani, H., Shirah, B., Subahi, A., et al.: Effectiveness and needs assessment of faculty development programme for medical education: experience from Saudi Arabia. Sultan Qaboos Univ. Med. J. [SQUMJ] 20, 83 (2020). https://doi.org/10.18295/squmj.2020.20.01.012 17. Naidu, S.: E-learning: A guidebook of principles, procedures and practices. Commonwealth Educational Media Centre for Asia (CEMCA) (2006). https://doi.org/10.56059/11599/53 18. Howarth, P.: The opportunities and challenges faced in utilizing e-Based assessment (2015). http://www.educationalrc.org/oldconf/old/pdf/Paul%20Howarth%20-%20B eirut%20Presentation.pdf. Accessed 6 Jul 2021 19. Huda, S., Kabir, M., Siddiq, T.: E-Assessment in higher education: students’ perspective. Int. J. Educ. Dev. Using Inf. Commun. Technol. 16, 250–258 (2020) 20. Kuyoro, S., Maminor, G., Kanu, R., Akande, O.: The design and implementation of a computer based testing system. History 5: 6 (2016) 21. Dy, E., Tan, A., Errabo, D.: Students’s Perceptions and Anxieties towards e-Assessment: ımplications for online classroom delivery. In: IEEE International Conference on Educational Technology (ICET), pp. 191–195 (2021).https://doi.org/10.1109/ICET52293.2021.9563138 22. Favale, T., Soro, F., Trevisan, M., et al.: Campus traffic and e-learning during COVID19 pandemic. Comput. Netw. 176, 107290 (2020). https://doi.org/10.1016/j.comnet.2020. 107290 23. Fawaz, M., Samaha, A.: E-learning: depression, anxiety, and stress symptomatology among Lebanese university students during COVID-19 quarantine. Nurs. Forum 56, 52–57 (2020). https://doi.org/10.1111/nuf.12521 24. Yusefzadeh, H., Amirzadeh Iranagh, J., Nabilou, B.: the effect of study preparation on test anxiety and performance: a quasi-experimental study. Adv. Med. Educ. Pract. 10, 245–251 (2019). https://doi.org/10.2147/amep.s192053 25. Blumenfeldwitz, J.: ULPT: need to cheat on a proctored, online exam? Make a cheat sheet and put it on your laptop screen (2020). https://www.reddit.com/r/UnethicalLifeProTips/com ments/ckf3fu/ulpt_need_to_cheat_on_a_proctored_online_exam/. Accessed 12 Dec 2021
Towards the Establishment of E-Assessment
903
26. Alessio, H.M., Malay, N., Maurer, K., et al.: Interaction of proctoring and student major on online test performance. Int. Rev. Res. Open Distributed Learn. 19(5), 165–185 (2018). https://doi.org/10.19173/irrodl.v19i5.3698 27. Ndibalema, P.: Online assessment in the era of digital natives in higher education institutions. Int. J. Technol. Educ. 4(3), 443–463 (2021). https://doi.org/10.46328/ijte.89 28. Garg, M., Goel, A.: A systematic literature review on online assessment security: current challenges and integrity strategies. Comput. Secur. 113, 102544 (2022). https://doi.org/10. 1016/j.cose.2021.102544 29. Alexandron, G., Yoo, L.Y., Ruipérez-Valiente, J.A., Lee, S., Pritchard, D.E.: Are MOOC learning analytics results trustworthy? With fake learners, they might not be! Int. J. Artif. Intell. Educ. 29(4), 484–506 (2019). https://doi.org/10.1007/s40593-019-00183-1 30. Gamage, K.A.A., Silva, E.K., Gunawardhana, N.: Online delivery and assessment during COVID-19: safeguarding academic integrity. Educ. Sci. 10, 301 (2020). https://doi.org/10. 3390/educsci10110301 31. King, C., Guyette, R., Piotrowski, C.: Online exams and cheating: an empirical analysis of business students’ views. J. Educ. Online (2009). https://doi.org/10.9743/jeo.2009.1.5 32. Bilen, E., Matros, A.: Online cheating amid COVID-19. J. Econ. Behav. Organ. 182, 196–211 (2021). https://doi.org/10.1016/j.jebo.2020.12.004 33. Senoran, H.: More students cheating during online classes, universities say CTV News (2020). https://kitchener.ctvnews.ca/more-students-cheating-during-online-classes-universit ies-say-1.5234890. Accessed 12 Feb 2021 34. Adedoyin, O., Soykan, E.: Covid-19 Pandemic and online learning: the challenges and opportunities. Interactive learning environments, pp. 1–13 (2020) 35. Amigud, A., Pell, D.J.: When academic integrity rules should not apply: a survey of academic staff. Assess. Eval. High. Educ. 46, 928–942 (2020). https://doi.org/10.1080/02602938.2020. 1826900 36. Xiong, Y., Suen, H.K.: Assessment approaches in massive open online courses: possibilities, challenges and future directions. Int. Rev. Educ. 64(2), 241–263 (2018). https://doi.org/10. 1007/s11159-018-9710-5 37. Adzima, K.: Examining online cheating in higher education using traditional classroom cheating as a guide. Electron. J. E-Learning 18, 476–493 (2020) 38. Phillip Dawson,: E-cheating. In: Phillip Dawson, (ed.) Defending Assessment Security in a Digital World: Preventing E-Cheating and Supporting Academic Integrity in Higher Education, pp. 1–18. Routledge, Abingdon, Oxon ; New York, NY : Routledge, 2021. (2020). https://doi.org/10.4324/9780429324178-1 39. Butler-Henderson, K., Crawford, J.: A systematic review of online examinations: a pedagogical innovation for scalable authentication and integrity. Comput. Educ. 159, 104024 (2020). https://doi.org/10.1016/j.compedu.2020.104024 40. Dendir, S., Maxwell, R.S.: Cheating in online courses: evidence from online proctoring. Comput. Hum. Behav. Rep. 2, 100033 (2020). https://doi.org/10.1016/j.chbr.2020.100033 41. Peled, Y., Eshet, Y., Barczyk, C., Grinautski, K.: Predictors of academic dishonesty among undergraduate students in online and face-to-face courses. Comput. Educ. 131, 49–59 (2019). https://doi.org/10.1016/j.compedu.2018.05.012 42. García-Peñalvo, F.J., Corell, A., Abella-García, V., Grande-de-Prado, M.: Recommendations for mandatory online assessment in higher education during the COVID-19 pandemic. In: Burgos, D., Tlili, A., Tabacco, A. (eds.) Radical Solutions for Education in a Crisis Context. Lecture Notes in Educational Technology, pp. 85–98. Springer, Singapore (2021). https://doi. org/10.1007/978-981-15-7869-4_6 43. Polit, D.F., Beck, C.T.: Generalization in quantitative and qualitative research: myths and strategies. Int. J. Nurs. Stud. 47, 1451–1458 (2010). https://doi.org/10.1016/j.ijnurstu.2010. 06.004
904
A. Q. Mohabuth
44. Whisenhunt, B.L., Cathey, C.L., Hudson, D.L., Needy, L.M.: Maximizing learning while minimizing cheating: new evidence and advice for online multiple-choice exams. Scholarsh. Teach. Learn. Psychol. 8, 140–153 (2022). https://doi.org/10.1037/stl0000242 45. Grouse, F., Malthaner, A., Hannon, K.: Digital education in the disciplines (2022) 46. Iqbal, S.A., Ashiq, M., Rehman, S.U., et al.: Students’ perceptions and experiences of online education in Pakistani universities and higher education institutes during COVID-19. Educ. Sci. 12, 166 (2022). https://doi.org/10.3390/educsci12030166 47. Bidonde, J., Meneses-Echavez, J.F., Asare, B., et al.: Developing a tool to assess the skills to perform a health technology assessment. BMC Med. Res. Methodol. 22, 78 (2022). https:// doi.org/10.1186/s12874-022-01562-4
Inquiring Minds Want to Know What HBCU Students Say About a STEM Master Course Model D’Nita Andrews Graham(B) Norfolk State University, Norfolk, VA 23504, USA [email protected]
Abstract. Historically Black Colleges and Universities (HBCUs) have had a long history of providing quality educational experiences to underrepresented minority students (URMs). Despite the increase in quality assurance initiatives at many of the nation’s colleges and universities, collectively, Historically Black Colleges and Universities (HBCUs) continue to lag behind non-HBCUs in the development and implementation of STEM courses that are Quality Matters (QM) certified. This research paper presents the results of experiences of students in the only QM certified STEM course that utilizes a master course model at Norfolk State University, a Historically Black Colleges and Universities (HBCUs). To help instructors, instructional designers, and administrators understand the impact of design standards from students’ viewpoints, an anonymous 20-item survey was developed and administered to students to gain insight into their experiences and perspectives. The findings indicated that the students in sections that utilized the QM certified master course model expressed favorable opinions of this approach, and they reported generally higher levels of satisfaction, engagement, and learning effectiveness than the students in other sections that did not utilize the QM certified model. Keywords: STEM · Online · Master Course · HBCU · Quality Matters · Student Experiences · Higher Education
1 Introduction There is minimal research that has examined the perceptions of HBCU students about a master course design model of a STEM course. Academic leaders can better support and implement best practices when they understand student perceptions about the master course design model. This research paper presents the student survey results of a STEM course utilizing the Quality Matters (QM) master course design model at Norfolk State University. Models make possible a systematic approach to design in that they intentionally lead designers to balance considerations of varied critical factors [1]. Models incorporate both theoretical and empirical research in related fields [2, 3]. There have been limited prior publications of voices from students about a QM based master course design model for STEM courses at HBCUs. The process for constructing © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): SAI 2023, LNNS 711, pp. 905–918, 2023. https://doi.org/10.1007/978-3-031-37717-4_59
906
D. A. Graham
and validating a QM HBCU STEM master course model as reported in this study, and its implications for improvement of course-level practice and educational innovation contribute to the growing literature on this meaning approach to learning. The remainder of this paper is organized as follows. Section 2 provides a literature review of the relevant subject domain. Section 3 describes the methodology used for this research and project implementation. Section 4 summarizes the results of the research. Sections 5 and 6 present a discussion and conclusion.
2 Literature Review In higher education there were approximately one-third of students taking at least one online course before the nationwide move to remote learning [4]. COVID-19 has changed the way education systems communicate and learn. Online higher education faculty is reviewing their perceptions about the new culture of learning and teaching. This transformation has caused researchers to examine online learning and online course design. According to [5], the COVID-19 pandemic elevated the urgency of providing quality online learning. The assumption that faculty, who have discipline expertise, are sufficiently trained and skilled in instructional design have been challenged [6]. Instructional design is an important component and requires specific attention in online courses [7–9]. To enhance higher education institutions’ quality assurance processes, faculty exposure to and application of course design standards, and course quality review [10] are required. Faculty’s active participation in quality assurance processes has been found to be a positive experience related to receptivity of feedback from internal quality assurance review [11, 12]. Taylor et al. [13] suggest administrators should review the challenges faculty might experience in “decision fatigue of the observed.” Researchers and practitioners have developed a quantity of knowledge about what establishes a quality online course over the past 25 years. This knowledge has been collected into processes and rubrics such as the OPEN SUNY OSCQR Course Design Review and the Quality Matters program and rubrics. These peer-validated instruments can help ensure that instructors and students can receive an experience that is high quality when applied to the online course design [14]. A foreseeable structure with clear expectation is one of the most significant factors related with a positive online experience. When criteria such as accessible design, copyright compliance, instructional design, universal design, teacher presence, mobile and cybersecurity awareness and many other key elements are included in a rubric, faculty can achieve intentional design. 2.1 Theoretical Framework To understand students’ perceptions about a master course design model, Oliver’s expectancy disconfirmation theory (EDT) [15] was used. The expectancy disconfirmation theory is based on the principle that personal satisfaction level is of the learning process and that the learner’s interpretation of the experience creates a difference between expected and perceived performance. In this study, the EDT was used to determine students’ level of satisfaction.
Inquiring Minds Want to Know What HBCU Students
907
2.2 Why Use a Master Course Model? Best practices related to quality in design, quality in teaching, and content management have proliferated. A team of subject matter experts and instructional designers can design the master course design model. The model commonly utilizes a developmental course shell to store all the materials for a course design in the learning management system on which the course will be taught (e.g., Blackboard, Canvas, Desire2Learn). With the goal of maximizing student access to the materials and learning outcomes the master course model can present the materials in a manner that aligns with the institutions’ branding standards, course design, accessibility, and quality standards. A master course model provides the ability to design and develop a course that can be taught by multiple individuals teaching different sections of the same course, whether they are visiting professor, adjunct professor, graduate teaching assistants, or full-time faculty members. Instructors have flexibility to modify and adapt to fit personal preferences. The master course design approach is about faculty collaborating, sharing and weighing in on all aspects of the course. One of the most critical factors of a master course design model is the structure and instructional design that affects the overall experience of students. Instructional design and delivery are common components of the success of the learning experience. Several studies have explored the efficiency of instructional design [16, 17]. 2.3 Benefits for Instructors The cohort of instructors from a unique legacy of an HBCU can benefit from a master course model because many instructors are multicultural educators who create a culturally sensitive classroom to facilitate the diverse student body in their classroom. The course consists of different peoples and societies. To ensure consistency and integrity of any given course, regardless of delivery format and instructional personnel, its instructional outcomes, materials, activities and assessments should be maintained from one offering to the next and revisions are made through intentional collaboration with the course design team [18]. Developing, organizing, and maintaining course content can be labor intensive and time consuming. Using a master course design, instructors can collectively save time and effort, and thus be able to focus more on facilitating student learning, personalizing instructions, and scaffolding student learning. Sharing is the heart of multi-faculty master course design. 2.4 Benefits for Students There are many benefits when students are provided with an opportunity to be engaged in a learner-centered approach to course design. A course that is mapped out with learning objectives that describe outcomes that are measurable and assessments that measure the stated learning objectives can assist students to develop a strong foundational understanding. Accessibility is another key component for providing content in multiple modes for the visually impaired and captioning for the audibly impaired. Students who use assistive technologies or need to locate a menu item within the structure of the master course design will already know where to access the information because they
908
D. A. Graham
would have learned them at the beginning. Having items with the same name and in the same location facilitates learning and enhances recall. It also facilitates ease of access for the students. Students benefit when educators intentionally design courses that allow students to engage and collaborate in meaningful and relevant ways. Design, management and communication skills are relevant for future employment [19]. The student’s experience with consistency usually leads to course satisfaction. 2.5 Intentional Design The importance of student satisfaction and intentional design is possibly best highlighted by two of the leading quality assurance frameworks of online learning: (1) the Quality Matters (QM) rubric [20] is one of the most widely adopted set of best practices for faculty-driven peer-review process used to ensure the quality of online and blended course design. The Sixth Edition of the QM Higher Education (QM HE) Rubric is organized by eight general standards: (a) Course Overview and Introduction, (b) Learning Objectives, (c) Assessment and Measurement, (d) Instructional Materials, (e) Learning Activities and Learner Interaction, (f) Course Technology, (g) Learner Support, and (h) Accessibility and Usability. Organized under the eight General Standards also are 42 specific review standards (SRS). Each standard on the QM HE Rubric has an assigned value of 3 points (essential), 2 points (very important), or 1 point (important). However, the standards are not the rubric, as each standard must be considered in conjunction with the assigned annotation, which provides examples and detailed guidance for faculty peer reviewers; and (2) the Online Learning Consortium’s (OLC) 5 Pillars of Quality Framework, which consist of: learning effectiveness, scale, access, faculty satisfaction, and student satisfaction [21, 22]. The OLC’s website currently describes the student satisfaction pillar as “students are pleased with their experiences in learning online, including interaction with instructors and peers, learning outcomes that match expectations, services, and orientation” [23]. 2.6 Students’ Perspectives Students’ perspectives provide invaluable, first-hand insights into their experiences and expectations [24]. The quality of online courses and students’ satisfaction with them are important since students’ perceptions have a direct impact on their learning and motivation [25]. Although student perceptions of quality and satisfaction are associated with QMcertified courses, research related to student perceptions about how QM standards impact their learning and engagement in online QM-certified courses is limited. Therefore, it is important to understand students’ perceptions of how master course designs impact their learning and engagement. With the massive growth in the number of university students taking online courses, course quality in terms of improving student engagement and learning is an important goal for HBCUs [4]. The student perspective is especially important when new teaching approaches are used and when new technologies are being introduced [26–28]. With the renewed interest in “active” education in general [29–32] and the flipped classroom approach in particular
Inquiring Minds Want to Know What HBCU Students
909
[33–39] along with extraordinary shifts in the technology, the student perspective of course design is profoundly important. 2.7 Course Design Impact Design for impact can possibly increase students’ interaction and impact students’ overall learning in online courses [40]. Numerous researchers have found that the design of the online course establishes the structure for the development of quality interactions, meaningful social connections, and the nurturing of the learning community [41– 43]. The study [44] examined the relationships among online course design features, course organization and presentation, learning objectives and assessment, and interpersonal interaction and technology. They found that design features influenced student performance and interaction affected student grades. According to [45] and [46], improved online course design has been shown to contribute to improved student learning. Course design can impact leaners’ persistence in the online learning environment (OLE) [43, 47, 48]. Intuitive design can make the achievement of learning objectives simple and foster optimal performance [48] whereas poor design can impart artificial barriers to attaining learning objectives and can cause frustration for learners [49].
3 Methodology 3.1 Context Norfolk State University is a Historically Black Colleges and Universities (HBCU) in Norfolk, Virginia. The Computer Science Department offers two service courses with approximately one thousand enrollments per semester. The computer science general education courses offer 32 sections each academic year (20 in the Fall, 11 in the Spring, and 1 in the Summer), with most of the classes having 40–45 students per class regardless of the delivery modes: face-to-face (f2f), hybrid, or online. The computer science general education courses have been identified as gateway courses at many institutions, including Norfolk State University. Therefore, the courses need to provide high standards of instructions that contribute to student success, consistency, and persistence. One of the service courses Advanced Computer Concepts is the only Quality Matters certified course at the University. The course is student centered and uses a master course template of learning outcomes, learning activities, materials and resources, assignment and assessments, and summary. The course utilizes a learning management system (LMS) for getting started, announcement, course information, course content, discussion, communication, learner tools, university services, grading, and other engagement and interaction. After a comprehensive literature review, an instrument was created to measure students’ perceptions about taking the advanced computer concepts course at the university. Items were designed to capture students’ overall acceptance of the master course design as well as what worked well, feedback on diversity, inclusion and equity in the course. Information on demographic was gathered to determine the effects on students’ overall acceptance of the master course design based on gender, age, classification, and years of online experience.
910
D. A. Graham
3.2 Participants The participants for the study were selected using anonymous student volunteers who were enrolled in the four QM certified sections and the other four non-QM certified sections during the 2021 spring and fall semesters. The age of the participants ranged from 18 years old to greater than 50 years old. The University’s latest Fact Book shows a headcount of 5,2045; 84% of its students are Black, 3.8% are White, and 12.2% are classified as other (includes international and unknown). Sixty-six percent of the students are females, and 34% are males. In addition to serving a high proportion of students of color, many students are first-generation college students. 3.3 Instrument Development To investigate the perspectives of students, a questionnaire with a Likert scale (Rating: 1 = Strongly Disagree, 2 = Disagree, 3 = Neither Agree nor Disagree, 4 = Agree, 5 = Strongly Agree) was developed for participants to rate their opinion of experiences utilizing a STEM master course model for the Computer Literacy course at NSU. Participation in the study was voluntary, and all data were stored confidentially. None of the participants’ identities were exposed. The survey instrument was developed to assess students’ most meaningful and relevant learning experiences. There were four demographic questions and 16 questions about the participants’ perspective of course expectations; accessibility; diversity, inclusion, and equity; collaboration and interaction of the master course model. To ensure the survey focused on the needs of the students and for continuous improvement purpose, three open ended questions were asked of all participants: (a) “What did you like best about the course?” (b) “What would you suggest changing about the master course?” and (c) “Do you have any comments or feedback that you would like to add?” The descriptive narratives provided by the students were essential, as they conveyed more detailed information regarding the students’ perceptions of the impact of the course and their voices on implementation and processes for continuous improvement. 3.4 Data Collection Data for the project was collected in spring 2021 and fall 2021. The study was approved by the University’s Institutional Review Board (IRB). Data were collected from Group 1 (QM Participants) and from Group 2 (Non-QM Participants) using a survey. The surveys were distributed online using the university Learning Management System (LMS). These efforts resulted in a total of 306 participants across Group 1 (n = 207) and Group 2 (n = 99). A majority of participants were self-identified as females: 76% from Group 1 and 82% from Group 2. Students’ ages ranging from under 20 to 59 years were represented in the survey results. As exhibited in Fig. 1, more than 56% of the participants self-identified in age range 20–29 for both groups. Of the participants that self-identified their classification, juniors comprised the largest share at 36% in Group 1, and freshmen at 41% in Group 2.
Inquiring Minds Want to Know What HBCU Students
Quality MaƩers Course Age Range 3%
12%
911
Non-Quality MaƩers Course Age Range 4% 5%
19%
56%
35%
66%
Under 20
20 -29
30-39
40-49
Under 20
20 -29
30-39
40-49
Non-Quality MaƩers ClassificaƟon
Quality MaƩers ClassificaƟon 2%
13% 28%
34% 26%
41%
36% 20% Freshmen
Sophmore
Junior
Freshmen
Senior
Sophmore
Junior
Senior
Fig. 1. Demographic Information of Sample
3.5 Data Analysis For data analysis, descriptive statistics—averages (i.e., means) were derived using Microsoft Excel after initial data inputting, cleaning, and coding (where needed) were completed. The answers to the open-ended questions were collected and studied verbatim.
4 Results It is important to consider and understand how HBCU students perceive and describe their learning experiences with or without a QM certified course. Considering the results as seen in Table 1, it is recommended to use the experiences of students in preparing a master course model design that targets continuous improvement. Table 1. Student Perceptions of a Master Course Model Design Questions
QM (Group 1) Impact Non-QM (Group 2) Impact
1. Was course easy to follow
86%
79%
2. Course met my expectations
81%
64% (continued)
912
D. A. Graham Table 1. (continued)
Questions
QM (Group 1) Impact Non-QM (Group 2) Impact
3. Enjoyed the course
65%
52%
4. Prefer to have all the contents in one 82% place and scroll to each component for that week
69%
5. Prefer to click in and out of the learning activities, materials and resources, and assessment weekly folders
67%
64%
6. Was able to meet and work with other 29% students
34%
7. There were enough audio and video in 74% the course
68%
8. There were enough assessments in the 88% course
77%
9. The accessibility features were appropriate for the course
85%
77%
10. The course included diversity, inclusion, and equity
67%
56%
To allow students to share rich and detailed narratives, students responded individually to three open-ended questions: (1) What did you like best about the master course? (2) What would you suggest changing about the master course? (3) Do you have any other comments or feedback that you would like to add? The survey responses will guide the implementation strategy for the QM recertification and to improve and enhance the learning experiences for the students. The responses to the open-ended questions served as a foundation in understanding how the master course model affects the attitudes, perspectives, and practices of HBCU students. The majority of the students mentioned that they preferred to have weekly modules with all the contents being in one place, allowing them to scroll to each component for that specific week. Listed below are the responses given by students for the open-ended questions from the survey: “What I liked best about the course was the weekly set-up. It was organized and the flow was good. I was able to browse through the weeks and find the work that needs completion without having to scramble through tabs and different section to find things.” “I like the communication for this semester, I also liked that the assignments were broke up week by week, on blackboard and being able to located everything in one place for that week.” “I like the willingness of my professor to help me whenever I needed him the most, the openness of this course and everything easy to find and navigate through in
Inquiring Minds Want to Know What HBCU Students
913
one place. There were just enough assignments, there weren’t too many and there weren’t too few.” “This course design on blackboard was setup perfectly to accommodate online students. Great job!” The majority of the student noted that no changes needed to be modified to the course and a few students disagreed about items that were already provided in the course as follows: “I would change absolutely nothing about the course. I personally enjoyed SAM it was fun and engaging, MindTap, Read Aloud, Respondus, and I loved the mobile app to get my eBook, flashcards, quizzes, grades, and due date reminders. I was able to know my grade immediately and being able to have more than one attempt was helpful and grateful.” “I do not think any changes should be made to the pre-design course. I wouldn’t change anything. I started off in fear and now I am savvier than before.” “Thanks to the setup of the instruction of the course.” “I wouldn’t change anything, everything was lined up in the correct spots and it allowed us to find all the material helpful and easy. Great course and the way it was designed allowed us to be successful in any way!!” “Relying on Respondus was hard. Get rid of Respondus. Mine never worked which made it hard to take tests. Too many issues with it. This class wasn’t very engaging at all. More interactivity and creativity in this class could make it more enjoyable. Not being able to receive second chances on work.” “Should be much easier than what it is. Minimize the exam length and maybe extended deadlines.” “I think extra credit should be given to students who did try all semester and being able to receive second chances on work.” “The policy on late work, less assignment, extra credit, second chances and no software for proctoring exams.” Most of the additional comments were positive remarks about how students learned to enhance skills and the design of the course. “I retained a lot of information throughout this class. Moreover, I’ve definitely obtained skills that will help me in the future.” “I think everyone should take this course so they can learn how to use different technology skills.” “I really enjoyed this class, and it has strengthened my use of the computer and PowerPoint presentation.” “Overall, this is a great course with a great learning environment for me as a first-time online student.” “I enjoyed learning so much in just one course.”
914
D. A. Graham
The surveys revealed that more than half of the students in Group 1 and Group 2 reported that the instructional strategies and course design features were effective and helped students learn online and face-to-face. Specifically, video demonstrations, engaging activities, course design features, consistent structure, accessibility, read-aloud electronic book, various resources and learning activities, and a mobile app, were found to be effective. Surveys that pertained to teaching or learning resources were noted and subjected to additional review. Applicable reasons that were cited by more than one student within each course were recognized as issues to be addressed. Based on the results relative to the open-ended questions, modifications that may be needed for continuous improvement and the preparation for the Quality Matters recertification in spring 2022 were implemented.
5 Discussion The implications of this study included the voices of HBCU students on instructional strategies, course design, and the study results that can be used to guide for all modalities of teaching and learning, as well as instructors, course designers, and students. The results suggested how course design and consistent structure can affect student success. Student feedback indicated room for improvement. The study found that the course was easy to follow and was cited as one of the highest among both Group 1 (QM) and Group 2 (Non-QM). Both groups of students preferred to have all the content in one folder for that week instead of clicking in and out of weekly learning activities folder, materials and resources folders, and assessment folders. When reflecting on what students liked best about the master course, students focused mostly on the design and organization that caused them to be motivated and satisfied with the overall course. The students recognized the importance of an effective course structure and related that structure to the ease of accessing content. Interest in content is directly linked to motivation and in turn, affects student learning. This finding is consistent with several studies that emphasize how students are more motivated with what they perceive as interesting content or content related to their jobs [50, 51]. In addition to content, the respondents also pointed out that their online classes provided them with all the resources they needed to succeed. The results indicated that one primary attribute of the students’ experience was the master course model provided them with a comfortable learning environment. This is consistent with findings of previous studies (i.e., [52–54]) that also indicated that convenience and flexibility were key features of students’ perceptions. Another essential attribute according to the students, was that the master course model provided clear instructions and how to locate various components of the course. This is consistent with [55] that indicated that quality of course design and clear instructions on how to get started in a course, where to find various course components, and the design features that required students to interact with others were ranked high. These researchers concluded that instructors should make sure that the course design features that facilitate interaction are relevant, appropriate, and well-structured. The study also examines the specific situation of the voices of HBCU students’ experiences of a master course model during the global COVID-19 lockdown. This
Inquiring Minds Want to Know What HBCU Students
915
study is among the first to contribute to the initial body of knowledge in that area. It was substantially significant to report the data collected on students’ perceptions validated theory through practice. The results were irrespective of the demographic differences related to gender, undergraduate status, age, and their online experiences. This study was started in hopes of finding ways to improve both design and instructions of the courses and examine students’ perception of quality. When discussing students’ perceptions of quality, there is little clarity about the actual range of concepts because no integrated empirical studies exists comparing major factors found throughout the literature [28]. Rather, there are practitioner-generated lists of micro-competencies such as the Quality Matters consortium for higher education [20], or broad frameworks encompassing many aspects of quality beyond teaching [56]. While checklists are useful for practitioners and accreditation processes, they do not provide robust, theoretical bases for scholarly development [28].
6 Conclusion The role of a master course design model in the learning process and its potential effect on HBCU students were examined. It was limited by a lack of data from one HBCU university. Incorporating more HBCU’s student voices and gathering feedback can assist in creating high-quality learning experiences for students of color. The results of the study achieved its aim and objectives through feedback indicating that all students were interested and satisfied with the master course STEM design model. The students thought that the course was user friendly, well organized and stress free having all course information ready on the first day of class. The continuous improvement processes for the course earned the QM Recertification Certified Badge. As a result of this achievement, Norfolk State University (as an institution) is recognized for 2022 QM-Certified Courses. It is increasingly essential for HBCUs to evaluate, validate, and communicate their online education quality to students. The realities of the COVID-19 pandemic allowed HBCUs to evaluate and implement more quality assurance systems for online education. The University recognized that HBCUs students deserve quality online learning experiences and should have processes and resources in place to support them. The COVID-19 pandemic has changed education forever, and many HBCUs will have online learning as the new norm going forward. Therefore, it is important to understand how students perceive master course model. Research suggests that well-designed online courses correlate positively to student attainment of learning outcomes, increased student–content interaction, ease of course navigation, decrease in student questions regarding course expectations, and higher student satisfaction [57]. Further research will advance an understanding of the overall experiences of students at HBCUs and exposure to strategies that will broaden participation and improve master course model. Data collection from more HBCUs across the nation will provide a more comprehensive understanding of student experiences in STEM master course model.
916
D. A. Graham
References 1. Daly, S.R., Adams, R.S., Bodner, G.M.: What does it mean to design? A qualitative investigation of design professionals’ experiences. J. Eng. Educ. 101(2), 187–219 (2012) 2. Dick, W.: A model for the systematic design of instruction. In: Instructional design: International Perspectives, pp. 361–369. Lawrence Erlbaum Associates, Mahwah (1997) 3. Smith, K.M., Boling, E.: What do we make of design? Design as a concept in educational technology. Educ. Technol. 49(4), 3–17 (2009) 4. Allen, I.E., Seaman, J.E., Seaman, J.: Grade increase: Tracking distance education in the United States. Babson Survey Research Group (2018) 5. Means, B., and et al.: Suddenly online: A national survey of undergraduates during the COVID-19 pandemic. Digital Promise (2020) 6. Kearns, L.R., Mancilla, R.: The impact of Quality Matters professional development on teaching across delivery formats. Am. J. Distance Educ. 31(3), 185–197 (2017). https://doi. org/10.1080/08923647.2017.1301145 7. Kamenetskiy, M.: Evaluating faculty perceptions of teaching practices in online asynchronous courses: An action research study. ProQuest Dissertations and Theses Global (2016) 8. Kennedy, A.: Faculty perceptions of the usefulness of and participation in professional development for online teaching: An analysis of faculty development and online teaching satisfaction. ProQuest Dissertations and Theses Global (2015) 9. Moore, M.G.: The theory of transactional distance. In: Moore, M.G., Diehl, W.C. (eds.) Handbook of distance education, 4th edn., pp. 32–46. Routledge, New York (2019) 10. Adair, D., Shattuck, K.: Ensuring quality while creating and innovating. In: Huntemann, N.B., Linder, K.E. (eds.) The Business of Innovating Online: Practical Tips and Advice From Industry Leaders, pp. 97–112. Routledge, New York (2023). https://doi.org/10.4324/978100 3447641-8 11. Bazluki, M., Gyabak, K., Uderman, B.: Instructor feedback on a formal online course quality assurance review process. Online J. Distance Learn. Adm. 21(2) (2018) 12. McNeal, L., Gray, J.: A new spin on quality: broadening online course reviews through coaching and slow thinking. Online J. Distance Learn. Adm. 22(4) (2019) 13. Taylor, C., Roehrich, H., Grabanski, J.: External factors that impact online instructor performance: a study measuring the impact of decision fatigue & quality matters recognition of courses on online instructor evaluation. Online J. Distance Learn. Adm. 21(3) (2018) 14. Cavanagh, T.: The importance of intentional online program design. The evolllution: A Modern Campus Illumination (2020) 15. Oliver, R.L.: A cognitive model of the antecedents and consequences of satisfaction decisions. J. Mark. Res. 17, 46–49 (1980) 16. Bozarth, J., Chapman, D.D., LaMonica, L.: Preparing for distance learning: designing an online student orientation course. Educ. Technol. Soc. 7(1), 87–106 (2004) 17. Wegner, S., Holloway, K., Garton, E.: The effects of internet-based instruction on student learning. J. Asynchronous Learn. 3(2), 1–9 (1999) 18. Darr, K.: Why Use Master Shells to Manage Online Courses. https://teachonline.asu.edu/ 2018/02/use-master-shells-manage-online-courses/. 23 Feb 2018 19. Reiser, R.A.: A history of instructional design and technology, Part I. Educ. Tech. Res. Dev. 49(1), 53–64 (2018) 20. Quality Matters: Specific Review Standards from the QM Higher Education Rubric, Sixth edn. Quality Matters (2020). https://www.qualitymatters.org/sites/default/files/PDFs/Standa rdsfromtheQMHigherEducationRubric.pdf 21. Baldwin, S., Ching, Y.H., Hsu, Y.C.: Online course design in higher education: a review of national and statewide evaluation instruments. TechTrends 62(1), 46–57 (2018)
Inquiring Minds Want to Know What HBCU Students
917
22. Moore, J.C.: Elements of Quality: The Sloan-C Framework. Sloan Center for Online Education, Needham, MA (2002) 23. Sloan Consortium: The 5 Pillars. https://onlinelearningconsortium.org/5-pillars/ (2018) 24. Dawson, P., et al.: What makes for effective feedback: staff and student perspectives. Assess. Eval. High. Educ. 44(1), 25–36 (2019) 25. Davies, R.S., Howell, S.L., Petrie, J.A.: A review of trends in distance education scholarship at research universities in North America, 1998–2007. Int. Rev. Res. Open Distance Learn. 11(3), 42–56 (2010) 26. Arthur, L.: From performativity to professionalism: lecturers’ responses to student feedback. Teach. High. Educ. 14(4), 441–454 (2009) 27. Crews, T., Butterfield, J.: Data for flipped classroom design: using student feedback to identify the best components from online and face-to-face classes. High. Educ. Stud. 4(3), 38–47 (2014) 28. Van Wart, M., Ni, A., Ready, D., Shayo, C., Court, J.: Factors leading to online learner satisfaction. Bus. Educ. Innov. J. 12(1), 15–24 (2020) 29. Arruabarrena, R., Sánchez, A., Blanco, J.M., et al.: Integration of good practices of active methodologies with the reuse of student-generated content. Int. J. Educ. Technol. High Educ. 16, 10 (2019) 30. Kay, R., MacDonald, T., DiGiuseppe, M.: A comparison of lecture-based, active, and flipped classroom teaching approaches in higher education. J. Comput. High. Educ. 31, 449–471 (2019) 31. Nouri, J.: The flipped classroom: For active, effective and increased learning – Especially for low achievers. Int. J. Educ. Technol. High. Educ. 13, 33 (2016) 32. Vlachopoulos, D., Makri, A.: The effect of games and simulations on higher education: a systematic literature review. Int. J. Educ. Technol. High. Educ. 14, 22 (2017) 33. Flores, Ò., del-Arco, I., Silva, P.: The flipped classroom model at the university: analysis based on professors’ and students’ assessment in the educational field. Int. J. Educ. Technol. High. Educ. 13, 21 (2016) 34. Gong, D., Yang, H.H., Cai, J.: Exploring the key influencing factors on college students’ computational thinking skills through flipped-classroom instruction. Int. J. Educ. Technol. High. Educ. 17, 19 (2020) 35. Lundin, M., Bergviken Rensfeldt, A., Hillman, T., Lantz-Andersson, A., Peterson, L.: Higher education dominance and siloed knowledge: a systematic review of flipped classroom research. Int. J. Educ. Technol. High. Educ. 15, 20 (2018) 36. Maycock, K.W.: Chalk and talk versus flipped learning: a case study. J. Comput. Assist. Learn. 35, 121–126 (2019) 37. McGivney-Burelle, J.: McGivney-Flipping Calculus. PRIMUS Problem. Res. Issues Math. Undergraduate Stud. 23(5), 477–486 (2013) 38. O’Flaherty, J., Phillips, C.: The use of flipped classrooms in higher education: a scoping review. Internet High. Educ. 25, 85–95 (2015) 39. Tucker, B.: The flipped classroom. Educ. Next 12(1), 82–83 (2012) 40. Knapp, B., Paull, J.: Measuring the impact on learner engagement in the redesigned blended course using Quality Matters Standards. In: 2013 Quality Matters Conference, Naperville, TN (2013) 41. Conceicao, S., Lehman, R.: Persistence model for online student retention. In: Conceicao, S., Lehman, R. (eds.) Proceedings of EdMedia 2013--World Conference on Educational Media and Technology (2013) 42. He, Y.: Universal design for learning in an online teacher education course: enhancing learners’ confidence to teach online. MERLOT J. Online Learn. Teach. 10(2), 283–298 (2014) 43. Park, C.L., Perry, B., Edwards, M.: Minimizing attrition: strategies for assisting students who are at risk of withdrawal. Innov. Educ. Teach. Int. 48(1), 37–47 (2011)
918
D. A. Graham
44. Jaggars, S.S., Xu, D.: How do online course design features influence student performance? Comput. Educ. 95, 270–284 (2016) 45. Harkness, S.S.: Program administration and implementation of an online learning initiative at a Historically Black College University: a case study [Webinar]. In: EDUCAUSE/Quality Matters Online and Blended Learning: Institutional Case Studies on Implementing a Quality Assurance Program and Designing Research on Effective Practice Webinar Series (2014) 46. Bogle, L., Sc Day, D., Matthews, K Swan: The power of a collaborative, collegial approach to improving online teaching and learning. In: Shattuck, K. (ed.) Assuring Quality in Online Education: Practices and Processes at the Teaching, Resource, and Program Levels, pp. 110– 123. Routledge, New York (2023). https://doi.org/10.4324/9781003443124-10 47. Bekele, T.A.: Motivation and satisfaction in internet-supported learning environments: a review. J. Educ. Technol. Soc. 13(2), 116–127 (2010) 48. Poll, K., Widen, J., Weller, S.: Six instructional best practices for online engagement and retention. J. Online Doctoral Educ. 1(1), 56–72 (2014) 49. Hart, C.: Factors associated with student persistence in an online program of study: a review of the literature. J. Interact. Online Learn. 11(1), 19–42 (2012) 50. Barron, K.E., Hulleman, C.S.: Expectancy-value-cost model of motivation. In: International Encyclopedia of Social and Behavioral Sciences, vol. 8, pp. 503–509. Elsevier (2015) 51. Hulleman, C.S., Barron, K.E., Kosovich, J.J., Lazowski, R.A.: Student motivation: current theories, constructs, and interventions within an expectancy-value framework. In: Lipnevich, A.A., Preckel, F., Roberts, R.D. (eds.) Psychosocial skills and school systems in the 21st century. TSSHE, pp. 241–278. Springer, Cham (2016). https://doi.org/10.1007/978-3-31928606-8_10 52. Skordia-Worrall, J., Haghparast-Bidgoli, H., Batura, N.: Learning online: a case study exploring student perceptions and experience of a course in economic evaluation. J. Teach. Learn. High. Educ. 27(3), 413–422 (2015) 53. Harris, P.E.: Perceptions of online versus face-to-face learning of educational leadership graduate students. Eur. J. Educ. Sci. 1(1), 30–37 (2014) 54. Perreault, H., Waldman, L., Alexander, M., Zhao, J.: Graduate business students perceptions of online learning: a five-year comparison. Delta Pi Epsilon J. 50(3), 164–179 (2008) 55. Hixon, E., Buckenmeyer, J., Barczyk, C.: Closing the feedback loop: hearing the student voice in course quality. Qual. Approaches High. Educ. 6(1), 26–31 (2015) 56. Open & Distant Learning Quality Council. ODLQC standards (2012) 57. Alizadeh, M., Mehran, P., Koguchi, I., Takemura, H.: Evaluating a blended course for Japanese learners of English: why quality matters. Int. J. Educ. Technol. High. Educ. 16(1), 1–21 (2019)
On the Use of Blogging in the Classroom of English for Specific Purposes in Times of COVID-19 to Promote Written Skills: A Collaborative Approach Ana Ibañez Moreno(B) UNED University, Paseo Senda del Rey 7, 28024 Madrid, Spain [email protected]
Abstract. This study presents the results of a blogging project that was carried out in UNED, the National Spanish University of Distance Education, in the course English for Tourism II, from the Degree in Tourism, during the lockdown in Spain in March-May 2020. This course, aimed at students who already possess a B1 level, intends to help them reach a B2 level. UNED university is traditionally based on blended learning, but due to the lockdown, that year was fully online. Blogging was implemented as an extra activity. The project consisted of the following: students were invited to write an entry for the course blog, created with the tool Blogger. They also had to comment on each other’s posts, and there was a phase of peer review. Both a quantitative and a qualitative analysis was carried out of their final exams and of a mixed-type final post questionnaire. Quantitative results show that the rate of success among the students who took part in this project was significantly higher than those who did not. Qualitative results show a rise in the students’ motivation. Thus, blogging proved to be an interesting resource to promote extra language practice in a natural and motivating way, and participants reported it was a good learning support during the lock-down period. Keywords: COVID-19 Pandemic · English for Specific Purposes · Blogging
1 Introduction The use of the new information and communication technologies (ICT) in the foreign language classroom has been widely addressed in the last two decades. However, the demand for digital tools has grown exponentially due to the coronavirus pandemic in 2020, which made practically the whole world lock at home. This paper is thus framed within the use of the digital technologies in English as a Foreign Language (EFL) learning and in the context of the coronavirus pandemic. The beginning of the second decade of the 21st century has been marked by the COVID-19 crisis, and 2020 will probably go down in history as the year that changed the lives of most inhabitants of the world. The whole world was affected by a pandemic which made most countries go through a period of lockdown to cope with the negative © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): SAI 2023, LNNS 711, pp. 919–939, 2023. https://doi.org/10.1007/978-3-031-37717-4_60
920
A. I. Moreno
impact on health of this virus. One of the consequences of this crisis is that it made us reflect on human life in this planet. By the end of 2020 online communication had grown exponentially. According to [11] there is no definite data regarding the crisis caused by this devastating pandemic. However, what is true is that it modified how people communicate. Regarding education, the closure of schools and universities all over the world obviously affected education systems, which started to be based on online learning. Technologies in the foreign language classroom are not something new. In fact, already in the 1990s [15] defined computer assisted language learning (from now onwards CALL) as a way of using computers to teach and learn languages: “The search for and study of applications of the computer in language teaching and learning” (p. 1). Already from the beginning of the XXIst century digital competences are considered as part of nowadays literacy, especially when it comes to language learning [10]. What the pandemic did was to accelerate the use of all kinds of digital resources and activities, as well as social media technology in e-learning [1]. To provide online education and training that meets the new students’ needs in a motivating and engaging way has been the great challenge, even for institutions such as UNED University, which, founded in 1972, and has been based on distance learning. Nowadays, UNED is the biggest university in Spain, it has learning centers all over the world, and almost 300,000 students. The educational model is blended learning. This means that, even though UNED students develop digital skills and computer literacy, such as the ability to use computer applications for language learning and for communication [25], they can rely also on onsite activities such as weekly mentoring sessions and the final exam. Therefore, as already mentioned, the COVID-19 crisis was a greater challenge, and new modes of teaching and evaluation had to be developed and practiced in record time. In this context, this research emerged out of the need to attend the emerging and growing demands of English for Tourism students to practice their written English production at home. More specifically, it presents the results of a blogging project that was carried out in the course Inglés II para Turismo (English for Tourism II), from the Degree in Tourism, during the lockdown in Spain in March, April and May 2020. This course, aimed at students who already possess a B1 level (according to the Common European Framework of Reference for Languages, hereafter CEFRL, [7, 8]), intends to help them reach a B2 level. These students were on their second year of the degree in Tourism at the time of the experiment. This course is taught during the second semester. Therefore, in the academic year 2019–2020 it coincided with the period of full lockdown that Spain had to go through due to the COVID-19 pandemic from March 14 to the beginning of June 2020. That semester, then, the course was exclusively online. Consequently, blogging was offered as an extra activity so that they could practice their writing skills with a social component. Thus, my aim was to see if this extra activity had a positive impact on their achievement. Therefore, the questions posed here are: (1) Is blogging a good activity for online students of ESP to promote their writing skills? and (2) Is blogging a good activity to motivate online students of ESP? In this regard, [5] have proven that positive attitudes towards online learning have positive effects on learning achievement.
On the Use of Blogging in the Classroom of English for Specific Purposes
921
Already in the 90’s of the past century, [4, 22] and [30], among others, pointed out at the close interrelation between the how and what of learning. In this sense, the successful implementation of CALL programs depends highly on the ability to engage students, and blogging has been used here as a tool for this. Even if blogging in the EFL classroom has been widely addressed ([19–21, 23], etc.), up until now this resource in online courses has not been studied in depth. Among the few works available, [13] did focus on online blogging, exploring Indonesian student’s perceptions of this resource, with highly positive results. Here, the aim is to continue this line of work and provide further evidence of its benefits. In order to answer the research questions posed above, a quasi-experimental study was carried out. This paper is organized as follows: After describing the theoretical framework, the methodological aspects of this work are outlined. After this, the discussion section provides an analysis of the data obtained, and finally, in the last section the conclusions obtained are presented, together with the limitations of this piece of research and some suggestions for future research.
2 Theoretical Framework The term CALL dates to the previous century, when [15] defined Computer-Assisted Language Learning (CALL) as “the search for and study of applications of the computer in language teaching and learning” (p.1). As mentioned in [26], CALL is an umbrella term for all technology-based ‘learning objects’ for second languages, which range from dedicated software to elements coming from seemingly distant domains. In line with this idea, [26] points out at the use of the word computer as umbrella one, as it also describes several other terms that have been coined to talk about the use of technologies in the language classroom: Although the name includes computers, the term CALL embraces any application of ICTs to teaching and learning foreign languages. Two different terms such as CALI (Computer-Assisted Language Instruction) and CAI (Computer-Assisted Instruction) were used before CALL in the early 1980s [9]. Around the early 1990s, alternative terms such as TELL (Technology-Enhanced Language Learning) also emerged. We are in the phase of the so-called Integrative CALL, according to [29]. This phase is described by [25] as follows: • The aim of the last phase of CALL was to overcome the obstacles of language learning and teaching, and therefore to optimize the opportunities for integrating new technologies in language classrooms. Different educators and scholars tried to find more integrative ways of teaching instead of structure-based ones. Therefore, task-based approaches came into vogue which attempted to integrate leaners in more authentic environments. Fortunately, developments and advances in technology provided these opportunities. In the mid-1990s, multimedia computers and the World Wide Web (WWW) were the basis of integrative CALL. Nowadays, it is very easy for all of the
922
A. I. Moreno
learners to click a mouse to access a plethora of multimedia resources on the Internet (pp. 257–258). Many authors have pointed out the advantages of CALL, such as use of multimodal and authentic material, use of the internet and its connection to higher motivation, collaborative works, development of digital skills ([3, 29], etc.). In this sense, there are some studies on the relationship between teachers’ attitude and acceptance of technology. For example, [2] carried out a study that revealed that teachers strongly agreed with the positive impact of CALL programs: computers save time and effort, motivate students, and enhance their learning, etc. In [5] a study was carried out that confirmed the need for positive attitudes of teachers to motivate students in blended language learning courses to make use of the technologies available. In this line, several studies show the students’ positive attitudes towards CALL as a key learning point, as [26] reports: • A number of studies investigated the students’ attitudes towards CALL (Heflin, Shewmaker, & Nguyen, 2017; Lin, Warschauer, & Blake, 2016; Lintunen, Mutta, & Pelttari, 2017; Pinto-Llorente, Sánchez-Gómez, García-Peñalvo, & Casillas-Martín, 2016; Wright, 2017). (p. 1841) As stated by [26], previous research has also addressed the way in which teachers are implementing CALL tools in their classrooms. Besides, according to many studies the use of the ICTs in the teaching and learning of languages causes more student interaction, and more collaborative work and teamwork ([14, 17, 28]), and provide students with very beneficial experiences. Another theoretical underpinning of this paper is, then, collaborative language learning. The study [27] points out that it is the most efficient way to achieve learning: • Since the second half of the 20th century, a number of learning theories emerging from Social Constructivism (Vygotsky, 1978) support the notion that knowledge and skills, including language and verbal communication, develop largely through student interaction. The concept of collaborative learning (Dillenbourg, 1999), which dates back to times before computers existed, has developed and diversified greatly with general pedagogies like problem-based learning, project-based learning, and specifically computer-based learning paradigms like Computer Mediated Communication, Computer Supported Collaborative Learning, and Collaborative Networked Learning, to name but a few. All of the above share the underlying idea that engaging together in a common activity is the most efficient way to achieve a shared learning goal. CALL appears in this scenario, then, as a great opportunity to develop digital literacy skills as the ones proposed by [10], reproduced in (1) below, in combination with language learning, such as looking for possibilities to use the language in real situations and thus be better prepared for the forthcoming real experiences: 1. 2. 3. 4.
Critical Thinking & Problem Solving Collaboration & Communication Creativity & Imagination Citizenship
On the Use of Blogging in the Classroom of English for Specific Purposes
923
5. Student Leadership and Personal Development Another important idea underlying online learning is students as protagonists of their own learning. These roles have been emphasized with new learning approaches. This is even more relevant nowadays, where teachers’ digital literacy has become essential, and has meant the greatest challenge [24] of all. As these authors suggest, the COVID-19 outbreak opened a new scenario where teachers had to develop adequate digital literacy to teach online and to implement a current and innovative educational model, where they facilitate learning from a content expertise position and let students become actors of their own learning. This is, thus, the role that was adopted by the teacher-researcher in the present study. Finally, as regards blogging in the classroom to promote written production, this research aims at shedding light on this area in an online environment and within a specific context of lockdown. The research [6] focuses on motivation as a key aspect of blogging, [18] deals with blogging in the same field of English for Specific Purposes (ESP) for adults, in this case English for Professional Purposes–, and [20], analyses blogging in English at the level of Spanish compulsory secondary education. The novelty of the present study lies in the context in which blogging was applied, which is the lockdown period due to the global pandemic. As for the methodological approaches to online language teaching, [27] points out at constructivism as an essential one. They remark that since the second half of the 20th century, several learning theories emerging from Vygotsky’s Social Constructivism (in the 1970’s support the notion that knowledge and skills, including language and verbal communication, develop largely through student interaction. Besides, they also mention the essential concept of Dillenbourg’s collaborative learning of the 1990’s, which has developed and diversified greatly. Going further, [20] remarks that the two approaches the Web 2.0 uses, and combines are Constructivism and the Communicative approach: • The two methodological approaches, upon which Web 2.0 is based within the language teaching field, are social Constructivism (Bruner, 1996) and the Communicative approach (Richards & Rodgers, 2009; Nunan, 2005), where interaction is the key aspect in language learning. Task-based Learning (Willis & Willis, 2013; Salzmann, 2014) is included within the communicative approach since this is a teaching method based upon the use of communicative and interactive tasks as central units to plan and impart the contents of the subject (p.1). As mentioned by [20], within the communicative approach, to achieve our objectives we followed a task-based approach, as described by [12] and [16], who consider a task as a communicative activity whose goal is to achieve a specific learning objective that takes place in the real world. In this method, thus, a task makes use of natural language and must emulate natural settings. Consequently, in the present blogging project the task students had to pursue was to write a post for the blog and to comment on their peers’ posts. Thus, they had to behave as real bloggers for a while.
924
A. I. Moreno
3 Method 3.1 Type of Study The design of this piece of research is quasi-experimental, given that the control and the experimental groups were not randomly assigned. Instead, the subjects under study of the experimental group were the ones who voluntarily participated in the blogging project. Therefore, the criterion for their selection is their willingness to participate. Likewise, the subjects selected for the control group is an equivalent selection of students who did not participate in such project. The independent variable under study was the blogging activity, and the dependent variable was the results. It was therefore a non-equivalent groups design, with a potential confounding variable or factor: attitude towards the course and willingness to work extra, given that to volunteer for more practice you need to be motivated. Thus, to assure the internal validity of the study, the sample of subjects in the control group was restricted to those who did not participate in the blogging project, but who had an equal active participation in the other tasks, and in the forums as the experimental group, which can be related to similar attitudinal factors. Even more, there was another optional task in this course, an oral one, and the subjects of the control group were selected among those who did such task. In this way, by selecting a comparison group that matches with the treatment group, the members in both groups had the same values of potential confounders, but different independent variable value. 3.2 Participants First, it is important to state that this work complies with APA ethical standards in the treatment of its human sample and in the description of the details of its treatment (APA Certification of Compliance with APA Ethical Principles). In this sense, the participants’ consent was obtained, to use their data for this research, and anonymity of such participants has been fully preserved. The sample for this study consists of 70 students of the course English II for Tourism of the Degree in Tourism of UNED university. This course enables them to reach a B2 level, according to the [8]. This means that, regarding written reception, they will be able to: “…read with a large degree of independence, adapting style and speed of reading to different texts and purposes, and using appropriate reference sources selectively. Has a broad active reading vocabulary but may experience some difficulty with low-frequency idioms” (p. 60). And, with respect to written production, they aim at being able to: “write clear, detailed texts on a variety of subjects related to his/her field of interest, synthesising and evaluating information and arguments from a number of sources” ([8]; p. 75). The students were divided into two similar groups: the control group (CG), composed of 35 students who did not participate in the blogging project but who had an active participation in the other course tasks, and, especially, who also participated in another optional task: an oral task, which was graded but did not count for the final mark; and the experimental group (EG), composed of the 35 students who volunteered for the blogging activity. The total number of students enrolled in the course was 291. Thus, for the nonprobability sampling procedure convenience sampling was used, since participants were
On the Use of Blogging in the Classroom of English for Specific Purposes
925
selected based on willingness to take part. As already mentioned, to prevent results obtained to be significantly biased and for the sample to be representative, the control group was selected based on positive attitude towards the course, to obtain a sample as homogenous as possible and prevent confounding variables to interfere. 3.3 Data Collection Instruments The instruments used to collect the data were four: (1) participant observation (through their activity in the forum and in the blog: https://gramaticainglesaturismo.blogspot. com/), (2) a writing task for the blog, (3) the final exam to evaluate their final achievement, and (4) a mixed-type post questionnaire, with both open and close questions, which was delivered among the subjects in the experimental group and which aimed at evaluating their rating of the blogging activities and their own learning. 3.4 Procedures The project was carried out during the second semester of the 2019–2020 academic year, which in Spain coincided with the lockdown period: from March until June 2020. More specifically, the project started on April 2020 and lasted four weeks. They had two weeks to send their post for the blog (http://gramaticainglesaturismo.blogspot.com/), but after this period they could peer-review other posts and comment on the publication. The blog exists since 2012, but before this project it was used as a support for students to be in contact with natural language settings. It was the teaching team who posted articles, and students could comment on them. In 2020 we started this blogging project: students were invited to participate in it, by sending a post (with pictures or videos). Below, in Fig. 1, the call for participation and the instructions, announced in the forum, is reproduced:
Fig. 1. Call for Participation in the Blog
The teacher-researcher corrected the texts produced by the students, sent them in the forum with feedback, and then posted them in the blog. Then, the teacher sent the
926
A. I. Moreno
link to the post in the forum. The posts were published on the authors’ names with their consent. In fact, some students showed their enthusiasm for seeing their work in the blog and having some kind of public visibility. Peer students could then comment on the blog directly, as well as the general public. In order to motivate students, the teacher informed weekly of the visits to the blog, so that students could have an idea of the most popular posts. An example is provided in Fig. 2:
Fig. 2. Example of Interaction in the Blog Thread
Students could only send one post per person, due to the high number of students enrolled in the course (289). By the deadline, April 17, 27 students had sent their posts. Some excerpts (without corrections) are given below: Amsterdam was supposed to be only two days, but the memories remain, by Student X. I was turning 21 years old the day I put my first step in Amsterdam. That was the first time I was travelling completely alone. It was me, my phone with downloaded maps of the city in it, a selfie stick, and gloves, because it was December of 2019. I was just starting to see myself as an actual traveller, I always tried to be a traveller, not a tourist, and that was my 10th country since I moved to Europe from South America, and it took me about two years to get there. I always said that I wanted to be old enough, mature enough and free enough to live the Amsterdam experience. I arrived by train to Amsterdam Centraal at 9.30 from Gare de Bruxelles-Nord. My hostel was only 8 blocks away from the station so that wasn’t difficult to accomplish but from that moment already everything felt so comfortable. Side note number 1: 10 h later I’ll realize that everything in Amsterdam is 8 or 10 blocks away from everything. (…). Travel Apps, by Student Y. I am a techy person, I am always looking for new gadgets, inventions and different ways to use technology. There’s something all of us could use in our trips and it doesn’t sound Greek to us, I am talking about APPs. If you are a “traditional” traveler who loves
On the Use of Blogging in the Classroom of English for Specific Purposes
927
to fold and save the city map in your backpack you could think “new brush sweeps clean but old broom knows all the corners” maybe reading this post you could give a chance to APPs and new technology. Here you have an APP list which very interesting (that I have already used, and they are highly recommended): #1 TravelSafe. This APP is not available in all the countries, but it is very useful. You can find a list of all the emergency numbers and embassies’ contacts. Anyway, don’t forget travel insurance to protect you against illness, theft, injury or cancellations. (…). Because some students were still willing to participate, a new conversation was created, with a proposal for peer review. Thus, students posted their texts and other classmates could revise them. In this second forum eight students more added their posts and comments, but these posts were not revised by the teacher-researchers nor published in the blog. Instead, they were peer-reviewed by other classmates.
4 Discussion In this section we delve into different areas concerning the blogging project, which provide data (both qualitative and quantitative) on the students’ progress and attainment. A combination of all the data is used to triangulate this study and ensure its validity. Thus, a correlation can be established here between the students’ participation in the blogging and their results in the exams. For this, those areas under analysis are (1) participation in the forum and (2) number of posts made by the students and peer review and can be correlated to (3) exam results and (4) data obtained from a subsequent post questionnaire. As for point (1), participation in the forum, Fig. 3 shows the number of messages posted in general forum of the course: As can be seen in Fig. 3 below, 101 messages were sent to Blog of the course thread, where the project was proposed. As for the second thread, a total of 34 messages were sent to the conversation, from eight volunteer participants posting new texts and another 11 commenting and peer reviewing them. This may also show evidence of how the situation provoked by the COVID-19 pandemic affected student involvement in this online activity. It would explain why participation in the forums raised in general, as shown in Table 1 below, where a general account of the number of messages can be seen in comparison with the previous academic year, 2018–2019: As can be observed, participation in 2020 was almost four times higher than in the year before. In any case, what can be seen in 2020 is that participation in the two forums related to the blogging activity was much higher as compared to the other conversation threads in the forums. As can be observed in Fig. 1 above, most messages had to do with questions related to three elements, which were of paramount importance in this course: (1) the blog of the course, (2) the distance evaluation tasks (PECs in Spanish, Pruebas de Evaluación Continua), and (3) the final exams. The last two topics had to do with solving doubts about how to do the continuous assessment task and the new online exam, which had to be created due to the pandemic and its subsequent lockdown.
928
A. I. Moreno
Fig. 3. Conversation Threads in the General Forum
Table 1. Student Participation in the Forums in 2019 and 2020 (Number of Messages Posted) General forum
Student forum
Forum units 1–5
Forum units 6–11
TOTAL
2019
168
1
12
14
195
2020
438
5
9
8
460
Thus, we may consider this a sign of autonomous learning, since it is related to problem solving attitudes. Regarding point (2), the posts made by the students in the thread of the blogging project, those who sent their texts before the 17th of April (the deadline set for posting in the blog) wrote about the following topics, given below in Table 2 As can be seen, 27 students published their posts, adding pictures, and they all accepted to have their texts published in the blog with their real names. Even more, some of them showed enthusiasm for seeing their texts in a public website. As already mentioned, to promote their motivation to work with this task in a natural way, they were informed several times of the visits their posts had had. In this sense, the most visited post, and which received more comments, was New York and Dating: a real-life story and how to take advantage of the opportunities while you are traveling, with 357 visits by the end of the project. In this sense, another part of the task was to revise each other’s posts in the forum, and to comment them in the blog itself, in order to work interaction
On the Use of Blogging in the Classroom of English for Specific Purposes
929
Table 2. Titles of the 27 Posts Written by the Students by Thematic Area Places to visit
Opinions and essays
Personal experiences
• Seven countries in 30 days, Istanbul • Amsterdam was supposed to be only two days, but the memories remain • Panamá, the undiscovered paradise • A tour through the medieval quarter of Vitoria • My first trip to France: Strasbourg • Italy getaway • Cadiz’s white hill towns route • My trip to Bologna • A Thailand experience • A trip to South America • My first time in Italy • Córdoba • Lugo: an ancient place where you can eat for free • My trip to Vietnam • Our trip to Switzerland • My trip in Castilla la Mancha and Lisbon
• It’s time to dream! • Travel apps • COVID-19, Tourism and Sustainability • “Classical” tourism in the 21st century • Traveling without moving
• A childhood anecdote; My last trip home • I stay in Mexico City • To infinity, and beyond • New York and Dating: a real-life story and how to take advantage of the opportunities while you are traveling
and natural language use. In this case, students preferred to comment on the posts in the blog itself, and they used the forum to peer-review each other. Table 3 below shows the number of comments in the blog for the commented posts: Table 3. Comments made by Students to the Posts in the Blog
Most commented posts
COVID-19, Tourism and Sustain-ability
Traveling without moving
New York and dating
To infinity and beyond
Travel apps
Cordo-ba
Total
5
2
2
1
1
1
12
Interestingly, although most students wrote about places to visit (traveling personal experiences, except of the post Seven countries in 30 days, Istanbul, which was the result
930
A. I. Moreno
of desktop research, as the student himself reckoned), the most read and commented posts included other topics, related to opinions, anecdotes, or personal reflections. As for the peer review exercise, it started after the deadline to send posts to be published in the blog, and this did not have any time limitation. In the new forum thread, students could send “posts” and their peers could correct them. It is quite striking to see that they were more interactive in this practice, in a more “private” context. Eight students posted their texts there, and they all obtained feedback from other classmates, with a total of 11 peer reviews (some of them were revised by two people). An example of the students’ comments and positive interaction is provided in Fig. 4 below:
Fig. 4. Example of Students’ Peer Review
This shows a positive collaborative attitude, and it improved the students’ interaction and the course sphere. With regards to point (3), the final exam, it is used here as a measurable post-task which can give an indication of the students’ actual improvement thanks to the blogging activities. In this regard, Table 4 below shows the media of the marks obtained by the experimental and the control group, both in June and in September. A column also indicates the number of students who finally did not take the exam, and consequently did no pass the course: As can be observed, the media of the marks obtained by the experimental group are significantly higher than the marks obtained by the control group in June. In September, just two students from the experimental group took the exam, and both failed, and they had already taken it in June. As regards the control group, from the five the students who did not take the exam in June, two of them took it in September and passed it with high marks (8,4/10 and 7/10), so this is the reason why the media obtained is higher in the control group for September. Additionally, Table 5 below shows, more specifically, the specific rates obtained by all the students under analysis in June:
On the Use of Blogging in the Classroom of English for Specific Purposes
931
Table 4. Results of the Final Exams Exam mark in June (media) Exam mark in September Drop-out rate (media) Control group (CG)
6,83
Experimental group 7,56
5,85
3
4
0
Table 5. Final Exams in June. Comparison between the Control and the Experimental Group Absent
Fail (0–4,99)
C (5–6,9)
B (7–8,9)
A (9–9,9)
C. Laude (10)
EG
0 (0%)
2 (5,71%)
7 (20%)
20 (57,14%)
6 (17,14%)
1 (2,86%)
CG
3 (8,57%)
2 (5,71%)
10 (28,57%)
17 (48,57%)
3 (8,57%)
0 (0%)
This table shows clear evidence that the blogging activity had an actual improvement in the written skills of the participants. We see that the EG obtained higher marks than the CG, even though the number of failed exams is equal, and low in both cases. The graphs below, in Figs. 5 and 6, illustrate the differences more clearly
Marks of the EG in June
Absent
Fail
C
B
A
Cum Laude
Fig. 5. Mark Distribution of the Experimental Group in June 2020
932
A. I. Moreno
Marks in the CG in June
Absent
Fail
C
B
A
Cum Laude
Fig. 6. Mark Distribution of the Control Group in June 2020
As can be seen, even if the marks are quite similar, the excellence is higher in the EG, with even one cum laude, and double the number of A’s. If we compare these results with the results obtained for whole population, the 291 students, differences are much higher; Table 6 below shows the data: Table 6. Final Exams in June and September of the Whole Class (Population) Absent
Fail (0–4,99)
C (5–6,9)
B (7–8,9)
A (9–9,9)
C Laude (10)
June
69 (23,71%)
37 (12,71%)
71 (24,4%)
86 (29,55%)
18 (6,18%)
3 (1,03%)
September
65 (22,33%)
18 (6,18%)
20 (6,87%)
3 (1,03%)
0 (0%)
0 (0%)
Below, the graph in Fig. 7 illustrates the differences:
On the Use of Blogging in the Classroom of English for Specific Purposes
933
Comparison of the EG and the whole classroom 60 50 40 30 20 10 0 Absent
Fail
C June (populaƟon)
B
A
Cum Laude
June (EG)
Fig. 7. Comparison of the Experimental Group Marks and the Whole Classroom
Finally, as for (4), data obtained from a post-questionnaire, this area of analysis consisted of a mixed-type questionnaire, with both closed and open questions. More specifically, it included ten closed 5-point Likert scale questions –where 1 meant not at all and 5 meant I am very happy with the activity or I totally agree– and one final optional open question where the participants could add any comments they wanted to make. It was delivered among the students who volunteered to participate in the blogging task, that is, the experimental group, in order to analyse their degree of satisfaction with it and also their awareness on their own learning process. The questions are reproduced below: 4.1 Written Skills Please select the option that fits your ideas: 1. Nothing; 2. A bit, not enough; 3. Enough; 4. Quite a lot, I am satisfied; 5. A lot, I am very happy with the activity. 1. 2. 3. 4. 5.
With this task I think I have worked my writing skills With this task I think I have improved my writing skills With this task I think I have worked my reading comprehension skills With this task I think I have developed my reading comprehension skills I have learned practical and useful extra vocabulary and expressions
934
A. I. Moreno
4.2 Collaborative Work (Please, Answer Only if You Did Some Peer Review) Please select the option that fits your ideas: 1. Not at all; 2. A bit, not enough; 3. Enough; 4. Quite a lot, I am satisfied; 5. A lot, I am very happy with the activity. 1. To correct my classmates has helped me reflect about my own learning and English 2. Collaborative work has helped me work on my own learning process 4.3 Methodology Please select the option that fits your ideas: 1. Not at all, I totally disagree; 2. A bit, not enough; 3. Enough; 4. Quite a lot; 5. A lot, I totally agree. 1. The use of authentic materials such as a blog has motivated me to learn English 2. I have had to be creative, and this has helped me in my own learning process 3. Blogging was a good way to foster my motivation to practice my writing skills 4.4 Additional Comments (Your Opinion is Very Useful and Valuable to Us; It Will Help Us Improve! As can be observed, the questionnaire was divided into four sections, which were aimed at obtaining different types of data. Section A wanted to see how the participants assessed their own progress as regards written skills, Section B wanted to see how they evaluated their collaborative experience, Section C wanted to obtained information about the blog as a tool in the task-based classroom, and, finally, Section D was left open for nondirected comments, to obtain honest opinions on the overall experience. In some cases, the questionnaire was designed with similar questions and with slight variations, to assess the validity of the responses. Thus, questions 1 and 2 only differ in one verb: work versus improved, and questions 3 and 4 too: worked versus developed. Given that this part of the project was optional, only 13 students filled in the post questionnaire. The results, though, outreached the author’s expectations. Appendix 1 includes the students’ responses to all 10 closed questions, most of them being 4 or 5. Below, Figs. 8, 9 and 10 illustrate the positive results, showing the answers to questions 1, 9 and 10: Only two questions obtained one response in scale 3: questions 4 and 7. Therefore, in the light of these responses we can state that participants were satisfied with the overall task and with their performance in it. As for the open question, it was meant to confirm the results of the closed questions and to obtain any extra information that students would like to share. All the respondents added comments, although one of them was just “none” and another one was just “ok”, so they cannot be considered as relevant for our purposes here. In (4) all the other comments are reproduced: 1. It was very interesting to read my classmates’ compositions 2. The blog and also correcting my classmates’ writings helped me to improve my writing skills 3. I like how the teacher motivates us all to do all the exercises. 4. I think they have been very useful and motivating activities for us
On the Use of Blogging in the Classroom of English for Specific Purposes
Fig. 8. Responses to Question 1 in the Post Questionnaire
Fig. 9. Responses to Question 9 in the Post Questionnaire
Fig. 10. Responses to Question 10 in the Post Questionnaire
935
936
A. I. Moreno
5. I have liked me enough this task I have learned vocabulary and a describe cities 6. For sure, the best way to learn english and at the same time discover some new life experiences. I loved it!! 7. I definitely encourage this kind of activities as part of the course. I do believe it helps students and motivates us to keep on improving our writing, reading skills. Thank you very much. 8. Thank very much for your great work helping us to improve:) 9. Thank you for asking us for own opinion, and of course, thank you for your help 10. It was fun learn from classmates experıences These comments clearly show that the respondents were motivated and satisfied with the task, because they felt it was interesting, that feeling involved was a great experience, and also feeling supported by the teacher.
5 Conclusion On the light of the results presented in the Discussion section above, we can confirm that blogging is a motivating activity that contributes to promoting written skills (especially production skills) in the learning of ESP in an online environment. In this sense, this study expands the work made by [18] by supporting the hypothesis that blogging is a very useful and successful resource to enhance collaborative written production in EFL in online settings, and, to motivate students (as in [6]). In the context in which it took place, in which all had to be done online and face to face relationships in general had to be shut down, this tool has proven to a very interesting support for students to work collaboratively and to not feel alone in their learning process. Thus, it has proven to be a good resource to help students in a difficult context where face to face lessons had to be cancelled, not only because of writing the post itself, but because of the interaction in the forum and in the blog with other students, which also proves that collaborative approaches (as in [27]) in online settings give good results. This study has, nonetheless, some limitations: the duration of the task was very brief, and students could only send one post per person. Even if they later had the opportunity to comment on the peers’ posts, we can consider this experiment as a pilot study which could be replicated with a more in-depth one within the context of a longer project, which will be carried out in the near future in order to assess the validity of the results presented here.
Appendix 1 Students Responses to the Closed Questions of the Post-Questionnaire.
2. With this task I think I have improv-ed my writing skills
4
5
5
4
5
5
5
4
5
5
5
4
5
1. With this task I think I have worked my writing skills
4
5
5
4
5
5
5
5
5
5
5
4
5
4
5
5
5
5
4
4
5
5
5
5
4
4
3. With this task I think I have worked my reading compre-hension skills
5
4
5
5
5
3
5
5
5
5
5
4
4
4. With this task I think I have develop-ed my reading compre-hension skills
4
5
5
5
5
5
5
5
5
5
5
4
4
5. I have learned practical and useful extra vocabu-lary and expres-sions
4
5
5
5
4
4
5
5
5
4
6. To correct my class-mates has helped me reflect about my own learning and English
4
5
5
5
3
4
5
5
5
4
7. Colla-borative work has helped me work on my own learning process
5
5
5
5
5
5
4
5
5
5
5
4
4
8. The use of authentic materials such as a blog has motiv-ated me to learn English
5
5
5
5
5
4
5
5
5
5
4
4
4
9. I have had to be creative and this has helped me in my own learning process
4
4
5
5
5
5
5
5
5
5
5
5
4
10. Blogging was a good way to foster my motiva-tion to practice my writing skills
On the Use of Blogging in the Classroom of English for Specific Purposes 937
938
A. I. Moreno
References 1. Ali, W.: Online and remote learning in higher education institutes: a necessity in light of COVID-19 pandemic. High. Educ. Stud. 10, 16–25 (2020) 2. Bordbar, F.: English teachers’ attitudes toward computer-assisted language learning. IJLS 4(3), 179–206 (2010) 3. Cabrini Simões, L.: An overview on the use of new technologies in English language teaching. Acta Scientiarum Hum. Soc. Sci. 29(1), 31–34 (2007) 4. Candy, M.: Self-direction for Lifelong learning. Jossey Bass, Los Angeles (1991) 5. Chirimbu, S., Tafazoli, D.: Blended learning: bridging the motivational gap in ESP purposes. In: 10th International Scientific Conference eLearning and software Education. Bucharest (2014) 6. Carney, N.: Language study through blog exchanges. Wireless Ready Symposium: Podcasting Education and Mobile Assisted Language Learning, 16–25 (2010) 7. Council of Europe: Common European Framework of Reference for languages: learning, teaching, and assessment. Cambridge University Press, Cambridge (2001) 8. Council of Europe: Common European Framework of Reference for Languages: Learning, Teaching, and Assessment. Companion Volume with New Descriptors. Cambridge University Press, Strasbourg (2018) 9. Davies, G., Higgins, J.: Computers, language and language learning. CILT, London (1983) 10. Dudeney, G., Hockly, N., Pegrum, M.: Digital Literacies. Research and Resources in Language Teaching. 2nd edn. Routledge, New York (2022) 11. Dushime, C., Hashemıpour, S.: The psychological effect of the COVID-19 pandemic in Turkey and the world at the context of political psychology. Avrasya Sosyal ve Ekonomi Ara¸stırmaları Dergisi 7(6), 75–86 (2020) 12. Ellis, R.: Task-based Language Learning and Teaching. Oxford University Press, Oxford (2003) 13. Fithriani, R., Rafida, T., Siahaan, A.: Integrating online blogging into EFL writing instruction: exploring students’ perceptions. In: Advances in Social Science, Education and Humanities Research (ASSEHR), volume 188. UNNES International Conference on English Language Teaching, Literature and Translation (ELTLT 2018), pp. 87–90. Atlantis Press, Indonesia (2019) 14. Kukulska-Hulme, A.: Group leadership in online collaborative learning. In: Howard, C., Boettcher, J.V., Justice, J., Schenk, K.D., Rogers, P-L., Berg, G.A. (eds.) Encyclopedia of Distance Learning, pp. 9758–983. IGI-GLOBAL, Hershey, PA, USA (2005) 15. Levy, M.: CALL: Context and Conceptualization. Oxford University Press, Oxford (1997) 16. Littlewood, W.: The task-based approach: some questions and suggestions. ELT J. 58(4), 319–326 (2004) 17. Maina, E.M., Oboko, R.O., Waiganjo, P.W.: Using machine learning techniques to support group formation in an online collaborative learning environment. Int. J. Intell. Syst. Appl. 9(3), 26–33 (2017) 18. Martín Monje, E.: Interactive materials, collaborative work and web 2.0 in the context of English for specific purposes. In: Talaván, N., Martín Monje, E., Palazón, F. (eds.) Technological Innovation in the Teaching and Processing of LSPs. Proceedings of TISLID’10, pp. 101–114. UNED University Press: Madrid (2011) 19. Milliner, B.: Class Blogging in the EFL Classroom. Front. Lang. Teach. 6, 1–11 (2015) 20. Montaner-Villalba, S.: Written expression in English for specific purposes through blogging and cooperative learning. J. Teach. English Specific Acad. Purposes 8(3), 171–186 (2020) 21. Muhtia, A., Drajati, N.A.: Incorporating blogging into an EFL writing course: an action research. Issues Lang. Stud. 6(2), 31–44 (2017)
On the Use of Blogging in the Classroom of English for Specific Purposes
939
22. Rogers, E.M.: Diffusion of innovations. The Free Press, New York (1995) 23. Said, N.E., et al.: Blogging to enhance writing skills: a survey of students’ perception and attitude. Asian Soc. Sci. 9(16), 95–101 (2013) 24. Sánchez-Cruzado, C., Santiago Campión, R., Sánchez-Compaña, M.T.: Teacher digital literacy: the indisputable challenge after COVID-19. Sustainability. 13(4), 18–58 (2021). https:// doi.org/10.3390/su13041858 25. Tafazoli, D.: Review of computer-assisted language learning: history, merits & barriers. In: Coombe, C., Khan, R. (eds.) Best Practice in ELT: Voices from the Classroom, pp. 255–265. TESOL Arabia Press, Dubai (2015) 26. Tafazoli, D., Gómez Parra, M.E., Huertas Abril, C.A.: A cross-cultural qualitative study on students’ attitudes towards computer-assisted language learning. Qualitative Rep. 25(5), 1841–1855 (2020) 27. Talaván, N., Ibáñez Moreno, A., Bárcena, E.: Exploring collaborative reverse subtitling for the enhancement of written production activities in English as a second language. ReCALL 29(1), 39–58 (2017) 28. Vázquez Cano, E., Sevillano García, M.L.: Educadores en red. Elaboración y edición de materiales audiovisuales para la enseñanza. Ediciones académicas-UNED University Press, Madrid (2011) 29. Warschauer, M., Kern, R.: Network-Based Language Teaching: Concepts and Practice. Cambridge University Press, Cambridge (2000) 30. Wenden, A.: Learner Strategies for Learner Autonomy. Prentice Hall, London (1998)
Using Virtual Reality Learning Environments to Improve Success for Online Students Evelyn R. Sowells-Boone(B) North Carolina A&T State University, Greensboro, NC, USA [email protected]
Abstract. Dramatic changes have occurred in the postsecondary educational landscape since the COVID-19 virus compelled educators and researchers to innovate in order to continue serving learners in safe ways. As a result, the Metaverse has become increasingly popular. Virtual Reality (VR) platforms that provide students with immersive experiences in the Metaverse have been documented to improve learning outcomes via engaging simulations. Even beyond its potential for more experiential learning, the Metaverse landscape offers unprecedented opportunities to improve the quality of online STEM education. As such our university has launched a new initiative to study the impact of virtual reality in education. Keywords: Virtual Reality · Cyberlearning
1 Introduction 1.1 Background While Virtual Reality (VR) as a teaching and learning modality has existed for close to 50 years, its implementation and adoption has only accelerated through the past decade. The first digital VR system was implemented as a training flight simulator for the U.S. Air Force in 1966 [1]. Since then, cost, logistics, and advancements in immersive technology have facilitated the growth of augmented learning and immersive VR particularly as it relates to teaching and learning in the following areas: simulations and facilitation of practical skills (“training”), distance learning, and access to limited resources [2]. VR can be defined as “the sum of the hardware and software systems that seek to perfect an all-inclusive, sensory illusion of being present in another environment” [3]. Put most simply, VR operates by engaging users through head-mounted displays and works through immersion. According to Freina & Ott [4], immersion is the engagement of a user in a virtual environment in which he/she/they are surrounded “with images, sound, or other stimuli” and develop a sense of “being” in the task environment. As a teaching and learning modality, VR is grounded in the theory of constructivism where the learner actively constructs knowledge through their own subjective representations and understanding of reality [5]. Constructivist learning strategies include situated and experiential learning whereby the user engages in a process of learning by doing. This theory is undergirded by the notion that learning is a process that involves © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): SAI 2023, LNNS 711, pp. 940–947, 2023. https://doi.org/10.1007/978-3-031-37717-4_61
Using Virtual Reality Learning Environments
941
active construction by the learner and not passive acquisition of information. This calls into question conceptions of learning as “the mind as a container waiting to be filled”; constructivist theory posits, on the other hand, that a learner’s mind actively seeks out knowledge to satisfy its curiosity and makes connections between previous experiences and information to create new knowledge and associations [6]. The underlying notion that learners produce knowledge and form meaning based upon their experiences provides a helpful framework for understanding the mechanisms for why VR modalities work to deeply engage learners. In short, VR works to promote student learning by creating and offering immersive learning environments that directly expose learners to the material being studied. This direct experience of the metaverse world allows, according to constructivist theory, for the learner to derive meaning [7]. Attempts to better understand the potential value of VR in enhancing student learning via visualization and interaction have yielded encouraging results. As a training modality and simulator, VR appears to be highly effective in preparing practitioners and improving performance across many different disciplines [8–10]. Whitmer, Ullman, & Johnson [11] examined whether training in VR leads to performance improvements on the real-world task relative to a no-training condition and an active training control. Their results suggest that there are significant benefits to training in VR. Further, these researchers posit that the act of physically simulating the task during a VR training may not be the sole driving force behind a user’s learning. They cite related research [12, 13] that supports the notion that a process of “rich encoding” can complement VR learning to allow users to transfer their knowledge from the VR realm into the real world. This encoding can happen as users are afforded opportunities to make mistakes and errors – and thus learn from such faults – in the VR world that they would not be able to make in reality. In attempts to understand the mechanisms for why VR is effective in preparing users in training contexts and practical skill development, researchers have found that VR can engage learners in deeper and more meaningful ways, for longer durations than more traditional modalities. In their examination of Virtual Reality Learning Environments (VRLE) and the shift from Web-based and more conventional multimedia to more immersive and interactive VR learning environments, Huang, Rauch & Liaw [14] found that students who engaged with VR experienced increased time-on-task. This study corroborated earlier findings by Johnson [15], who examined an immersive learning environment for children ages 6–10 deemed the “NICE” project (Narrative Immersive Constructionist/Collaborative Environments). The NICE project provided an engaging setting where children could construct and cultivate simple virtual ecosystems, collaborate via networks with other remotely-located children, and create stories from their interactions in the real and virtual world. Observations of fifty-two youth participants while they were interacting with both each other and the virtual environment revealed that users had very little difficulty learning how to use the interface technology and remained engaged throughout multiple sessions and even after hours of participation. Researchers have also found that students who engage with VR for the purpose of learning content or developing skills enjoy their learning more [10, 16, 17] and experience deeper learning engagement and longer-term retention of knowledge [14]. As it helps to improve motivation, VR has also been shown to increase user enthusiasm for learning in both higher education [18] and K-12 [19].
942
E. R. Sowells-Boone
Previous studies have shown that VR technology and platforms can increase accessibility in many postsecondary fields and offer the potential for greater inclusion [20]. A recent case study from Morehouse College, see Table 1 below, demonstrated that the use of virtual reality in online settings can significantly increase undergraduate success and engagement [21]. Table 1. Morehouse College Student Achievement, Engagement, Satisfaction Comparing Traditional, Online, and VR Metaversity Instruction 2021–2022 Item of Study
Traditional
Online
Metaversity (VR)
Student Achievement Final Grade Avg. (%)
78
81
85
Student Engagement 1 Essay Grade Average (%)
80
78
90
Student Engagement 2 Presentation Morale (%)
98
90
100
Student Satisfaction Attendance Rate (%)
80
88
90
Average
84
84.25
91.25
2 Implementation Given these promising results, our university will pilot immersive VR environments to improve student success for online students. This project will build on these initial findings of positive outcomes with data-driven results that focus not only on student achievement, engagement, and satisfaction, as in previous studies, but more specifically on retention and engagement. Additionally, capabilities unique to VR offer striking new possibilities for inclusion and diversity, such as the opportunity for students to create avatars with the appearance and abilities of their choice. We expect that the scalability of VR offers an additional benefit to any positive impacts on inclusion and equity that emerge from this study, as instructors can serve far more students through online instruction than through onsite classrooms.
3 Pilot Objectives and Methodology Per the literature and anecdotal evidence from many STEM colleagues, students often either avoid majoring in STEM fields or change their majors from a STEM to a nonSTEM major because they feel they cannot be successful in the STEM major. (In other words, they may be interested in STEM, but they feel there are significant barriers to their future success.) This feeling can come from a variety of factors, most often lack of preparation coupled with lack of self-efficacy and STEM identity (as a student and as a future STEM professional). This proposed project will address ways to overcome those barriers by providing support that show participants that they can indeed be successful in STEM and give them opportunities to explore and learn about options in STEM areas they may not be familiar with. This project is an expansion of our proven Recruit-Support-Connect
Using Virtual Reality Learning Environments
943
(RSC) program that increased enrollment, retention and degrees awarded to women in CoST in 2011–2013. Our three objectives for innovation in instruction and curriculum development, focused on increasing the production of high achieving underserved students who obtain STEM degrees, pursue STEM careers, and increase diversity in STEM, are outlined below: Objective 1: Enrich online technology course offerings with virtual reality technology to bolster attraction and persistence by re-designing course materials to be used in immersive cyberlearning environments. Educational researchers are in constant pursuit of learning how to effectively transfer knowledge to the next generation and currently cyberlearning is paramount. This is our quest and there are several steps required to implement this project. First, North Carolina A&T will contract for the creation of an online “metaversity” campus, with customized VR spaces designed to mirror the most iconic spaces on the respective institution’s campus as well as stock spaces that provide experiential environments ranging from lab classrooms to art history museums. The university will order 100 VR headsets to serve the students projected to enroll in the program. A library of over 7,000 modeled learning objects is available for interactive learning. Professors and educators receive accelerated training in VR with a certificate development program for teaching in virtual environments. Virtual labs and teaching materials will be created and redesigned that will enable an expedited and minimal path of learning for computing students as compared to the traditional teaching assets. The goals of this process are to facilitate teaching and learning, minimize concern for cost, time, and risks as compared to the traditional labs and teaching materials. We will take advantage of pre-existing virtual assets as well as newly created customized virtual assets to recreate the physical assets in the strategically designed virtual reality environment. These virtual assets will mimic the operation of physical hardware and software processes to enable efficient transfer of knowledge from the physical world to the virtual environment. Once deployed, feedback will be gathered in a testing and pre-analysis phase to allow real-time feedback. The feedback gathered will be used to initiate modifications and updates before the initial launch to the online/distance learning students. This proposed human computer interaction cyclic design process ensures that we maintain a framework for iterated improvements. Objective 2: Strengthen online student engagement across all demographic groups by taking advantage of identity-flexible aspects of virtual reality technology along with experiential learning techniques and opportunities for online community building to increase efficacy and persistence. The pilot will use student designed avatars to encourage self-expression, decrease social anxiety, and foster active participation; gamify teaching concepts to enhance experiential learning in low-stake environments; create virtual spaces for learning and socializing that offer diverse role models and encourage cross-cultural interactions. Social Skills Deficits can develop early in a person’s life and manifest alongside mental health disorders, with the origin of these deficits possibly ranging from genetics to traumatic events to physiological dysfunction for example [22, 23]. People who are diagnosed with pervasive developmental disorders such as Social Anxiety, Autism, Public Speaking Anxiety, and Social Phobia exhibit Social Skills Deficits as common attributes
944
E. R. Sowells-Boone
amongst them [23–25]. Recently, mental health treatment and clinical research has experienced a rise in the use of VR, as well as practical clinical application as therapeutic interventions to elevate social inclusion [26–29]. The baseline educational pedagogy and methodologies have not kept up with the current needs of students, and this has resulted in a lack of interests, enthusiasm, and participation. The current generation of students have grown up and interacted with high end technology from an early age [30]. These students display difficulty attending and focusing on sessions where the teaching or lesson is delivered in the traditional manner, for example lecturing. Gamification is a novel teaching approach that combines game design elements with technology to teach concepts [31–33]. The popularity of utilizing educational games as teaching and learning aids in STEM subjects has grown sharply over the past decade or so. This technology – oriented pedagogical concept fosters active student engagement, reinforces independent learning while enabling the building of problem solving and critical thinking skills [30]. Students are involved in real-time activities in gamified VR/AR environments. The VR/AR environment also enables students to learn by experimenting due to the low-stakes nature of the technology implementation. The products of applying gamification to teaching concepts in VR/AR environments are an increase in learning and knowledge retention. VR enables another form of self-expression or presentation by allowing the user to merge their physical body and the digital representation of their chosen avatar [34]. This selective self-expression or presentation is shown to be a vital instrument by which VR users construct, perceive, and experience their digital self [34]. As such, users tend to exhibit the phenomenon known as the proteus effect by conforming to their avatar and the mental make-up. This act of conforming influences their behavior, perception, and cognition which can be manifested by choosing an alternate social identity [35, 36], an avatar with limitless abilities or unrestricted expression in the digital world [37, 38]. By this mechanism a student can create an avatar that allows them the freedom to selfexpress or present themselves in a manner which reinforces self-efficacy in an equitable and inclusive social, virtual environment. Objective 3: Investigate relationship between student perceptions of subject proficiency and levels of engagement and learning outcomes in the online VR courses. We will accomplish this by gathering and analyzing data on the relationship between perceptions of proficiency and underrepresented students’ success and future intention to pursue careers in STEM fields. We anticipate producing a data set on the relationship between perceptions of subject proficiency and student success and career plans will inform postsecondary educators and strategists in future best practices. NC A&T is well positioned to lead this initiative. The Metaverse Convergence Engineering Laboratory’s focus is to address complex challenges faced by online communities by: 1) advancing and translating knowledge and discovery to create technologies, systems, processes and models; and 2) examining the intersection of the engineering metaverse with social, political, artistic, health and economic factors. The Metaverse is a 3D, virtual and immersive space that supports design, development, ideation and creation. The foundation of the lab is to use convergence engineering, which focuses on unifying multiple disciplinary inputs that will give rise to radically new concepts and concrete outputs.
Using Virtual Reality Learning Environments
945
4 Conclusion This pilot proposes a strategic initiative that cultivates STEM talent from an underutilized resource, online students, using an immersive virtual reality platform to foster inclusivity in STEM education. Studying the experiences, challenges, and triumphs of the pilot will advance future recruiters’ and researchers’ knowledge regarding which initiatives offer the most effective results for improving online STEM education. This pilot’s intellectual merit is in its use of cutting-edge strategies aimed at increasing the number of students in STEM thereby promoting innovation leading to economic growth while relieving the forecast US shortage of skilled STEM professionals. Additionally, the pilot will impact society by enhancing online infrastructure for research and education and improving faculty expertise and competitiveness. Understanding and addressing the fundamental barriers that online students face will contribute to decreasing the educational achievement gaps among US populations. This project will thus broaden immediate participation by attracting and guiding online students successfully into the STEM workforce and increase future participation in the STEM workforce via the lessons learned from this project.
References 1. Page, R.L.: Brief History of Flight Simulation. Proceedings of the SimTecT 2000, pp. 1–11 (2000). https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.132.5428&rep= rep1&type=pdf 2. Radianti, J., Majchrzak, T.A., Fromm, J., Wohlgenannt, I.: A systematic review of immersive virtual reality applications for higher education: design elements, lessons learned, and research agenda. Comput. Educ. 147, 103778 (2020). https://doi.org/10.1016/j.compedu.2019.103778 3. Biocca, F., Delaney, B.: Immersive virtual reality technology. Commun. Age Virtual Reality 15(32), 10–5555 (1995) 4. Freina, L., Ott, M.: A literature review on immersive virtual reality in education: state of the art and perspectives. Int. Sci. Conf. Elearning Softw. Educ. 1(133), 10–1007 (2015) 5. Fosnot, C.T.: Constructivism: Theory, Perspectives, and Practice. Teachers College Press (2013) 6. Schcolnik, M., Kol, S., Abarbanel, J.: Constructivism in theory and in practice. English Teach. Forum 44(4), 12–20 (2006) 7. Driscoll, M.: Psychology of Learning for Instruction. Allyn & Bacon, Boston (2000) 8. Rose, F.D., Attree, E.A., Brooks, B.M., Parslow, D.M., Penn, P.R.: Training in virtual environments: transfer to real world tasks and equivalence to real task training. Ergonomics 43(4), 494–511 (2000) 9. Seymour, N.E., et al.: Virtual reality training improves operating room performance: results of a randomized, double-blinded study. Ann. Surg. 236(4), 458–464 (2002) 10. Langley, A., et al.: Establishing the usability of a virtual training system for assembly operations within the automotive industry. Hum. Factors Ergon. Manuf. Serv. Ind. 26(6), 667–679 (2016) 11. Whitmer, D.E., Ullman, D., Johnson, C.I.: Virtual reality training improves real-world performance on a speeded task. Proc. Hum. Factors Ergon. Soc. Annu. Meet. 63(1), 1218–1222 (2019) 12. Bossard, C., Kermarrec, G., Buche, C., Tisseau, J.: Transfer of learning in virtual environments: a new challenge? Virtual Reality 12(3), 151–161 (2008)
946
E. R. Sowells-Boone
13. Lemole, G.M., Jr., Banerjee, P.P., Luciano, C., Neckrysh, S., Charbel, F.T.: Virtual reality in neurosurgical education: part-task ventriculostomy simulation with dynamic visual and haptic feedback. Neurosurgery 61(1), 142–149 (2007) 14. Huang, H.M., Rauch, U., Liaw, S.S.: Investigating learners’ attitudes toward virtual reality learning environments: based on a constructivist approach. Comput. Educ. 55(3), 1171–1182 (2010) 15. Johnson, A., Roussos, M., Leigh, J., Vasilakis, C., Barnes, C., Moher, T.: The NICE project: learning together in a virtual world. In: Proceedings. IEEE 1998 Virtual Reality Annual International Symposium (Cat. No. 98CB36180), pp. 176–183. IEEE (1998) 16. Apostolellis, P., Bowman, D.A.:. Evaluating the effects of orchestrated, game-based learning in virtual environments for informal education. In: ACE’14: Proceedings of the 11th Conference on Advances in Computer Entertainment Technology, pp. 1–10. ACM (2014). https:// doi.org/10.1145/2663806.2663821 17. Ferracani, A., Pezzatini, D., Del Bimbo, A.:. A natural and immersive virtual interface for the surgical safety checklist training. In: Proceedings of the 2014 ACM international workshop on serious games, pp. 27–32 (2014) 18. Sattar, M.U., Palaniappan, S., Lokman, A., Hassan, A., Shah, N., Riaz, Z.: Effects of virtual reality training on medical students’ learning motivation and competency. Pak. J. Med. Sci. 35(3), 852–857 (2019) 19. Hsu, Y.: Exploring the learning motivation and effectiveness of applying virtual reality to high school mathematics. Univ. J. Educ. Res. 8(2), 438–444 (2020) 20. Parong, J., Mayer, R.E.: Learning science in immersive virtual reality. J. Educ. Psychol. 110(6), 785–797 (2018). https://doi.org/10.1037/edu0000241 21. Clark, T., Hamilton, O., Morris, M., Vereen, E.: Transforming undergraduate education in the sciences and humanities with virtual reality: the case at Morehouse College. In: ICERI2021. Proceedings, ATLANTA, GA, USA, p. 3467 (2021) 22. Sülter, R.E., Ketelaar, P.E., Lange, W.G.: SpeakApp-Kids! Virtual reality training to reduce fear of public speaking in children–A proof of concept. Comput. Educ. 178, 104384 (2022) 23. Wang, X., Young, G.W., Mc Guckin, C., Smolic, A.: A systematic review of virtual reality interventions for children with social skills deficits. In: 2021 IEEE International Conference on Engineering, Technology & Education (TALE), pp. 436–443. IEEE (2021). https://doi. org/10.1109/TALE52509.2021.9678808 24. Beidel, D.C., Alfano, C.A., Kofler, M.J., Rao, P.A., Scharfstein, L., Sarver, N.W.: The impact of social skills training for social anxiety disorder: a randomized controlled trial. J. Anxiety Disord. 28(8), 908–918 (2014) 25. Hagopian, L.P., Kuhn, D.E., Strother, G.E., Van Houten, R.: Targeting social skills deficits in an adolescent with pervasive developmental disorder. J. Appl. Behav. Anal. 42(4), 907–911 (2009) 26. Carl, E., Stein, A.T., Levihn-Coon, A., Pogue, J.R., Rothbaum, B., Emmelkamp, P., et al.: Virtual reality exposure therapy for anxiety and related disorders: a meta-analysis of randomized controlled trials. J. Anxiety Disord. 61, 27–36 (2019) 27. Maples-Keller, J.L., Bunnell, B.E., Kim, S.J., Rothbaum, B.O.: The use of virtual reality technology in the treatment of anxiety and other psychiatric disorders. Harv. Rev. Psychiatry 25(3), 103 (2017) 28. Stendal, K., Balandin, S., Molka-Danielsen, J.: Virtual worlds: a new opportunity for people with lifelong disability? J. Intellect. Dev. Disabil. 36(1), 80–83 (2011) 29. Morina, N., Ijntema, H., Meyerbröker, K., Emmelkamp, P.M.G.: Can virtual reality exposure therapy gains be generalized to real-life? A meta-analysis of studies applying behavioral assessments. Behav. Res. Therapy 74, 18–24 (2015)
Using Virtual Reality Learning Environments
947
30. Zhao, D., Muntean, C.H., Chis, A.E., Rozinaj, G., Muntean, G.M.: Game-based learning: enhancing student experience, knowledge gain, and usability in higher education programming courses. IEEE Trans. Educ. 65(4), 502–513 (2022). https://doi.org/10.1109/TE.2021. 3136914 31. Saxena, M., Mishra, D.: Gamification and gen Z in higher education: a systematic review of literature. Int. J. Inf. Commun. Technol. 17, 1–22 (2021). https://doi.org/10.4018/IJICTE.202 11001.oa10 32. Landers, R.N.: Developing a theory of gamified learning: linking serious games and gamification of learning. Simul. Gaming 45(6), 752–768 (2015). https://doi.org/10.1177/104687 8114563660 33. Menin, A., Torchelsen, R., Nedel, L.: An analysis of VR technology used in immersive simulations with a serious game perspective. IEEE Comput. Graphics Appl. 38(2), 57–73 (2018). https://doi.org/10.1109/MCG.2018.021951633 34. Freeman, G., Maloney, D.: Body, avatar, and me: the presentation and perception of self in social virtual reality. Proc. ACM Hum.-Comput. Interaction 4(CSCW3), 1–27 (2021). https:// doi.org/10.1145/3432938 35. Ratan, R., Beyea, D., Li, B.J., Graciano, L.: Avatar characteristics induce users’ behavioral conformity with small-to-medium effect sizes: a meta-analysis of the proteus effect. Media Psychol. 23(5), 651–675 (2020). https://doi.org/10.1080/15213269.2019.1623698 36. Stets, J.E., Burke, P.J.: Identity theory and social identity theory. Soc. Psychol. Q. 63(3), 224 (2000). https://doi.org/10.2307/2695870 37. Bessière, K., Seay, A.F., Kiesler, S.: The ideal elf: Identity exploration in World of Warcraft. Cyberpsychol. Behav. 10(4), 530–535 (2007) 38. Kim, C., Lee, S.-G., Kang, M.: I became an attractive person in the virtual world: Users’ identification with virtual communities and avatars. Comput. Hum. Behav. 28(5), 1663–1669 (2012)
A RabbitMQ-Based Framework to Deal with Naval Sensor Systems Design Complexity Paul Quentel1,2(B) , Yvon Kermarrec1 , Pierre Le Berre2 , Ludovic Grivault2 , and Laurent Savy2 2
1 IMT Atlantique, Lab-STICC, 29238 Brest, France Thales Defence Mission Systems, Brest and Elancourt, France
Abstract. Naval sensor systems are complex due to their nature, the services they need to fulfil under rigorous constraints, and the information they require to be aware of their environment. Sensors are among the sources of the information that is collected, exchanged and synthesised and their integration into a sensor systems architecture presents numerous challenges. In this context, it is difficult to visualise these systems as a whole, and we lack methods to abstract the complexity level. We will introduce an approach to evaluate different architectural concepts and their impacts on the communication network. This paper presents our methodology involving: 1) an operational simulation, which allows to get closer to real use case; 2) a benchmark involving a middleware to simulate, in an abstract manner, communications between naval platforms, the operational environment, and sensors; 3) a network monitoring tool to experiment new mechanisms that might be implemented in the final architecture. Keywords: Communication Architecture · Sensor Network · Multi-Function Sensors · Naval Sensor Systems · RabbitMQ · Prometheus
1
Introduction
The evolution of the naval Defense context requires a significant modification of sensor systems architecture to overcome future threats. For about ten years, the French Ministry of Defence (MoD) has initiated R&D investigations and programs to enhance the surface ship combat systems’ operational capacities. Nowadays, naval sensor systems cooperate and each platform shares its data with the others through the Combat Management System (CMS). Management of sensors, as well as tracking, are done autonomously and locally at the CMS level in each naval platform. Each surface ship keeps a local tactical situation within the CMS thanks to its sensors and a global tactical situation thanks to exchanges with other platforms through Tactical Data Links. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): SAI 2023, LNNS 711, pp. 948–959, 2023. https://doi.org/10.1007/978-3-031-37717-4_62
RabbitMQ-Based Framework
949
In 2021, the “Veille Coop´erative Navale” (VCN), a new information system financed by the French MoD, proved its effectiveness in operations. The VCN project, led by France and Netherlands, improves the tactical situation thanks to exchanges between platforms and the local data fusion based on the raw radar data. The Combat Cloud is a wide-ranging meshed network where data are shared between users, platforms and nodes (e.g. sensors, effectors). The U.S. DoD (Department of Defense) pointed out the concept named “Network Centric Warfare” in the early 2000s [10], the forerunner of the Combat Cloud. The automation of the Combat Cloud still remains a major challenge. In this context, the cloud is characterised by a low latency, a wide bandwidth and a high resiliency. 1.1
Context and Issues
In the context of naval collaborative combat, we need a new naval sensor system architecture that will associate sensors from different platforms to carry out collaborative actions. The new architecture will have to enable collaboration between different networked sensors while taking advantage of data exchanges for accelerating the data processing and enhancing the threat engagement. The multi-platform collaboration is a key point; it permits proposing new sensor services (e.g. collaborative geolocation) that will exist only thanks to these exchanges between naval platforms. The most important need is to obtain information about the enemy faster, in order to counter new threats that can undermine the fleet defence in various conditions (e.g. saturation, jamming, communication loss, material destruction). Numerous architectural issues appear in this context, such as sensor networking, sensor management, resource optimisation, improvement of data processing increasing data workability, communications bandwidth expansion, etc. The technical challenges are thus to develop sensor architectures to enable a lower decision time, to improve reliability of sensor systems and to scale up according to a changing number of sensors. By means of scenarios, we aim at comparing between architectures to determine the most adapted one in terms of performance and adaptability against major threats. 1.2
Application and Industrial Contexts
Preliminary investigations, mandated by the French MoD, are engaged on the cooperative subject, sharing radar plots between frigates is the main concern within the VCN. However, current systems have well-known limitations as the quantity of data to be exchanged is growing faster than the communication bandwidth. Systems need to exchange ever-increasing data volumes without having the necessary bandwidth to do so. Currently, data, which come from sensors integrated into the platform or external platforms, are transmitted to the combat system through a centralised architecture. A decision has to be taken at sensor level on what should be transmitted in order not to overload and saturate the communication channel. These
950
P. Quentel et al.
raw data could be aggregated and filtered in the architecture, or the channel bandwidth could be increased. The exchanged information within our systems must be processed in due time under the aforementioned constraints on bandwidth and latency, notably to engage future major threats (e.g. hypervelocity missiles). Furthermore, there must be an answer to the needs of the system users during the definition of the architecture. The architecture must be resilient, scalable, modular, i.e. the system must be able to work even in degraded mode, to be extended to a larger heterogeneous fleet and to include new components. In addition, one major issue of naval sensor systems is to reduce most of the decision-making process [11] known as OODA loop – Observe, Orient, Decide and Act. This problem can be partially answered by proposing a networked sensor system architecture with an increased autonomy and resiliency. We have to make hypotheses on future sensors and on computing technologies in order to be able to design the architecture. 1.3
Objectives
In this paper, we outline an approach to evaluate architectural concepts using RabbitMQ and monitoring tools that we will describe later. The proposed framework simulates the communication network of the system and allows comparing architectures in order to choose the most efficient one in terms of resiliency, latency and throughput. The final aim of this work will be to design and develop an architecture for the future naval sensor systems which will not be presented in this paper. The paper is organised as follows: we introduced the application context and industrial needs of our research in the first section. The second section covers the related work, to show the potential of our approach. The third section presents our contributions for the design of a new architecture for naval sensor systems. The fourth section provides initial results that we have obtained. Finally, the fifth and last section summarises this paper and opens for the upcoming work.
2
Related Work
Numerous teams have investigated on architectures for terrestrial, aerial or naval military systems. Their research activities are directed towards combat cloud concepts or collaborative warfare. An American Navy approach, called CEC (Cooperative Engagement Capability) [1], aims at increasing performances of the battle fleet in response to threats. Combat systems share sensor data associated with tracks, quickly and precisely, in order for the group of ships, aircraft and ground units, located on the battleground, to act as a whole. The CEC enables to exchange radar data between the platforms, so tactical situations are shared and synchronised. Besides, the system allows a coordinated engagement of the target between the platforms. The CEC is interfaced with the platform CMS and integrated on
RabbitMQ-Based Framework
951
combat systems of rank Aegis and LHA for example (Destroyer and Landing Helicopter Dock). The concepts presented in CEC are closed to the French ones defined in Sect. 1. On the aeronautical collaborative combat side, the FCAS (Future Combat Air System) is a joint project between French, German and Spanish air forces which will rise by 2040 [16]. The system of systems FCAS contains a combat cloud that connects airborne platforms together and allows the collaborative combat. A PhD research on multi-agent based architectures was conducted [7] for multi-sensor systems embedded on airborne platform. The suggested architecture takes advantage from autonomous entities capable of communicating and making decisions in a software environment: the software agents. Each agent corresponds to an object from the theatre of operations, and each agent proposes actions for sensors to perform, then a scheduler plans the actions of each sensor over time. This proposed architecture works only on a single platform and its own sensors. Our future architecture will address these limitations by managing sensors while extending the targets to multi-platform and multi-domain (i.e. air, land, sea) needs. The CESMO project team (Cooperative Electronic Support Measures Operations) highlighted the lack of standards and interfaces in NATO forces [9]. The aim of CESMO is to enable cooperation between ESM sensors, while ensuring low bandwidth usage, in order to obtain better localisation of radar emitters. CESMO is mainly concerned about data sharing between allied platforms and the decision is taken centrally on the main platform. In CESMO, data are exploited locally and sensors are not controlled by this system, while for our research, we want a distributed architecture with autonomous sensors. The optimisation of the bandwidth proposed in this project is a major constraint for our research. The European project CAMELOT (C2 Advanced Multi-domain Environment and Live Observation Technologies) [3] proposes a distributed architecture aiming to control European borderlands against illegal immigration or drug smuggling. In these regards, CAMELOT project team has made a review and a classification of existing architectures in accordance with chosen criteria [15]. The motivations that lead the choices of the new architecture concern the capacity to command and control several UxVs (Unmanned Vehicles) and sensors to deliver complex services. Furthermore, the standardisation allows the straightforward integration of new modules or services. This architecture uses a middleware, which permits different modules, services, assets or tasks to interact using a publish-subscribe paradigm. The middleware provides facilities and services such as scalability, modularity, reliability, security and distribution, to mention a few. A study about the different middleware according to architecture needs has been made in this report, the project CAMELOT chose to integrate the middleware RabbitMQ. Mali et al. [12] pointed out that the current implementation of RabbitMQ defines and configures the queues and the consumers when the application is
952
P. Quentel et al.
launched, as a consequence the system does not scale dynamically when more messages are sent to already filled queues. These authors present a software tool that monitors messages between cloud components, this tool enables auto-scaling when problems are detected in a smart home system. They came to use RabbitMQ for several reasons like multiple messaging protocols, message queue and delivery acknowledgment. In the proposed architecture, a service is responsible for the functionalities introduced by an end device, each service sends data at the publisher/producer side and the consumers are listening on a queue where the messages are stored. Furthermore, the paper provides a test environment using Zabbix as a monitoring tool, the authors introduced different values for parameters like prefetch or number of consumers to determine the optimal configuration parameters and the limits of the system. In our context, we have selected RabbitMQ and monitoring tools to evaluate the network load while developing new concepts for naval combat dealing with the limited access to the bandwidth. As we can see in this state of the art, most of the investigations in the domain of collaborative combat are focused on sharing data between platforms in order to have a better view of the battlefield. As far as we know, testing the network of architectural concepts in the military domain has never been done before. The framework presented in this paper is a premise of the incoming work that will propose to exchange data between software agents in a distributed architecture, it will be presented in a future paper.
3
Contributions
Our investigation is composed of three sub-works. The first step was to expend a proprietary software by Thales called Sonia and to interface it with another software, the STK (Sensor Tasking). The second step was to use and experiment the middleware RabbitMQ [6] as well as a performance testing tool [2]. The final step was to develop a benchmark using RabbitMQ. These three sub-works allow answering the original objectives of our research: abstract the complexity of our systems to design our future architecture and evaluate it. 3.1
Framework and Methodology
Our team has developed a simulator named Sonia. This software provides a simulated environment that represents the theatre of operations. In this environment, actors like naval or aerial platforms can be added and the systems can evolve and adapt to the operational context. Each platform can enable or disable their sensors and effectors; it is also possible to make the platform moving in space and so to create scenarios that can be replayed and analysed afterwards. The major benefit of this simulator for our work is to retrieve sensor data when a target is being detected and tracked. The software STK (for Sensor Tasking) is built on an agent-based architecture [17]; it relies on Ludovic Grivault’s doctoral thesis [7] by extending the concepts to distributed architecture instead of a centralised one. The algorithms from the
RabbitMQ-Based Framework
953
software propose to activate sensors on brief intervals, then sensor resources are reserved. An agent represents an object (e.g. a projectile, an aircraft, a warship) in the theatre of operations. The internal memory of one agent gathers various information about one object such as its position or its speed. Moreover, all agents take decisions about the tasks that the sensors will perform, thus, sensors give feedback to agents about the objects they track and the tactical situation is completed or updated in a loop. These software applications need to be interfaced, as the platforms from the simulator Sonia could use the intelligence from the agents in order to schedule the sensor tasks. The middleware ZMQ (Zero Message Queue) [8] is used to exchange messages between these two entities; a platform in the simulator can send sensor tracks towards the STK, after that, agents are created or updated for each track received. The agents can send, to the platforms, the scheduling of sensors to accomplish the mission. Currently, the STK framework is based on a centralised architecture, only one platform contains agents and performs computation to establish the sensor tasking of the entire fleet. The sensor tasking refers to the management of a set of sensors (e.g. radar, electronic warfare, optronic) in order to collect data that are necessary to accomplish the global mission. Nevertheless, the centralised architecture has few specific limitations for the requirement of the naval domain (e.g. platforms are distant, the network must be resilient to communication loss). So, we face problems of complexity that restrain us when proposing new concepts of architectures. The architecture needs to be scaled up and to be reliable. For these reasons, we have decided to work on a benchmark using RabbitMQ [6]. 3.2
Functionalities and Requirements of Our Benchmark
The benchmark with RabbitMQ will raise a level of abstraction to our concepts from a network point of view. The benchmark will evolve from a simple program to a software with new features. Our model will be defined for the platforms to exchange data through messages. Later, users will be able to modify parameters, and specific scenarios will be designed to show the impacts of the suggested architecture concepts. In the model, we want to configure and adapt different parameters like the size of messages, the latency between distinct producers and consumers. Then, we would like to define which producer is sending messages to which consumer. Finally, we want to integrate additional facilities and among them: simulate communications or packet losses, delays; add or remove nodes during simulation; modify latency or throughput on communication links. 3.3
RabbitMQ
RabbitMQ [6] is a Message-Oriented Middleware (MOM), which is used within distributed architectures in order to communicate and cooperate between services. This MOM uses the AMQP standard (Advanced Message Queuing Proto-
954
P. Quentel et al.
col) where the messages are sent asynchronously from a producer to a consumer through queues and exchanges; these messages are routed to receiving queues thanks to binding and routing keys. RabbitMQ brings important properties and services for the architecture, such as: – Modularity: services or other heterogeneous networked entities can be interconnected with the addition of a serialisation protocol like JSON; – Scalability: nodes can be added or removed dynamically and the middleware deals with thousands of nodes; – Quality of Service (QoS): queue size can be modified, the Time-To-Live (TTL) of a message can be changed as well as the priority, the messages can be acknowledged; – Interface and management: a management user interface or a metric exporter (e.g. metrics can be messages per second, the number of producers or consumers) can be enabled through plug-ins; – Fault tolerance and failures since nodes can fail or messages and queues can be lost; – Secured communications. Furthermore, RabbitMQ proposes APIs (Application Programming Interface) for different programming languages such as C#, Java, PHP or python, this permits the integration of RabbitMQ in legacy software and heterogeneous environments. RabbitMQ is one MOM among others and is widely used in numerous platforms (e.g. OpenStack, Red Hat). In [13], the authors compare four MOM for the communication in distributed architectures: AMQP (RabbitMQ), Kafka, MQTT and ZeroMQ. Multiple features are compared like QoS, security, the standardisation of the MOM or the transport protocols used. They consider RabbitMQ as a very balanced MOM with the highest flexibility and a lot of functionalities, ZeroMQ outstands from all other MOM by its performance but is harder to implement. Furthermore, RabbitMQ provides mechanisms of persistence and message retention. Monitoring can be achieved by integrating Grafana and Prometheus in a RabbitMQ-based architecture. This latest characteristic makes it possible to evaluate and to tune our architectural solutions. 3.4
A Throughput Testing Tool: PerfTest
PerfTest [2] is a tool developed in Java for testing the throughput performances of RabbitMQ. It generates load and traffic under numerous configurations and parameters such as the number of consumers, the number of producers, the message sending and receiving rates, the queue size or the size (in bytes) of a message. It is also possible to generate random load and to modify message publishing rates on defined time intervals. The tool allows users to export Prometheus metrics, like with the RabbitMQ plugin. The exported metrics are data related to latency, non-routed messages,
RabbitMQ-Based Framework
955
the total of published, rejected or consumed messages, but also metrics about memory or CPU usage. The tests that we conducted on this tool brought us the following analyses: PerfTest was created for testing network load, which is not exactly appropriate for our needs and it presents a lack of flexibility and limitations. The tool proposes to generate a global latency and a global throughput. Moreover each message has the same size. These points bring us to the following conclusion: PerfTest does not complete our requirements and is limited in regard to our objectives. 3.5
RabbitMQ-Based Framework
N=5
CriƟcal queue
P P
Binding Keys
P
C1
*.criƟcal
RouƟng Keys N=5
Common queue
Test.criƟcal
C2
*.common
P P
Common+criƟcal queue Exchange
P
Test.common
Test.independent
P
C3
*.criƟcal,*.common
*.independant
Independent queue C4
Fig. 1. Example of Configuration with 11 Producers, 4 Consumers and 3 Different Topics
Given the reasons listed previously, we decided to develop our framework which fits our needs. Our framework relies on RabbitMQ, Prometheus and Grafana. Firstly, we define a configuration that describes consumers, producers and exchanges between them. Then, we export real-time information, called metrics, about consumers, producers and their data exchanged to a Prometheus server. Finally, Grafana queries and displays data from this server. The developed framework allows starting many consumers or producers which are both programs, their numbers are defined in a configuration file. We chose to use topics as routing rules, the producers send messages to a router-like entity called exchange, this exchange initially receives messages before routing them to one or more corresponding queues. The queue is a buffer that stores messages ready
956
P. Quentel et al.
to be delivered to the consumers. In the messaging model of RabbitMQ, the producers use routing keys while the consumers use binding keys. Figure 1 shows a configuration example with different routing and binding keys. For instance, there are five producers that publish messages via the routing key “Test.critical”, these messages are sent and duplicated into connected queues (i.e. queues with the binding key “*.critical”). Consequently, consumers one and three will receive the same messages from the aforementioned producers. Experimental Setup We started the development of the framework with a classic Java communication application using Eclipse IDE. The program sends a message from a single producer to a single consumer. We added features afterwards, the purpose being to get more relevant information and metrics for the monitoring of a combat network. The second step was to use additional facilities and services of the MOM (e.g. sending several messages from N producers to M consumers). The third step of our development was to configure the size of the messages and the frequency of their delivery. Afterwards, we collect metrics and traces to analyse them using graphs with Prometheus and Grafana. Together, they present graphically network data inside dashboards (see Fig. 2). In our configuration file, Grafana and Prometheus servers are launched locally with executable files (Note: it works well using Docker images). The example of configuration shown in Fig. 1 can be modified using a CSV file. The device used here to run our test is a Windows 10 Pro Laptop with 16 GB RAM and an Intel Core i7-6820HQ processor. The library used to implement the AMQP protocol is amqp-client-5.7.1 and the metrics can be created using the micrometer-core1.7.3 library which contains built-in support for Prometheus. Prometheus Prometheus [14] is a network monitoring and alerting software. It registers metrics in a real-time database and provides a query language called PromQL that queries data in the database. Prometheus offers four types of metrics: counters, gauge, histogram and summary. Furthermore, it is possible to use regex (Regular Expression) to modify outgoing metrics, to collect values during time in a vector and to use aggregation operators (e.g. sum, min, max) as well as other functions. Grafana Grafana [4] is an analytical and monitoring software which allows showing graphs within a dashboard and whose data are coming from a temporal database (in our case, the data are stored in a Prometheus server). Grafana scrapes data from this database which is updated every few seconds. Besides, Grafana permits to
RabbitMQ-Based Framework
957
create personal dashboards to fit user needs by showing time graphs, gauges or histograms to highlight the data. Variables are available in Grafana, so the same dashboard can be used for many producers or consumers; combined with regex, the dashboard fits well even when the configuration is modified. Thus, the scalability of the framework is guaranteed as modification in RabbitMQ configuration will not have impacts in the dashboard.
4
Results
At this stage of the study, we produce metrics and dashboards which aggregate various communication information. These results will allow us to compare and evaluate architectures according to several criteria. For instance, we can take the simple case of a centralised architecture where one single consumer represents the leading naval platform, and a variable number of producers symbolise the allied fleet sensors. Afterwards, we can compare this architecture with another one where the consumers are more numerous for instance, which will represent a distributed or decentralised architecture. From the Grafana dashboards, we expect to analyse details about the traffic (e.g. throughput, congestion, latency). As depicted in Fig. 2, general information coming from RabbitMQ servers are displayed; the configuration used there follows the one presented in Fig. 1. We can observe the average throughput of all producers (i.e. incoming average), as well as for all consumers (i.e. outgoing average). The consumers throughput is more important because messages are duplicated and sent to several queues.
Fig. 2. Dashboard with General Information from RabbitMQ
In Fig. 3, a time graph provides throughput information about one consumer. In this example, the information is updated every five seconds and the instantaneous throughput is shown in green while the average throughput in yellow. More results were displayed such as the number of messages effectively sent and well received for each producer. That means we can observe that the producer number one sent X messages to the consumer number one and Y messages to the consumer number two, by making an addition we can know the number of
958
P. Quentel et al.
Fig. 3. Graph with the throughput of one Consumer
messages sent by the producer. The throughput of each consumer and producer are displayed in a table. Then, two graphs were displayed thanks to plugins from Grafana, the graphs show producers and consumers as nodes (or vertices), the number of messages received by a consumer from a producer defines the links between the nodes (or edges).
5
Conclusion and Perspectives
Our studies aim to abstract the architecture complexity of naval sensor systems by proposing a method for evaluating architectures using a RabbitMQ-based framework. For this purpose, we introduced the context of our research and the upcoming issues, then we highlighted the work done in the field. The first contributions were presented: – The interfacing of two proprietary software; – Measures of performances using the PerfTest tool from RabbitMQ; – The development of a RabbitMQ-based framework allowing using many possible configurations between producers and consumers; – First results by displaying metrics with Prometheus and Grafana, and ability to compare and evaluate architectures under various criteria. The RabbitMQ-based framework showed us that it is possible to exchange data through messages between programs and that we can highlight the behaviour of the network system using Grafana. For the future work, we will integrate our consumers and producers in a multi-agent system-based architecture (MAS) [17] extended from the concepts presented in Sect. 3.1. Our work is set in a context where many naval platforms have to cooperate, so data exchange is a major issue. Different organisations are possible for the agents [5], and some rules (e.g. message sending frequency, optimised packet length) have to be followed to optimise the bandwidth use. Depending on how agents are situated, and how they are communicating, the complexity is increasing while the size of the fleet is growing and we need to point out the limits. Then, RabbitMQ will be used to exchange data between agents which are physically located on distinct platforms, the agents need to cooperate so they produce and consume messages from one to another. RabbitMQ, combined with Grafana and Prometheus monitoring tools, will produce
RabbitMQ-Based Framework
959
valuable insights that will help us to analyse and evaluate. We will focus on the limits of the proposed architecture, mostly from the network point of view, in order to evolve our concepts of agents.
References 1. The cooperative engagement capability. Johns Hopkins APL Technical Digest, Volume 16, Number 4 (1995) 2. RabbitMQ perftest (2022). https://rabbitmq.github.io/rabbitmq-perf-test/stable/ htmlsingle/ 3. TEKEVER ASDS. Specifications (2018). https://www.camelot-project.eu/results 4. Chakraborty, M., Kundan, A.P.: Grafana. In: Monitoring Cloud-Native Applications, pp. 187–240. Springer, Apress, Berkeley, CA (2021) 5. Dorri, A., Kanhere, S.S., Jurdak, R.: Multi-agent systems: a survey. IEEE Access 6, 28573–28593 (2018) 6. Dossot, D.: RabbitMQ Essentials. Packt Publishing Ltd., Birmingham (2014) 7. Grivault, L.: Architecture multi-agent pour la conception et l’ordonnancement de syst`emes multi-senseur embarqu´es sur plateformes a´eroport´ees. Ph.D. thesis, Sorbonne universit´e, December 2018 8. Hintjens, P.: ZeroMQ: Messaging for Many Applications. O’Reilly Media, Inc., Sebastopol (2013) 9. Johnsen, F.T., Hafsøe, T., Skjervold, E., Rose, K., Lund, K., Nordbotten, N.A.: Multinett II: SOA and XML security experiments with cooperative ESM operations (CESMO). FFI rapport, 2344:2009, 2008 10. Johnson, B.W., Green, J.M.: Naval network-centric sensor resource management. 7th ICCRTS; Naval Postgraduate School: Monterey, CA, USA (2002) 11. J´erˆ ome, B., Micka¨el, U.: The armed forces’ current and future needs of radio frequencies: a strategic issue for France. Annales des Mines, March 2020 12. Mati´c, M., Ivanovi´c, S., Anti´c, M., Papp, I.: Health monitoring and auto-scaling RabbitMQ queues within the smart home system. In: 2019 IEEE 9th International Conference on Consumer Electronics (ICCE-Berlin), pp. 380–384. IEEE (2019) 13. Sommer, P., Schellroth, F., Fischer, M., Schlechtendahl, J.: Message-oriented middleware for industrial production systems. In: 2018 IEEE 14th International Conference on Automation Science and Engineering (CASE), pp. 1217–1223. IEEE (2018) 14. Turnbull, J.: Monitoring with Prometheus. Turnbull Press (2018) 15. UPVLC. Architecture and data model CAMELOT (2018). https://www.camelotproject.eu/results 16. Vogel, D.: Future combat air system: too big to fail; differing perceptions and high complexity jeopardise success of strategic armament project (2021) 17. Wooldridge, M.: An Introduction to Multiagent Systems. Wiley, Hoboken (2009)
A Novel Method of Automatic Modulation Classification with an Optimised 1D DBSCAN Bill Gavin1,2(B) , Edward Ball1,2 , and Tiantai Deng1,2 1
The University of Sheffield, Sheffield S10 2TN, UK [email protected] 2 Department of Electronic and Electrical Engineering, Mappin Building, Mappin Street, Sheffield S1 3JD, UK
Abstract. Automatic modulation classification is an important goal in the creation of a system which can adapt to and demodulate a range of wireless modulation schemes. In this paper, a new method of automatic modulation classification using clustering is proposed. Constellation points are decomposed into arguments and absolute values. Then this data is clustered to find characteristic features about incoming data which is then used to determine the modulation scheme with a machine learning based classifier. The algorithm shows a comparable classification accuracy with the state-of-the-art technologies at 20dB SNR with less computing complexity and comparable execution speed. The computing complexity is further improved with an optimisation for unidimensional clustering which achieves a minimum speed-up of 102 by reducing computational complexity from O(n2 ) to O(nlog(n)). Keywords: DBSCAN · Clustering Modulation Classification
1
· Cognitive Radio · Automatic
Introduction
Cognitive Radio (CR) is an emerging area of communications, the aim is to create a device capable of sensing and adapting to conditions. Part of this adaptation is automatically modifying hardware for the demodulation of different modulation schemes, achieving this will allow for the creation of general purpose radios which can demodulate a range of signals which will have numerous civilian and military applications such as dynamically reconfiguring to the best wireless channel in a busy area and the interception of messages on a battlefield. In order to achieve this the modulation scheme of the incoming signal must be obtained so the receiver hardware can be altered to support the data. In the literature there have been numerous solutions to this problem proposed, from ensemble based classifiers [1], to CNN and LSTM deep learning models [2,3]. In general it can be seen that the more complex the machine learning (ML) classifier algorithm, the better classification accuracy achieved, however for deployment in a c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): SAI 2023, LNNS 711, pp. 960–967, 2023. https://doi.org/10.1007/978-3-031-37717-4_63
DBSCAN for Modulation Classification
961
hardware system it is advantageous for the system to feature low latency, power consumption, and chip area. In this paper a new method of tackling the modulation classification problem using clustering for feature extraction is proposed which aims to: – Match or beat the classification accuracy of the state-of-the-art automatic classifiers – Exhibit lower power cost, hardware cost, and calculation delay than the stateof-the-art Should this work meet these goals a significant step towards a general purpose and universal demodulator will have been taken, and this work can be applied to the realisation of a low-power and small area ML demodulator. The rest of this paper is organised as follows: Sect. 2 provides an overview of the state-of-the-art and outline the importance of this work, Sect. 3 gives an introduction to constellation clustering then describes in detail the algorithm used to prepare data for classification as well as optimisations made to reduce complexity. Section 4 gives an overview of the results achieved and finally Sects. 5 and 6 provides a conclusion and overview of the future of this work.
2
Literature Review
The state-of-the-art ML modulation classifiers such as those found in work by Doan et al. [2] as well as Ke and Vikalo [3] employ the complex neural networks CNN and LSTM respectively. Therefore any hardware implementation of these models will draw a large amount of power and utilize a large amount of resources on a FPGA. This can be seen from work such as that by S. Wang et al. [7] which is the state-of-the-art LSTM implementation and still utilizes over 50% of a Stratix V. Similar implementation sizes for a CNN are reported by D. T. Kwadjo et al. [8]. Furthermore a large issue with the CNN strategy in [2] is the usage of images, the creation and handling of these images requires significant resources which adds further chip-space requirements and delay to the calculation time. From these examples it is clear that in order for a classifier to be implemented alongside demodulators for the purpose of CR, a more efficient method is required. Therefore this work is proposing a brand new method of tackling this problem with a modified clustering algorithm.
3 3.1
Methodology PSK and QAM Constellation Clustering
Time-series RF waves are decomposed and represented as two waves known as I and Q which correspond to the amplitude and phase of the original wave. The IQ point pairs can then be plotted in 2D space as a complex number Z which is formed as in Eq. 1. Z = I + jQ (1)
962
B. Gavin et al.
Modulation schemes which utilise changes in phase and amplitude are represented by clusters of points throughout the 2D plane as the I and Q values change to represent different data symbols, this forms a particular pattern known as a constellation diagram. Finding the number of constellations via clustering can be used to determine the order of modulation, for example BPSK and QPSK would have 2 and 4 constellations respectively, determining this information makes classification of each order trivial. However different modulation schemes may have equal orders of modulation making this method ineffective, therefore a method of numerically representing the positioning of the constellations is required. This can be achieved by decomposing the complex constellation points into arguments and absolute values. This provides two benefits, firstly this provides more information about the positioning of constellations but also allows for a less computationally complex clustering algorithm to be employed, providing savings in power and hardware usage and speeding up calculation times, this will be discussed in Sect. 2.2.
Fig. 1. Absolute and Arguments of Constellation Points
3.2
DBSCAN
As can be seen in Fig. 1, the number of different arguments and absolute values can be used to discern between the two modulation schemes, finding the number of groups of these raw values can be performed with a clustering algorithm. Most clustering algorithms such as KNN and K-Means require an input of the number
DBSCAN for Modulation Classification
963
of clusters which are required and group data based upon this parameter [6]. We propose the clustering algorithm Density-based spatial clustering of applications with noise (DBSCAN) [4] for this problem as it forms an arbitrary number of clusters by using density to cluster points in space. Only two parameters are required to achieve this. These parameters are the minimum number of spatially near points to constitute a cluster (N), and minimum distance between two points to be considered part of the same cluster (). The algorithm operates as follows: 1. A point p is randomly selected from dataset D 2. The absolute distance to each point in D from p is determined to see if it is distance from p 3. Points within are labelled as part of cluster X if the number of points closer than is greater than N 4. For each newly clustered point check if any points are within , if new points are found they are added to the cluster X 5. Repeat step 4 until all points of cluster X are found 6. Repeat steps 1–5 until all points are clustered 7. If data-points do not have a number of points nearby greater than N they are labelled as outliers. DBSCAN is designed for multidimensional data and has a computational complexity of O(n2 ) owing to the requirement to determine the distance to each point in the dataset for each point in the dataset. When working with 1 dimensional data as in this case, it is advantageous to sort the data and apply a modified algorithm in a similar manner to the work by Bartosz Meglicki [5]. By sorting the data only the distance to the next point in the array needs to be calculated to determine if the next point belongs to the same cluster. The minimum cluster points value N can be implemented by incrementing a variable for each point in a chain which is no larger than to the subsequent value. Should this variable be below the desired N the chain will all be registered as outliers. Making this optimisation reduces the complexity to the complexity of the selected sorting algorithm which in this case was O(nlog(n)) for quick sort. From the simulation results it was found that implementing the clustering algorithm in this manner provided a speed-up of 102 at 100 data-points and grew to 103 at 10000 data-points, a graph of calculation times of the unoptimised and optimised DBSCAN algorithms can be seen in Fig. 2.
4
Results
Data for this simulation was generated using the Rohde & Schwarz SMW100A and captured with a Keysight N9030B PXA signal analyser, waves modulated with BPSK, QPSK, 8PSK, and 16QAM were created at SNRs which ranged from 30dB to 3dB. The signal analyser was configured to the same carrier frequency as the signal source, but was not in carrier phase lock. The code for this system was written with MATLAB R2021b and executed on an AMD Ryzen 5 3600. A single hidden layer multilayer-perceptron (MLP) was used for classification.
964
B. Gavin et al.
Fig. 2. 1D Optimised and Unoptimised DBSCAN Execution Time
The MLP was trained using 5-fold cross validation and regularised to reduce complexity. The system was tested across a SNR range from 3dB to 20dB, it was found that 100% accuracy was maintained until 15dB then the performance began to deteriorate: at 10dB the system showed 85% accuracy and at 3dB the performance dropped to 80%. While performance did drop at lower SNRs, the minimum classification accuracy still remained above 75%. An example of a confusion matrix at 16dB is shown in Fig. 3, the complete results alongside a comparison to other work is shown in Fig. 4. This performance beats the high SNR accuracy reported in the literature of the LSTM model in [3] but fails to match the CNN in [2] below 18 dB SNR, as the SNR decreases the results show that the LSTM and CNN display greater noise immunity as an accuracy of above 90% and 100% is maintained at 5dB SNR. Across all SNRs this work achieves greater performance than the ensemble statistical classifier. A comparison of execution time compared to a CNN and LSTM implementation from [2] is found in Table 1. The table shows that this work matches the execution time of the CNN presented in [3] and beats the LSTM on a comparable CPU. It is expected that if this system was implemented outside of MATLAB, as was done in [3], that further gains could be made. This work achieves better accuracy at SNRs above 8dB whilst executing in a comparable time, therefore on low noise data the system presented in this work is advantageous in every way. Paper [2] gives no indication of execution speed but we predict the time to be slower owing to the need to format the data into images before classification.
DBSCAN for Modulation Classification
965
Fig. 3. Confusion Matrix at 16dB, Class 1 through 4 is BPSK, QPSK, 8PSK, and 16QAM Respectively
Fig. 4. Graph of Classification Accuracy of this Work and the Best from the Literature
966
B. Gavin et al.
Table 1. Table to Compare Execution Times of this Work and the State-of-the-Art Design
5
Execution time (ms) CPU
This work 0.659
AMD Ryzen 5 3600
LSTM [3] 1.151
Intel i7-8700K
CNN [3]
Intel i7-8700K
0.634
Conclusion
In this paper a novel method of using the DBSCAN clustering algorithm to extract information about a modulation scheme from a constellation diagram is presented. With the absolute value and the arguments of the constellations a classifier could be trained which achieves the best performance for a system without using deep learning models and matches the performance of deep learning models at SNRs above 15dB. One dimensional clustering modifications are shown to provide a significant speed-up in calculation time and allow for the system to at least match the speed of the fastest classifiers found in the literature whilst significantly reducing the complexity.
6
Future Work
Following this work, the presented algorithm will be implemented on a FPGA platform then rigorously tested and optimised to maximise classification accuracy as well as minimise implementation size and power-consumption. Once this is completed it is expected that the system will be deployed alongside an array of demodulators and be used to control the flow of RF data through the array to achieve a cognitive radio capable of sensing the modulation scheme of incoming data and reacting to and demodulating any type of RF data.
References 1. Saharia, D., Boruah, M.R., Pathak, N.K., Sarma, N.: An ensemble based modulation recognition using feature extraction. In: International Conference on Intelligent Technologies (CONIT) 2021, pp. 1–6 (2021). https://doi.org/10.1109/CONIT51480. 2021.9498547 2. Doan, V.-S., Huynh-The, T., Hua, C.-H., Pham, Q.-V., Kim, D.-S.: Learning constellation map with deep CNN for accurate modulation recognition. In: GLOBECOM 2020–2020 IEEE Global Communications Conference, 2020, pp. 1–6. https:// doi.org/10.1109/GLOBECOM42002.2020.9348129 3. Ke, Z., Vikalo, H.: Real-time radio technology and modulation classification via an LSTM auto-encoder. IEEE Trans. Wireless Commun. 21(1), 370–382 (2022). https://doi.org/10.1109/TWC.2021.3095855 4. Ester, M., Kriegel, H.-P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Simoudis, E., Han, J., Fayyad, U.M. (eds.) Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96), pp. 226–231. AAAI Press (1996)
DBSCAN for Modulation Classification
967
5. Meglicki, B.: Linear time DBSCAN for sorted 1D data and laser range scan segmentation (2021) 6. Yesilbudak, M., Sagiroglu, S., Colak, I.: A novel implementation of kNN classifier based on multi-tupled meteorological input data for wind power prediction. Energy Conversion Manage. 135, 434–444 (2017). Deng, T., Crookes, D., Siddiqui, F., Woods, R.: A new real-time FPGA-based implementation of K-means clustering for images. In: Intelligent Computing and Internet of Things, pp. 468–477. Springer, Singapore, 2018 7. Wang, S., et al.: Acceleration of LSTM with structured pruning method on FPGA. IEEE Access 7, 62930–62937 (2019). https://doi.org/10.1109/ACCESS. 2019.2917312 8. Kwadjo, D.T., Mbongue, J.M., Bobda, C.: Performance exploration on preimplemented CNN hardware accelerator on FPGA. In: International Conference on Field-Programmable Technology (ICFPT) 2020, 298–299 (2020). https://doi.org/ 10.1109/ICFPT51103.2020.00055
A Secure Information Transmission Scheme for the Cluster Blockchain of the Internet of Vehicles Hua Yi Lin1(B) , Meng-Yen Hsieh2 , and Kuan-Ching Li2 1 Department of Information Management, China University of Technology, Taipei, Taiwan
[email protected] 2 Department of Computer Science and Information Engineering, Providence University,
Taichung, Taiwan
Abstract. In recent years, with the popularity of blockchain, many of its applications have been gradually introduced into the Internet of vehicles (IoV) and cloud computing. Due to the decentralized nature of IoV and blockchain, they are very suitable for each other’s architecture. However, blockchain is a flat framework. This study considers how to import the cluster blockchain framework to improve the efficiency of blockchain when the Internet of vehicles forms a cluster or group architecture. In addition, the data of Internet of vehicles and cloud computing are exposed to the public network, how to protect the transmission data from being changed. This study proposes the elliptic curve digital signature (ECDSA) communication protocol to protect the transmission data of Merkle tree node. The ECDSA digital signature is used to sign the transaction block to ensure the security of data transmission in the Internet of vehicles. Keywords: Internet of Vehicles · Cloud Computing · Elliptic Curve Digital Signature
1 Introduction With the rise of autonomous vehicles and 5G, the Internet of vehicles in the cloud is no longer out of reach. At present, the Internet of vehicles consists of the following components: vehicle-to-infrastructure V2I, vehicle-to-roadside V2R, vehicleto-pedestrian V2P, vehicle-to-vehicle V2V, vehicle-to-group V2G, vehicle-to-network V2N and vehicle-to-everything V2X [1, 2]. In order to improve the value-added application of Internet of vehicles, the roadside devices send the vehicle information to the cloud service platform of the travel control center for analysis and output value-added information. The cloud service platform includes an important master server to assign and coordinate work and several Mapper/Reducer servers to perform map/reduce operations. When the vehicle is moving on the road, the information collected along the road is transmitted to the neighboring vehicles, and the information is also sent to the cloud service classifier of the back-end cloud service platform through RSU and the Internet routers. The cloud service classifier classifies the service according to the type of service, © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): SAI 2023, LNNS 711, pp. 968–979, 2023. https://doi.org/10.1007/978-3-031-37717-4_64
A Secure Information Transmission Scheme
969
and then transmits it to the corresponding cloud service master server. When the master server receives the request, it will select the mappers and reducers participating in the map/reduce operation to perform the map/reduce operation. In the near future, if the IoV infrastructure architecture is gradually complete, a more complete V2X networking architecture will be achieved. Blockchain is a peer-to-peer decentralized database connected by many blocks [3, 4]. Recently, the study surveys that blockchain has progressed into the cryptocurrency trend. Blockchain differs from traditional centralized databases by centralizing data in a single server storage. The blockchain distributes the transaction information among the many transaction nodes within the block to avoid centralization in a certain storage space, so as to achieve the demand of decentralization and avoid the occurrence of a single point of error. In addition, the blockchain owns these characteristics: 1. Nameless; 2. Tamper-proof; 3. Data consistency and 4. Data transparency. Therefore, this study intends to improve the data security mechanism of cloud IoV by using the characteristics of blockchain. We propose the concept of information security transmission of the decentralized cluster blockchain, and avoid the security vulnerability of IoV by the ECDSA digital signature. The remainder of this paper has the following structure. Section 2 presents blockchain with the ECDSA digital signature. Section 3 depicts the proposed security transmission framework for the cluster blockchain of the Internet of vehicles. Section 4 gives a description of the analyses of secure. Finally, conclusions and future work are presented in Sect. 5.
2 Blockchain with ECDSA Digital Signature In the last few years, decentralized blockchain technology has become mainstream for cryptocurrencies. The decentralized blockchain architecture is very different from the traditional centralized database architecture. Blockchain uses a peer-to-peer decentralized storage architecture to store data in objects called nodes instead of a single main storage to avoid a single point of failure. As we know, a single block is the main element that constitutes a blockchain, which combines many blocks to constitute a blockchain. The composition of every block includes the block head and the block body. Figure 1 describes the detailed construction of the block. The block head contains the following data types: 1. Pervious block hash: This prev_hash field is calculated from the previous block head [5]. 2. Nonce: It represents the difficulty of the workload algorithms and the number of workload algorithms. 3. Time stamp: The time of generating this block. 4. Merkle tree root: It is the current block body hash value and the Merkle tree hash value that is computed by the algorithm of the Merkle tree. The Merkle tree located in the block body is the set of all transaction records. It maps each transaction node of the leaf node to its non-leaf parent node through hash operation, and finally forms the Merkle tree structure and produces the Merkle tree root node. In
970
H. Y. Lin et al.
other words, the root node is the hash value of all the transactions, the child nodes are all the transaction records, and the block body is the overall transaction information, including the following features and fields. Besides, each block is linked in order. 1. Genesis block: When a blockchain is created, the system first establishes a genesis block [6]. This block is located at the first position of the blockchain, and the next block points to the genesis block header hash via the previous_hash field in the header [7]. Similarly, the previous_hash field of an ordinary block saves the hash value of the previous block header. Besides, the previous hash field of the genesis block is null. 2. Transaction: Each transaction is stored in the leaf node of the block body [5]. In order to ensure the security and non-repudiation of the transaction information, this study introduces a customized transaction block, which includes the head and body of the transaction block [8, 9]. The name of each field is shown in Fig. 2, and the detailed function of the field is described as follows. The proposed customized transaction block contains the transaction header and transaction body. The transaction header includes the following fields. 1. Transaction Number: It represents the transaction number [10]. 2. Time stamp of transaction: The time of generating the transaction block. 3. Serial number of transaction block: This represents the serial number of the generated transaction block. The transaction body includes the following fields. 1. Vehicle ID: The identity of the vehicle ID. 2. MAC add of OBU: OBU mac address on the vehicle. 3. Time stamp of record: The generating time of this transaction record.
Fig. 1. The Diagram of the Internal Block Structure
A Secure Information Transmission Scheme
971
4. Transmitted plain data: The plain data of the vehicle sending out. 5. Serial number of the transmitted plain data: The number of the transmitted plain data. 6. Type of service: The request service from the vehicle.
Fig. 2. The Internal Structure Diagram of a Transaction Block
Due to the large number of vehicles and rapid changes in the Internet of vehicles, considering efficiency, we propose a cluster blockchain to take care of the transaction security of adjacent vehicles and improve the efficiency of the original blockchain structure. In addition, we would like to use the ECDSA digital signature algorithm to secure the transaction information of the vehicle and ensure the identity of the trader. Moreover, ECDSA can ensure the integrity of the message transmission and confirm the identity of the sender and prevent the occurrence of repudiation in transactions. ECDSA is composed of ECC (elliptic curve cryptosystem) and DSA (digital signature algorithm). It mainly uses digital signature mode similar to DSA, but the biggest difference is the use of elliptic curve cryptographic algorithm for signature. The great advantage of ECDSA is that it requires a shorter key length for the same level of security, but it is faster than RSA or other asymmetric cryptosystems.
3 The Security Transmission Scheme for the Cluster Blockchain At the beginning of this system, each vehicle is registered to obtain the digital signature of the OBU. For the convenience of explanation, we take the vehicle V A , V B , V C and V D as examples to illustrate the system operation of this study. Firstly, the vehicle signs the transaction block of the data TDA to be transmitted to obtain SignVA (TDA ), and then executes hash operations on this signature through SHA256 to obtain H(SignVA (TDA )), and generates a transaction in block11 . Likewise, when vehicle V B ~ V D would like to transmit data respectively, H(SignVB (TDB )),
972
H. Y. Lin et al.
H(SignVC (TDC )) and H(SignVD (TDD )) are generated by the above signature and hash operations, respectively. Then, the hash operations are executed in the paring neighbor vehicles to obtain H(H(SignV A (TDA )) + H(SignV B (TDB ))) and H(H(SignV C (TDC )) + H(SignV D (TDD ))) respectively. Finally, the Merkle tree root is figured out as [H(H(H(SignV A (TDA )) + H(SignV B (TDB ))) + H(H(SignV C (TDC )) + H(SignV D (TDD )))]. When the complete block is generated, then this the block11 will be added into Chain1 . After the similar steps, block21 , block22 ,…block2N well be added into chain2 and blockN1 , blockN2 , …blockNN well be added into chainN . And so on, transactions and blocks generated by vehicles belonging to different cluster groups join Chain1 ~ ChainN respectively, as shown in Fig. 3. Finally, the whole cluster blockchain is strung in the cloud, as shown in Fig. 4, and the blockchain is managed through such a hierarchical cluster architecture, which not only reduces the operation time of each blockchain, but also shortens the length of each regional blockchain, thereby improving the service efficiency of the blockchain.
Fig. 3. The Architecture of the Cluster Blockchain
ECDSA Signature Procedures In order to protect the transmitted data in the blockchain, this study adopts ECDSA [4, 6, 8] to secure transaction blocks. The following phases describe the operation of ECDSA in this study, as shown in Table 1. Initially Phase: S1. Initially, the OBU of the vehicle picks up a generator T and an elliptic curve EGF(p) (e, f ) with an order n = | EGF(p) (e, f )| + 1. Besides, n represents the amount of points of this curve including the far point at infinity.
A Secure Information Transmission Scheme
973
Fig. 4. The Architecture of the Cluster Blockchain System
S2. Subsequently, this OBU of the vehicle chooses a private key K and a point T = (X T , Y T ) that K is between 1 and n-1 and n is the order of T. Meanwhile, OBU calculates the public key PK = K × T = K × T(X T , Y T ) using the generator T. Here, this system represents the public key as (e, f , p, n, T, PK ). Signature Phase: S3. The OBU picks up a random integer number i which is also between 1 and n-1. Then, it figures out a point O = (X O , Y O ) = i × TT = i × T(X T , Y T ). S4. Then OBU takes the transmitted transaction block TD from the point O and a coordinate value (X, Y ) as inputs and then computes f = Hash(TD) using SHA256. S5. G = X O mod n. S6. S = (K × TG + f ) × Ti−1 (mod n). S7. (TD, G, S) represents the signature result, if G or S equals 0, then this system repeats S3 to regenerate a random integer i until completed from S1 to S7. S8. When the receiver obtains the transaction block TD with the signature result (TD, G, S), the receiver figures out f = Hash(TD), M = S −1 mod n, V 1 = f × TM mode n, V 2 = G × TM mode n. W = (X w , Y w ) = V 1 × TT + V 2 × TPK . S9. Subsequently, the receiver verifies whether G equals to X w . S10. If G = X w , the receiver accepts this signature, or otherwise discards the received signature. Secure Phase When the vehicle V A would like to send data to the cloud, the vehicle must first perform the above digital signature procedure on the transaction block through the ECDSA digital signature to obtain (TDA , G, S); for the convenience of description, we represent the signature result (TDA , G, S) as SignVA (TDA ). The information protection procedures for this study are as follows: Firstly, when vehicles V A ~ V D belong to the cluster block11 , as shown in Fig. 4, we omit the subscript number 11 for ease of explanation. V A ~ V D would like to transmit the data of OBU to the cloud. The following steps represent the detailed procedures.
974
H. Y. Lin et al. Table 1. The ECDSA Digital Signature Procedure on Blockchain
Registration phase
Initial Phase
Signature phase
Verify phase
(1). Each vehicle is registered to obtain the digital signature of the OBU (2). Initially, the OBU of the vehicle picks up a generator T and an elliptic curve EGF(p) (e, f ) with an order n = | EGF(p) (e, f )|+1. Besides, n represents the amount of points of this curve including the far point at infinity (3). Subsequently, this OBU of the vehicle chooses a private key K and a point T = (X T , Y T ) that K is between 1 and n-1 and n is the order of T. Meanwhile, OBU calculates the public key PK = K × TT = K × T(X T , Y T ) using the generator T. Here, this system represents the public key as (e, f , p, n, T, PK ) (4). The OBU picks up a random integer number i which is also between 1 and n-1. Then, it figures out a point O = (X O , Y O ) = i × TT = i × T(X T , Y T ) (5). Then OBU takes the transmitted transaction block TD from the point O and a coordinate value (X, Y ) as inputs and then computes f = Hash(TD) using SHA256 (6). G = X O mod n 7. S = (K × TG + f ) × Ti−1 (mod n) (8). (TD, G, S) represents the signature result, if G or S equals 0, then this system repeats (4) to regenerate a random integer i until completed from (2) to (8)
(continued)
A Secure Information Transmission Scheme
975
Table 1. (continued) Registration phase
Initial Phase
Signature phase
Verify phase (9). When the receiver obtains the transaction block TD with the signature result (TD, G, S), the receiver figures out f = Hash(TD), M = S −1 mod n, V 1 = f × TM mode n, V 2 = G × TM mode n. W = (X w , Y w ) = V 1 × TT + V 2 × TPK (10). Subsequently, the receiver verifies whether G equals to X w (11). If G = X w , the receiver accepts this signature, or otherwise discards the received signature
S1. Vehicles V A ~ V D first perform digital signature on the entire transaction block through ECDSA protocol to obtain SignVA (TDA ) to SignVD (TDD ). S2. Vehicles V A ~ V D perform hash operations on the SignVA (TDA ) to SignVD (TDD ) using SHA256 for each pair to obtain the Merkle tree root as [H(H(H(SignVA (TDA )) + H(SignVB (TDB ))) + H(H(SignVC (TDC )) + H(SignVD (TDD )))]. Then the Merkle tree root value is written into block11 head and relevant parameters such as time stamp, nonce, previous hash and block hash are obtained to complete all data fields of block11 . S3. After performing the similar procedures, this study can obtain block12 , block13 ,…,block1N . S4. And so on, this study executes the above procedures on the other cluster vehicles and obtain block22 , block23 ,…block2N and block32 , block33 ,…,block3N . S5. Subsequently, RSU1 , RSU2 and RSU3 individually cascade these blocks together to form chain1 , chain2 , chain3 . S6. Eventually, road side units RSU1 to RSU3 , respectively, obtain the cluster blockchains chain1 , chain2 , and chain3 for the cluster group vehicles, and then transmit the cluster blockchains to the cloud server through the secure routing protocol, and then the cloud server cascades the cluster blockchains chain1 , chain2 , and chain3 to finally obtain the entire system cluster blockchain, as shown in Fig. 3. S7. After receiving the data of the block, the cloud service platform can compare the ECDSA digital signature result of the transaction block through the hash function to verify the digital signature. The verification process is as follows: Phase1. First, the newer Merkle tree root hash value is calculated to be H*, and the root hash value is compared to the received root hash value to be H. If H* is not equal to H, it means that data has been modified during the transmission process. Phase2. Then compare the newer H*[(H(SignVA (TDA )) + H(SignVB (TDB ))]with the receive one. If they are not equal, then SignVA (TDA ) or SignVB (TDB ) has been modified,
976
H. Y. Lin et al.
else it represents H*[H(SignVC (TDC )) + H(SignVD (TDD ))] is not equal. That means SignVC (TDC ) or SignVD (TDD ) has been modified. Phase3. Here, this study assumes if the H*[(H(SignVA (TDA )) + H(SignVB (TDB ))] is not equal to H[(H(SignVA (TDA )) + H(SignVB (TDB ))]. Then the study compares the H*(SignVA (TDA )) and H*(SignVA (TDB )) with the original received H(SignVA (TDA )) and H(SignVB (TDB )), and can immediately detect that the modified transaction block is SignVA (TDA ) or SignVB (TDB ). Or else, the different SignVA (TDC ) or SignVB (TDD ) is detected. The whole verification process only needs to calculate the hash value of the newer Merkle tree root and its child nodes, and then compares whether the newer hash value is the same. Then the node whose data has been changed can be detected. The required comparison times is log2 N, where N represents the number of vehicles. Eventually, when the cloud platform obtains the transmitted plain data TDA ~ TDD with the signature result (TDA , G, S) ~ (TDD , G, S), for simplicity, this study uses the TDA as the sample. When the cloud platform receives the transmitted data and then figures out f = Hash(TDA ), M = S −1 mod n, V 1 = f × TM mode n, V 2 = G × TM mode n. W = (X w , Y w ) = V 1 × TT + V 2 × TPK . Subsequently, the receiver verifies where G equals X w . If G = X w , the receiver accepts this signature, or otherwise discards the received signature. Then the cloud platform performs the requested service according to the type of service (TOS) of the transaction block body.
4 Secure Analyses This section will demonstrate how the proposed methods can actually improve the information security and computing efficiency of IoV. This study evaluates the following items: 1. Data integrity: Since each vehicle in the blockchain uses ECDSA to sign the transmitted transaction block, and calculates the HMAC and Merkle tree root of each transaction through the hash function. If there is any modification encountered during the transmission process, this transaction node that has been changed can be compared and detected. Since the previous hash field points to the hash value of the previous block, if the previous block is modified, the subsequent block must also be changed. This would be a difficult task, thus ensuring the integrity of the information. 2. Decentralization prevents single points of error: Decentralization avoids single point of error because the blockchain distributes the transaction data in each block and stored in each vehicle instead of storing in a centralized server storage [9. 10], and each vehicle backs up the blockchain data with each other, so the occurrence of single point of error can be avoided. 3. Data transparency: Since the data of the block is stored and maintained by all vehicles, any vehicle can open access to the transaction data of the blockchain, so as to achieve data transparency [11, 12]. 4. Performance improvement: Because the data is protected by hash operation, when a node transaction is changed, it can determine the data of its left or right child nodes by comparing the data of the Merkle tree root. And so on, it only takes log2 N stages
A Secure Information Transmission Scheme
977
at most to find the data changed point, where N represents the number of vehicles in the cluster. 5. Ensure trader identity: The identity of the sender can be confirmed by ECDSA digital signature for the transmitted data. If the data is modified during transmission, the receiver can decrypt it only through the sender’s public key, thereby confirming the sender’s identity [13]. In addition, if the newer hash value calculated is different from the received hash value, it can immediately detect that the transmitted data has been modified. 6. Blockchains cannot be modified: Due to the change of the transaction record, the Merkle tree root in the block header will be modified, which leads to the hash field value of each block header cascaded by the whole blockchain needs to be changed synchronously [14. 15], since the blockchain integrity is broken. As a result, the hash of the previous block that leads to the next block must be adjusted as well. Therefore, if someone wanted to modify the transaction records of a block, all subsequent blocks would need to be changed, which is virtually impossible to do. 7. Generating time of Merkle tree root: This study adopts SHA256 to compute the hash value of each transaction block, and evaluates the converge time of generating the Merkle tree root. We compare MD5 and SHA1 with SHA256 under the same data size and execution times, as shown in Fig. 5. Eventually, the study realizes that SHA1 has better performance than SHA256, however SHA256 is more secure than SHA1. Therefore, this study adopts SHA256 as the hash function to secure the cluster blockchain. In summary, since the blockchain is fast and non-reversible, after the data is changed during the process of transmission, it is detectable immediately, that is highly appropriate for use in the decentralized Internet of vehicles circumstance.
Fig. 5. The Comparison of Generating Time for the Merkle Tree Root
978
H. Y. Lin et al.
5 Conclusions and Future Work With the development of the Internet of vehicles (IoV), the issue of security has become more important. This study proposes a hierarchical cluster blockchain architecture to concentrate the communication of adjacent vehicles in a cluster, forming a cluster blockchain, and thus can improve the operation speed of flat blockchain. In addition, we propose a new transaction block structure combined with ECDSA digital signature to secure the vehicle transaction information inside the cluster, which can ensure the correctness of the identity of the vehicle transmission data and avoid denial of identity. Besides, in the process of transaction data transmission, the data integrity of the signature data is protected through SHA256, which can avoid the modification on the way of transmission, and then provide a more secure information transmission scheme for the cluster blockchain of the Internet of vehicles. Acknowledgments. This paper is supported by the Ministry of Science and Technology (MOST), Taiwan, under grants MOST 111-2221-E-163-002.
References 1. Wei, L., Weihong, H., Jing, L., Ke, Z., Li, K.-C., Zhang, D.: Deep reinforcement learning for resource protection and real-time detection in IoT environment. IEEE Internet of Things J. 7(7), 6392–6401 (2020) 2. Liang, W., Fan, Y., Li, K.-C., Zhang, D., Gaudiot, J.-L.: Secure data storage and recovery in industrial Blockchain network environments. IEEE Trans. Ind. Inform. 16(10), 6543–6552 (2020) 3. Liang, W., Li, K.-C., Long, J., Kui, X., Zomaya, A.Y.: An industrial network intrusion detection algorithm based on multifeature data clustering optimization model. IEEE Trans. Ind. Inform. 16(3), 2063–2071 (2020). https://doi.org/10.1109/TII.2019.2946791 4. Lin, H.Y.: Integrate the hierarchical cluster elliptic curve key agreement with multiple secure data transfer modes into wireless sensor networks. Connection Sci. 34(1), 274–300 (2022) 5. Liang, W., Tang, M., Long, J., Peng, X., Xu, J., Li, K.-C.: A secure fabric blockchain-based data transmission technique for industrial internet-of-things. IEEE Trans. Indus. Inform. 15(6), 3582–3592 (2019) 6. Lin, H.Y., Hsieh, M.-Y.: A dynamic key management and secure data transfer based on m-tree structure with multi-level security framework for Internet of vehicles. Connection Sci. 34(1), 1089–1118 (2022) 7. Zhang, Q., Ding, Q., Zhu, J., Li, D.: Blockchain empowered reliable federated learning by worker selection: a trustworthy reputation evaluation method. In: 2021 IEEE Wireless Communications and Networking Conference Workshops (WCNCW), Nanjing, China, pp. 1–6. Nanjing, China (2021 8. Lin, H.Y., Hsieh, M.Y., Li, K.C.: Flexible group key management and secure data transmission in mobile device communications using elliptic curve Diffie-Hellman cryptographic system. Int. J. Comput. Sci. Eng. 12(1), 47 (2016) 9. Zhang, L., Xu, J.: Blockchain-based anonymous authentication for traffic reporting in VANETs. Connection Sci. 34(1), 1038–1065 (2022) 10. Zheng, J., Wang, X., Yang, Q., Xiao, W., Sun, Y., Liang, W.: A blockchain-based lightweight authentication and key agreement scheme for internet of vehicles. Connection Sci. 34(1), 1430–1453 (2022)
A Secure Information Transmission Scheme
979
11. Amritesh, K., Debasis, D.: EIoVChain: Towards authentication and secure communication based Blockchain for Internet of Vehicles (IoV). In: IEEE International Conference on Blockchain, Melbourne, Australia (2021) 12. Khaleel, M., Bilal, S.: A blockchain model for secure communications in internet of vehicles. In: IEEE/ACS 17th International Conference on Computer Systems and Applications (AICCSA), Antalya, Turkey (2020) 13. Yu, S., Lee, J., Park, K., Das, A.K., Park, Y.: IoV-SMAP: secure and efficient message authentication protocol for IoV in smart city environment. IEEE Access 8, 167875–167886 (2020) 14. Alladi, T., Chamola, V., Sahu, N., Venkatesh, V., Goyal, A., Guizani, M.: A comprehensive survey on the applications of blockchain for securing vehicular networks. IEEE Commun. Surv. Tutorials 24(2), 1212–1239 (2022) 15. Farooque, A., Arun, B., Neeraj, P., Sneha, K., Dhafer, A., Shrikant, T.: A framework for secured dissemination of messages in internet of vehicle (IoV) using Blockchain approach. In: IEEE International Conference on Mobile Networks and Wireless Communications (ICMNWC). Tumkur, Karnataka, India (2021)
Techniques for Long-Term Service Prioritization and Optimization Using IEEE 802.11 Technologies (Email and HTTP) Ali Mohd Ali1(B) , Mohammad R. Hassan1 , Ahmed Abu-Khadrah2 , and Ahmad Al-Qerem3 1 Communications and Computer Engineering Department, Faculty of Engineering,
Al-Ahliyya Amman University, Amman 19328, Jordan [email protected] 2 College of Computing & Informatics, Saudi Electronic University, Riyadh, Saudi Arabia 3 Department of Computer Science, Faculty of Information Technology, Zarqa University, Zarqa 13132, Jordan
Abstract. The article presents several valuable insights. Firstly, it elucidates the applicable IEEE technologies and network architecture for HTTP and Email services. Additionally, it proposes a framework and algorithm to evaluate network performance and determine the optimal network configuration based on available technologies. The suggested approach takes into account multiple parameters such as spatial distribution and the number of nodes, aiming to provide the most efficient network configuration that result in superior service quality and overall network performance. We maintain a record of each application’s service quality metric, providing accurate numerical results for categorizing and identifying the most effective technologies in terms of overall performance; doing so enables us to make further performance enhancements and build a computational algorithm model. Our empirical findings support the findings of the study and prove that the proposed algorithm is effective. Keywords: E-mail · HTTP · BSS · ESS · Ad hoc · QoS; IEEE Technologies
1 Introduction The advancements in computer and Internet technology have led to an increase in the use of online technologies in recent applications such as online commerce and medical research. In the past, WLANs and mobile networks were the primary wireless communication methods. However, these days, they have become more popular due to their high-speed throughput, ease of setup, and affordability. The media access control protocol (MAC) used in Wi-Fi networks is based on the IEEE 802.11 standard. Moreover, it allows people from all over the world to communicate with each other by sharing documents, images, and videos, regardless of the distance between them. Using a WLAN as a transmission channel, one can perform various services and applications [1]. However, it can be challenging to determine the most suitable physical layer technology to © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): SAI 2023, LNNS 711, pp. 980–999, 2023. https://doi.org/10.1007/978-3-031-37717-4_65
Techniques for Long-Term Service Prioritization
981
use because of the numerous options available. It’s not always certain that IEEE technology will provide the best performance in industrial communication networks, as it varies from previous technologies. Therefore, a detailed investigation of the different technologies is necessary before deciding on the most appropriate one, as stated in [2]. Internet services like social media, electronic mail, and data transmission, in addition to voice over wireless networks, can have an effect on wireless connectivity. Classic data like news, texting apps, and file transfers were successfully shared using the Internet’s architecture. Excellent digital content has become more prevalent over the last few years, and user behavior has changed along with it. This has led to issues with network efficiency and usability, and has made Wi-Fi technology the industry leader in addition to the flexibility, accessibility, and digital media capabilities of the IEEE 802.11 standard. Apps for viewing and distributing digital video have benefited from the expansion of platforms like YouTube and Netflix. The quality of the user experience can be negatively impacted by these apps’ latency, jitter, and throughput if they are not properly managed [3]. In WLANs where multiple applications are running, it is important to discuss and quantify a wide range of network performance influencing factors. Nevertheless, when it comes to real-time multimedia applications, the provision of precise Quality of Service (QoS) is often considered a matter of attempting to do the best possible. As far as we know, no previous research has compared the QoS metrics of various IEEE 802.11 technologies over the internet to identify the most suitable technology standard for both infrastructure and independent network designs, which will be discussed in this paper. Additionally, incorporating QoS elements into real-time networks, such as latency, jitter, and packet loss, is viewed as a significant challenge. Because of the availability of competing IEEE 802.11 standards, it is essential to conduct a rational evaluation of the available options before settling on a single solution for widespread deployment. In addition, the effectiveness of IEEE technologies utilized in real-time communication systems for industrial purposes may not always be assured for newer technologies such as 802.11n as compared to older ones like 802.11g. Therefore, our study examines and recommends the best technology or technologies and network structure to the user, thereby avoiding the waste of resources and problems associated with randomly selecting specific technologies and redesigning the entire configuration. Our focus is on exploring mixed network topologies to find the most advanced technology and optimal network design across various technologies. These topologies involve a variety of services, a greater number of nodes, and IEEE technologies, each at specified percentages. In our earlier research [4–7], we looked specifically at setting up internet applications as a standalone service and studied how network performance was affected by node distribution (whether circular, random, or uniform) for each IEEE 802.11 standard. The study considered all three types of node distribution and found that each had an impact on network performance. To determine whether or not this new method of distributing channel access improves the performance of Wireless LAN 802.11, an evaluation will be performed using the OPNET simulator and will include all of the relevant parameters of WirelessLAN802.11e. Using the same network architecture, Wei et al. [8] compared the speeds of the HTTP and FTP protocols for five users. To ensure reliable numerical outcomes, the QoS metrics were customized for each program.
982
A. M. Ali et al.
QoS metric parameter applications configured with IEEE technologies have undergone a number of evaluation efforts. Here is a breakdown of the article’s subsequent parts. Literature Review is presented in Sect. 2. The fundamentals and principles of IEEE physical layer technologies are introduced in Sect. 3. Mathematical calculations and an explanation of the proposed algorithm are presented in Sect. 4. Section 5 provides an in-depth analysis and evaluation of the findings, while Sect. 6 and 7 provide a comparative analysis and conclusions, respectively.
2 Literature Review In this section, the proposed approach will be briefly contrasted with the various algorithms. By measuring parameters like throughput, jitter, and end-to-end latency, modern methodologies [9, 10], and [11] determined the ideal network design and deployed their models’ utilizing nodes from (3, 9, and 18), (20), and (2). However, only BSS network designs have been used to validate their suggested procedures. The performance of all six IEEE 802.11 technologies has been studied with regards to the distribution of nodes (whether they are randomly, uniformly or circularly distributed). However, recent studies including those cited in reference [12], have not delved into this area of research. Consequently, there is a lack of literature that evaluates the quality-of-service metrics of different IEEE 802.11 systems. This gap in knowledge makes it difficult to select the best infrastructure and independent network architecture model to present in this paper. It is also a significant task to add QoS characteristics to best-effort networks, such as latency, jitter, and packet loss. Evaluating and implementing technologies objectively is just as crucial as using a wide range of IEEE 802.11 standards. In addition, the proliferation of available alternatives has made it difficult to definitively state which network architecture should be employed for a given set of wireless network resources in order to guarantee the best possible service quality. For this reason, this investigation provides a high-level study that advises the user on the most efficient technology/standard and network architecture. When the signal-to-interference-plus-noise ratio (SINR) is insufficient, a device that has higher capabilities can automatically adjust its physical layer (PHY layer) to a lower throughput and improved robustness. In the event of poor conditions, devices can switch on their own, going from 11g to 11b, for example. However, it cannot automatically upgrade from 11g to 11e. If, for instance, 8 Mbps of upload bandwidth is needed but the devices can only support 4 Mbps, either the devices themselves must be upgraded or the video quality must be lowered in order to make the most efficient use of the available bandwidth. However, this research is useful and makes a significant contribution because it optimizes the network and preserves its resources (in a cost-effective manner).
3 General Description of IEEE Standard 802.11 3.1 The Underlying Framework of IEEE Networks The most important part of an 802.11 WLAN is the Base Station (BSS). This is a system where a centralized coordination or access point (AP) controls the different wireless nodes in the network. All stations within a specific range of the base stations can communicate with each other. An ESS is made up of BSSs and is used for infrastructure
Techniques for Long-Term Service Prioritization
983
purposes. To create communication infrastructure networks, APs are necessary. In contrast, the IBSS network consists of a decentralized collection of BSS-nodes that function independently of one another [13, 14]. The primary distinctions between IEEE WLAN standards are summarized in Table 1. Table 1. Brief Description of the 802.11 WLAN Standard Application
802.11
802.11a
802.11b
802.11g
802.11n
Data rates (Mbps)
1, 2
up to 54
1, 2, 5.5, 11
up to 54
up to 600
Frequency band (GHz)
2.4
5
2.4
2.4
&5
3.2 QoS and Importance Performance Measurements Multi-service applications have performance metrics that define their quality-of-service metrics. Table 2 illustrates the primary Quality of Service requirements for each scheme, and reflects the criteria for fulfilment for each application (acceptable thresholds) defined in [15]. The performance of best-effort applications is directly affected by the following measurements of quality of service: • Packet End-to-End delay (sec): The network’s data/voice transmission capacity is being utilized between nodes A and B. • Page response time (sec): How long it takes for a page to completely load, including all inline objects? • Throughput (bit/sec): How many bytes are sent from the source to the receiver per second on average? • Traffic Sent (packet/sec) and Traffic Received (packet/sec): used for determining how often data is lost after being sent over a network from a source device. For both applications parameters, there exists an important coefficient (ICP) that quantifies the parameter’s relative importance in determining the overall service quality. The relevance of consistency in addition to the arbitrary cut-off values for each QoS metric that can be applied to any given application are shown in Table 2. The simulation needs to be adjusted to account for these measurable distinctions’ factors (H = 1, M = 0.5, and L = 0.1).
4 Network Protocol and Architecture Selection using the Proposed Algorithm 4.1 Systematic Techniques Designed. All scenarios are built and analysed in this paper by employing the OPNET simulation model [16]. With OPNET Modeler, you can quickly and easily investigate network communication, facilities, architectures, and protocols. With the OPNET simulation,
984
A. M. Ali et al. Table 2. Values of Threshold Importance for Best-Effort Service Provisioning
Parameters
Email
HTTP
Threshold (Th) / Importance (I) Threshold (Th) / Importance (I) Delay/Response time (sec) 1 / L Throughput (kbps) 30 / L Racket Loss Rate (%) 10 / L
1/M 30 / L 10 L
we have taken into account inputs from two primary sources for this algorithm: user configuration and technical specifications (standards). User configurations detail the network’s scale and space allocation. The physical layer’s architecture and technology are laid out in the technology specs. The upper portion of Fig. 1 provides definitions for these variables. To define the topological distribution of wirelessly implemented nodes, network architectures specify two methods for connecting wireless nodes: the presence of access points (AP) (BSS and ESS) or the absence of AP (IBSS), the required size of the network (1–5, 6–10, 11–20, 21–40,41–65), and the distribution of spaces (circular, random, uniform). A variety of scenarios can be constructed with the help of IEEE 802.11 technologies, which are described in IEEE MAC Technologies. Some examples of these actual implementations are depicted in Fig. 2(a), Fig. 2(b), and Fig. 2(c). IEEE 802.11a, 11b, 11g, 11e, and 11n standards and technologies were implemented. See Tables 3 and 4 for a rundown of the simulated protocols and multi-service application configurations. According to studies [17, 18], networks with up to 65 nodes are feasible. In spite of this, Since these five sets of nodes were found to be sufficient to maintain the reliability of the network, it follows that even with low traffic volumes, the quality of service suffers due to the limited bandwidth available. 4.2 Computational System Structure. Phase II presents the system calculations and mathematical model in Fig. 1‘s lower section. Data consisting of a cumulative distribution function and QoS Threshold values for each application were input to the algorithm (CDF). The number of performance metrics met in each case will be calculated mathematically. The following requirements must be met in order to properly illustrate the calculations and results for each of the aforementioned projects. QoS Performance Metric (QPM): For each performance criterion n, shown in Fig. 3, the value generated by applying the QoS metric Parameter Threshold value (PTV) in the CDF distribution F(n) is depicted as a function of the nth root of the corresponding expression (1). QPMn = F(ptv)
(1)
QoS Fitness Metric (QFM): Using (H = 1, M = 0.5, and L = 0.1), we can write down the QPM weighting value for each QoS metric parameter as (2). QFMn = QPMn ∗ ICP
(2)
Techniques for Long-Term Service Prioritization
985
Fig. 1. Proposed Algorithm Flowchart.
After summing all QFMs and the n parameters of the application QoS metrics, we get the Application Fitness Metric (AFM) for j IEEE 802.11 technology (delay, jitter,
986
A. M. Ali et al. Table 3. Transmission Model Parameters via Email
Parameters
Send Inter-arrival Time (sec)
Receive Inter-arrival Time (sec)
E-Mail Size (bytes)
Symbolic Server Name
Types of service (TOS)
Values
exponential (360)
exponential (360)
20000
Email Server
Best Effort
Table 4. Transmission Model Parameters via HTTP Parameters
HTTP Specification
Page Interval Time (sec)
Types of service (TOS)
Values
HTTP 1.1
Exponential (60)
Best Effort
throughput, and losses). AFMj =
4
QFMn
(3)
n=1
Each AFM-based network design will produce an evaluation of the six IEEE technologies. As stated before, the OPNET modeler’s simulation will produce a CDF distribution F(n) [19], which will then be analysed for PTV across all applications based on QoS metric parameters: 1. If ptv ∈ F(n): then the CDF distribution of PTV is equal to QPM for certain fixed value of n. QPM is given an ICP weight to yield QFM. To finish classifying IEEE 802.11 technologies, we add all QFMs to the AFM. 2. If ptv > F(n): This means that QPM has a value of 1, and that QFM has been produced. 3. If ptv < F(n): This results in QFM being generated while QPM is made null. Aside from the packet loss metric, all QoS metric applications are computed as described above. In OPNET Modeler, the packet loss parameter is used to produce a Boolean value (either 0.0 or 1.0) representing packet acceptance or rejection. Still, a measurable value for packet loss is required for this study. In order to determine the percentage of packets lost by each program, a code was written in MATLAB. Every application has its own unique relationship with the OPNET Modeler and the packet loss it causes. The packet loss rate ωi for an application on node i is defined as the percentage of packets lost ki relative to the total packets ρi sent for that node, expressed as (4). ωi = (ki/ρi ) ∗ 100%
(4)
Techniques for Long-Term Service Prioritization
987
Fig. 2. Network Architectures for E-mail and HTTP Across three Spatial Distributions (a) BSS, (b) ESS, (c) IBSS
5 Result Analysis Review Based on the generated table, this paper describes the available options for the client (user) as described by the algorithm output. The preferences indicate that the best technological performance can be achieved with any of the three network designs. The research results for best-efforts services are broken up into two sections (HTTP and E-mail). The research facility (room) dimensions used for all of the modelling and simulations range from 2x3 m to 10x14 m. Depending on whether or not an AP exists, either the generic flowchart or the IBSS flowchart will be displayed in place of the original results tables. • The proposed algorithm is shown in Fig. 1 and its output is shown in Fig. 4 and Fig. 6 if there is at least one AP in the network. Both layers of the infrastructure are involved in this scenario (ESS and BSS).
988
A. M. Ali et al.
Fig. 3. A Quality Performance Measurement Approach to Responses
• If the network is set up without access points, the proposed flowchart in Fig. 1 and the result in IBSS defined in Fig. 5 and Fig. 7 will be implemented. Both findings are dependent on the network’s configuration, which can range from one to sixty-five nodes. 5.1 HTTP Performance Both sets of algorithms produce a similar set of results, which can be summarized as five distinct groups of nodes in a network. 1. Fig. 4 presents a general flow chart indicating that both BSS and ESS architectures offer the most effective output for all six technologies in the first, second, third, and fourth categories. Regarding IBSS, Fig. 5 displays limited options for clients in the first group of nodes. Specifically, for circular distribution, 802.11 technology is the optimal choice available. Furthermore, because its packet loss performance metric is marginally greater, roughly 0.698, 802.11b technology is configured at random. The second set of nodes has various options available in IBSS. Firstly, 802.11 is the most suitable system for all spatial distributions. Secondly, if the network is set up for random and circular distributions only, then both 802.11g and 11b technologies are suitable for use. 2. BSS and ESS offer several options for managing the third category (where 20 ≥ N > 10). BSS has six effective technologies that are applicable to all three possible spatial distributions. On the other hand, the IEEE 802.11 family (including 802.11, 11a, 11b, and 11g) is considered the preferred choice for ESS applications because of its extensive geographic coverage. However, as shown in Fig. 5 of the IBSS flowchart, all IEEE technologies perform well for the third and fourth spatial distribution categories. 3. In the generic flowchart, ESS is the best architecture for the fifth group (large network) where 65 ≥ N > 40. The best option is to use IEEE 802.11b technology with a uniform configuration. When utilizing the IBSS flowchart, however, the user is presented with
Techniques for Long-Term Service Prioritization
989
a wide variety of decision points from which to select. To begin, 802.11g’s optimal configuration is exclusively circular distribution. In the generic flowchart, ESS is the best architecture for the fifth group (large network) where 65 N 40. The best option is to use IEEE 802.11b technology with a standard configuration. When utilizing the IBSS flowchart, however, the user is presented with a wide variety of decision points from which to select. To begin, 802.11 g’s optimal configuration is exclusively circular distribution. Figure 5 illustrates that the randomly configured 802.11b technology is the second-best option.
BSS eĸciency for 5 HTTP nodes
ESS eĸciency for 5 HTTP nodes
C
U
e
g
11 80
80
2.
2.
11
C
BSS eĸciency for 10 HTTP nodes
11
b
a
R
2.
80
2.
0.6 802.11 802.11a 802.11b 802.11g 802.11e
11
0.65
80
0.7
2.
0.75
80
0.8
11
0.85 0.8 0.75 0.7 0.65 0.6
0.85
U
R
ESS eĸciency for 10 HTTP nodes
U
R
C
Fig. 4. Generic Proposed Algorithm for HTTP.
U
80
2.
11
n
e
g
11
80
2.
11
b
80
11 2.
80
2.
a 11 2.
80
2. 80
80
2.
11
n
e
g
11
80
2.
11
b 11
80
2.
2.
a 11 2.
2.
11
80
80
80
C
11
0.85 0.8 0.75 0.7 0.65 0.6
0.85 0.8 0.75 0.7 0.65 0.6
R
A. M. Ali et al.
BSS eĸciency for 20 HTTP nodes
ESS eĸciency for 20 HTTP nodes
0.79
0.8
0.785
0.75
0.78
0.7
0.775
0.65
0.77
C
C
80
80
2.
2.
11
11
n
e
g
b
11
80
2. 80
R
2.
11
11
11 80
2.
2.
2. 80
U
a
n
e
11
g
11 2.
80
b
11
80
2.
a 80
2.
11
11 2.
80
80
2.
11
0.6
80
990
U
R
ESS eĸciency for 40 HTTP nodes
BSS eĸciency for 40 HTTP nodes 0.76
C
U
C
BSS eĸciency for 65 HTTP nodes
n
11
80
2.
2. 80
11
g 11
80
2.
11 2.
80
R
0.12 0.1 0.08 0.06 0.04 0.02 0
b
a 11 2.
80
80
2.
11
0.68
e
0.75 0.7 0.65 0.6 0.55
0.72
U
R
ESS eĸciency for 65 HTTP nodes
Fig. 4. (continued)
U
11
n
e 80
2.
11
g
2.
11
b 11 2.
2.
a 11 2. C
80
R
80
U
80
C
80
80
2.
11
0.4 0.3 0.2 0.1 0
R
Techniques for Long-Term Service Prioritization
IBSS eĸciency for 5 HTTP nodes
991
IBSS eĸciency for 10 HTTP nodes
0.7 0.65 0.6 0.55
80 2. 11 80 2. 11 80 a 2. 11 b 80 2. 11 g 80 2. 11 e 80 2. 11 n
0.75 0.7 0.65 0.6 0.55 0.5
0.75
0.5 802.11 802.11a 802.11b 802.11g 802.11e
C
U
C
R
IBSS eĸciency for 20 HTTP nodes
U
R
IBSS eĸciency for 40 HTTP nodes 0.696
0.8 0.6 0.4 0.2 0
0.688
U
80
2.
11
n
e
g
2.
11
11
C
80
b 11
80
2.
2.
a 11 2.
80
80
80
2.
11
0.68
R
C
IBSS eĸciency for 65 HTTP nodes 0.06 0.05 0.04 0.03 0.02 0.01 0 C
U
R
Fig. 5. The Findings for HTTP from IBSS.
U
R
992
A. M. Ali et al.
5.2 E-Mail Performance 1. Fig. 6 illustrates that for categories 2 and 3, the BSS and ESS architectures deliver superior outcomes across all spatial arrangements, where 10 ≥ N > 5 and 20 ≥ N > 10, when only three technology systems are applied, specifically 802.11a, 11g, 11e, and 11n. In the first category the circular distribution performing in both 11a and 11g for BSS architecture. While in the ESS, both 11e and 11n performing well if configured randomly and 11a if configured uniformly. 2. In both the fourth (40 ≥ N > 20) and fifth (65 ≥ N > 40) categories, the ESS is the superior architecture for these major networks. In the fourth category, the client has a choice between 802.11a and 802.11g, and in the fifth category, the client can pick from 802.11a, 11g, 11e, or 11n (as shown in Fig. 6). Both choices are best-suited for all possible spatial distributions.
BSS eĸciency for 5 Email nodes
ESS eĸciency for 5 Email nodes
C
U
11
n
e
2. 80
80
2.
11
g 11 2.
11
80
2. 80
C
R
BSS eĸciency for 10 Email nodes
b
a 11
2. 80
80
2.
2.
11
11
n
e
g 11
80
80
80
2.
11
b
a 11
2.
2. 80
80
2.
11
0.18
2.
0.2 0.19
80
0.21
11
0.25 0.2 0.15 0.1 0.05 0
0.22
U
R
ESS eĸciency for 10 Email nodes 0.2
0.25 0.2 0.15 0.1 0.05 0
0.15 0.1 0.05
U
R
C
Fig. 6. Generic Proposed Algorithm for E-mail.
U
2.
11
n
e 80
80
2.
11
g 11 2.
80
2.
11
b
a 11
80
80
2.
11 2. 80
11 n 2.
11 e
80
80
2.
11 g 2.
11 b 2.
C
80
a 11 2.
80
80
80
2.
11
0
R
Techniques for Long-Term Service Prioritization
U
R
C
BSS eĸciency for 40 Email nodes 0.08
n
e
g
80
80
2.
2.
11
11
b
11
2. 80
80
80
2.
2.
11
11
n
g
2.
2.
80
80
80
2.
11
b
a
11
11 2.
2.
80
80
C
11
0 2.
0
80
0.05
11
0.05
2.
0.1
11
0.1
e
0.15
11
0.15
a
ESS eĸciency for 20 Email nodes
80
BSS eĸciency for 20 Email nodes
993
U
R
ESS eĸciency for 40 Email nodes 0.15
0.06
0.1
0.04
0.05
0.02
11 2.
2.
11
n
e
g 11 2.
80
80
80
80
2.
11
b
a 2.
80
11
11
n
e
g
2.
2.
U
C
80
C
80
80
2.
11 2.
11
b
a 11
80
80
80
2.
2.
11
80
2.
0
11
11
0
R
BSS eĸciency for 65 Email nodes 0.074 0.072 0.07 0.068 0.066 0.064 0.062
U
R
ESS eĸciency for 65 Email nodes
R
Fig. 6. (continued)
C
U
2.
11
n
e 80
80
2.
11
g 11 2.
80
2. 11
b
a 80
2. 11
80
11 2. 80
n 11 2.
80
2.
11
e
g 11
U
80
b 11
80
2. 80
C
2.
a 11 2.
80
80
2.
11
0.1 0.08 0.06 0.04 0.02 0
R
994
A. M. Ali et al.
IBSS eĸciency for 5 Email nodes
IBSS eĸciency for 10 Email nodes
0.25 0.2 0.15 0.1 0.05 0
0.3 0.2 0.1
U
R
C
R
n
80
2.
2.
11
11
e
g 80
2. 11
11 2. 80
80
b
a 11 2.
80
80
2.
11
0.035 0.03 0.025 0.02 0.015 0.01 0.005 0
U
R
Fig. 7. IBSS Email Results.
11
n
e 80
2.
11
g 80
2.
11
b
U
IBSS eĸciency for 65 Email nodes
C
2.
11 2.
80
C
80
11
11 80
2.
2.
2. 80
80
U
a
n
e
11
g
11 2.
b
11
80
2.
a
2.
11
11
80
80
2.
11 2. 80
R
0.12 0.1 0.08 0.06 0.04 0.02 0
0.2 0.15 0.1 0.05 0
C
U
IBSS eĸciency for 40 Email nodes
IBSS eĸciency for 20 Email nodes
80
C
80 2. 11 80 2. 11 a 80 2. 11 b 80 2. 11 g 80 2. 11 e 80 2. 11 n
80 2. 1 80 1 2. 11 80 a 2. 11 b 80 2. 11 80 g 2. 11 80 e 2. 11 n
0
R
Techniques for Long-Term Service Prioritization
995
3. As can be seen in Fig. 7, IEEE 802.11a performs best across the board for when considering any kind of distribution patterns in the first four categories of the IBSS flowchart, while all technologies perform equally well across the fifth category.
6 Comparative Analysis We will briefly contrast our approach with several other algorithms described in [12, 20–22], and [23]. Table 5 is a comparison and summary of the following characteristics: Quality of Service metric parameters; number of nodes; network architecture; IEEE technology; simulation model. Table 5. Assessments of the Proposed Method in Comparison to Other Approaches Described in the Literature Reference Approach
QoS metric Number Network IEEE Simulation parameters of nodes Architecture Technology model
[20]
To determine the best wireless network design for real-time applications, an adaptive QoS technique is proposed
Average NA delay time Jitter Packet loss Throughput
3G WLAN WiMAX
802.3 802.11 802.16
NS-2
[21]
Consider the 36 Mbps 802.11a scenario and how well the EDCA 802.11e protocol supports quality of service
Average 5–45 delay (of voice grows to 46 ms, and the video reaches the 130 ms for 45 stations) Queue size (for voice and video traffic remain below 0.4 packets for 35 stations)
BSS
802.11e
Möbius™
(continued)
996
A. M. Ali et al. Table 5. (continued)
Reference Approach
QoS metric Number Network IEEE Simulation parameters of nodes Architecture Technology model
[22]
Examine how well VoIP functions over 802.11e and 11g wireless networks
End-to-end 3–15 delay (has been reduced significantly over 802.11e) Jitter (appears to be within ITU-T tolerance in all cases) Throughput (high throughput using 5 frames per packet in G.729)
ESS
802.11g
OPNET
[12]
The effect of the RTS and fragmentation thresholds on network performance is evaluated. Moreover, the network’s speed was measured via a variety of media access control (MAC) access techniques and compared to standard values
Jitter 10 End-to-end delay Throughput
IBSS
802.11e 802.11g
OPNET
(continued)
Techniques for Long-Term Service Prioritization
997
Table 5. (continued) Reference Approach
QoS metric Number Network IEEE Simulation parameters of nodes Architecture Technology model
[23]
Three types of internet-based traffic with potential usage in smart city applications will be analyzed for their effectiveness. Voice over Internet Protocol, HyperText Transfer Protocol, and File Transfer Protocol
Throughput 25- 100 End-to-end delay Jitter Average time in FIFO Queue
IBSS
802.11b
EXata 3.1
Present study
The best network architecture can be determined by comparing the HTTP and Email metrics of various IEEE 802.11 technologies
Packet loss 1–65 Jitter Delay Throughput
BSS ESS IBSS
802.11 802.11a 802.11b 802.11g 802.11e 802.11n
OPNET
Unlike the limitations that were discussed earlier, this article proposes a new method for evaluating network performance that can determine the optimal configuration for three distinct network architectures, including base station subsystems (BSS), enhanced subsystems (ESS), and in-building subsystems (IBSS).With respect to five distinct IEEE technological standards—802.11, 802.11a, 802.11b, 802.11g, and 802.11e—the suggested method has been tested for four internet-based applications with varying node size (ranging from 1 to 65).
998
A. M. Ali et al.
7 Conclusion This research aimed to identify the optimal network architecture between BSS, ESS, and IBSS by creating a novel Algorithm for evaluating HTTP and Email applications over various IEEE 802.11 technologies and distributions. E-mail applications, which suffer significant packet loss and delay when using an ESS network, are shown to prefer using a network with a high density of workstations / nodes. In addition, all spatial patterns can make use of nearly all IEEE technologies. Using 802.11a technology, IBSS is also capable of running efficiently in networks of essentially any size. As the number of nodes for both E-mail and HTTP grows to 65, the BSS’s performance begins to suffer. Furthermore, HTTP results show that IBSS performs well in medium network sizes across all technologies, while E-mail results show that 802.11a’s IBSS technology is well-suited to 5 GHz OFDM modulation.
References 1. Stallings, W. (2016). Wireless Communications & Networks (2nd ed.). Pearson 2. Yang, H., Cheng, L.: Bounding Network-Induced Delays of Wireless PRP Infrastructure for Industrial Control Systems. In: ICC IEEE International Conference on Communications (ICC), pp. 1–7. IEEE (2019) 3. Coronado, E., Villalón, J., Garrido, A.: Improvements to Multimedia Content Delivery over IEEE 802.11 Networks. In: 2020 IEEE/IFIP Network Operations and Management Symposium. IEEE (2020). https://doi.org/10.1109/NOMS47738.2020.9110424 4. Ali, M., Dhimish, A.M., Alsmadi, M., Mather, P.: Algorithmic identification of the best WLAN protocol and network architecture for Internet-Based applications. J. Inf. Knowl. Manag. 19(01), 2040011 (2020). https://doi.org/10.1142/S0219649220400110 5. Ali, M., Dhimish, A.M., Alsmadi, M., Mather, P.: WLAN protocol and network architecture selection for Real-time applications. Int. J. Adv. Comput. Eng. Netw. (IJACEN) 7(11), 8–14 (2019) 6. Ali, M., Dhimish, A.M., Glover, I.: WLAN protocol and network architecture identification for service mix applications. Int. J. Adv. Comput. Eng. Netw. (IJACEN) 8(2), 24–30 (2020) 7. Mohd Ali, A., Dhimish, M., Alsmadi, M.M., Mather, P.: An algorithmic approach to identify the optimum network architecture and WLAN protocol for VoIP application. Wireless Pers. Commun. 119(4), 3013–3035 (2021). https://doi.org/10.1007/s11277-021-08383-6 8. Wei, P., Hong,Z., Shi, M.: Performance analysis of HTTP and FTP based on OPNET. In: 2016 IEEE/ACIS 15th International Conference on Computer and Information Science (ICIS), pp. 1–4. IEEE 9. Farej, Z., Jasim, M.: Performance evaluation of the IEEE 802.11n random topology WLAN with QoS application. International Journal of Electrical and Computer Engineering, 10(2), 1924 (2020). https://doi.org/10.11591/ijece.v10i2.pp1924-1934. 10. Genc, E., Del Carpio, L.: Wi-Fi QoS enhancements for downlink operations in industrial automation using TSN. In: 15th IEEE International Workshop on Factory Communication Systems (WFCS). IEEE. (2019). https://doi.org/10.1109/WFCS.2019.8757992 11. Khiat, A., Bahnasse, A., EL Khail, M., Bakkoury, J.: Wi-Fi and WiMax QoS performance analysis on High-Level traffic using OPNET modeler. Pertanika J. Sci. Technol. 25(4), (2017) 12. Refaet, A., Ahmed, M., Aish, Q., Jasim, A.. VoIP performance evaluation and capacity estimation using different QoS mechanisms. In: 3rd International Conference on Sustainable Engineering Techniques (ICSET 2020). IOP Publishing (2020).https://doi.org/10.1088/1757899X/881/1/012146
Techniques for Long-Term Service Prioritization
999
13. IEEE Std 802.11ac. (2013). IEEE Standard for Information technology-Specific requirements-Part 11: WLAN Medium Access Control (MAC) and Physical Layer (PHY) Specifcations-Amendment 4: Enhancements for Very High Throughput for Operation in Bands below 6 GHz, Standards 802.11ac.F. Babich, M. Comisso, A. Crismani, “Considerations on the multiplexing and diversity tradeoff in IEEE 802.11 networks,” in IET Communications, vol. 8, no. 9, pp. 1551–1559 (2014). https://doi.org/10.1049/iet-com.2013. 0741 14. Mohd Ali, A., Dhimish, M., Mather, P.: Optimization and Selection of Best Sustainable Services in Various IEEE 802.11 Technologies. J. Green Eng. 11(1), 20–439. (2021). N. Ghiata, M. Marcu. Measurement methods for QoS in VoIP review. In: Ultra Modern Telecommunications and Control Systems and Workshops (ICUMT), 2011 3rd International Congress on, pp. 1–6. IEEE (2011) 15. Zaidi, T., Nand Dwivedi, N. Voice Packet Performance Estimation through Step Network Using OPNET. In: 3rd International Conference on Computing, Communication and Security (ICCCS). IEEE (2018).https://doi.org/10.1109/CCCS.2018.8586812 16. Riverbed. (2018). Retrieved from Riverbed Web site: https://www.riverbed.com/gb/index. html 17. Sammour, I., Chalhoub, G.: Evaluation of Rate Adaptation Algorithms in IEEE 802.11 Networks. Electronics 9(9), 1436 (2020). https://doi.org/10.3390/electronics9091436 18. Lawal, I., Mu’azu, A.: A new distributed model comparison to enhance VoIP QoS performance over WLAN and WiMAX network. Ilorin J. Comput. Sci. Inf. Technol. 3(1) (2020) 19. Yates, R.D., Goodman, D. J.: Probability and stochastic processes: a friendly introduction for electrical and computer engineers. John Wiley & Sons. (2014 20. Chen, J.-L., Liu, S.-W., Wu, S.-L., Chen, M.-C.: Cross-layer and cognitive QoS management system for next-generation networking. Int. J. Commun. Syst. 24(9), 1150–1162 (2011) 21. Perez, S., Facchini, H., Dantiacq, A., Cangemi, G., Campos, J.: An evaluation of QoS for intensive video traffic over 802.11 e WLANs. In: 2015 International Conference on Electronics, Communications and Computers (CONIELECOMP), pp. 8–15. IEEE (2015) 22. AlAlawi, K., Al-Aqrabi, H.: Quality of service evaluation of VoIP over wireless networks. GCC Conference and Exhibition (GCCCE), pp. 1–6. IEEE (2015) 23. Jabbar, W., Ismail, M., Nordin, R.: On the performance of the current MANET routing protocols for VoIP, HTTP, and FTP applications. J. Comput. Netw. Commun. 2014 (2014) .https:// doi.org/10.1155/2014/154983
An Improved WRR Scheduling Algorithm for MANETs Mukakanya Abel Muwumba(B) , Odongo Steven Eyobu, and John Ngubiri College of Computing and Information Sciences, Makerere University, Kampala, Uganda [email protected], [email protected]
Abstract. Analytical Weighted Round Robin (WRR) schedulers implemented in the classical M/G/1 queue system are rare in Mobile Ad-Hoc Networks (MANETs) because these networks possess unique properties (mainly the dynamic topology) that make the design of models for provision of Quality of Service (QoS) to multimedia applications (such as voice and video) a challenge. In order to solve the problem of quality guarantee of multimedia services—QoS in converged Internet Protocol (IP), some scholars proposed a mathematical model of WRR service strategy that prioritizes voice queue packets. However, by giving priority to voice queue packets, this type of scheduling is most likely to starve packets in the low priority queues (video queue packets) depending on the service distribution. We enhance and study the existing WRR service strategy; and then proposed an Improved Weighted Round Robin (IWRR) models in the M/G/1 queue system under varying workloads distributions. The study proposed the IWRR Algorithm based on the existing WRR Algorithm and utilizes the technique of computing the partial average waiting times of the small/large voice/video packets. The main metrics that were measured were conditional mean response time and slowdown. The numerical results show that the video packets perform poorly compared to voice packets in the EWRR Algorithm. We also compared the performance of the WRR Algorithms experimentally under two service distributions. The numerical results revealed that the IWRR exhibited superior performance while transmitting video packet. Keywords: M/G/1
1
· Quality of Service · Video and Voice
Introduction
In the recent past innovators, researchers and scholars in wireless networks have developed a viable technology popularly referred to as MANETs that is able to provide connectivity even when wireless infrastructure is inexistant or disabled. MANETs represent the categories of wireless networks which are infrastructureless and do not require base stations [18]. Figure 1 shows MANET technology with no fixed base stations and every node must cooperate in forwarding packets in the network [8]. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): SAI 2023, LNNS 711, pp. 1000–1021, 2023. https://doi.org/10.1007/978-3-031-37717-4_66
WRR Scheduling Algorithm
1001
Fig. 1. Example of Mobile Ad-Hoc Network
MANETs have unique characteristics that make them popular everywhere at any time. (i) Infrastructure-less nature - MANETs are formed based on the collaboration between independent peer-to-peer nodes to communicate with other nodes for a particular purpose [1]. (ii) Easy and rapid deployment - MANETs technology comes with several advantages over wireless networks, including ease of deployment, speed of deployment, and decreased dependence on a fixed infrastructure. MANETs are becoming popular because of their ability to provide an instant network formation without the presence of fixed base stations and system administration [13]. (iii) Bandwidth constraints and variable link capacity - The MANET nodes are connected by wireless links that have much smaller bandwidth than those with wires [1]. (iv) Multi-hop communication - A message from source node to destination is transmitted via multiple nodes because of limited transmission radius [24]. Within the MANETs network every node acts as a router and forwards packets from other nodes in order to facilitate multi-hop routing [7]. (v) Constrained resources (light-weight terminals) - A large proportion of the MANET nodes are small hand-held devices characterized with limited power (battery operated) processing capabilities and storage capacities. Some of the notable examples range from laptops, smart phones and Personal Digital Assistants (PDA) to cell phones. (vi) Short range connectivity - MANET depends on Radio Frequency (RF) or Infrared (IR) technology for connectivity, both of which are generally used for short range communications. Despite the attractive applications and different characteristics of MANETs, there are several challenges and issues that must be studied carefully before a wide commercial deployment are mentioned below [22,23]: (i) Limited bandwidth - Wireless links continue to have significantly lower capacity than infrastructure networks. (ii) Dynamic topology- Dynamic topology membership may disturb
1002
M. A. Muwumba et al.
the trust relationship among nodes. (iii) The wireless link characteristics are time-varying in nature - The terminals communicate via a channel which is subjected to fading, noise, interference, path loss and has low bandwidth as compared to wired networks. (iv) Battery constraints and power management The nodes that constitute these networks have restrictions on the power source in order to maintain portability, size and weight of the device. (v) QoS provision and transmission of multimedia require high bandwidth, low delay, high Packet Delivery Ratio (PDR) and high reliability. Transmitting real time video contents such as video streaming or video conferencing over MANET is a challenging task, because multimedia applications are delay sensitive and require an acceptable level of QoS to provide multimedia services. This study is motivated by the challenge of MANETs to satisfactorily guarantee QoS to multimedia traffic. In many practical situations there aren’t adequate resources for user requests (can be in form of packets, applications, jobs or customers) be guaranteed QoS in Communication Networks or Computer systems. This undesirable situation might result into large user response times. The response time (a.k.a., flow time or sojourn time) of a user request is the time from when a user request arrives until it completes service. Normally response time is viewed as waiting time plus service time. The question is “How can we reduce user response times in MANETs without necessarily purchasing additional resources like Random Access Memory (RAM), Central Processing Unit (CPU) and Servers which cost money?” By simply serving user requests in the right order is one way of reducing mean delay by an order of magnitude and this is what is commonly referred to as scheduling [10]. Scheduling Algorithms are vital to the MANET in providing QoS to the different user requests. In [3,4,9,11,17] we mention some of the earlier solutions in MANETs. Particularly, Hottmar et al. [12] proposed a mathematical model of WRR service strategy in order to solve the problem of quality guarantee of multimedia services—QoS in converged IP networks. Our study is closely related to the Hottmar’s WRR service strategy. The study proposed an IWRR Algorithm based on the existing WRR Algorithm. The main metrics that were measured were conditional mean response time and slowdown. The numerical results show that the video packets perform poorly compared to voice packets in the EWRR Algorithm. We also compared the performance of the WRR Algorithms experimentally under the Exponential and BP distributions. The numerical results revealed that the Proposed WRR exhibited superior performance while transmitting video packet. The main contribution of this study is that the proposed IWRR Algorithm that utilizes the technique of computing the partial average waiting times of the small/large voice/video packets and in turn shortens the conditional average response time and slowdown. The rest of the paper is organized as follows. Section 2 presents some Related Works. Section 3 provides a brief overview of the WRR Scheduling Algorithms. Section 4, presents the Result. Conclusion and Future Work are finally presented in Sect. 5.
WRR Scheduling Algorithm
2
1003
Related Work
In this section we introduce a few schedulers that have been modeled in an M/G/1 queue system [10,21] because the EWRR and IWRR Algorithm shall be based on queueing theory. Mor Harchol-Balter [10] compared the mean response time under the different policies (First Come First Served, Processor sharing, Shortest Job First, Preemptive Shortest Job First Shortest Remaining Processing Time) as a function of load for an M/G/1 under the Weibull job size distribution and showed that conditional mean response time increased with load. However, one of the differences between our study and that of Mor Harchol-Balter is that we considered the exponential and BP distributions instead of Weibull job size distribution in the analysis of the Adopted WRR Algorithm. Rai and Okopa [21] proposed SWAP models and evaluated the scheduling policy using workloads under two service distributions. The numerical results obtained from the derived models reveal that SWAP approximates Shortest Job First (SJF) better for heavy-tailed workloads than for exponentially distributed workloads. The results also showed that SWAP performs significantly better than First Come First Served (FCFS) and Processor Sharing (PS) policies regardless of the distribution of the workload. Although the solution utilized the M/G/1 queue system, it did not focus on applications in MANETS. We now briefly discuss the Weighted Fair Queueing (WFQ) and WRR Queue Scheduling Algorithms because they are widely employed for implementing differentiated services among multiple classes. We present the WFQ Algorithm and some earlier solutions because it closely related to the WRR. The WFQ is a sophisticated algorithm designed by Demers et al. in [6], building on the work of Nagle [18]. The WFQ Algorithm partitions the available bandwidth among queues of traffic based on their weights. The Algorithm assigns the bandwidth for each service based on the weight assigned to each queue and not based on the number of packets. In other words, it can guarantee each class bandwidth share proportional to its assigned weight but comes at a cost of greatly increased complexity to implement the scheduling discipline. The weight for each packet is calculated by multiplying the packet size with the inverse of weight for the associated queue. The WFQ is based on the system virtual definition [6]. The WFQ scheduler assigns a start tag and a finish tag to each arriving packet and serves packets in the increasing order of their finish tags. Although WFQ-like schedulers are very popular, there are very few analytical results available in the literature [2]. In this study we indicate only a few of them for the benefit of our readers. The Idealized Wireless Fair Queueing (IWFQ) algorithm, proposed by [14] is one of the earliest representative packet scheduling algorithms for wireless access networks and to handle the characteristic of location-dependent burst error in wireless links. The difference between IWFQ and WFQ is that when a picked packet is predicted in a bad link state, it will not be transmitted and the packet with the next smallest virtual finish time will be picked. The process will repeat until the scheduler finds a packet with a good state. One limitation with the IWFQ is that it does not consider the delay/jitter requirements in real-time applications.
1004
M. A. Muwumba et al.
Amongst the other earlier solutions was the WFQ [19] and its variant Worstcase Fair Weighted Fair Queueing (WF2Q) [5] that have good delay and fairness properties but have high implementation complexity. A newly localized and fully distributed fair queueing model provides measurable plus effective solutions to the queuing problems faced in MANETs was proposed [15]. We now briefly introduce the WRR scheduling as this Algorithm is the base of our study. The WRR Algorithm can achieve similar service differentiation to that of WFQ and is a simpler scheme to implement, with similar performance. The WRR Algorithm offers numerous advantages: [12]; This scheduling strategy operates in hardware, hence works at high-speed interfaces at the core and edges of the network; WRR strategy guarantees all the service classes network resources to limit bandwidth starvation; WRR provides strict control on the proportion of output port bandwidth allocation to each user; and the traffic service classification provides an equitable management and more stability for network applications. The WRR strategy ensures that different priorities are assigned to different queues. This accomplished by fair selection interval amongst active queues with minimal delay and jitter [11]. The Algorithm maintains a weighting coefficient for each queue to determine the proportion of bytes of data the system requires to deliver from the queue before it proceeds to the next queue. For a given scheduling turn in each queue, packets are sent until the transmitted number of bytes exceeds the bandwidth allocated by the queues weighting coefficient, or the queue is empty [3,4]. The process proceeds to the next queues. When a queue is empty, WRR strategy transmits packets from the next queue that has ready packets to send. Four MANET scheduling Algorithms are selected and investigated in mobile ad hoc networks which are Strict Preference (SP), Round Robin (RR), WRR and WFQ [16]. The results showed the performance metrics difference of the network such the end-end delay and peak queue size. The WRR has outperformed others regarding the end-end delay. Moreover, WRR represents the best scheduling algorithm regarding both peak queue size since its greater than RP, WF, and WRR. However, the WRR registered the worst queue average time. J. Gautam et al. [9] proposed Algorithm that improves the QoS by focusing on three parameters such as Packet drop ratio, Throughput and Time delay. Hottmar and Adamec’s [12] mathematical model was aimed at solving the challenge of QoS in converged IP networks. Unfortunately, starvation video packets were observed at the expense of voice packets. We summarize some of the important points on the limitations and metrics that were studied of the previous solutions in Table 1. Our work is closely related to the Hottmar et al. [12] strategy. In order to overcome the gap, we enhance the existing WRR model in the M/G/1 queue system; and then propose an Improved WRR Algorithm that utilize the technique of computing the partial mean waiting times of the small/large voice/video packets under two service distributions.
WRR Scheduling Algorithm
1005
Table 1. Important Points on the Limitations of the Previous Solutions Previous Studies
Metric studied
Mor Harchol-Balter [10] Mean response time Rai and Okopa [21] Mean response time and slowdown H. F. Mahdi et al. [16] Queue average time C. Bennett and H. Zhang [5] Delay A. K. Parekh and R.G.Gallager[19] Delay Hottmar and Adamec [12] Average delay
3 3.1
Limitation Focused on scheduling policies The models were not validated Poor performance of WRR High implementation complexity High implementation complexity Starvation of video packets
The Weighted Round Robin Scheduling Algorithms Overview
Different classes/users require different bandwidth allocation, hence the WRR Algorithm allows several packets to be processed in a given queue each time that queue receives a scheduling turn. The number of packets to receive service in each scheduling turn is determined by the weight of the queue. The weight may be a percentage of the interface bandwidth, thereby reflecting the service differences between the queues and the traffic classes assigned to those queues or the number of timeslots assigned to the packets or the number of processors allocated. In the procedure of WRR, packets are classified into queues and then allocated a proportion of bandwidth. The service order that is followed is Round Robin as shown in Fig. 2. The Schedular operates by serving high priority packets with lower bandwidth are serviced. The starvation challenge is overcome by guaranteeing that all users can access a minimum bandwidth allocation [16].
Fig. 2. The WRR Scheduling Algorithm
The WRR Algorithm is not aware of the true sizes of the packets in the buffers that are to be scheduled. The queues and scheduling are generally optimized for mean packet size. However, the sizes are all just estimates and have no true meaning with regard to the actual traffic mix in each queue. This operation of
1006
M. A. Muwumba et al.
WRR is both an advantage and a challenge. The WRR does not have complex resources that require state computation unlike with WFQ, which has got to transform bits to bandwidth scheduling, hence it is fairly simple to implement WRR. The result is a solution well-suited for handling a large number of flows and sessions, making WRR into something of a core QOS solution that can deal with large volumes of traffic and with congestion. The drawback of WRR is that it is blind when it comes of bandwidth allocation and suffers severe limitations when scheduling variable-sized packets. 3.2
Performance Metrics, Mathematical Notations and Expressions
We used the conditional average response time and slowdown as the main performance metrics to be used in the analysis. Some mathematical notations and expressions in Rai and Okopa [21] were used: In Table 2, we present the mathematical notations and expressions that were used in the study. Table 2. Mathematical Notations and Expressions Description
Notation
λ ρxvO Load for video packets ρvI Total load for all packets in the system ρ Second moment of voice packets of the service-time distribution xvO 2 Second moment of video packets of the service-time distribution xvI 2 Conditional average response time T (x) Conditional average response time of voice packets T (xvO ) Conditional average response time of video packets T (xvI ) Average waiting time of voice packets W (xvO ) Average waiting time of video packets W (xvI ) Probability density function (pdf) f (x) Cumulative distribution function (cdf) F (x) Survival function F c (x) = 1 − F (x) Average packet arrival rate Load for voice packets
3.3
Adopting the Weighted Round Robin Algorithm into MANETs
We start from the well-known point of view, the general relationship for the mean waiting time of the existing WRR Strategy [12] as given in Eq. 1. Wi =
[(BR − BRi ) ∗ E(Xi )] [BRi ∗ 2(1 − ρiS ) ∗ BR]
(1)
We indicate the following changes: The first adoption is that we assume two queues, i.e., queue 1 for voice and queue 2 for video packets. Basing on Eq. 1,
WRR Scheduling Algorithm
1007
Wi is the average waiting time of packets of ith queue; BR overall transfer capacity of output interface is replaced with a continuous random variable, x (which also represents packet size); Depending on the traffic type and the service time distribution, E[X1 ], the average size of the packets is replaced with the second moment xvO 2 and xvI 2 of voice and video packets of the service-time distribution; BRi , the proportionate ratio sizes out of the overall capacity BR are replaced with weights, ω1 = 7808 or ω1 = 1712 respectively; the load coefficient of processed traffic, ρiS is replaced with ρxvO or ρxvI (ii)The second adoption is that we consider an M/G/1 queue.
Fig. 3. A Typical M/G/1 Queue
The M/G/1 queue is shown in the Fig. 3 and we assume that the user requests (voice and video packets) arrive according to a Poisson Process with mean rate λ packets per second. Also Fig. 3 depicts user requests with large and small sized boxes implying that voice and video packets are large size and others are small size meaning that their service demands are different. We use a continuous random variable x which is said to have an exponential distribution with the probability density function (pdf ) [20,21] given as follows: f (x) = λe−μx ,
x ≥ 0, λ ≥ 0.
i. We use the exponential distribution to get an expression for the second moment for voice packets of size less than or equal to xvO ; and for video packets of size less than or equal to xvI respectively is given as: xvO 2 xxvO = t2 μe−μt dt (2) 0
x2xvI =
xvI
t2 μe−μt dt
(3)
0
The load, ρxvO associated with voice packets of sizes less than or equal to xvO is given as: xvO tf (t)dt (4) ρxvO = λ 0
Likewise the load, ρxvI associated with video packets of sizes less than or equal to xvI is given as: xvI ρxvI = λ tf (t)dt (5) 0
1008
M. A. Muwumba et al.
ii. We also use a continuous random variable x which is said to have the BP distribution denoted in short form as; BP (k, L, α) where k and L are the minimum and the maximum packet sizes, and α is the power law constant. We write the pdf of the Pareto [20,21] as follows; f (x) =
αk α −α−1 αx 1 − (k/L)
k ≤ x ≤ L,
0 ≤ α ≤ 2.
For voice packets, the second moment of sizes less than or equal to xvO and video packets of sizes less than or equal to xvI under the BP distribution are given respectively by; x x2xvO = k vO t2 f (t)dt (6)
x2xvI =
xvI k
t2 f (t)dt
(7)
Note: load, ρ = λx,
The load, ρxvO for voice packets of sizes less than or equal to xvO under the BP distribution is given as: x ρxvO = ρxvI = λ k vO tf (t)dt (8) The load, ρxvI for video packets of sizes less than or equal to xvI under the BP distribution is given as: x ρxvO = ρxvI = λ k vI tf (t)dt (9) iii. The third adoption is we use Hottmar et al. [12] configuration parameters i.e., E[X1 ] = 1712 bit and E[X2 ] = 7808 bit to determine the weights assigned to the queues. We find it appropriate to determine the queue weights to be assigned using these parameters instead of the guaranteed proportions from the overall capacity. It follows that the expressions for the average waiting times of voice and video packets respectively are given by: W (xvO ) =
[x − (ω1 ∗ xvO 2 )] [ω1 ∗ 2(1 − ρxv0 )]
(10)
W (xvI ) =
[x − (ω2 ∗ xvI 2 )] [ω2 ∗ 2(1 − ρxvI )]
(11)
and
WRR Scheduling Algorithm
1009
x iv. The fourth adoption is that we assume the residence time is (1−ρ . Quite xv ) often the conditional average response time consists of two components that is to say the waiting time and the residence time. It follows that the expressions for the conditional average response times of voice and video respectively are given as;
T(vO) =
[x − (ω1 ∗ xvO 2 )] x + (1 − ρxvO ) [ω1 ∗ 2(1 − ρxv0 )]
(12)
T(vI) =
[x − (ω2 ∗ xvI 2 )] x + (1 − ρxvI ) [ω2 ∗ 2(1 − ρxvI )]
(13)
and
We summarize the EWRR Scheduling in pseudo code Algorithm 1.
Algorithm 1 The Pseudo Code of EWRR Require: Consider an M/G/1 queue system Classify incoming traffic into queue 1(voice) and queue 2(video); if traffic in the queues is voice or video then Find the second moment of queue 1 and queue 2 packets; Find the load due to queue 1 and queue 2 packets; Determine the weights assigned to each queue; Determine the average waiting time to each queue; for each queue do Compute the cond. average response time and slowdown; end for end if
3.4
The IWRR Algorithm
The reasons to improve the EWRR are due to the following deficiencies in the WRR. According to J. Gautam et al. [9] WRR behaves like a blind scheduling policy because it is ignorant about how different packet lengths are scheduled, the scheduling strategy does not favor queues that have different sized packets. The proposed WRR Algorithm is based on the E WRR Algorithm with the aim of addressing the above stated deficiencies. We indicate the following improvements in the enhanced Hottmar WRR Algorithm: i. The first change is that besides assuming the two queues, one for voice and two for video packets, within each queue the packets are classified into small and large packets. ii. The second change is that we use an M/G/1 system where the arrival rate is λ and x is a continuous random variable of the service-time distribution, to get the mean waiting time for the small voice packets of sizes less than or
1010
M. A. Muwumba et al.
equal to xvO ; and video packets of size less than or equal to xvI respectively as given by Pollaczek-KhinChine (PK) formula. E [W (xvO )] =
λxvO 2 2(1 − ρxvO )
(14)
E [W (xvI )] =
λxvI 2 2(1 − ρxvI )
(15)
where xvO 2 and xvI 2 are second moments of voice and video packets of the service-time distribution. iii. The third change is that the average size of a large voice and video packet under an exponential and the BP distribution is derived as follows: ρxL vO = λxL vO ρxL vI = λxL vI since ρxL = ρ − ρx . 1 (ρ − ρxvO ) λ 1 xL vI = (ρ − ρxvI ) λ
xL vO =
iv. The fourth change is that we let the survival function (or reliability function), F c (xth ) = 1 − F (xth ), the probability that a large video or voice packet is found in the queue multiplied by the average size of a large voice xLvI or video xLvI packet to get the workload due to the large packets. Hence, the expressions for average waiting time for the large packets (voice and video) respectively are: (16) W (xvO ) = xLvO ∗ F c (xth ) W (xvI ) = xLvI ∗ F c (xth )
(17) x (1−ρxvO )
v. The fifth change is that we add the residence time plus equations14 and 16 together. We do the same for the residence time (1−ρxx ) for Eqs. 15 vI and 17 and then perform some mathematical manipulations to the combined equations as well as assigning the queue weights. The aim of this operation is to get the expressions for the conditional average response time of the voice and video packets. We use the configuration parameters for queue 1 and 2 in Sect. 3.3 to determine the queue weights (ω1 and ω2 ) to be assigned.
WRR Scheduling Algorithm
1011
Therefore, the expressions for the conditional average response time of the voice and video packets respectively are: x − ω1 ∗ (λxvO 2 + 2(1 − ρxvO )xLvO ∗ F c (xth )) x + T(vO) = (1 − ρxvO ) ω1 ∗ 2(1 − ρxvO ) (18) and x − ω2 ∗ (λxvI 2 + 2(1 − ρxvI )xLvI ∗ F c (xth )) x T(vI) = + (19) (1 − ρxvI ) ω2 ∗ 2(1 − ρxvI ) Note: The main aim behind the changes (i) to (v) is to enable us compute the partial average waiting times of the small and large voice or video packets. These partial average waiting times are added together to get a resulting average waiting time which is subtracted from the continuous random variable x as shown in expressions 18 and 19 to obtain the average waiting time of voice and video packets. It’s this operation that is responsible for the reduction in conditional average response time and slowdown for the proposed WRR scheduling Algorithm as indicated in the results in Sect. 4. We summarize the proposed WRR Scheduling in pseudo code Algorithm 2.
Algorithm 2 The Pseudo Code of the IWRR Require: Consider an M/G/1 queue system Classify incoming traffic into queue 1(voice) and 2(video); Classify traffic in each queue into small and large packets; if packet size in the queue is small or large then Find the second moment of each packet size; Find the load due to each packet size; Determine the weights assigned to each queue; Determine the average waiting time due to each packet size for each queue; for each queue do Compute the cond. average response time and slowdown for each queue; end for end if
4
Numerical Results
This section presents the performance evaluation of the EWRR and IWRR Algorithm under the exponential and BP distributions. In this analysis we first show the weakness of the enhanced WRR Algorithm in terms of starving video packets. The study then evaluates the performance of the WRR Algorithms with the goal of depicting the performance gains of the IWRR.
1012
4.1
M. A. Muwumba et al.
Experimental Set Up and Simulation Software
The implementation of WRR scheduling Algorithms was done in Matlab version R2021a that was installed on Windows Operating System platform. The experiments were run on a Dell Laptop Latitude 3520 with the following specifications: System - Microsoft Windows 11 Pro, Version 22H2; Processor, 11th Gen Intel(R) Core (TM) i3-1115G4 @ 3.00 GHz 2.19 GHz and 4.00 GB (3.74 GB usable) of RAM. The advantages of Matlab implementations of the Algorithms is that it allows for fast numerical performance evaluation of the Algorithms under varying workloads. We used the WRR scheduling Algorithms developed in Sect. 3 to obtain the Matlab Codes for the study. During the experimental set up, we specified the number of iterations the Algorithms were expected to execute. The exponential and BP distributions were used to generate the workloads. We assumed the following typical values of 300,000 iterations; mean packet arrival rates, λ of 1 2000 and 0.0124, threshold values, xth of 1526.7 and 2500 for exponential and BP distributions respectively; queue weights (ω1 = 7808 and ω2 = 1712) as shown in Table 3 for the experiments. These parameters were chosen for illustration purposes otherwise these can be varied depending on the user requests. The corresponding values of conditional mean response times and slowdowns were obtained and plotted against the packet sizes as shown next. Table 3. Parameters for the Experiments Parameter
Values
Average packet arrival rate, λ for exp
1 2000
Average packet arrival rate, λ for BP 0.0124
4.2
Queue weights (ω1 and ω2 )
7808 and 1712
System Load, ρ
0.9
Range of values for x BP
x=10:0.01:300,000;
Range of values for x exp
x =0:0.1:300,000;
BP (k, L, α)
BP (10, 5000, 1.1)
Threshold value, xth for BP
1526.7
Threshold value, xth for exp
2500
Evaluation of the EWRR Algorithm
Figure 4 shows T(x) Vs x under the exponential distribution for EWRR scheduling Algorithm when ρ is 0.9. We observe that T(x) increases rapidly in the direction of the growth of video packet size, but there is a gentle increase of T(x) for the voice packet size. Obviously, the result reveal that video packets are starved at the expense of voice packets and this is a weakness of this EWRR scheduling Algorithm. This result also confirms what the previous solution like Hottmar et al. [12] found out that the size of the weight has a direct relationship with the average waiting time.
WRR Scheduling Algorithm
1013
In Fig. 5(a) and 5(b) we show the performance of voice and video packets for EWRR Algorithm in terms of T(x) Vs load, ρ when x is 20000 bytes and 70000 bytes respectively under the exponential distribution. We observe that with fixed packet sizes, the conditional mean response time grows most rapidly in the direction of the growth of load while a gentle increase is realized for
Fig. 4. T(x) Vs x at ρ = 0.9. for Exponential Distribution
Fig. 5. T(x) Vs ρ for Exponential Distribution
1014
M. A. Muwumba et al.
the voice packet size. The trend of the graphs obtained show a closer relationship with those other previous solution like Mor Harchol-Balter [10]. The results obtained further revealed that video packets are starved at the expense of voice packets. Figure 6 shows S(x) Vs x under the exponential distribution for EWRR scheduling Algorithm when ρ is 0.9. We observe that S(x) increases with increase in packet size. We note that S(x) for video packets is higher than that of voice packets. The performance of video packets is poorer than that for voice packets. Obviously, this trend is expected since S(x) a ratio of conditional average response time divided by packet size.
Fig. 6. S(x) Vs x at ρ = 0.9 for Exponential Workloads
Fig. 7. T(x) Vs x at ρ = 0.9 for BP (10, 5 ∗ 103 , 1.1)
Figure 7 shows T(x) Vs x under the BP (10, 5∗104 , 1.1) for EWRR Scheduling Algorithm when ρ is 0.9. The results reveal a higher T(x) for video packets compared to voice packets. This is a clear indication that video is being starved at the expense of voice. The difference is much more pronounced when the packet sizes are large. We further note that for the same packet size, T(x) for video packets is still higher than that of voice packets. This result again confirms Hottmar’s et al. [12] solution that the size of the weight has a relationship the average waiting time.
WRR Scheduling Algorithm
1015
Fig. 8. S(x) Vs x at ρ = 0.9 for BP (10, 5 ∗ 103 , 1.1)
In Fig. 8 we present the result for S(x) Vs x under the BP (10, 5∗104 , 1.1) for the EWRR Scheduling Algorithm when ρ is 0.9. We observe that video packets registered a lower S(x) compared to that of voice packets. Recall, S(x) is a ratio of T(x) /x. The results clearly show that under the EWRR Scheduling Algorithm, the video packets are performing poorly in terms of T(x) and S(x) under the two service distributions at high system load. We can therefore conclude that video packets are starved under the EWRR strategy and this Algorithm is not scalable. 4.3
Evaluation of the Algorithms Under Exponential Distribution
In Fig. 9(a) and Fig. 9(b), the performances of voice and video packets for WRR and IWRR Scheduling Algorithms when ρ is 0.9 is compared in terms of T(x) Vs x under the exponential distribution. We note that T(x) increases with increase in packet size. The graph for T(x) voice packets for EWRR scheduling Algorithm is steeper compared to that of IWRR scheduling Algorithm. From the result, we can rightly conclude that the IWRR Algorithm out-performs EWRR in terms of scheduling voice and video packets. This is so because the EWRR scheduling Algorithm is unaware of the packet sizes in the queues, and the mean waiting time is computed as one combined entity for the packets. While the IWRR scheduling Algorithm is aware of the sizes, the voice and video packet sizes are known in advance because the traffic within each queue is classified into small and large packets therefore, the average waiting time is computed as two entities that are later combined. In Fig. 10 we compare the performance of voice and video packets for IWRR scheduling Algorithm and the results for T(x) Vs when x under the exponential distribution when ρ is 0.9 are presented. The results are very interesting because they depict that voice and video packets performance is very close or nearly the same. This dramatic improvement in performance of video packets, at the cost of no additional resources indicates that the IWRR scheduling Algorithm is scalable. Hottmar et al. [12] points out that one advantage of classification of traffic by service class is that it promotes more equitable management and increased stability for network applications rather than the use of priorities or preferences,
1016
M. A. Muwumba et al.
Fig. 9. T(x) Vs, x at ρ = 0.9 for Voice and Video Packets for Exponential Workloads
and this benefit is clearly observed from this result. J. Gautam et al. [9] solution on WRR Algorithm revealed that a queue that has mostly small packets while another has mostly big packets, then more bandwidth allocation is given to the queue with big packets. On the other hand, if less bandwidth is allocated to the queue with big packets, then the big packets are starved at the expense of small packets. Also, services that have a very strict demand on delay and jitter can be affected by the scheduling order of other queues because WRR offers no priority levels in its scheduling. However, our result seems to invalidate J. Gautam et al. claim. From this result, we are right to say that the IWRR scheduling Algorithm is superior compared to the EWRR scheduling Algorithm. In Fig. 11, we present the results of the IWRR scheduling Algorithm T(x) Vs x when ρ is 0.9 for voice and video packets under exponentially distributed workloads. We observe that there is no significant variation in performance all classes of packets. It clearly revealed that at high system load there is no significant performance degradation for voice and video packets occurs. Therefore, we are right to say that the IWRR scheduling Algorithm is scalable under exponentially distributed workloads.
WRR Scheduling Algorithm
1017
Fig. 10. T(x) Vs x at ρ = 0.9 for Exponential Distribution
4.4
Evaluation of the Algorithms Under Heavy Tailed Distribution
In Fig. 12(a) and Fig. 12(b), the performances of voice and video packets for the EWRR and IWRR Scheduling Algorithm are compared in terms of T(x) Vs x when ρ is 0.9 under heavy tailed workloads. As expected, T(x) increases with increase in packet size. We observe a better performance of video and voice packets for the IWRR Algorithm compared to EWRR scheduling Algorithm. We note that there is performance degradation for voice and video packets in the EWRR scheduling Algorithm. This is attributed to the fact the IWRR Algorithm is aware of the packet size and utilizes a technique of computing the mean waiting times of small and large packets as separate components, which are later combined into one component. Obviously, the sum of the resulting average waiting time is higher than in the EWRR Scheduling Algorithm. The resulting average waiting time is got from the difference between the continuous random variable x as shown in Eqs. 18 and 19 and the average waiting times of small plus large packets. The resulting average waiting time for the voice and video packets is lower in the IWRR Algorithm. Definitely a reduction in average waiting time results into low T (x) for voice and video packets in the IWRR Algorithm. We further note that the a slight disparity for the video packets compared to voice packets. One strong aspect of the IWRR Algorithm is that it can take bursty traffic
Fig. 11. S(x) Vs x at ρ = 0.9 for Exponential Workloads
1018
M. A. Muwumba et al.
Fig. 12. T(x) vs x at ρ = 0.9 for BP (10, 5 ∗ 103 , 1.1) for Voice and Video Packets
Fig. 13. T(x) vs x at ρ = 0.9 for BP (10, 5 ∗ 103 , 1.1)
that conforms to the exponential and BP distribution. By splitting the average waiting times of the small and large packets in this Algorithm, we enhance the capability of handling bursty traffic into our analytical model. Figure 13 shows the performance comparison of voice and video packets for IWRR scheduling Algorithm. The results for T(x) Vs x under the BP (10, 5 ∗ 103 , 1.1) when ρ is
WRR Scheduling Algorithm
1019
0.9 are presented. We note that the slight disparity in performance of video and voice packets is small. This result clearly indicates that the IWRR Algorithm can take on bursty traffic in conformity with exponentially and heavy tailed work loads. The results of the IWRR scheduling Algorithm for S(x) Vs x when ρ is 0.9 for voice and video packets under the BP (10, 5 ∗ 103 , 1.1) presented in Fig. 14. We again observe that there is no significant variation in performance for both voice and video packets.
Fig. 14. S(x) vs x at ρ = 0.9 for BP (10, 5 ∗ 103 , 1.1)
5
Conclusion and Future Work
The study proposed an IWRR Algorithm that utilizes the technique of computing the partial average waiting times of the small/large voice/video packets. The aim of the technique is to shorten the conditional mean response times and slowdowns of the voice and video packets. The proposed IWRR Algorithm performs better than the EWRR Algorithm in terms of conditional mean response time and slowdown. The main idea behind the shortening the conditional mean response times and slowdowns of the voice and video packets is to reduce the starvation of low priority packets. From the numerical results we rightfully conclude that the aim and objectives of the study were achieved. Finally in our future work, we intend optimize the performance of the scheduler. Acknowledgment. This work was funded by Government of Uganda through Makerere University Research and Innovation Fund (Grant Number: MAK-RIF Round 4, 2022/23). Special thanks also goes to the Management of Uganda Business and Technical Examinations Board (UBTEB) for the generous support which made it possible to disseminate these research findings at the Computing Conference held in June 2023 in London, UK.
1020
M. A. Muwumba et al.
References 1. Al-Bahadili, H.: An optimized scheduling scheme in OFDMA WiMax networks (2012) 2. Mahmood, D.A., Horv´ ath, G.: A simple approximation for the response times in the two-class weighted fair queueing system. In: Thomas, N., Forshaw, M. (eds.) ASMTA 2017. LNCS, vol. 10378, pp. 125–137. Springer, Cham (2017). https:// doi.org/10.1007/978-3-319-61428-1 9 3. Orda, A.: Routing with end-to-end QoS guarantees in broadband networks. Proc. IEEE/ACM Trans. Network. 7, 365–374 (1999) 4. Orda, A., Sprintson, A.: Precomputation schemes for QoS routing. Proc. IEEE/ACM Trans. Network. 11, 578–591 (2003) 5. Bennett, C.R., Zhang, H.: Wf2q: worst-case fair weighted fair queueing. In: Proceedings of IEEE INFOCOM’96, pp. 120–128, March 1996 6. Demers, A., Keshav, S., Shenker, S.: Analysis and simulation of a fair queueing algorithm (1989) 7. Dhar, S.: MANET: applications, issues, and challenges for the future. Int. J. Bus. Data Commun. Network. (IJBDCN) I, 66–92 (2005) 8. Fahad, A.M., Alani, S., Mahmood, S.N., Fahad, N.M.: NS2 based performance comparison study between DSR and AODV protocols. Int. J. Adv. Trends Comput. Sci. Eng. 8, 379–393 (2019) 9. Gautam, J., Divyalakshmi, R.R., Ishwarya, B.M., Aashika, K.S.: Efficient traffic scheduling and congestion control mechanism in wireless networks. Int. J. Sci. Res. Dev. 7 (2019) 10. Harchol-Balter, M.: Queueing disciplines. In: Wiley Encyclopedia of Operations Research and Management Science, April 2009 11. Chaskar, H.M., Madhow, U.: Fair scheduling with tunable latency: a round Robin approach. In: Proceedings of the IEEE Global Telecommunication Conference (GLOBECOM 99), pp. 1328–1333, December 1999 12. Hottmar, V., Adamec, B.: Analytical model of a weighted round robin service system. J. Electr. Comput. Eng. 2012 (2012) 13. Loo, J., Mauri, J.L., Ortiz, J.H.: Mobile Ad Hoc Networks: Current Status and Future Trends. CRC Press, Boca Raton (2016) 14. Lu, S., Bharghavan, V.: Fair scheduling in wireless packet networks. Proc. IEEE/ACM Trans. Network. 7, 473–489 (1999) 15. Luo, H., Medvedev, P., Cheng, J., Lu, S.: A self-coordinating approach to distributed fair queueing in ad hoc wireless networks (2001) 16. Mahdi, H.F., Alwan, M.H., Al-Bander, B., Sameen, A.Z.: A comparision of node detection algorithms over wireless sensor network. Int. J. Interact. Mob. Technol. 16(3) (2022) 17. Mohammed, A., et al.: Weighted round robin scheduling algorithms in mobile ad hoc network. In: 3rd International Congress on Human-Computer Interaction, Optimization and Robotic Application (2021) 18. Nagle, J.: SIGCOMM Comput. Commun. Rev. 14(4), 61 (1984) 19. Parekh, A.K., Gallager, R.G.: A generalized processor sharing approach to flow control in integrated services networks. Proc. IEEE/ACM Trans. Network. 2 (1994) 20. Rai, I.A.: QoS Support in Edge Routers. Ph.D. thesis, Paris Telcom, France (2004) 21. Rai, I.A., Okopa, M.: Modeling and evaluation of SWAP scheduling policy under varying job size distributions. In: Proceedings of the Tenth International Conference on Networks (2011)
WRR Scheduling Algorithm
1021
22. Raja, M.L., Babooi, C.D.S.S.: An overview of MANET: applications, attacks and challenges. Int. J. Comput. Sci. Mob. Comput. (IJCSMC) 3, 408–417 (2014) 23. Sedrati, M.B.: Multipath routing to improve quality of service for video streaming over mobile ad hoc networks. Wirel. Person. Commun. 99, 999–1013 (2018). Springer, US 24. Taniar, D.: Mobile Computing: Concepts, Methodologies, Tools, and Applications: Concepts, Methodologies, Tools, and Applications, vol. I. IGI Global, Hershey (2008)
Blockchain Network Analysis: A Comparative Study of Decentralized Banks Yufan Zhang1,2 , Zichao Chen1,2 , Yutong Sun1,2 , Yulin Liu3(B) , and Luyao Zhang1,2(B) 1
2
Duke Kunshan University, Suzhou 215316, Jiangsu, China [email protected] Data Science Research Center and Social Science Division, Duke Kunshan University, Suzhou, China 3 SciEcon CIC, London WC2H 9JQ, UK [email protected] Abstract. Decentralized finance (DeFi) is known for its unique mechanism design, which applies smart contracts to facilitate peer-to-peer transactions. The decentralized bank is a typical DeFi application. Ideally, a decentralized bank should be decentralized in the transaction. However, many recent studies have found that decentralized banks have not achieved a significant degree of decentralization. This research conducts a comparative study among mainstream decentralized banks. We apply core-periphery network features analysis using the transaction data from four decentralized banks, Liquity, Aave, MakerDao, and Compound. We extract six features and compare the banks’ levels of decentralization cross-sectionally. According to the analysis results, we find that: 1) MakerDao and Compound are more decentralized in the transactions than Aave and Liquity. 2) Although decentralized banking transactions are supposed to be decentralized, the data show that four banks have primary external transaction core addresses such as Huobi, Coinbase, and Binance, etc. We also discuss four design features that might affect network decentralization. Our research contributes to the literature at the interface of decentralized finance, financial technology (Fintech), and
The corresponding author Luyao Zhang is supported by the National Science Foundation China on the project entitled “Trust Mechanism Design on Blockchain: An Interdisciplinary Approach of Game Theory, Reinforcement Learning, and Human-AI Interactions.” (Grant No. 12201266). Yutong Sun is supported by the Summer Research Scholar (SRS) program 2022 under Prof. Luyao Zhang’s project entitled “Trust Mechanism Design: Blockchain for Social Good” at Duke Kunshan University. Yufan Zhang and Zichao Chen are supported by the Social Science Divisional Chair’s Discretionary Fund for undergraduate research as the Teaching and Research Assistants of Prof. Luyao Zhang at Duke Kunshan University. Yufan Zhang, Zichao Chen, Yutong Sun, and Luyao Zhang are also with SciEcon CIC, a not-for-profit organization aiming at cultivating interdisciplinary research of both profound insights and practical impacts in the United Kingdom. Yulin Liu is also with Shiku Foundation and Bochsler Finance, Switzerland. We thank the anonymous referees at Computing Conference for their professional and thoughtful comments. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): SAI 2023, LNNS 711, pp. 1022–1042, 2023. https://doi.org/10.1007/978-3-031-37717-4_67
Blockchain Network Analysis: A Comparative Study of Decentralized Banks
1023
social network analysis and inspires future protocol designs to live up to the promise of decentralized finance for a truly peer-to-peer transaction network. We release our code on GitHub and data on Harvard Dataverse as open source for future research. Keywords: Blockchain · Social Network Analysis · Decentralized Finance · Ethereum · Decentralized Bank · stablecoins
1
Introduction
Blockchain technology is notable for its security, transparency, and reliability worldwide [15,48]. DeFi (Decentralized Finance) is one important blockchain application with over ten billion U.S. dollar market value [15]. According to Werner et al.’s research, DeFi is a peer-to-peer financial system [42]. DeFi has great potential to replace traditional centralized finance with the help of blockchain technology by using tamper-proof smart contracts to verify peerto-peer transactions [27]. Among the various DeFi programs, a class of lending agreements plays a role similar to that of traditional banks. That is, decentralized banks provide lending and borrowing of on-chain assets, facilitated through protocols for loanable funds (PLFs) [42]. PLFs can create distributed ledger-based marketplaces for loanable funds of crypto assets by pooling deposited funds in smart contracts [42]. Many decentralized banking platforms have emerged in recent years, including Aave, Compound, MakerDao, and Liquity [15]. How do we measure the quality of a decentralized banking platform? According to the study “SoK: Blockchain Decentralization” [46], decentralized transactions on decentralized banking platforms are important not only because of their financial connotation but also because blockchains with centralized transactions can be easily manipulated by a few individuals [46], which also threatens blockchain security. Existing literature shows that network characteristics, which are indicators of trading network structures in decentralized markets, significantly affect market outcomes and performance, such as liquidity and volatility [13,36,41]. Furthermore, a more decentralized network can significantly predict higher returns and lower volatility for the associated DeFi tokens [3]. There is still a lack of sufficient research on blockchain decentralization, and decentralization measurement should include multiple dimensions. The paper ‘SoK: Blockchain Decentralization’ designed a taxonomy to analyze blockchain decentralization in five dimensions: consensus, network, governance, wealth, and transactions, but they found a lack of studies on a transactional perspective [46]. This gap in decentralized banks was filled by a recent study in which Ao et al. (2022) [3] applied social network analysis to Aave’s user blockchain transaction data to capture the degree of decentralization, network dynamics, and economic performance of Aave. They found that the AAVE token transaction network has a distinct core-periphery structure, with multiple network features in a decentralized dynamic state. However, Aave does not represent all decentralized banks. Moreover, the relationship between the mechanism design of DeFi protocols and their degrees of decentralization has not
1024
Y. Zhang et al.
been well studied. Table 1 compares decentralized bank designs in terms of governance, Airdrop, Loan before the deposit, and Stablecoins, where ‘Y’ is short for ‘Yes’ and ‘N’ is short for ‘No.’ For example, Aave, Compound, and MakerDao have a decentralized autonomous organization (DAO). In contrast, Liquity has an ungoverned protocol that represents a more decentralized mechanism [9]. In addition, Liquity airdrops the native token LQTY to the lenders of the asset pool [9]. Before answering how these design mechanisms affect the degree of decentralization of decentralized banks, we need to first compare the degree of decentralization of the platforms and the patterns of their network characteristics. Therefore, our study applies social network analysis [3] to blockchain transaction network data from leading decentralized banks, including Liquity, Aave, MakerDaok, and Compound and aims to answer the following research questions (RQ). – RQ1 on network decentralization and dynamics: How does network decentralization vary over time and across different decentralized bank protocols? – RQ2 on the core-periphery network: What are the core components of the transaction network in each decentralized bank? We complete the core-periphery analysis and characterization of the transaction network using the LIP algorithm [32]. Our research successfully compares the network features and transaction decentralization of four major decentralized bank protocols. Due to the introduction of multiple platforms in our comparison, the quantity of data in our study is tens of times larger than that in the previous study [3]. To solve the technical issue, we introduce the LIP algorithm, a faster core-peripheral analysis algorithm, in our study [32]. Our results (R) reveal that – R1: MakerDao and Compound are more dispersed in transactions than AAVE and Liquity. – R2: The largest externally owned address cores for LQTY, LUSD, AAVE, Dai, and COMP are centralized exchanges such as Huobi, Coin base, and Binance. The rest of the paper is organized as follows. Section 1.1 addresses the related literature. Section 2 introduces the data and methods. Section 3 presents the results. Section 4 concludes and discusses future research. Table 1. Mechanism design comparison between four decentralized banks Platforms
Governance Airdrop Loan before the deposit Stablecoins
Compound Aave MakerDao Liquity
Y Y Y N
Y N N Y
N N Y Y
N N Y, Dai Y, LUSD
Blockchain Network Analysis: A Comparative Study of Decentralized Banks
1025
Fig. 1. Literature review flowchart
1.1
Literature
As in Fig. 1, our research contributes to four lines of literature including decentralized finance (DeFi), the mechanism design of decentralized bank network analysis in finance, and blockchain network analysis. Decentralized Finance. Our research contributes to the decentralized bank literature in the DeFi area. DeFi, decentralized finance, is one of the most discussed emerging technological evolutions in global finance [45]. DeFi refers to an alternative financial infrastructure built on top of the Ethereum blockchain [38]. It uses smart contracts on blockchain to replicate the existing financial services in a more open, interoperable, and transparent way [38]. DeFi brings a brand new trust mechanism in this digital era, which implies a move from trust in banks or states to trust in algorithms and encryption software, according to research [4]. Among decentralized finance, the decentralized banks take up a large area [46]. According to Ao et al.(2022) [3], decentralized banks differ from centralized banks in two aspects: 1) they replace centralized credit assessments with coded collateral evaluation [26], and 2) they employ smart contracts to execute asset management automatically [5]. We introduce the network analysis to several decentralized bank transaction networks [3,6,11], including Aave, Liquity, Compound, and MakerDao. Our research contributes to decentralized banks by innovatively comparing the decentralized network features across several decentralized banks. DAO, Airdrop, Stablecoins, Loan and Deposit Mechanism Design. Centralized exchanges, such as Bitfines or Poloniex, are trust-based systems and have many limitations. In contrast, decentralized banks facilitate a much more convenient loan experience, according to the Compound white paper, a famous decentralized bank established in 2016 [19]. Decentralized banks have brought
1026
Y. Zhang et al.
customers a brand new mechanism for lending, borrowing, and depositing. In terms of platform design, different decentralized banks have different mechanisms. These mechanisms are divided into four aspects: Governance, Airdrop, Loan and Deposit, and Stablecoins. In Table 1, we provide the different mechanisms used by the different platforms. In the governance mechanism, decentralized banks such as Compound, Aave, and MakerDao all have a decentralized autonomous organization (DAO). Token holders can use their tokens to participate in community governance [16]. However, Liquity designs a completely autonomous DeFi protocol without a DAO or centralized governance. This innovative mechanism may help it to be more decentralized and transparent [9]. Second, in the airdrop process, the incentive mechanism, Compound, and Liquity deliver rewards tokens to the users for liquidity mining [9,19]. The incentive airdrop reward may help the decentralized bank obtain better liquidity and attract more users [43]. In a traditional depositing scheme, people can lend an asset only if someone deposits it in a liquid pool [28]. However, Liquity and MakerDao devised a new mechanism that allows customers to borrow an asset before any deposits [9]. This loan before the deposit mechanism of MakerDao and Liquity also introduces another stablecoin design. Stablecoins are one type of decentralized finance application [27] intended to remedy cryptocurrencies’ excess volatility. During stablecoin development, there have been 3 main periods: fiat-backed, crypto-backed, and algorithmic stablecoins [8]. A major role of stablecoins is to provide security and stability to investors. Compared with volatile cryptocurrencies such as Bitcoin, research has found that stablecoins act as a safe haven for bitcoin [19]. For stablecoins, MakerDao introduced the Dai stablecoin and Liquity introduced the stablecoin LUSD stablecoin [9]. Liquity designed a hard and soft peg mechanism for the LUSD stablecoin. Our paper contributes to the decentralized bank mechanism design, including Governance, Airdrop, Loan and Deposit, and Stablecoins, by exploring and analyzing the potential influence of these innovative mechanism designs on the transaction networks of four decentralized banks. Network Analysis in Finance. Applying social network analysis to financial markets became popular after the financial crisis from 2008 to 2009 [3]. Many studies have found that it is important to explore and evaluate financial network structures to identify systemic risks. Banks that are too centralized, for example, may cause a chain reaction in which their failure may destroy the wider financial system [7,44]. Cong et al.’s (2022) [20] research introduced the network analysis method to the decentralized financial system and successfully analyzed the Ethereum financial network. By introducing network analysis methods for several decentralized banks and comparing the results, our paper further demonstrates the transaction networks and possible influences of decentralized banks, contributing to the field of the decentralized finance field. Network Analysis on Blockchain. As a newly emerged and highly concerned field, blockchain is undergoing very rapid development [15]. A blockchain is not
Blockchain Network Analysis: A Comparative Study of Decentralized Banks
1027
only a distributed ledger but a network of transactions [39]. Each account address in the blockchain can be thought of as a network node. Somin et al. (2018) [39] started their research on network analysis research of ERC20 tokens rending on the Ethereum blockchain and demonstrated the strong power-law properties, richer get richer, in the network. Then Jiang and Liu (2021) [29] spread the network analysis to the NFT area, where they analyze the CryptoKitties’ transaction network. Cong et al. (2022) [20] provide a network analysis of decentralized finance network analysis on Ethereum. Ao et al. (2022) analyze the Aave decentralized bank transaction network in more detail using the core-periphery method and network feature analysis [3]. Our paper contributes to the field of blockchain network analysis field, delving further into existing blockchain-based analysis and exploring the transaction network of decentralized banks in a more detailed manner.
2
Data and Methods
Data and code availability We made our code on GitHub1 and data on Harvard Dataverse[51] as open source for future research. Figure 2 depicts the data science pipeline of our study. Our study has higher automation and computational efficiency than earlier studies [3], enabling crosssectional comparisons.
Fig. 2. Blockchain network analysis methodology flowchart
1
https://github.com/SciEcon/Blockchain-Network-Analysis.
1028
Y. Zhang et al.
2.1
Data Source and Preprocessing
The data for blockchain network analysis are derived from the transaction records of 5 tokens from 4 decentralized finance protocols: LUSD and LQTY of Liquity [9], AAVE of Aave [1], Dai of MakerDao [34], and COMP of Compound [19]. These transaction records were obtained from BigQuery dataset of the Ethereum blockchain via the BigQuery integration with the Kaggle kernel [17]. The launch dates for the DeFi protocols are summarized in Table 2. Specifically, the data cover the transaction records from the genesis dates of each token to July 12, 2022. The transactions whose from address or to address is the Ethereum null address, which is often associated with token-related events such as genesis, mint, or burns [23], were filtered out. We also summarize the total transaction value involved and the number of addresses in Table 2. Figures 10, 11, 12, 13 and 9 in the Appendix visualize the change in daily transaction value and the number of addresses over time. In addition, we developed undirected daily transaction networks in which the nodes represent Ethereum addresses, and the edges represent the entire daily transaction volume between two addresses, weighted by the transaction values. In other words, we aggregate the transaction values between two addresses without regard to the direction so that the transaction between two addresses will not be calculated repeatedly in the core-periphery structure analysis. Table 2. Queried data of the DeFi tokens for blockchain network analysis. Token
DeFi Protocol Genesis Date Duration (Day) Total Transaction Value (Wei) Number of Addresses
LUSD LQTY AAVE Dai COMP
Liquity Liquity Aave MakerDao Compound
2.2
2021-04-05 2021-04-05 2020-10-02 2019-11-13 2020-03-04
464 464 649 968 793
2.823 × 1028 4.147 × 1026 3.511 × 1026 1.295 × 1030 1.4823 × 1026
12,249 19,009 371,122 1,769,138 567,133
Network Feature Extraction
We extracted 4 network features for all the daily transaction networks built as described in Sect. 2.1. In detail, the network features include the number of components, the relative size of the largest component, the modularity score [37], and the standard deviation in the degree centrality. These network features are computed using the Python NetworkX [22] algorithms. According to the conceptual framework, the network features can characterize the difference between more centralized and more decentralized networks and therefore quantify the decentralization of a specific transaction network. For instance, in a ‘more centralized’ network, in which more vertices are connected to several central vertices, the relative size of the largest component will become larger compared to the decentralized ones.
Blockchain Network Analysis: A Comparative Study of Decentralized Banks
2.3
1029
Core-Periphery Structure Detection
There are two more features for blockchain network analysis that require detecting the core-periphery structure of the daily transaction network. The coreperiphery structure refers to a fundamental network pattern that categorizes network nodes [25]. Specifically, the network nodes are classified into two categories: “core” nodes, which are densely connected, and “periphery” nodes, which are weakly connected. Transaction networks with a more significant core-periphery structure are more decentralized than those with a less significant structure [3]. There are several algorithms for detecting the core-periphery structure of a given network [12,14,21,31,32]. By evaluating the statistical significance of the coreperiphery structure in a transaction network, we can construct two additional features, the number of detected core nodes and the average degree of the core nodes, for the blockchain network analysis. The core-periphery structure construction and statistical analysis are conducted using the LIP algorithm [32] in the Python cpnet library [30]. Together with the four extracted features, the two new network features are formally defined in Table 3. Table 3. Definition of the extracted network. ↓ means the lower the less decentralized, and ↑ means the more decentralized. Feature
Definition
The number of components ↑
The number of connected subnetworks that are not part of any larger connected network given a transaction network The number of nodes in the largest component The relative size of the largest divided by the total number of nodes component ↓ The fraction of the edges that fall within the Modularity score ↑ given groups minus the expected fraction if edges were randomly distributed The standard deviation of the degree The standard deviation of the degree of each node in a given transaction network centrality ↓ The number of detected core nodes ↑ The number of nodes that are detected as “core” by the LIP algorithm The average degree of the core nodes The average degree of the nodes that are detected as “core” by the LIP algorithm ↓
3 3.1
Results Network Feature Dynamics and Correlation
Given the calculated network features for each transaction network of each token, we explore how the decentralization of the transaction network for the five DeFi currencies evolves over time using the six network features introduced in Sect. 2.
1030
Y. Zhang et al.
The time series plots of the network features for each token are illustrated in Figs. 3, 4, 5, 6 and 7. By examining the relationship between these dynamic features and the degree of decentralization, we first validated the conceptual framework introduced in Table 3. Horizontally comparing the dynamic features of these four platforms reveals that COMP from Compound and Dai from MakerDao have a substantially higher degree of decentralization than AAVE from Aave and LQTY/LUSD from Liquity. In addition to calculating the network dynamics, we measured the correlations between network characteristics to better highlight their relationship with network decentralization. Figures 14, 15, 16, 17 and 18 in the Appendix depict the feature correlation for LUSD, LQTY, AAVE, Dai, and COMP tokens respectively. Through the correlation heatmap, we can determine the degree of correlation between each platform’s network properties. The stronger the correlation, the darker the square. The greater the degree of correlation between network variables moving in the same direction, the more effective and rational the network analysis for that bank’s network. AAVE and Dai have a larger correlation degree in the heatmap comparison, but LQTY and LUSD of the platform have a comparatively low correlation degree. This may suggest that the existing trading network is immature and that its features are obscure. In the horizontal comparison of these five tokens’ network features, we selected the following aspects of the significant comparison results. Number of Components. In this comparison, we discovered that both of the earliest established platforms, MakerDao and Compound, experienced a peak in the number of components around January 2021, followed by a continuous decline. Next, the graph demonstrates that there is a significant difference in the number of components between AAVE and LQTY and that the quantity of AAVE is much greater than that of LQTY. This indicates that Liquity’s trading network may be more straightforward than competing platforms. Modularity Score. The smaller the modularity score, the more centralized the market. In the horizontal comparison, both LQTY and LUSD had lower values than the two earlier platforms, Compound and MakerDao. This may suggest that Liquity’s trading network is less decentralized than the two previous platforms. Standard of Degree Centrality. We discover that LQTY and LUSD have a greater standard degree of centrality than other platforms based on these data. Both Compound and MakerDao, two older decentralized banks, have rather poor scores in this category. The values of LQTY and LUSD likewise exhibit an upward trend. The lower the value, the less centralization there is. This conclusion is identical to the modularity finding, and it matches the results of other network dynamic properties as well. This circumstance suggests that the Liquity platform is currently more centralized than other platforms.
Blockchain Network Analysis: A Comparative Study of Decentralized Banks
Fig. 3. Time-series plots of network features of the AAVE token.
Fig. 4. Time-series plots of network features of the Dai token.
1031
1032
Y. Zhang et al.
Fig. 5. Time-series plots of network features of the COMP token.
Fig. 6. Time-series plots of network features of the LUSD token.
Blockchain Network Analysis: A Comparative Study of Decentralized Banks
1033
Fig. 7. Time-series plots of network features of the LQTY token.
3.2
Core-Periphery Structure Comparison Between Contract Addresses and Externally Owned Addresses
In addition to the network feature analysis, we conducted further exploration of the detected core-periphery structure for the transaction networks of each token. As introduced in Sect. 2, we conducted the core-periphery structure analysis on the daily transaction networks for the DeFi tokens, LUSD, LQTY, AAVE, Dai, and COMP. Using the Python Web3 package [35], we subsequently queried the real types of addresses detected as core nodes. To further investigate the decentralization of a DeFi token, we further contrasted the core-periphery structure study results in terms of the address types. On the Ethereum blockchain, there are two types of addresses: contract addresses (CA) and externally owned addresses (EOA). The former represents the executable smart contract on the blockchain, whereas the latter consists of user accounts. We extracted the unique addresses that were detected as core nodes for at least one day within the time span of the data source. Figure 8 demonstrates the distribution of the days for them being core nodes in terms of the address type. Moreover, we investigated the detailed address information via Etherscan.io [23] of the outlier addresses of the distribution, which are also annotated in Fig. 8.
1034
Y. Zhang et al.
Fig. 8. Distribution of the number of core days for EOAs and CAs of the five tokens with the annotated address information of the outliers.
We observe that the CA outliers with the most core days are the token contracts created by the DeFi protocol developer. For instance, the CA with the highest number of core days in the AAVE transaction network is the Aave: Staked Aave (642 days), which is the token contract of the AAVE token. Other outliers of CA are the decentralized cryptocurrency exchanges built on Ethereum using smart contracts. For instance,Uniswap [40] is one of the automated liquidity protocols powered by smart constants that exist as outliers in all five token transaction networks, which enables peer-to-peer market making. Another decentralized cryptocurrency exchange that exists as the CA outlier for all five tokens is AirSwap [2], which can also archive peer-to-peer trading of Ethereum tokens. The outliers among EOAs are mostly centralized cryptocurrency exchanges. The most obvious examples are Coinbase [18] and Binance [10], both of which are famous exchanges where users can trade cryptocurrencies. Given the vast variety of trading tools and supporting services for users to earn interest [10], centralized cryptocurrency exchanges with a high volume of transactions have gained immense appeal, where a large number of transactions occur on these EOAs. However, the centralized exchanges bring high centralization” to the decentralized bank transaction networks. Table 4 summarizes the first list date for the four DeFi protocols.
Blockchain Network Analysis: A Comparative Study of Decentralized Banks
1035
Table 4. The first date that the DeFi protocols were first listed on the exchanges, Coinbase and Binance
Liquity Aave Compound DAO
4
Coinbase
Binance
January 12, 2022 December 14, 2020 June 23, 2020 May 23, 2019
Not yet October 16, 2020 June 25, 2020 July 23, 2020
Conclusion and Future Research
According to Cong et al. (2022) [20], the degree of decentralization and the stability of the trading network are both important factors in building trust for decentralized banks and increasing the inclusiveness of the decentralized bank platform. We, therefore, conducted a comparative study of four major decentralized banks including Liquity, Aave, MakerDao, and Compound, evaluating transaction decentralization using social network analysis. We made two major findings: first, the largest externally owned address cores for LQTY, LUSD, AAVE, Dai, and COMP mainly include exchanges such as Huobi, Coin base, and Binance, second, MakerDao and Compound are more decentralized in trading than AAVE and Liquity. Future research can further study the connection between protocol designs and the decentralization level. For example, the higher level of centralization on LQTY and LUSD may be due to three reasons. First, as the Liquity platform has not been established for a long time, there may be fewer users on the platform. Second, as LQTY has not yet been listed on some exchanges, such as Binance, the token may be less well known. This may lead to distrust of the Liquity platform by other decentralized bank participants, resulting in fewer addresses participating in the trading network. Third, Liquity is designed as a non-governance system, which may leave the platform without royalty users actively interacting in the network. How would the internal design features of governance, airdrop, loan before deposits, and stablecoins and the external events such as blockchain mechanisms upgrade [33,49] and sentiments on blockchain ecosystem [24,47] affect network decentralization and other desired properties [50]? Our study provides a direction for future exploration of transaction network analysis and mechanism design for decentralized banks.
1036
A
Y. Zhang et al.
Appendix
Fig. 9. Time series plots of the daily transaction value (Wei) and the number of addresses of the COMP token.
Fig. 10. Time series plots of the daily transaction value (Wei) and the number of addresses of the LUSD token.
Blockchain Network Analysis: A Comparative Study of Decentralized Banks
1037
Fig. 11. Time series plots of the daily transaction value (Wei) and the number of addresses of the LQTY token.
Fig. 12. Time series plots of the daily transaction value (Wei) and the number of addresses of the AAVE token.
1038
Y. Zhang et al.
Fig. 13. Time series plots of the daily transaction value (Wei) and the number of addresses of the Dai token.
Fig. 14. Correlation heatmap of network features of the LUSD token.
Fig. 15. Correlation heatmap of network features of the LQTY token.
Blockchain Network Analysis: A Comparative Study of Decentralized Banks
Fig. 16. Correlation heatmap of network features of the AAVE token.
Fig. 17. Correlation heatmap of network features of the Dai token.
Fig. 18. Correlation heatmap of network features of the COMP token.
1039
1040
Y. Zhang et al.
References 1. aave.com: aave - open source defi protocol (2022). https://aave.com/ 2. airswap.io: www.airswap.io (2022). https://www.airswap.io/ 3. Ao, Z., Horvath, G., Zhang, L.: Are decentralized finance really decentralized? A social network analysis of the Aave protocol on the Ethereum blockchain. arXiv preprint arXiv:2206.08401 (2022) 4. Baldwin, J.: In digital we trust: bitcoin discourse, digital currencies, and decentralized network fetishism 4(1), 1–10. https://doi.org/10.1057/s41599-018-0065-0, https://www.nature.com/articles/s41599-018-0065-0. Number: 1 Publisher: Palgrave 5. Bartoletti, M.: Smart contracts contracts 3. https://www.frontiersin.org/articles/ 10.3389/fbloc.2020.00027 6. Barucca, P., Lillo, F.: Disentangling bipartite and core-periphery structure in financial networks 88, 244–253. https://doi.org/10.1016/j.chaos.2016.02.004, https:// www.sciencedirect.com/science/article/pii/S0960077916300352 7. Battiston, S., Puliga, M., Kaushik, R., Tasca, P., Caldarelli, G.: DebtRank: Too central to fail? Financial networks, the FED and systemic risk 2(1), 541. https:// doi.org/10.1038/srep00541, https://www.nature.com/articles/srep00541. Number: 1 Publisher: Nature Publishing Group 8. Baur, D.G., Hoang, L.T.: A crypto safe haven against bitcoin 38, 101,431. https:// doi.org/10.1016/j.frl.2020.101431, https://www.sciencedirect.com/science/article/ pii/S1544612319312632 9. Bergeron, K.: Liquity Launch Details (2021). https://www.liquity.org/blog/ liquity-launch-details 10. binance.com: bitcoin exchange — cryptocurrency exchange — binance (2022). https://www.binance.com/en 11. Borgatti, S.P., Everett, M.G.: Models of core/periphery structures 21(4), 375–395. https://doi.org/10.1016/S0378-8733(99)00019-2, https://www.sciencedirect.com/ science/article/pii/S0378873399000192 12. Borgatti, S.P., Everett, M.G.: Models of core/periphery structures. Soc. Netw.21(4), 375–395 (2000). https://doi.org/10.1016/S0378-8733(99)00019-2, https://www.sciencedirect.com/science/article/pii/S0378873399000192 13. Bovet, A., et al.: The evolving liaisons between the transaction networks of bitcoin and its price dynamics. https://doi.org/10.48550/arXiv.1907.03577 14. Boyd, J.P., Fitzgerald, W.J., Mahutga, M.C., Smith, D.A.: Computing continuous core/periphery structures for social relations data with minres/svd. Soc. Netw. 32(2), 125–137 (2010). https://doi.org/10.1016/j.socnet.2009.09.003, https://www.sciencedirect.com/science/article/pii/S0378873309000513 15. Busayatananphon, C., Boonchieng, E.: Financial technology DeFi protocol: a review. In: 2022 Joint International Conference on Digital Arts, Media and Technology with ECTI Northern Section Conference on Electrical, Electronics, Computer and Telecommunications Engineering (ECTI DAMT and NCON), pp. 267–272. https://doi.org/10.1109/ECTIDAMTNCON53731.2022.9720373. ISSN: 768-4644 16. Chohan, U.W.: The decentralized autonomous organization and governance issues. https://papers.ssrn.com/sol3/papers.cfm?abstract id=3082055 17. Cloud, G.: Ethereum in BigQuery: a public dataset for smart contract analytics (2018). https://cloud.google.com/blog/products/data-analytics/ethereumbigquery-public-dataset-smart-contract-analytics
Blockchain Network Analysis: A Comparative Study of Decentralized Banks
1041
18. coinbase.com: coinbase - buy & sell bitcoin, ethereum, and more with trust (2022). https://www.coinbase.com/ 19. compound.finance: compound (2022). https://compound.finance/ 20. Cong, L.W., Tang, K., Wang, Y., Zhao, X.: Inclusion and democratization through web3 and DeFi? Initial evidence from the ethereum ecosystem. https://doi.org/10. 2139/ssrn.4162966. https://papers.ssrn.com/abstract=4162966 21. Cucuringu, M., Rombach, P., Lee, S.H., Porter, M.A.: Detection of core-periphery structure in networks using spectral methods and geodesic paths. Eur. J. Appl. Math. 27(6), 846–887 (2016) 22. Developers, Networkx— Networkx documentation (2014). https://networkx.org/ 23. etherscan.io: ethereum (eth) blockchain explorer (2019). https://etherscan.io/ 24. Fu, Y., Zhuang, Z., Zhang, L.: Ai ethics on blockchain: topic analysis on twitter data for blockchain security. arXiv preprint arXiv:2212.06951 (2022) 25. Gallagher, R.J., Young, J.G., Welles, B.F.: A clarified typology of core-periphery structure in networks. Sci. Adv. 7(12), eabc9800 (2021). https://doi.org/10.1126/ sciadv.abc9800. https://www.science.org/doi/abs/10.1126/sciadv.abc9800 26. Gudgeon, L., Werner, S.M., Perez, D., Knottenbelt, W.J.: DeFi protocols for loanable funds: interest rates, liquidity and market efficiency. https://doi.org/10.48550/ arXiv.2006.13922 27. Harvey, C.R., Ramachandran, A., Santoro, J.: DeFi and the future of finance. https://doi.org/10.2139/ssrn.3711777 28. Jakab, Z., Kumhof, M.: Banks are not intermediaries of loanable funds - and why this matters. Bank England. Q. Bull. 55(2), 206 (2015). https://www.proquest. com/docview/1693218824/abstract/B0494EFC29104368PQ/1. Num Pages: 1 Publisher: Bank of England. Economics Division. Bulletin Group 29. Jiang, X.J., Liu, X.F.: CryptoKitties transaction network analysis: the rise and fall of the first blockchain game mania 9. https://www.frontiersin.org/articles/10. 3389/fphy.2021.631665 30. KOJAKU, S.: A Python package for detecting core-periphery structure in networks (2022). https://github.com/skojaku/core-periphery-detection 31. Kojaku, S., Masuda, N.: Finding multiple core-periphery pairs in networks. Phys. Rev. E 96, 052,313 (2017). https://doi.org/10.1103/PhysRevE.96.052313 32. Lip, S.Z.W.: A fast algorithm for the discrete core/periphery bipartitioning problem. arXiv:1102.5511 [physics] (2011) 33. Liu, Y., Lu, Y., Nayak, K., Zhang, F., Zhang, L., Zhao, Y.: Empirical analysis of eip-1559: Transaction fees, waiting times, and consensus security. In: Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security, CCS ’22, pp. 2099–2113. Association for Computing Machinery, New York, NY, USA (2022). https://doi.org/10.1145/3548606.3559341. https://arxiv. org/abs/2305.02552 34. makerdao.com: Makerdao — an unbiased global financial system (2022). https:// makerdao.com/en/ 35. Merriam, P., Carver, J.: Introduction - web3.py 5.31.1 documentation. https:// web3py.readthedocs.io/en/v5/ 36. Motamed, A.P., Bahrak, B.: Quantitative analysis of cryptocurrencies transaction graph 4(1), 131. https://doi.org/10.1007/s41109-019-0249-6 37. Newman, M.E.J.: Modularity and community structure in networks. Proc. National Acad. Sci. 103, 8577–8582 (2006). https://doi.org/10.1073/pnas.0601602103 38. Sch¨ ar, F.: Decentralized finance: On blockchain- and smart contract-based financial markets 103(2), 153–174. https://doi.org/10.20955/r.103.153-74. http://
1042
39.
40. 41.
42.
43. 44.
45. 46. 47. 48. 49. 50. 51.
Y. Zhang et al.
www.proquest.com/docview/2532694802/abstract/F7C58DA99BA04B00PQ/1. Num Pages: 153-174 Publisher: Federal Reserve Bank of St. Louis Somin, S., Gordon, G., Altshuler, Y.: Network analysis of ERC20 tokens trading on ethereum blockchain. In: Morales, A.J., Gershenson, C., Braha, D., Minai, A.A., Bar-Yam, Y. (eds.) ICCS 2018. SPC, pp. 439–450. Springer, Cham (2018). https:// doi.org/10.1007/978-3-319-96661-8 45 uniswap.org: Home — uniswap protocol (2022). https://uniswap.org/ Vallarano, N., Tessone, C.J., Squartini, T.: Bitcoin transaction networks: an overview of recent results 8. https://www.frontiersin.org/articles/10.3389/fphy. 2020.00286 Werner, S.M., Perez, D., Gudgeon, L., Klages-Mundt, A., Harz, D., Knottenbelt, W.J.: SoK: decentralized finance (DeFi). https://doi.org/10.48550/arXiv. 2101.08778 Yin, J., Ren, M.: On Liquidity Mining for Uniswap v3 (2021). https://doi.org/10. 48550/arXiv.2108.05800 Yun, T.S., Jeong, D., Park, S.: “too central to fail” systemic risk measure using PageRank algorithm 162, 251–272. https://doi.org/10.1016/j.jebo.2018.12.021, https://www.sciencedirect.com/science/article/pii/S0167268118303536 Zetzsche, D.A., Arner, D.W., Buckley, R.P.: Decentralized finance 6(2), 172–203. https://doi.org/10.1093/jfr/fjaa010 Zhang, L., Ma, X., Liu, Y.: SoK: blockchain decentralization. arXiv preprint arXiv:2205.04256 (2022) Zhang, L., Sun, Y., Quan, Y., Cao, J., Tong, X.: On the mechanics of NFT valuation: AI ethics and social media (2023). https://doi.org/10.31219/osf.io/qwpdx Zhang, L., Tian, X.: On blockchain we cooperate: an evolutionary game perspective. arXiv preprint arXiv:2212.05357 (2022) Zhang, L., Zhang, F.: Understand waiting time in transaction fee mechanism: an interdisciplinary perspective. arXiv preprint arXiv:2305.02552 (2023) Zhang, S.: The design principle of blockchain: an initiative for the sok of soks. arXiv preprint arXiv:2301.00479 (2023) Zhang, Y., Chen, Z., Sun, Y., Liu, Y., Zhang, L.: Replication data for: blockchain network analysis: a comparative study of decentralized banks (2023). https://doi. org/10.7910/DVN/CZSB6C
Evaluating Self-supervised Transfer Performance in Grape Detection Michael Woodson and Jane Zhang(B) California Polytechnic State University, San Luis Obispo, CA 93407, USA {mwoodson,jzhang}@calpoly.edu
Abstract. Advances in computer vision have resulted in promising research and applications in the domain of precision agriculture. In particular, Deep Learning has rapidly improved state of the art in object detection and segmentation, both of which prove vital in crop monitoring and yield forecasting. In most object detection problems, transfer learning serves as the established paradigm for applying Deep Learning systems on downstream tasks. This is particularly important in agricultural vision applications, where available data is relatively scarce compared to the large demand required of deep learning systems. Recent advances in Self-Supervised Learning have generated pretraining methods that approach the transfer performance of supervised pretraining on a variety of downstream tasks. To demonstrate the impact of Self-Supervised learning in agriculture, this paper evaluates the transfer performance of one self-supervised method, BYOL, on grape cluster detection. By comparing BYOL with supervised pretraining on a Faster R-CNN architecture, this work demonstrates that Self-Supervised Learning is competitive with other supervised pretraining methods in agricultural applications, showing its promise in advancing precision agriculture to more accurate and robust solutions.
Keywords: Object Detection
1
· Self-Supervised Learning · Agriculture
Introduction
People across the world rely on the availability of fresh foods to feed themselves and their families. Whether it be farming processes that ensure the health and abundance of natural fruits and vegetables, or practices that provide dairy and meats, agriculture has remained an integral component of our lives. The future of agriculture faces many challenges in addressing the needs of a rapidly growing world population: climate change, increasing demand for energy, resource shortages, and labor shortages [26]. These challenges have given rise to Precision Agriculture, which utilizes technology to improve the efficiency of agricultural processes. Yield forecasting is one important process in agriculture. Yield forecasting is a technique where farmers obtain early crop counts to estimate the expected c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): SAI 2023, LNNS 711, pp. 1043–1057, 2023. https://doi.org/10.1007/978-3-031-37717-4_68
1044
M. Woodson and J. Zhang
yield at harvest. This can help farmers accurately prepare for harvest, providing data to inform decisions in packaging, marketing, and labor requirements [37]. However, yield forecasting remains manual: workers sample a subset of the vineyard, obtaining fruit counts, and extrapolate these numbers to cover the entirety of the vineyard, taking historical yields, weather, and field measurements into account [37]. Consequently, this manual process introduces a variety of biases and errors into the eventual yield forecast. To improve the accuracy and speed of forecasting, research has focused on applying computer vision to crop detection. By developing a system that can automatically detect and count crops, one could deploy an image acquisition system throughout a field to obtain accurate counts early in the harvest cycle. Before deep learning gained popularity, classical vision approaches were applied to various crop counting tasks. However, classical techniques were challenged by natural environmental conditions associated with agricultural applications such as dynamic lighting, weather patterns, and heavy occlusion. Deep learning, combined with the practice of transfer learning, has helped address such challenges to further advance state-of-the-art in agricultural tasks such as crop detection and phenotyping. While transfer learning is effective in agriculture, many systems are constrained by the current paradigm of supervised pretraining. Supervised pretraining requires that massively large image databases, such as ImageNet [11], are specially curated and largely annotated. Labeling such a dataset comes with great cost, as these datasets contain millions of images. This high cost, as a result, restricts the domain of learnable features, since labeled data comprises a small subset of all possible images that networks could learn from. Furthermore, obtaining labeled data becomes much more difficult across different imaging modalities. When considering the potential for agricultural applications to leverage multi-modal data, it’s important to explore training methods that obviate the need for ground-truth labels, broadening the scope of learnable features that transfer well to downstream tasks such as crop detection. Such considerations have given rise to a class of pretraining called Self-Supervised Learning, where features can be learned from data without class labels. Self-Supervised Learning (SSL) has shown success in Natural Language Processing (NLP), where networks can learn word embeddings by completing tasks using unlabeled data, such as the masked word prediction task used in BERT [12]. These word embeddings carry great semantic value in the feature space, transferring well to other language tasks. The success of SSL in NLP has motivated vision-based training approaches that can learn from image data itself, rather than relying on annotations which describe the images. Recent selfsupervised techniques have approached, or even surpassed, the success of supervised pretraining for image classification tasks, while displaying an ability to accurately transfer to a variety of downstream tasks such as dense prediction, few-shot recognition, and object detection [16]. While SSL has been shown to perform well in various imaging benchmarks, the effectiveness of self-supervised transfer performance, to the best of our knowledge, has not yet been analyzed in an agricultural field setting.
Self-supervised Grape Detection
1045
Given the early success of Self-Supervised Learning on various benchmark datasets, it’s important to analyze its impact on real-world applications such as crop detection. As a result, this work’s contributions are as follows: – We evaluated the transfer performance of BYOL (Bootstrap Your Own Latent) [19], a state-of-the-art SSL pretraining method, on grape cluster detection. – We compared BYOL transfer performance to supervised pretraining methods, particularly pretraining on ImageNet image classification and COCO [29] object detection. By demonstrating whether SSL can compete with supervised pretraining in an important agricultural task, we hope to illustrate its effectiveness as a powerful feature generator, while illuminating its potential to learn features that are unconstrained by annotated datasets. The paper is outlined as follows: Sect. 2 conducts a review of various literature in fruit detection and yield forecasting. This is followed by Sect. 3, which provides a background of Self-Supervised Learning. Section 4 then outlines the experimental methods used to compare BYOL with supervised pretraining, followed by Sect. 5, which demonstrates the results of the experiment. Finally, this paper offers concluding remarks in Sect. 6.
2
Related Work
There exists a variety of research in the area of crop detection and phenotyping. One category of research focuses on hand-crafted features to characterize and classify crops. Color-based segmentation can be used to identify fruit pixels [31,38,40], though such approaches struggle to generalize across various environments, lighting conditions, and ripening stages. Texture and SIFT features can help improve crop detection when crop color blends in with the surrounding environment, [2,30,37,39], but such approaches often utilize artificial illumination, which restricts widespread use. More recent approaches utilize deep learning for both crop detection and segmentation tasks. Fruit segmentation maps can be generated by applying networks in a sliding window fashion [7,33]. Segmentation maps cannot count fruit by itself, so some work introduced a second stage which learns to count fruit for each segmentation mask [8,20]. Alternatively to segmentation, counting applications can employ object-detection frameworks to identify separate crop instances. Research has used both single-shot detectors [5,47] and region-proposal networks for fruit detection [3,18,45], with the Faster R-CNN [41] seeing greater success at localizing crop instances. For some crop, localizing crop instances for yield forecasting is ambiguous; one could approach grape detection, for example, by identifying grape clusters or individual grape berries, which presents a trade-off between yield accuracy and annotation costs. At the cost of greater annotation demands, [48] counts individual grape berries by treating detection as a semantic segmentation task, assigning pixels to background, berry, and berry edge classes, then using connected components to count the individual berries. Alternatively, work by [10] adapted crowd-counting networks to grape berry counting tasks, using dot annotations to train a network.
1046
M. Woodson and J. Zhang
Although deep learning improved the accuracy and robustness of fruit detection systems, deep learning still struggles to cope with occlusion caused by overlapping crop and dense canopies. RGB-D cameras, LIDAR, and photogrammetry techniques can alleviate occlusion effects [13,22,27,42,46], but deep learning systems tend to operate on RGB image data for object detection, as networks are often pretrained on ImageNet-like data before transfer learning. Likewise, while agriculture vision often features multispectral data such as IR [17,23,43] to counteract difficult conditions such as direct sunlight exposure, supervised transfer learning is limited in providing pre-learned representations across various input modalities. Meanwhile, Self-Supervised Learning has shown success in discovering semantically strong feature embeddings without labels. Networks pretrained with SSL can outperform ImageNet supervised transfer performance on a variety of different datasets, including Cars [25], Flowers [35], and Food101 [4]; SSL especially performs well in a semi-supervised setting [6,9,19,49]. This is important for agriculture, where multispectral annotations are sparse. Furthermore, SSL can be applied on 3-D data, learning representations that are invariant to 3-D input formats [50] or finding representations that associate 3-D objects with 2-D renderings [1]. Such models can transfer to tasks such as 3-D object detection or 3-D segmentation, potentially allowing agriculture systems to acquire 3-D scene understanding.
3 3.1
Background Self-supervised Learning
Self-Supervised Learning (SSL) is a subset of unsupervised learning that aims to identify useful structures in data without requiring ground-truth annotations to guide the learning process. Self-supervised methods learn by forming annotations from the data itself, then optimizing through a learning task that utilizes these annotations. SSL approaches can be further divided into pretext methods and contrastive methods, which are further explained in the sections below. SSL has recently gained momentum in vision, where SSL can effectively pretrain networks on large datasets without labels, uncovering useful feature representations in the process. This approach allows models to develop a strong semantic understanding of a wide variety of data subjects and modalities, even extending well to longtailed distributions [50]. 3.2
Pretext Tasks
Pretext tasks attempt to learn from data by solving tasks associated with the input image. A parallel can be found in NLP: BERT [12], for example, would accept input sentences with random words masked out, and the pretext task involved training a classifier to predict the missing words. By solving this pretext task, the encoder network generated features that effectively modeled language semantics.
Self-supervised Grape Detection
1047
Self-supervised pretext tasks in vision often involve modifying an input image such that annotations can be formed automatically. For example, [14] extracted a random pair of patches from an image, fed the patches into a siamese network, then predicted the patch configuration using a softmax classifier. By predicting the relative position of one patch to another (from a set known patch configurations), [14] was able to learn features that transferred well to object detection. Various other pretext tasks were designed to improve transfer performance on common image benchmarks. For example, [36] attempted to solve jigsaw puzzles by splitting input images into patches and permuting the generated patches. This paper expanded upon [14] by feeding each patch individually into a siameseennead network, generating 9 feature vectors that were combined and classified as 1 of the possible 64 permutations. The work by [15] expanded upon the previous works by combining multiple pretext tasks when training a ResNet-101-v2 [21]. This was achieved by using separate network heads, one for each task, and updating weights from a shared network trunk. By combining gradients from a diverse set of pretext tasks such as image colorization, patch relative-position prediction, exemplar learning, and motion segmentation, [15] found that transfer performance improved compared to using a single pretext task, demonstrating that features from separate tasks are complimentary. However, pretext methods contain several problems. First, transfer performance largely depends on choosing a pretext task that bears association to the downstream application. Furthermore, combining pretext tasks can complicate the network architecture and training procedure [15]. Lastly, pretext learning generates features that degrade in quality in the final layers of the network. This occurs because network layers are optimized to solve a particular pretext task rather than discover an optimal feature representation. Therefore, contrastive methods have recently gained the most attention. 3.3
Contrastive Learning
Rather than solving pretext tasks, contrastive methods attempt to discover feature embeddings that are invariant to image transformations. For example, if an image were input into a network, the final-layer features should be highly similar to the features produced by an augmentation of that same image. This idea forms the basis of contrastive methods, which aims to learn representations that are invariant to low-level transformations, thus capturing image semantics. Contrastive methods operate as follows: By inputting an image I and an augmented version of that image I t , a network is optimized to produce a pair of feature vectors vI = f (I, θ) and vIt = f (It , θ) such that vI ≈ vIt . Network parameters are optimized by minimizing the cost function (1) [34], which has the effect of maximizing agreement between the two feature embeddings: θˆ = argminθ − log[vI , vIt ] ,
(1)
where ·, · is often the cosine similarity. This process, however, leaves a network prone to trivial solutions, where all joint embeddings are identical for every input pair. Therefore, contrastive
1048
M. Woodson and J. Zhang
methods introduce negative samples into the learning process, jointly maximizing the similarity between positive examples and minimizing the similarity measure between negative samples. Since images are unlabeled, a negative sample is any image I that is different than the original image I. Therefore, contrastive learning, in general, processes an image I, a transformed version of that image I t (which acts as a positive sample), and a set of negative samples I ∈ DN = {I1 ...IN }. By introducing a set of negative samples into the training objective, many contrastive methods formulate a variant of the following loss [9]: Lθ = −log
exp(vI , vIt /τ ) , exp(vI , vIt /τ ) + I ∈DN exp(vI , vIt /τ )
(2)
where DN is a set of randomly-selected negative samples from the training set, vI , vIt is the cosine similarity vI · vIt /vI vIt , and τ is a temperature value that can be set as a hyperparameter. Using the above framework, contrastive learning methods approach ImageNet supervised pretraining on a variety of downstream tasks [34]. Furthermore, unlike pretext features, contrastive features do not degrade in the deepest layers of a network. A downside to contrastive learning, however, is that effective learning requires a large quantity of negative samples. This is required since many negative samples make poor training examples, so methods must increase the batch size as a way to provide a sufficient pool of useful negatives. For example, [9] used up to 16384 negative samples per positive image pair. While contrastive methods have shown great promise in generating useful features in a transfer setting, the expensive training requirements has motivated recent methods to avoid negative samples. BYOL [19], the SSL method evaluated in this work, established a training procedure that required only positive image pairs. This was achieved by employing an online network and a target network. Like other contrastive methods, each network received augmented versions of the same image. However, while the online and target networks were identical in structure, the online network and the target network had different sets of weights. Specifically, the target weights ξ were an exponential moving average of the online network weights, θ. Therefore, while the online network received weight updates through its gradient, the target network avoided gradient updates via a stop-gradient, and instead updated its weights as follows [19]: ξ = τ ξ − (1 − τ )θ ,
(3)
where τ is a target decay weight τ ∈ [0, 1]. Since networks will generate different feature vectors as a result of differing weights, the online network also contains an additional prediction module qθ , which attempts to map the online feature to the target network feature. The BYOL architecture uses a normalized Euclidean distance between the output vectors as the loss metric, represented as Lθ,ξ (qθ (zθ ), z ξ ) = qθ (zθ ) − z ξ 2 = 2 − 2qθ (zθ ) · z ξ ,
(4)
Self-supervised Grape Detection
1049
where zθ and zξ are joint embeddings produced by network pair, and zθ = zθ /zθ and z ξ = zξ /zξ are normalized to unit length [19].
4 4.1
Methods and Materials Data
The Embrapa Wine Grape Instance Segmentation Dataset (WGISD)[44], provided by Santos et al. [45], was used for both training and evaluation. This dataset contains 300 images of grapes captured at the Guaspari Winery in Esp´ırito Santo do Pinhal, S˜ ao Paulo, Brazil. The dataset consists of five different grape varieties: Chardonnay, Cabernet Franc, Cabernet Sauvignon, Sauvignon Blanc, and Syrah. Examples of the grape varieties can be seen in Fig. 1. The different grape varieties offer a diverse set of visual characteristics, differing in color, size, and compactness. To prevent network bias toward any grape variety, the dataset contains near-equal examples of each variety. The dataset comes with bounding box and mask annotations. While box annotations are provided with every image in the dataset, mask annotations were applied to a random subset of 110 images. This amounts to a total of 4431 boxes and 2020 binary masks in the dataset. To maximize the amount of data available for training and testing, we focus on grape cluster detection using bounding boxes, ignoring the mask segmentation data. The ground-truth bounding boxes were manually annotated by Santos et al. [45]. In order to evaluate the performance of the network, the dataset was partitioned into training and test sets. The training partition contains 80% of the dataset, while 20% was reserved for testing. Additionally, 20% of the training data was reserved for validation. The data was partitioned such that both the training and test sets were balanced with respect to grape varieties. The exact details of how data was partitioned can be seen in Table 1; we followed the recommendations provided by [45]. The grape varieties are provided for illustrative purposes, as evaluation was performed over the entire test set without analyzing performance on specific varieties. 4.2
Training Procedure
To study the current capabilities of SSL transfer performance on grape detection, we selected two supervised pretraining methods as a reference of comparison. First, we used weights initialized from ImageNet supervised pretraining. Additionally, since we evaluated SSL in an object-detection task, we also used weights initialized from supervised pretraining on COCO object detection. As part of this work, we aimed to follow the procedure and build upon the results of [45], as we used the same dataset. As a result, like [45], we treat grape detection as an object detection task, where grape clusters must be localized with bounding boxes. When considering grape detection for the purposes of yield forecasting, it is also sensible to count individual grape berries in an image.
1050
M. Woodson and J. Zhang
Fig. 1. Different Grape Varieties: a Chardonnay b Cabernet Sauvignon c Sauvignon Blanc d Cabernet Franc e Syrah Table 1. Dataset Partitions by Variety Split
Variety
Images BoxedBunches
Train/Val Chardonnay Cabernet Franc Cabernet Sauvignon Sauvignon Blanc Syrah
50 55 48 51 38
660 910 532 1033 446
Test
15 10 9 14 10
180 159 111 283 117
Chardonnay Cabernet Franc Cabernet Sauvignon Sauvignon Blanc Syrah
However, we reserve alternative approaches for future research, as we first aim to compare BYOL to a reference experiment utilizing supervised pretraining. We also used the same set of augmentations for training as [45]: random pixel dropout, additive gaussian noise, gaussian blur, contrast enhancement, and horizontal flipping. A random subset of augmentations were applied to each image batch during training. A Faster R-CNN with a ResNet-50 backbone and an FPN [28] was used for this experiment. We used the AdamW optimizer [32] and performed a hyperparameter search over the following hyperparameters: the learning rate, the decoupled weight decay factor in AdamW, and the NMS threshold used in the Faster R-CNN. The hyperparameter search was conducted using Population Based
Self-supervised Grape Detection
1051
Training [24]. All other hyperparameters are set to the defaults established in the torchvision library. While we designed an experiment to compare the transfer performance of 3 pretraining methods (Supervised ImageNet, Supervised COCO, and BYOL), we also considered the effect of freezing different ResNet layers. More specifically, we looked at three finetuning scenarios: First, we froze all 5 ResNet layers, only updating FPN and Faster R-CNN weights. Next, we finetuned the last ResNet layer, followed by finetuning the entire backbone. These scenarios were coined “Freeze 5”, “Freeze 4”, and “Freeze 0”, respectively, corresponding to the number of ResNet layers that were frozen during training. It is worth mentioning that COCO pretraining was the only method that initialized FPN weights; otherwise, both the FPN and Faster R-CNN weights were trained from scratch while finetuning. Lastly, while evaluation was reported on the WGISD test partition using COCO AP metrics, we also analyzed performance on a separate grape dataset: the CR2 dataset [10]. Since the CR2 dataset lacks bounding-box annotations (it includes dot annotations instead), performance was observed by visual inspection only. However, introducing this dataset helped illustrate the ability of each pretraining method and each finetuning scenario at generalizing to a new variety (Teroldego) in a different vineyard. Detection on the CR2 dataset is presented in Sect. 5.
5
Results
Numerical results for all pretraining methods and finetuning scenarios can be seen in Table 2. COCO AP ranged from 0.404 to 0.551 and AP50 ranged from 0.808 to 0.886, with BYOL achieving 0.505 AP and 0.877 AP50 . A network trained from scratch was added for reference. In general, BYOL was competitive with supervised pretraining methods, even outperforming ImageNet supervised pretraining for all finetuning scenarios. COCO supervised pretraining consistently performed the best. This is possibly due to the fact that COCO pretraining involves object detection, which matches the downstream task at hand. Furthermore, COCO pretraining initialized FPN weights, while BYOL and ImageNet supervised pretraining do not. However, when considering the potential to effortlessly add training data for BYOL (and SSL methods in general) compared to supervised counterparts, the competitive results are promising. These results are an improvement over [45], which reported a 0.71 AP50 . This improvement is likely because [45] trained with the subset of WGISD containing mask annotations. One particularly interesting case is the relative performance between “Freeze 5” and “Freeze 4” scenarios. SSL methods employing pretext tasks observe a phenomenon where deep layers end up optimizing for the task being solved, resulting in poor transfer performance. By introducing “Freeze 4”, we could test whether downstream applications benefited from updating final-layer features. However, in general, finetuning the final ResNet layer had little effect on the
1052
M. Woodson and J. Zhang
Table 2. Transfer Performance of All Pretraining Methods. Results are Provided using COCO AP , which Computes Average Precision over a Range of IoU Thresholds. Additionally, AP50 is Provided, which Represents AP at a Single IoU Threshold of 0.5. A Network Trained from Scratch is Reported for Reference. Supervised COCO Pretraining is most Accurate, while BYOL is the Second Most Accurate. Finetuning the Entire Network Improves Performance in every Case Pretraining Method
Frozen Layers AP
AP50
Supervised ImageNet 5 4 0
0.404 0.808 0.416 0.812 0.492 0.862
Supervised COCO
5 4 0
0.453 0.840 0.440 0.817 0.551 0.886
BYOL
5 4 0
0.433 0.832 0.446 0.834 0.505 0.877
From Scratch
0
0.337 0.747
performance for grape detection, indicating that SSL features were well-suited for downstream tasks. Image results for each pretraining method can be seen in Fig. 2. Solid bounding boxes indicate true positives, while dashed boxes are false positives. Only bounding boxes with confidence score > 0.7 are displayed, with an IoU threshold of 0.5. Consistent with the numerical results, image results demonstrate that each pretraining method is competitive with the others. At times, detection accuracy suffers when cluster boundaries are ambiguous. Without in-field or 3-D data, cluster boundaries are difficult to discern, even for human annotators. Figure 3 demonstrates the effect of freezing different backbone layers with a BYOL-trained network. These results suggest that one should finetune an entire network for best performance on grape detection, as freezing layers produced images with overlapping bounding-box placement. Lastly, Fig. 4 displays BYOL transfer performance on an unseen dataset, CR2. Despite the distinct visual characteristics compared to WGISD, performance actually generalized well, even detecting overlapping clusters. Similar to detection on WGISD, CR2 performance appeared to improve when the entire backbone was finetuned during training.
Self-supervised Grape Detection
1053
Fig. 2. Image Results for Networks Initialized via Different Pretraining Methods. True Positives (Solid Bounding Box ) and False Positives (Dashed Bounding Box ) are marked. Detections have Confidence Score > than 0.7 and an IoU Threshold of 0.5. a Network Pretrained with ImageNet Supervised Classification. b Network Pretrained with BYOL SSL Method. c Network Pretrained with COCO Supervised Object Detection
Fig. 3. Cluster Detection Performance for a BYOL-Trained Network when Freezing Different Layers. True Positives (Solid Bounding Box ) and False Positives (Dashed Bounding Box ) are Marked. a Freeze 5 b Freeze 4 c Freeze 0
Fig. 4. BYOL Transfer Performance on the CR2 Dataset. Detections have Confidence Score > 0.7. The CR2 Dataset was not seen during Training and does not Contain Ground-Truth Bounding-Box Annotations, so Evaluation is by Inspection only
1054
5.1
M. Woodson and J. Zhang
Future Work
Research should explore training methods where Self-Supervised Learning is applied to datasets outside of ImageNet. Since SSL is not constrained to annotated data, datasets can increase in size, potentially allowing models to increase in size to carry greater learning capacity. Such systems can be applied to agriculture to continue to improve upon state-of-the-art. Furthermore, SSL should be applied across data modalities, with applications in agriculture leveraging new features for more dynamic vision systems.
6
Conclusion
This work studied the performance capabilities of BYOL, a Self-Supervised Learning method, when transferred to an agricultural application such as grape cluster detection. BYOL transfer performance was compared to supervised pretraining methods using a Faster R-CNN with a ResNet-50 FPN backbone. Specifically, BYOL was compared with supervised pretraining on ImageNet and COCO. BYOL transfer performance was competitive with other pretraining methods, illustrating its future potential in learning features across data modalities, allowing deep learning to make progress in agricultural applications ranging from yield forecasting to crop phenotyping.
References 1. Afham, M., Dissanayake, I., Dissanayake, D., Dharmasiri, A., Thilakarathna, K., Rodrigo, R.: CrossPoint: self-supervised cross-modal contrastive learning for 3D point cloud understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9902–9912 (2022) 2. Aquino, A., Millan, B., Diago, M.-P., Tardaguila, J.: Automated early yield prediction in vineyards from on-the-go image acquisition. Comput. Electron. Agric. 144, 26–36 (2018) 3. Bargoti, S., Underwood, J.: Deep fruit detection in orchards. In: 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 3626–3633. IEEE (2017) 4. Bossard, L., Guillaumin, M., Van Gool, L.: Food-101 – mining discriminative components with random forests. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8694, pp. 446–461. Springer, Cham (2014). https:// doi.org/10.1007/978-3-319-10599-4 29 5. Bresilla, K., Perulli, G.D., Boini, A., Morandi, B., Grappadelli, L.C., Manfrini, L.: Single-shot convolution neural networks for real-time fruit detection within the tree. Front. Plant Sci. 10, 611 (2019) 6. Caron, M., Misra, I., Mairal, J., Goyal, P., Bojanowski, P., Joulin, A.: Unsupervised learning of visual features by contrasting cluster assignments. Adv. Neural Inf. Process. Syst. 33, 9912–9924 (2020) 7. Cecotti, H., Rivera, A., Farhadloo, M., Pedroza, M.A.: Grape detection with convolutional neural networks. Expert Syst. Appl. 159, 113588 (2020)
Self-supervised Grape Detection
1055
8. Chen, S.W., et al.: Counting apples and oranges with deep learning: a data-driven approach. IEEE Robot. Autom. Lett. 2(2), 781–788 (2017) 9. Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607. PMLR (2020) 10. Coviello, L., Cristoforetti, M., Jurman, G., Furlanello, C.: GBCNet: in-field grape berries counting for yield estimation by dilated CNNs. Appl. Sci. 10(14), 4870 (2020) 11. Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. IEEE (2009) 12. Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018) 13. Dey, D., Mummert, L., Sukthankar, R.: Classification of plant structures from uncalibrated image sequences. In: 2012 IEEE Workshop on the Applications of Computer Vision (WACV), pp. 329–336. IEEE (2012) 14. Doersch, C., Gupta, A., Efros, A.A.: Unsupervised visual representation learning by context prediction. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1422–1430 (2015) 15. Doersch, C., Zisserman, A.: Multi-task self-supervised visual learning. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2051–2060 (2017) 16. Ericsson, L., Gouk, H., Hospedales, T.M.: How well do self-supervised models transfer? In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5414–5423 (2021) 17. Feng, J., Zeng, L., He, L.: Apple fruit recognition algorithm based on multi-spectral dynamic image analysis. Sensors 19(4), 949 (2019) 18. Ge, Y., Xiong, Y., From, P.J.: Instance segmentation and localization of strawberries in farm conditions for automatic fruit harvesting. IFAC-PapersOnLine 52(30), 294–299 (2019) 19. Grill, J.-B., et al.: Bootstrap your own latent-a new approach to self-supervised learning. Adv. Neural Inf. Process. Syst. 33, 21271–21284 (2020) 20. H¨ ani, N., Roy, P., Isler, V.: A comparative study of fruit detection and counting methods for yield mapping in apple orchards. J. Field Robot. 37(2), 263–282 (2020) 21. He, K., Zhang, X., Ren, S., Sun, J.: Identity mappings in deep residual networks. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 630–645. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0 38 22. Herrero-Huerta, M., Gonz´ alez-Aguilera, D., Rodriguez-Gonzalvez, P., Hern´ andezL´ opez, D.: Vineyard yield estimation by automatic 3D bunch modelling in field conditions. Comput. Electron. Agric. 110, 17–26 (2015) 23. Hung, C., Nieto, J., Taylor, Z., Underwood, J., Sukkarieh, S.: Orchard fruit segmentation using multi-spectral feature learning. In: 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5314–5320. IEEE (2013) 24. Jaderberg, M., et al.: Population based training of neural networks. arXiv preprint arXiv:1711.09846 (2017) 25. Krause, J., Stark, M., Deng, J., Fei-Fei, L.: 3D object representations for finegrained categorization. In: 4th International IEEE Workshop on 3D Representation and Recognition (3dRR-13), Sydney, Australia (2013)
1056
M. Woodson and J. Zhang
26. VAN WOENSEL Lieve. Precision-agriculture and the future of farming in Europe. https://policycommons.net/artifacts/1996735/precision/2748500/ (2016). Accessed 15 April 2022 27. Lin, G., Tang, Y., Zou, X., Xiong, J., Fang, Y.: Color-, depth-, and shape-based 3d fruit detection. Precis. Agric. 21(1), 1–17 (2020) 28. Lin, T.-Y., Doll´ ar, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2117–2125 (2017) 29. Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1 48 30. Liu, S., Cossell, S., Tang, J., Dunn, G., Whitty, M.: A computer vision system for early stage grape yield estimation based on shoot detection. Comput. Electron. Agric. 137, 88–101 (2017) 31. Liu, S., Whitty, M., Cossell, S.: Automatic grape bunch detection in vineyards for precise yield estimation. In: 2015 14th IAPR International Conference on Machine Vision Applications (MVA), pp. 238–241. IEEE (2015) 32. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) 33. Marani, R., Milella, A., Petitti, A., Reina, G.: Deep neural networks for grape bunch segmentation in natural images from a consumer-grade camera. Precis. Agric. 22(2), 387–413 (2021) 34. Misra, I., van der Maaten, L.: Self-supervised learning of pretext-invariant representations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6707–6717 (2020) 35. Nilsback, M.-E., Zisserman, A.: Automated flower classification over a large number of classes. In: 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing, pp. 722–729. IEEE (2008) 36. Noroozi, M., Favaro, P.: Unsupervised learning of visual representations by solving jigsaw puzzles. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9910, pp. 69–84. Springer, Cham (2016). https://doi.org/10.1007/9783-319-46466-4 5 37. Nuske, S., Wilshusen, K., Achar, S., Yoder, L., Narasimhan, S., Singh, S.: Automated visual yield estimation in vineyards. J. Field Robot. 31(5), 837–860 (2014) 38. Palacios, F., Diago, M.P., Tardaguila, J.: A non-invasive method based on computer vision for grapevine cluster compactness assessment using a mobile sensing platform under field conditions. Sensors 19(17), 3799 (2019) 39. Pothen, Z.S., Nuske, S.: Texture-based fruit detection via images using the smooth patterns on the fruit. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 5171–5176. IEEE (2016) 40. Reis, M.J.C.S., et al.: Automatic detection of bunches of grapes in natural environment from color images. J. Appl. Logic 10(4), 285–290 (2012) 41. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems. vol. 28 (2015) 42. Roy, P., Isler, V.: Surveying apple orchards with a monocular vision system. In: 2016 IEEE International Conference on Automation Science and Engineering (CASE), pp. 916–921. IEEE (2016) 43. Sa, I., Ge, Z., Dayoub, F., Upcroft, B., Perez, T., McCool, C.: DeepFruits: a fruit detection system using deep neural networks. sensors 16(8), 1222 (2016)
Self-supervised Grape Detection
1057
44. Santos, T.T., Buiani, M.: Embrapa wine grape instance segmentation dataset embrapa wgisd. https://github.com/charlespwd/project-title (2019) 45. Santos, T.T., de Souza, L.L., dos Santos, A.A., Avila, S.: Grape detection, segmentation, and tracking using deep neural networks and three-dimensional association. Comput. Electron. Agric. 170, 105247 (2020) 46. Santos, T.T., Bassoi, L.H., Oldoni, H., Martins, R.L.: Automatic grape bunch detection in vineyards based on affordable 3D phenotyping using a consumer ´ webcam. In: CONGRESSO BRASILEIRO DE AGROINFORMATICA, 11, 2017, Campinas. Ciˆencia de (2017) 47. Wang, Z., Walsh, K., Koirala, A.: Mango fruit load estimation using a video based mangoYOLO-kalman filter-hungarian algorithm method. Sensors 19(12), 2742 (2019) 48. Zabawa, L., et al.: Detection of single grapevine berries in images using fully convolutional neural networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 0–0 (2019) 49. Zbontar, J., Jing, L., Misra, I., LeCun, Y., Deny, S.: Barlow twins: self-supervised learning via redundancy reduction. In: International Conference on Machine Learning, pp. 12310–12320. PMLR (2021) 50. Zhang, Z., Girdhar, R., Joulin, A., Misra, I.: Self-supervised pretraining of 3D features on any point-cloud. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10252–10263 (2021)
HINAY: A Mobile Application for Real-Time Traffic Sign Detection Daniel Pete M. Aguilar and Reginald Neil C. Recario(B) University of the Philippines Los Ba˜ nos, College, Batong Malake, 4031 Los Ba˜ nos, Laguna, Philippines {dmaguilar2,rcrecario}@up.edu.ph
Abstract. Road traffic accidents are one of the leading causes of death globally. These accidents mainly include drivers’ error and poor knowledge of traffic signage. This study examined computer vision models for real-time detection in terms of accuracy and speed and implements an Android mobile application with real-time traffic sign detection with voice alert using Java and TensorFlow. The application includes a machine learning model that was trained on original (imbalanced) and augmented (balanced) datasets. Four pre-trained SSD MobileNet object detection models on TensorFlow 2 detection model zoo were evaluated for overall accuracy score and speed with regard to frames per second and inference time. This study determined the models trained with a balanced dataset performed better than the imbalanced dataset. It was noted that a higher number of training steps improved the accuracy of object detection models. Lastly, SSD MobileNet v2 320 × 320 provided the best trade-off in real-time traffic sign detection accuracy score (78.5%) and average inference time speed (50 ms).
Keywords: Traffic Sign Detection MobileNet · Mobile Application
1
· Computer Vision · SSD ·
Introduction
Although motorization enhanced many societies, the reaped benefits came with a price. Road traffic accidents (RTAs) are one of the leading causes of death worldwide. According to the World Health Organization (WHO), an estimated 1.3 million road traffic injuries are reported every year due to road traffic accidents [1]. Although the rate of road accidents decreased in high-income countries in recent decades, the burden of injuries caused by RTAs in societal and economic costs is rising, especially in developing countries. In 2019, the World Health Organization (WHO) reported road injury as the tenth leading cause of death for upper and lower-middle-income countries next to Stomach Cancer and Diabetes Mellitus, respectively. For developing countries like India, 85% of all deaths and 90% of disability-adjusted life years were lost due to road traffic injuries [2,3]. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): SAI 2023, LNNS 711, pp. 1058–1078, 2023. https://doi.org/10.1007/978-3-031-37717-4_69
Traffic Sign Detection
1059
In the Philippines, particularly in the main metropolitan area, an average of 178 road crash cases per day were reported in the Metro Manila Accident Reporting and Analysis System (MMARAS) of the Metropolitan Manila Development Authority (MMDA) 2020. Some of the risk factors in most accidents include driver error and overspeeding [4]. The government recognizes the alarming number of road crash cases. To address this, the Philippines under former President Benigno Aquino III developed the Action Plan of the Decade of Action for Road Safety 2011–2020 on March 11, 2011, and a revision was released for 2017–2022, adopting a vision of zero road traffic death, or at least decreasing the rate by 20% targeted by the year 2022. The United Nations Road Safety Collaboration initiated this plan. The action plan focused on the following: (1) road safety management, (2) safe roads and mobility, (3) safe vehicles, (4) safe road users, and (5) post-crash care [5]. Traffic signs help reduce driver errors. It helps maintain order and provides information to drivers through thoroughfares. It has two main goals: (1) to control traffic flow, and (2) to warn drivers of hazards [6]. One of the requirements upon getting a driver’s license is being familiar with common traffic signs. The Land Transportation Office (LTO) even implemented a stricter policy on issuing driver’s licenses-one of them includes passing a 15-hour lecture about traffic rules, signs, and regulations [7]. It is a driver’s responsibility to recognize and decode signs on the road quickly. Although many may recognize traffic signs, it is not necessarily true that they can understand the meaning of each sign. In 2011, the top violations of motorists in the Philippines included violations of no U-turn, no loading/unloading, and speed limit signs [8]. In addition to the government’s efforts, scientists and researchers have been continuously developing technologies related to Advanced Driver-Assistance systems (ADAs) to help drivers in terms of workload control in the vehicle and provide warnings on possible hazards on the road. The development of computing machines and computer vision caused the demand for driving-related technologies, giving and processing valuable information necessary for safe driving. Some ADAs applications include automatic emergency braking, parking assist, pedestrian detection, driver drowsiness, and surround-view. Traffic sign recognition algorithms became an essential part of ADAs. However, this system is limited and is only available to smart cars that cost a fortune, like the Audi A8, Mercedes-Benz SL-Class, Volvo S60, and others [9]. It also poses a challenge to countries with unstable weather conditions since it will lessen the efficiency and accuracy of the Traffic Sign Detection and Recognition (TSDR) system. Furthermore, more technological advancements paved the way for computer vision and artificial intelligence development, specifically deep neural networks. A deep neural network is a multi-layered Artificial Neural Network (ANN) that aims to imitate or simplify how the human brain works [10]. One of its examples is the Convolutional Neural Network (CNN) approach. It is a popular and widely used approach in image classification, computer vision, and natural language processing. In addition, CNN is an excellent tool for handling large datasets that involve images and solving machine learning problems [11]. In relation to the
1060
D. P. M. Aguilar and R. N. C. Recario
TSDR system, considering how large the dataset is and the complexity of traffic sign classification, the CNN as the classifier has the best trade-off between performance speed and classification accuracy on traffic signs against known algorithms such as K-nearest neighbors (KNN), Support Vector Machine (SVM), and Multilayer Perceptron (MLPC) given that the testing was done only in daylight condition using an embedded system using a microcomputer with an interface camera, display, and speaker [12]. However, this system is costly and time-consuming to assemble, assuming that you have no knowledge of the individual parts and how they work. Therefore, implementing a TSDR system that can run on a mobile device would be more accessible and considerably cheap. In addition, drivers who currently have Android mobile phones can easily access the application and mount their phones inside their vehicles.
2
Objectives
This study aims to implement traffic sign detection in a mobile application. Specifically, the study (1) collected traffic sign image data on Google Street View (GSV) database in a daytime setting from December 2021 to May 2022, (2) performed data augmentation on the collected data, (3) retrained four pre-trained Single-Shot MultiBox Detection (SSD) MobileNet object detection models on TensorFlow 2 Detection Model Zoo and converted it into a TensorFlow 2 Lite model, (4) developed a mobile application for the model based on TensorFlow 2 Lite Object Detection Android Demo App with the help of TensorFlow Object Detection API [14], (5) provided a speech output about the information regarding the traffic sign, and (6) measured the performance and accuracy of the object detection models as well as the mobile application. Transfer learning was applied for the creation of the object detection model using TensorFlow. The object detection model is selected over a simple classification model since it already includes object localization for locating and bounding detected objects in the specific frame and object classification for classifying bounded objects [15]. A classification model would require a separate detector mechanism that captures and locates a certain object in order to make function properly and accurately. Considering that mobile devices nowadays are not that powerful yet, combining the feature of localization and classification into a model will greatly benefit the mobile application. Consequently, the model was converted into a TensorFlow 2 Lite model to comply with the hardware and software capabilities of a mobile device. The converted model was applied and tested in a mobile application developed on top of the TensorFlow 2 lite Object Detection Android Demo App [13], which uses TensorFlow Object Detection API [14]. The traffic signs to be considered for detection and recognition are shown in Fig. 1. The unknown class pertains only to other circular traffic signs.
Traffic Sign Detection
1061
Fig. 1. Traffic Signs to be Detected in the Application.
3
Materials and Methods
The TSDR mobile application was developed using TensorFlow, Android Studio, and Java. It was tested on an Android mobile device with a low to mid-range camera module and processor. Figure 2 shows the detailed design and process of the application.
Fig. 2. Process Flow of the Traffic Sign Detection System in Mobile Application.
The input data for the mobile application was sourced from its live camera feed. Each frame from the feed was pre-processed by resizing the image depending on the input image size of the particular object detection model. The pre-processed frame underwent a traffic sign localization process wherein detected traffic signs were located and bounded by a box. Considering that the localization process was successful, the bounded location is classified against the six classes defined in the model. The mobile application notifies the user if the classification process was successfully carried out. The notification was done in the form of a speech output from the mobile device.
1062
3.1
D. P. M. Aguilar and R. N. C. Recario
Data Set
The signs seen in Fig. 1 were used for the retraining of the pre-trained object detection model. The pre-trained model has trained on Microsoft Common Objects in Context 2017 dataset or MS COCO 2017 dataset [16], which includes 80 different classes with 2.5 million labeled instances in 328,000 images. The dataset used for the retraining was manually sourced from Google Street View (GSV) database. Figure 3 shows sample images in the GSV database.
Fig. 3. Description of Images from Left to Right, Top to Bottom: (1) No Parking, (2) No U-Turn, (3) No Overtaking (belongs to the Unknown Class Defined in Fig. 1), (4) 60 kph Speed Limit, (5) No Right Turn, and (6) No Left Turn
The collection of the traffic sign images for the dataset came from the three metropolitan areas in the Philippines namely Metro Manila, Metro Cebu, and Metro Davao. Two types of the dataset were created for the training and validation of the model, original and augmented. A Python package called Augmentor [17] was used for the data augmentation. The image augmentation functions that were utilized include random brightness, random distortion, and perspective skewing. The parameters used for random brightness contain a probability of 1, a minimum factor of 0.3, and a maximum factor of 1.2. For random distortion, it includes a probability of 0.9, grid width of 4, grid height of 4, and magnitude of 8. Lastly, perspective skewing involves the parameter of probability equal to 0.8. Each image was labeled manually using a graphical tool in Python called LabelImg [18]. In addition, an image data may contain one or more traffic signs. The original dataset contained an exact number of 527 images with 673 labeled traffic signs. This consists of uneven distribution of traffic sign classes as it varies in different locations. Figure 4 shows the frequency distribution of traffic signs in the original dataset. To solve this problem, an augmented dataset was used. This resulted in 1,033 images with 1,200 labeled traffic signs and an even distribution of traffic sign classes. Consequently, the datasets were then split into two different sets. The training set and validation set contained 80% and 20% of the data, respectively. The validation set was used for the evaluation of the model for every 1,000 steps
Traffic Sign Detection
1063
Fig. 4. Original Dataset Frequency Distribution
or checkpoint. The test data used for the model testing came from road traffic videos posted online, while the test data for mobile application testing came from the actual road traffic footage taken using the camera of a mobile device. 3.2
Object Detection
Machine learning models have a lot of different variations depending on their use case. Classification models or CNN models are one of the most popular models used in the field of Artificial Intelligence and Computer Science. It became popular due to its automation of feature extraction by using the convolution of images and filters in a series of layers. The way CNN works is inspired by the biology of the human eyes. Images are taken as input and turned into smaller pieces of information or convolutions. Features such as edges, shapes, and colors are extracted from each convoluted image. The features extracted are processed at a higher level and classified based on the information gathered. Moreover, CNN models were used in different systems and devices including mobile phones. However, classification tasks in mobile devices require lightweight and low-power models in order to meet the hardware constraints. In this case, the MobileNet CNN architecture was utilized in the application [19]. This architecture exhibits the use of depthwise separable convolutions that aims to reduce the number of computations in the first few layers. Basically, it is a factored version of the standard convolution wherein it consists of a depthwise convolution and a pointwise convolution. Pointwise convolution is a 1 × 1 convolutional filter applied to the result of the depthwise filters and the channels of an input image. This also includes two hyper-parameters namely width multi-
1064
D. P. M. Aguilar and R. N. C. Recario
plier and resolution multiplier. The width multiplier determines the number of depthwise convolution filters or the width of the network while the resolution multiplier is responsible for the value multiplied depending on the specified scale of the input image. This study also focused on SSD models since YOLO models are not available in TensorFlow 2 Detection Model Zoo [20] and SSD models are the only models convertible to a TensorFlow 2 Lite model as of the moment. The models were compared in terms of accuracy and performance. Table 1 shows the involved models along with their performance evaluation results on MS COCO 2017 dataset [16]. Table 1. SSD MobileNet Models on Tensorflow 2 Detection Model Zoo [20]
3.3
Model Name
Speed (ms) COCO mAP Output
SSD MobileNet V2 320 × 320
19.00
20.20
Boxes
SSD MobileNet V2 FPNLite 320 × 320 22.00
22.20
Boxes
SSD MobileNet V2 FPNLite 640 × 640 39.00
28.20
Boxes
SSD MobileNet V1 FPN 640 × 640
29.10
Boxes
48.00
Model Training
The training of the model was done using Google Colaboratory (Google Colab) [21]. The runtime was hosted on the cloud and the setting includes GPU as a hardware accelerator to speed up the training process. The data utilized were stored and accessed via Google Drive and the version of TensorFlow used for the training process was 2.0. The method done for the training is transfer learning since the models that were used in the process are pre-trained on MS COCO 2017 dataset [16]. The models listed in Table 1 were retrained on the traffic sign datasets that were manually collected and labeled for this study. Each model was trained on two types of datasets, original and augmented. The datasets were randomly split into train and validation sets based on their hashed value. Only the training steps, the batch size, and the learning rate base were modified in the configuration file of the model for the training process. The setup for the model training includes variations in training steps and learning rate base. The training steps tested on each model were 10k, 15k, and 20k. On the other hand, the learning rate bases tested on the models consist of 0.07, 0.08, and 0.09. Other parameters in the model training configuration file were left unmodified for the rest of the training process. After the training and validation, the saved model was converted into a TensorFlow 2 lite model for mobile inference. 3.4
Mobile Application
The mobile application was developed on an Android device. It was done with the use of Android Studio 4.2.2 [22], Java 8, and TensorFlow 2 Object Detection
Traffic Sign Detection
1065
API [14]. The Compile Software Development Toolkit (SDK) version used for the mobile application includes API 30 and Android 11. For the Minimum SDK, the specified version is 21. The saved TensorFlow 2 Lite models from the training were imported to the mobile application for mobile inference. The application was designed to handle more than one traffic sign in a frame and produce a speech output on the prediction with the highest confidence level. It also displays the confidence level and generates bounding boxes for the detected objects with the initials of the traffic sign classes shown in Fig. 1. Consequently, only the versions with the highest performance evaluation in each SSD MobileNet model were tested in the application for the mobile performance comparison. 3.5
Performance Evaluation
Object detection evaluation is not as straightforward as evaluating a classification model. Since object detection involves the process of object localization and classification, the model determines a prediction depending on the detected localized area. This measures the performance of the object detection model. In this case, True Positive (TP) was considered a correct detection, False Positive (FP) was considered an incorrect detection, and False Negative (FN) was considered a missed detection. True Negative (TN) was not utilized in the evaluation process since it covers the regions where there are no objects and the model predicted a non-detection. In short, it pertains to the background in the image where the object relies upon. Another metric necessary for the object detection model is the Intersection over Union (IoU). This metric measures the degree of overlap of the prediction made by the model and the ground truth where the object to be detected reside [23]. The diagram computation of IoU is defined in Fig. 5.
Fig. 5. Intersection over Union Diagram [23].
Given the two metrics, the IoU used in the study was utilized by performing thresholding [23]. The idea for the IoU with thresholding depends on the defined value of the threshold. This determines the classification of a prediction on whether it belongs to TP, FP, or FN. Figure 6 shows a demonstration of the process.
1066
D. P. M. Aguilar and R. N. C. Recario
Fig. 6. IoU Thresholding Classification of TP, FP, and FN [23].
Furthermore, the study utilized the IoU with thresholding as the basis for computing the Precision and Recall. According to [24], Precision refers to the number of correct detections as a proportion of the number of all detections, whereas Recall pertains to the number of correct detections as a proportion of all the ground truths. The Precision and Recall are computed as: P recision =
TP TP + FP
TP TP + FN The value for these metrics ranges from zero to one. It also suggests that a model with a high value of Precision and Recall is a good model. In relation to the metrics mentioned, its usage is general and it covers a lot of ground in terms of model evaluation. In order to make a more specific evaluation, the study used MS COCO detection metrics that incorporate all the object detection metrics discussed. Recall =
Fig. 7. Some Metrics Utilized in Characterizing the Performance of an Object Detector on COCO [16].
As seen in Fig. 7, Precision and Recall were used in computing the Average Precision (AP) for each class in the model. The results in each class were used to compute the mean Average Precision (mAP) of the model. In MS COCO detection metrics, there is no distinction between AP and mAP. However, the study used mAP for the overall model evaluation metric in order to avoid confusion
Traffic Sign Detection
1067
in regard to the terminologies. Moreover, mAP was computed in different IoU values. The primary challenge metric used an IoU from 0.5 to 0.95 with a step value of 0.05. It can also be denoted as mAP@[ioU = 0.5:0.95]. [16] stated that obtaining the mean of AP per class over different IoUs positively benefits the object detection model in terms of localization. The PASCAL VOC metric [25] is included in the MS COCO detection metrics where mAP is computed at the IoU threshold of 0.5. For stricter metrics, mAP at the Iou threshold of 0.75 was also used. In terms of object sizes, mAP across various scales was also computed. Particularly, three sizes were determined in this mAP evaluation namely small, medium, and large. The small-sized objects cover an area of less than 32 × 32 pixels. For medium-sized objects, the area considered was between 32 × 32 pixels and 96 × 96 pixels. Lastly, the large-sized objects include an area greater than 96 × 96 pixels. The performance evaluation of the mobile application, on the other hand, was measured in terms of detection and classification accuracy using a confusion matrix. Two test data were considered for the testing. The first one includes the road traffic videos posted online. The second one is the actual road traffic footage taken live using the camera of the mobile device.
4
Results and Discussion
The models involved in the training process were evaluated in different setups. The first one being was through the validation data. The data was used to evaluate the performance of the model in its 10k, 15k, and 20k steps with respect to three learning rate base setups, which are 0.07, 0.08, and 0.09. 4.1
mAP at Different IoUs
The result of running validation data against the trained models at each specified checkpoint and different types of datasets has produced significant context for the models as shown in Table 2. In terms of the dataset, the models trained in augmented or balanced data generated greater mAP scores compared to the models trained with original or imbalanced data. The margin regarding the difference in mAP scores of original and augmented dataset varies depending on the model. The SSD MobileNet V2 320 × 320 model recorded the lowest margin on the two datasets with an average of 1% to 7% mAP score difference. On the other hand, the model that most benefited from using the augmented data is the SSD MobileNet V2 FPNLite 320 × 320 model. It recorded an average of 1% to 13% mAP score difference between the original and augmented dataset. Moreover, the number of training steps that recorded the highest mAP scores is 20k steps. This implies that the models would benefit given more number of training steps. For the learning rate base, the result varies on each model at different values of training steps. For a more consistent and specific comparison, the data on 20k steps were analyzed. For SSD MobileNet V2 320 × 320 and SSD
1068
D. P. M. Aguilar and R. N. C. Recario
Table 2. (Primary Challenge Metric) Detection mAP (%) at Different IoU Values Tested against Various Setup, Involving Training Steps (10,000, 15,000, and 20,000) and Learning Rate Base (0.07, 0.08, 0.09) with Original (O) or Imbalanced Dataset and Augmented (A) or Balanced Dataset on Four SSD MobileNet Models mAP@[IoU = 0.5: 0.95] (%) 10000 O SSD MobileNetV2 320 × 320
15000 A
O
20000 A
O
A
0.07 65.30 66.20 62.20 70.70 68.40 71.10 0.08 63.30 66.80 63.30 68.00 67.30 65.70 0.09 64.40 62.70 62.80 67.80 66.60 68.70 10000 O
SSD MobileNet V2 FPNLite320 × 320
15000 A
O
20000 A
O
A
0.07 62.50 72.70 60.50 73.40 61.60 74.40 0.08 59.70 60.70 63.50 74.40 62.30 74.80 0.09 62.90 75.50 60.70 72.10 63.30 75.70 10000 O
15000 A
O
A
20000 O
A
SSD MobileNet V2 FPNLite 640 × 640 0.07 74.20 81.70 76.10 79.50 76.50 82.80 0.08 71.60 73.10 76.80 81.60 76.10 76.80 0.09 77.70 80.20 73.60 81.30 75.30 80.20 10000 O SSD MobileNetV1 FPN 640 × 640
15000 A
O
A
20000 O
A
0.07 78.30 80.00 78.80 82.90 79.50 83.00 0.08 76.40 81.70 75.40 83.30 76.80 83.50 0.09 76.50 82.20 75.80 83.90 77.70 83.90
MobileNet V2 FPNLite 640 × 640, the learning rate base of 0.07 generated the highest mAP scores on both original and augmented data. For SSD MobileNet V2 FPNLite 320 × 320, the highest mAP scores were produced with a learning rate base of 0.09 on both original and augmented data. Lastly, SSD MobileNet V1 FPN 640 × 640 has generated different results on original and augmented data. The best mAP output for original data came from the learning rate base of 0.07, while the best mAP output for augmented data was produced on the learning rate base of 0.09. Tables 3 and 4 display the mAP scores of the models evaluated using the PASCAL VOC metric [25] and a strict metric for COCO detection, respectively. In general, the models recorded higher mAP scores in these metrics as compared to the primary challenge metric. This is also considering that the PASCAL VOC metric is the most lenient metric measuring the mAP scores in terms of IoU threshold amongst the three mAP IoU metrics present in the COCO detection metric. Consequently, SSD MobileNet V1 FPN 640 × 640 and SSD MobileNet V2 FPNLite 640 × 640 performed better results compared to other models in
Traffic Sign Detection
1069
Table 3. (PASCAL VOC Metric) Detection mAP (%) at 0.5 IoU Value Tested against Various Setup, Involving Training Steps (10,000, 15,000, and 20,000) and Learning Rate Base (0.07, 0.08, 0.09) with Original (O) or Imbalanced Dataset and Augmented (A) or Balanced Dataset on Four SSD MobileNet Models mAP@[0.5] (%) 10000 O SSD MobileNetV2 320 × 320
15000 A
O
20000 A
O
A
0.07 86.20 92.70 83.50 93.60 86.80 94.10 0.08 83.30 92.50 85.30 91.90 87.20 92.80 0.09 86.20 85.20 86.00 93.70 88.30 94.40 10000 O
15000 A
O
20000 A
O
A
SSD MobileNet V2 FPNLite 320 × 320 0.07 75.70 90.50 73.80 90.10 74.40 92.40 0.08 74.20 78.90 77.50 91.70 75.90 92.70 0.09 77.10 92.20 74.90 89.50 77.90 92.20 10000 O SSD MobileNetV2 FPNLite 640 × 640
15000 A
O
20000 A
O
A
0.07 93.30 97.60 92.00 96.60 92.60 97.40 0.08 87.10 91.70 94.00 97.00 91.80 93.30 0.09 93.70 97.00 91.90 96.70 90.70 97.30 10000 O
SSD MobileNet V1 FPN640 × 640
15000 A
O
20000 A
O
A
0.07 92.00 94.50 93.50 95.90 94.50 95.70 0.08 91.40 96.50 90.70 96.50 91.60 96.90 0.09 91.70 96.10 92.50 96.70 92.30 97.30
the PASCAL VOC metric and strict metric with an average mAP score of more than 90% across all setups. 4.2
mAP Across Various Scales
The models were evaluated in different traffic sign scales present in the image data using mAP@[IoU = 0.5:0.95]. The three scales considered are small, medium, and large. For small-scale traffic sign images, all of the models did not perform very well considering as well that these images are comparable to faraway traffic signs in terms of distance (Table 5). The model that recorded high mAP scores across all setups relative to other models is the SSD MobileNet V1 FPN 640 × 640 with an average of less than 80%. On the other hand, models with an input image size of 320 × 320 performed poorly with only an average of less than 60% across all setups. In addition, there are instances where the mAP score on models trained with imbalanced data is higher compared to the models trained with balanced data, particularly on SSD MobileNet V2 320 × 320. Moreover, the models evaluated on medium-scaled traffic signs generated better results with respect to the mAP scores on small-scaled traffic signs (Table 6).
1070
D. P. M. Aguilar and R. N. C. Recario
Table 4. (Strict Metric) Detection mAP (%) at 0.75 IoU Value Tested against Various Setup, Involving Training Steps (10,000, 15,000, and 20,000) and Learning Rate Base (0.07, 0.08, 0.09) with Original (O) or Imbalanced Dataset and Augmented (A) or Balanced Dataset on four SSD MobileNet Models mAP@[0.75] (%) 10000 O SSD MobileNet V2 320 × 320
15000 A
O
20000 A
O
A
0.07 78.90 76.90 77.10 87.50 81.90 83.30 0.08 76.30 83.60 81.00 80.70 81.20 79.20 0.09 82.70 74.90 78.70 82.50 84.10 82.90 10000 O
SSD MobileNet V2 FPNLite320 × 320
15000 A
O
20000 A
O
A
0.07 72.00 84.80 72.30 85.50 72.40 88.50 0.08 70.90 70.50 72.50 83.30 72.20 87.50 0.09 74.30 87.10 72.80 83.50 74.50 86.40 10000 O
15000 A
O
20000 A
O
A
SSD MobileNet V2 FPNLite 640 × 640 0.07 88.30 95.90 91.30 95.30 91.50 95.60 0.08 85.80 89.80 90.90 95.00 90.00 92.60 0.09 93.10 94.70 89.70 95.70 89.40 95.30 10000 O SSD MobileNet V1 FPN 640 × 640
15000 A
O
20000 A
O
A
0.07 91.30 94.00 92.00 94.70 92.90 93.90 0.08 89.60 95.20 89.00 95.80 90.20 96.00 0.09 90.40 95.20 90.10 96.40 91.20 96.40
Specifically, models with an input image size of 320 × 320 recorded percentage increases between 20% and 50% to its mAP results in medium-scaled traffic signs from small-scaled traffic signs across all setups. The models with 640 × 640 input image size, on the other hand, registered percentage increases between 8% and 36%. For the large-scale traffic signs, the mAP score results went even higher compared to the two previous scales discussed (Table 7). The percentage increases with regards to the mAP results of the two previous scales also improved. However, SSD MobileNet V2 FPNLite showed poor performance in detecting large traffic signs on the original or imbalanced datasets. It scored even lower compared to its mAP scores on detecting small traffic signs (Table 5). Given the results on different scales, the data showed that larger traffic sign images can be detected with high accuracy compared to smaller traffic signs. In reality, it is worth noting that large traffic signs are comparable to near traffic signs relative to the human eye or camera device. In this case, the models are capable of detecting traffic signs at short distances accurately.
Traffic Sign Detection
1071
Table 5. Detection mAP (%) at Different IoU Values for Small Traffic Signs Tested against Various Setup, Involving Training Steps (10,000, 15,000, and 20,000) and Learning Rate Base (0.07, 0.08, 0.09) with Original (O) or Imbalanced Dataset and Augmented (A) or Balanced Dataset on Four SSD MobileNet Models mAP@[0.5:0.95] (%)Small Traffic Signs (area < 322 ) 10000 O SSD MobileNet V2 320 × 320
15000 A
O
20000 A
O
A
0.07 57.50 52.00 49.60 56.00 59.90 57.10 0.08 54.90 49.80 55.50 55.60 55.70 48.40 0.09 56.50 47.80 56.20 52.80 56.60 55.40 10000 O
SSD MobileNet V2 FPNLite 320 × 320
15000 A
O
20000 A
O
A
0.07 54.70 55.60 50.70 57.40 53.40 58.40 0.08 53.80 39.20 54.30 60.60 58.20 58.40 0.09 53.90 55.80 53.20 55.80 54.70 60.40 10000 O
SSD MobileNet V2 FPNLite 640 × 640
15000 A
O
20000 A
O
A
0.07 65.60 75.30 71.00 71.30 68.30 76.50 0.08 64.30 58.40 70.20 74.20 70.10 66.20 0.09 71.00 75.10 72.20 75.00 71.20 73.40 10000 O
SSD MobileNet V1 FPN 640 × 640
15000 A
O
20000 A
O
A
0.07 71.00 70.60 74.80 75.10 76.70 73.60 0.08 73.10 73.20 67.60 72.40 69.50 74.80 0.09 70.20 75.20 70.10 76.70 73.50 77.30
4.3
Road Traffic Evaluation Setup
The models evaluated using the validation data were analyzed and compared in order to get the best version in each model. The mAP results in the Primary Challenge metric were used for the selection (Table 2). The version in each model that garnered the highest mAP score, with respect to the type of dataset, number of training steps, and learning rate base, was chosen for the road traffic evaluation. The models of interest include SSD Mobilenet V2 320 × 320 trained in augmented data with 20k steps and 0.07 learning rate base, SSD MobileNet V2 FPNLite 320 × 320 trained in augmented data with 20k steps and 0.09 learning rate base, SSD MobileNet V2 FPNLite 640 × 640 trained in augmented data with 20k steps and 0.07 learning rate base, and SSD MobileNet V1 FPN 640 × 640 trained in augmented data with 20k steps and 0.09 learning rate base. Online road traffic videos were sourced from the Internet, particularly on a free sharing video website called YouTube. Three road traffic videos located in Metro Manila, Philippines were downloaded from the website. The videos were trimmed to fit in a 30-min video. Consequently, the video was taken as input on each chosen version of the models. Each frame on the video was processed
1072
D. P. M. Aguilar and R. N. C. Recario
Table 6. Detection mAP (%) at Different IoU Values for Medium Traffic Signs Tested against Various Setup, Involving Training Steps (10,000, 15,000, and 20,000) and Learning Rate Base (0.07, 0.08, 0.09) with Original (O) or Imbalanced Dataset and Augmented (A) or Balanced Dataset on Four SSD MobileNet Models mAP@[0.5:0.95] (%)Medium Traffic Signs (322 < area < 962 ) 10000 O SSD MobileNet V2 320 × 320
15000 A
O
20000 A
O
A
0.07 71.20 73.00 70.50 77.40 76.20 77.40 0.08 70.00 73.30 68.70 74.90 74.40 73.00 0.09 70.10 70.40 69.40 74.60 74.20 74.70 10000 O
SSD MobileNetV2 FPNLite 320 × 320
15000 A
O
20000 A
O
A
0.07 73.20 79.80 69.40 81.70 71.00 81.80 0.08 69.70 72.70 73.00 81.50 71.90 82.60 0.09 75.00 84.00 72.50 79.40 74.60 82.80 10000 O
15000 A
O
20000 A
O
A
SSD MobileNet V2 FPNLite 640 × 640 0.07 81.20 84.90 81.60 83.70 82.80 85.90 0.08 77.90 79.20 83.00 84.60 82.30 81.40 0.09 84.40 83.00 78.00 84.60 79.10 83.60 10000 O SSD MobileNet V1 FPN 640 × 640
15000 A
O
20000 A
O
A
0.07 83.70 84.30 83.40 86.70 84.10 87.00 0.08 82.60 85.00 80.30 87.60 82.70 87.00 0.09 82.40 85.80 82.00 86.80 84.20 86.90
and analyzed by the saved models. The output file is an Audio Video Interleave (AVI) video format, which contains the frame per second (FPS) generated by the model when processing the video, bounding boxes of the detected traffic signs, and the confidence level with regards to the class of the traffic sign detected. The evaluation process was done in Jupyter Notebook, which utilized several packages such as Pillow version 9.1.1, OpenCV version 4.5.5, NumPy version 1.21.6, Matplotlib version 3.5.2, and TensorFlow 2 Object Detection API [14]. For live road traffic footage evaluation, the data was taken using a camera on a mobile device. Specifically, the mobile device used for the evaluation is OnePlus Nord CE 2 5G. It has a Mediatek Dimensity 900 processor [26] and eight gigabytes of Random-Access Memory (RAM). The processor mentioned is an Octa-Core processor and is considered a relatively capable model since it was only released in 2020 [27]. It has 64 megapixels main camera sensor that can record up to 4K video at 30 FPS. The mobile device can also record 1080p and 720p video in 30 FPS and 60 FPS, respectively. The mobile device was mounted on the rearview mirror inside the vehicle using a phone holder. Figure 8 shows the actual setup of the mobile device inside the vehicle.
Traffic Sign Detection
1073
Table 7. Detection mAP (%) at Different IoU Values for Large Traffic Signs Tested against Various Setup, Involving Training Steps (10,000, 15,000, and 20,000) and Learning Rate Base (0.07, 0.08, 0.09) with Original (O) or Imbalanced Dataset and Augmented (A) or Balanced Dataset on Four SSD MobileNet Models mAP@[0.5:0.95] (%)Large Traffic Signs (area > 962 ) 10000 O SSD MobileNet V2 320 × 320
15000 A
O
20000 A
O
A
0.07 85.00 88.60 75.00 89.00 80.00 88.60 0.08 80.00 83.40 80.00 85.00 85.00 88.00 0.09 70.00 86.30 65.00 87.20 75.00 83.70 10000 O
15000 A
O
20000 A
O
A
SSD MobileNet V2 FPNLite 320 × 320 0.07 45.00 86.60 90.00 87.20 62.50 86.20 0.08 40.00 62.50 42.50 86.10 45.00 89.50 0.09 60.00 83.10 62.50 88.40 42.50 87.80 10000 O
15000 A
O
20000 A
O
A
SSD MobileNet V2 FPNLite 640 × 640 0.07 85.00 84.20 90.00 80.20 90.00 89.10 0.08 90.00 82.50 85.00 90.90 85.00 89.60 0.09 80.00 88.90 75.00 84.70 95.00 85.60 10000 O SSD MobileNet V1 FPN 640 × 640
15000 A
O
20000 A
O
A
0.07 85.00 71.50 90.00 89.30 85.00 90.50 0.08 90.00 91.50 90.00 92.70 85.00 91.10 0.09 90.00 84.20 85.00 93.50 90.00 90.90
4.4
Online Road Traffic Video Test
The 30-min video edited for the testing contains different variations of traffic scenarios. It includes driving at high speed with 40–60 km per hour (KPH) on expressways and highways, driving at slow speed with 10–30 KPH approaching an intersection or traffic, and stationary driving when the vehicle encounters an intersection with traffic stoplights. All of the models were tested on the same video with a similar setup. In Table 8, the data showed that SSD MobileNet V2 FPNLite 640 × 640 recorded the highest accuracy score with 50.7% while having the second to the lowest average FPS produced in the output video. SSD MobileNet V2 FPNLite 320 × 320 recorded the lowest accuracy score of 33.3% despite having an additional FPNLite architecture for the object detection and the second-highest average FPS produced in the video. Moreover, SSD MobileNet V2 320 × 320 produced more balanced results in terms of accuracy and speed. It recorded an accuracy score of 43.5% and 23 FPS, the highest FPS among the four models. The data showed that only the No Parking (NP) sign recorded zero correct detection despite having 16 actual instances. The reason behind this behavior lies in the distance of the NP signs
1074
D. P. M. Aguilar and R. N. C. Recario
Fig. 8. Mobile Device Setup for Actual Road Testing Table 8. SSD MobileNet Models with their Accuracy Score and Average FPS Results from the Online Video Test Model Name
Training Steps Learning Rate Base Accuracy score (%) Average FPS
SSD MobileNet V2 320 × 320
20,000.00
0.07
43.50
23.00
SSD MobileNet V2 FPNLite 320 × 320 20,000.00
0.09
33.50
19.00
SSD MobileNet V2 FPNLite 640 × 640 20,000.00
0.07
50.70
10.00
SSD MobileNet V1 FPN 640 × 640
0.09
46.20
4.00
20,000.00
in the video test. Most of the NP signs contained in the video were small in size. This means that the model struggles to detect NP signs with distances between 6–10 meters. SSD MobileNet V2 320 × 320 model performed poorly in detecting small traffic signs (Table 5), hence it misclassified most of its predictions in the online video test. Another issue for this matter includes poor lighting conditions and old road traffic signs. 4.5
Actual Road Test
The actual road test for the models running on the mobile application was done on the route shown in Fig. 9. The route stretches for about three kilometers in distance and the testing considered a round-trip along the route totaling a distance of 6 km. The speed considered for the entire testing is between 5–30 KPH. Not all traffic signs were considered for the actual road test since the route selected does not contain 60 speed limit signs and no right turn sign. For test results, SSD MobileNet V1 FPN 640 × 640 recorded a low accuracy score and average inference time speed of 3.9% and 2600 ms, respectively (Table 9). The behavior of this specific model is expected since FPN architecture does additional processing on the input frames. Consequently, models with FPNLite architecture did a more respectable output in terms of real-time object detection by having average inference times of 120 ms and 450 ms (Table 9). However, the FPNLite model with 640 × 640 input image size produced a poor performance since it is processing a larger image resolution. Lastly, the model that performed the best in terms of accuracy and speed is the SSD
Traffic Sign Detection
1075
Fig. 9. Route of the Actual Road Test on Google Maps Table 9. SSD MobileNet Models with their Accuracy Score and Average Inference Time Results from the Actual Road Test Model Name
Training Steps Learning Rate Base Accuracy Score (%) Average Inference Time (ms)
SSD MobileNet V2 320 × 320
20,000.00
0.07
78.50
50.00
SSD MobileNet V2 FPNLite 320 × 320 20,000.00
0.09
42.10
120.00
SSD MobileNet V2 FPNLite 640 × 640 20,000.00
0.07
9.60
450.00
SSD MobileNet V1 FPN 640 × 640
0.09
3.90
2,600.00
20,000.00
MobileNet V2 320 × 320. It generated an average score of 78.5% and an average inference time of 50 ms. Sixty Speed Limit (SSL) and No Right Turn (NRT) signs are not present in the route shown in Fig. 9. SSL signs are mostly located on expressways, while NRT signs are scarce in the actual road route.
5
Conclusion and Future Work
This study aimed to implement a traffic sign detection in an Android mobile application by collecting image data on Google Street View database and performing data augmentation, retrain four pre-trained SSD MobileNet models on TensorFlow 2 Detection Model Zoo and convert it to TensorFlow lite model for mobile inference, provide voice alert for the detected traffic signs, and measure the performance and accuracy of the involved models in different setup. Based on the results of the validation testing, the data showed that models trained with augmented and balanced data performed better compared to models
1076
D. P. M. Aguilar and R. N. C. Recario
trained with original and imbalanced data. It also showed that a higher number of training steps improved the accuracy of the object detection models. The models with additional architectures such as FPN and FPNLite showed great accuracy scores in the validation test and online video test. However, for the online video test, it performed poorly in terms of average FPS making it not a good candidate model for real-time traffic sign detection. Moreover, the actual video test backed up this claim as FPN and FPNLite models struggled to perform well in terms of accuracy and speed. In contrast, SSD MobileNet V2 320 × 320 performed a little lower in validation tests than other models since it only uses SSD and MobileNet architectures. However, the online video test showed that the model could keep up with the accuracy scores of other models while producing the best average FPS output among the four models. The result of the actual road test of the SSD MobileNet V2 320 × 320 model further supplements the claim that it has the best trade-off between accuracy and speed. The model generated a 78.5% accuracy score and the lowest average inference time of 50 ms out of all the models compared. A mobile application for real-time traffic sign detection with voice alert for detected traffic signs using Android Studio 4.2.2, Java 8, and TensorFlow 2 Object Detection API has been developed. TensorFlow 2 models were converted to TensorFlow 2 Lite models to manage mobile inference in a real-time environment. Overall, the SSD MobileNet V2 320 × 320 is the better choice among the four models for real-time traffic sign detection. The model’s accuracy can still be further improved by training the model in a more diverse and larger image dataset compared to what was used in this study. In terms of speed, mobile phones with a faster processor and better camera module can improve the traffic sign detection inference time of the model.
References 1. World Health Organization: Road traffic injuries (2021). https://www.who.int/ news-room/fact-sheets/detail/road-traffic-injuries 2. Gopalakrishnan, S.: A public health perspective of road traffic accidents. Family Med.Primary Care 1(2), 144–150 (2012). https://doi.org/10.4103/2249-4863. 104987 3. Nantulya, V.M., Reich, M.R.: The neglected epidemic: road traffic injuries in developing countries. BMJ (Clin. Res. Ed.) 324(7346), 1139–1141 (2002). https://doi. org/10.1136/bmj.324.7346.1139 4. Metropolitan Manila Development Authority: Metro manila accident reporting and analysis system. Technical report, Metropolitan Manila Development Authority, Metro Manila, Quezon City (2020) 5. World Health Organization: New who report highlights progress, but cites need for more actions to tackle road safety in the Philippines (2021). https://www.who. int/news-room/fact-sheets/detail/road-traffic-injuries 6. Aguilar, M.: Road signs: your key to responsible driving, May 2015. https://www. autoindustriya.com/features/road-signs-your-key-to-responsible-driving.html
Traffic Sign Detection
1077
7. Manila Bulletin: LTO to implement stricter policy on issuance of driver’s license (2020). https://mb.com.ph/2020/02/24/lto-to-implement-stricter-policyon-issuance-of-drivers-license/ 8. Chan, J., Gonzalez, P., Perez, E.: Designing traffic signs: a case study on driver reading patterns and behavior. In: 16th Philippine Computing Science Congress, Puerto Princesa, Palawan, Philippines, March 2016, pp. 1–4 (2016). https://doi. org/10.1016/j.ergon.2018.01.011 9. Fan, Y., Zhang, W.: Traffic sign detection and classification for advanced driver assistant systems. In: 2015 12th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD), pp. 1335–1339 (2015). https://doi.org/10.1109/ FSKD.2015.7382137 10. Vanneschi, L., Castelli, M.: Multilayer perceptrons. In: Ranganathan, S., Gribskov, M., Nakai, K., Sch¯ onbach, C. (eds.) Encyclopedia of Bioinformatics and Computational Biology, vol. 1, pp. 612–620. Academic Press, Oxford (2019). https://www. sciencedirect.com/science/article/pii/B9780128096338203397 11. Albawi, S., Mohammed, T., Al-Zawi, S.: Understanding of a convolutional neural network. In: International Conference on Engineering and Technology (ICET) 2017, pp. 1–6 (2017). https://doi.org/10.1109/ICEngTechnol.2017.8308186 12. Santos, A., Abu, P.A., Oppus, C., Reyes, R.: Real-time traffic sign detection and recognition system for assistive driving. Adv. Sci. Technolo. Eng. Syst. J. 5(4), 600–611 (2020). https://doi.org/10.25046/aj050471 13. Google Brain: Tensorflow lite object detection android demo app. https://github. com/tensorflow/examples/tree/master/lite/examples/object 14. Huang, J., et al.: Speed/accuracy trade-offs for modern convolutional object detectors. CoRR, abs/1611.10012 (2016). http://arxiv.org/abs/1611.10012 15. S. Trivedi. (2020) Object detection - a quick read. https://medium.com/ visionwizard/object-detection-4bf3edadf07f 16. Lin, T.-Y., et al.: Microsoft COCO: Common Objects in Context (2014). https:// arxiv.org/abs/1405.0312 17. Bloice, M.D., Roth, P.M., Holzinger, A.: Biomedical image augmentation using Augmentor. Bioinformatics 35(21), 4522–4524 (2019). https://doi.org/10.1093/ bioinformatics/btz259 18. Tzutalin: Labelimg (2015). https://github.com/tzutalin/labelImg 19. Howard, A.G., et al.: Mobilenets: efficient convolutional neural networks for mobile vision applications. CoRR, abs/1704.04861 (2017). http://arxiv.org/abs/ 1704.04861 20. TensorFlow: Tensorflow 2 detection model zoo (2022). https://github.com/ tensorflow/models/blob/master/research/object detection/g3doc/tf2 detection zoo.md 21. Google: Google Colaboratory (2022). https://github.com/googlecolab 22. Google and JetBrains: Android Studio (2022). https://developer.android.com/ studio 23. Koech, K.E.: Object Detection Metrics with worked example (2020). https:// towardsdatascience.com/on-object-detection-metrics-with-worked-example216f173ed31e 24. Buckland, M., Gey, F.: The relationship between recall and precision. J. Am. Soc. Inf. Sci. 45(1), 12–19 (1994). https://asistdl.onlinelibrary.wiley.com/doi/abs/10. 1002/%28SICI%291097-4571%28199401%2945%3A1%3C12%3A%3AAID-ASI2 %3E3.0.CO%3B2-L
1078
D. P. M. Aguilar and R. N. C. Recario
25. Everingham, M., Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes (VOC) challenge. Int. J. Comput. Vision 88(2), 303–338 (2010). https://doi.org/10.1007/s11263-009-0275-4 26. Mediatek: Mediatek dimensity 900 (2022). https://www.mediatek.com/products/ smartphones-2/mediatek-dimensity-900 27. Hinum, K.: Mediatek dimensity 900 (2021).. https://www.notebookcheck.net/ MediaTek-Dimensity-900-Processor-Benchmarks-and-Specs.566227.0.html
Challenges of the Creation of a Dataset for Vision Based Human Hand Action Recognition in Industrial Assembly Fabian Sturm1(B) , Elke Hergenroether2 , Julian Reinhardt1 , Petar Smilevski Vojnovikj1 , and Melanie Siegel2 1
Bosch Rexroth AG, Lise-Meitner-Strasse 4, 89081 Ulm, Germany [email protected] 2 University of Applied Sciences Darmstadt, Schoefferstraße 3, 64295 Darmstadt, Germany
Abstract. This work presents the Industrial Hand Action Dataset V1, an industrial assembly dataset consisting of 12 classes with 459,180 images in the basic version and 2,295,900 images after spatial augmentation. Compared to other freely available datasets tested, it has an above-average duration and, in addition, meets the technical and legal requirements for industrial assembly lines. Furthermore, the dataset contains occlusions, hand-object interaction, and various fine-grained human hand actions for industrial assembly tasks that were not found in combination in examined datasets. The recorded ground truth assembly classes were selected after extensive observation of real-world use cases. A Gated Transformer Network, a state-of-the-art model from the transformer domain was adapted, and proved with a test accuracy of 86.25% before hyperparameter tuning with 18,269,959 trainable parameters, that it is possible to train sequential deep learning models with this dataset. Keywords: Human Action Recognition · Assembly Lines Manufacturing · Assistance Systems · Transformers
1
· Dataset ·
Introduction
The full automation of production processes in industrial assembly lines has shown that humans cannot be completely replaced by machines. This is mainly for monetary reasons, such as high fixed maintenance costs for work tasks that are too complex for robotics and other machines and can currently only be solved economically by humans. The disadvantage for humans, resulting from the increasing variance of products and the desired flexibility of the worker in assembly, is the susceptibility of humans to errors, which increases due to the increased workload. This is caused primarily by decreasing concentration during long shifts, inattention, or distraction. Increasing demands also mean that assembly workers need more and more knowledge to assemble products correctly c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): SAI 2023, LNNS 711, pp. 1079–1098, 2023. https://doi.org/10.1007/978-3-031-37717-4_70
1080
F. Sturm et al.
and therefore have less time for training and familiarization, resulting in a lack of experience. These initially undetected assembly errors then continue through the production process until they are only noticed in quality control or even when the product is in operation at the customer’s site, with unpleasant consequences for production, such as an increased reject rate or damage to the company’s image due to the low quality of the products. To counteract these weaknesses of manual assembly by humans, ensure high product quality and minimize the error rate, an intelligent assistance system that recognizes the actions of the assembler by checking the assembly steps and provides direct feedback so that the assembler can immediately correct the mistake made is urgently needed. The smart assistance system can be provided by deep learning approaches via the visual recognition of human actions. However, in deep learning, examples are needed to train the network architectures appropriate for the task to find patterns and correlations in the data for a final classification. Since industrial assembly is performed by hand, the datasets studied in this work are mostly based on the movement of human hands and the interaction between tools and objects in the red, green, blue color model (RGB) and red, green, blue color model and corresponding depth model (RGB-D). Based on the weaknesses of the presented datasets for industrial usability, a specially created dataset is presented that meets the industrial requirements and proves that the created dataset can be used for training a deep learning model. The requirements placed on the dataset in this context are to recreate an industrial scenario in great detail, taking into account legal data protection regulations. Therefore, the focus is exclusively on hands. Furthermore, the dataset should contain occlusions of the hands by e.g., bigger parts, clear finely granular actions as well as object interactions and no non-industrial disturbances by the environment, like reaching for a smartphone or similar objects which are not part of the assembly process in order to be considered as ground truth for the training of a deep learning model. Additionally, since the focus relies on human hands and to avoid feature extraction problems from the environment, the video sequences are for the model training preprocessed into extracted features by a hand pose estimation approach to focus the network only on the behavior of a skeleton of human hands. The remaining work is structured as follows. In Sect. 2 open sourced datasets which are also part of a scientific publication for Human Action Recognition (HAR) and gesture control are examined. Section 3 deals with the analysis of these datasets and explains the usability aspects for the desired industrial application. In Sect. 4 the recording setup is described before the introduction of the created “Industrial Hand Action Dataset V1” in Sect. 5. Section 6, explains the used network architecture used to test trainability on the dataset, followed by the description of the training environment. In Sect. 7 the results of the creation of the dataset are examined, followed by planned future tasks and a final conclusion in Sect. 8.
Dataset for Human Hand Action Recognition
2
1081
Related Datasets
Table 1. Human Action Recognition Dataset Comparison Dataset
Classes Activity
IPN Hand [3]
14
EgoHands+ [1]
16
EYTH (EgoYouTubeHands) [10] Ikea Assembly Dataset [2]
X 4
20bn-something-something-v2 [6]
174
Cambridge HandGesture Dataset [11] EGTEA Gaze+ [12] MPII Cooking 2 [20] YouCook 2 [27]
9 106 59 89
COIN [24]
180
Drive&Act [17] WorkingHands [22] ChaLearn Iso/ConGD [7] LTTM Senz3D [15, 16]
83 13 249 11
Multi-Modal HandActivity Video Dataset [21]
15
EgoGesture Dataset [26] Gun71 [19]
83 71
Environment
Human Computer Open Space Interaction Office/Courtyard/ Playing Games Family Room Hand Object Daily Live Interaction Assembly Office/Family Room Hand Object Daily Live Interaction Gestures Cooking Cooking Cooking Hand Object Interaction Driving Assembly Gestures Gestures Hand Object Interaction Gestures Grasping
GDPR View Constrains Yes
Third Person
Yes
First Person
Yes Yes
First/ Third Person Third Person
No
First Person
Clean Background Kitchen Kitchen Kitchen
No No Yes Yes
Daily Live
Yes
Driver Cabine Workbench Open Space Open Space
Yes No Yes Yes
First Person First Person First Person Third Person First/ Third Person Third Person First Person Third Person Third Person
Workbench
No
First Person
Indoor/Outdoor Family Room
No No
First Person First Person
Human actions often appear in data recorded by visual sensors, more specifically Red, Green, Blue Color Model (RGB) or Red, Green, Blue Color Model and Corresponding Depth Model (RGB-D) cameras, in the form of image or video data [14]. One of the reasons for this visual type of data acquisition is that they are rich in features that can be used to remove ambiguities after preprocessing. As can be seen from the existing literature, extensive research has been done in the last decade, especially in the field of video classification as the important works for HAR by [4,5,9] proves. A classical approach in this regard for HAR is to extract features from individual 2D frames of RGB videos and use these features to train models to classify human actions. For this purpose, datasets already exist that are used to provide models with the necessary examples to recognize specific human actions. These examined datasets contain human assembly tasks, [2,22] hand gestures for human computer interaction, [3] cooking tasks, [23,27] which are similar to Human-Object Interaction (HOI) in the preparation of the ingredients or HOI in tasks like playing games [1]. Furthermore, it is important to mention that these existing and publicly available datasets were created under specific environmental conditions, such as in free environment [10,24] in closed rooms [2,19] or under laboratory conditions [11,21]. In the following, datasets are presented that are suited for especially industrial and real-world use cases. They are divided into the domains of assembly tasks 2.1, gestures 2.2, hand-object interactions 2.3, cooking 2.4, and daily tasks 2.5. All the following information
1082
F. Sturm et al.
can be viewed in a summarized form in Table 1 or in a detailed version regarding the technical specification in the Appendix 8 in Table 5. While it should be noted that some divisions of the datasets are not entirely clear, an attempt has been made to divide the datasets as much as possible. The goal in the next chapter is to gain a better understanding of fine-grained human actions, with focus on industrial environments. 2.1
Assembly Datasets
A wide variety of different assembly types and different time scales is provided by the Ikea Assembly Dataset [2]. In this approach, the assembly of furniture is done with the same type of components in several ways. It consists of natural, as well as unusual, human poses which are visually very similar and contain fine-grained HOI’s. Data were collected from three different camera views to process the complete human body, object, and self occlusions from multiple sensor modalities, including color, depth, and surface normal. Comparable to the Ikea Assembly Dataset is the Working Hands dataset, [22] which was built for manufacturing tasks. The difference to the above-mentioned one is that it was created to focus only on the hands and interaction between tools and parts. To get a wide range of examples, the Working Hands dataset is a mixture of real and synthetic interaction data. Therefore, not only the tools but also hands were created synthetically with the goal to segment the hands and tools for interaction. Similar manufacturing HOI are provided in the Multi-Modal Activity Video Dataset, [21] in which pixel-wise hand segmentation is used to segment hands from objects by thermal and RGB-D cameras. Besides, datasets that were created only for assembly or manufacturing tasks, human gesture recognition can also take into account to recognize human actions, especially if the hands are the only recorded body part. 2.2
Gestures Datasets
A very early approach was made by [7] with the ChaLearn Iso/ConGD dataset, [7] consisting of 54,000 hand and arm gestures recorded with a RGB-D camera for 249 classes in 22,535 manually labeled videos from a third person view. The videos are organized into 100 gestures belonging to a small gesture vocabulary of 8 to 12 gestures recorded by the same user. In addition, there is a subset of augmented batches in which the horizontal position of the user is randomly shifted or scaled. A similar recorded dataset is the LTTM Senz3D [15,16]. This dataset was recorded with gestures performed by 4 different people, performing 11 different gestures and repeated 30 times, for a total of 1,320 samples. A later created approach and one of the leading datasets for the human gesture recognition is the IPN Hands dataset, [3] that was made for human computer interactions via touchless screens. This dataset was recorded in a third-person view in open spaces. It contains 4,000 clips of RGB data, a high amount of videos comparatively to the other inspected datasets in this work. The dataset offers static but also dynamic gestures with considerable variation, as well as clutter
Dataset for Human Hand Action Recognition
1083
backgrounds, strong and weak illumination conditions and static and dynamic background environments. However, the focus of the gesture recognition datasets are not only on clean hand gestures, as is the case with the Cambridge Hand Gesture Dataset [11]. These datasets may also include gestures from scenes with HOI. 2.3
Human Object Interaction Datasets
The EgoGesture dataset [26] consists of 83 different static and dynamic egocentric gestures, focused on interaction with wearable devices. Over 24,000 video sequences were recorded in six different indoor and outdoor scenes with different backgrounds and lighting. Therefore, the data is highly diverse. The goal was to segment gestures in continuous data, and thus to be able to evaluate different approaches for gesture detection and classification in numerous ways. An instructional video dataset for daily life tasks comparable to assembly instructions is the COIN dataset [24]. The creators had a similar goal to the creators of the EgoGesture dataset, which was to create a collection of various real-world activities but from YouTube videos. The dataset consists of 11,827 videos related to 180 instructional tasks from several domains, and each step is connected to a label. This structure is also the difference to other instructional video datasets because it is organized in a three-level semantic structure [24]. Gestures and assembly tasks are very obvious to describe human behavior in certain scenes, but also the interaction between humans, hands and handobjects has to be considered. The EgoHands dataset, [1] contains individuals out of an egocentric view in 48 different videos of dynamic HOI within playing games with pixel-level ground-truth annotations for 4,800 frames and ground truth segmentation masks for over 15,000 hands. Beside the HOI, the dataset includes more realistic and challenging social situations where, different from the other datasets, multiple sets of hands appear in the view. In contrast to previous datasets, the description of a single action in the 20bn-something-something v2 dataset, [6] was made by a natural language template to present the labels instead of a fixed data structure to make the human action’s description more fine-granular [6]. With 108,499 video clips and 174 classes, this dataset is one of the biggest dataset currently existing. The recording of crowd workers was mostly based on HOI. But not only tasks like playing games or similar can be used as examples for human actions. Also, human hand movements during cooking, which are very similar to movements in industrial assembly lines, were recorded. 2.4
Cooking Datasets
A good example is the large EGTEA Gaze+ dataset, which is an extended version of the former GTEA dataset [12]. It was created for egocentric views with wearable cameras on cooking tasks. As a result, the very specific, finegrained human actions were used for segmentation and classification. Similar to the forementioned dataset with fine-grained cooking tasks is the MPII Cooking
1084
F. Sturm et al.
2 [20], which was live recorded from a front view and the Youcook 2 which is a collection of cooking videos from several views from YouTube [27], with the goal to differentiate between fine-grained body motions and the categories of the tasks. 2.5
Daily Tasks Datasets
A collection of several daily task videos from YouTube, recorded from an egocentric view, is the EgoYouTubeHands (EYTH) dataset by [10]. The goal was to create a “hands-in-the-wild” dataset for detection and segmentation. Regarding the authors, the limitation of existing hand action datasets are caused by laboratory settings and can therefore not be used in open worlds. The goal for this dataset is to detect any hand, especially in first person videos recorded in unconstrained daily settings. Specialized on grasps is the Grasp Understanding dataset also known as Gun71 with 12,000 RGB-D images of scenes from hand to object manipulations in typical house scenes from a chest-mounted RGB-D camera from 5 to 6 views each [19]. A different kind of use case for HAR was pursued in the Drive&Act dataset [17]. The goal was to use the dataset to recognize fine-grained human behavior inside the vehicle cabin. Compared to the other datasets, the Drive&Act was taken from 6 different angles with Red, Green, Blue Color Model and Corresponding Infrared Depth Model (RGB-D IR) information, aiming to see the location, the objects for passenger interaction and two types of labels for the actual action. With more than 9.6 million images, this is the largest dataset presented in this paper.
3
Weaknesses of Existing Datasets for Industrial Assembly Lines and Real World Applications
The computer vision approach brings some problems which are only partly covered in the existing HAR datasets from Sect. 2. This concerns technical aspects as well as aspects specific to industrial environments. 3.1
Technical Weaknesses
Occlusions Technical aspects that mainly affect the recognition and prior training of this approach are the occlusion of the hands and work steps by objects, more precisely by components which obscure significant recognition features of the hand. Since this can only happen due to HOI it is not covered by IPN Hand [3], Cambridge Hand Gesture Dataset [11], ChaLearn Iso/ConGD [7], LTTM Senz3D [15,16] and the EgoGesture Dataset [26]. Depth Data The small distance between the work surface and the worker’s hands, which makes hand detection difficult, is also an occurrence that must be taken into
Dataset for Human Hand Action Recognition
1085
account and is not fully captured in the datasets studied. Many HAR approaches use depth data that should provide the necessary spatial information of human behavior as input to a deep learning network, like the Ikea Assembly Dataset [2], the WorkingHands Dataset [22], ChaLearn Iso/ConGD [7], LTTM Senz3D [15, 16] and GUN71 [19]. However, the data were mostly recorded in open spaces, such as in front of a green screen in the Cambridge Hand Gesture Dataset [11], rather than in industrial or unsupervised conditions with a lot of noise in the foreground and background, [3,7,15,16], which is a weakness in scaling the trained network. These occurrences in the environment due to tools or parts that need to be assembled can negatively affect the depth data. Also, the close distance between the depth camera and the worker to the work surface can result in an image with a lot of noise in the data [8]. Bias Furthermore, the classes and tasks of the test person are sometimes unclear like in the EYTH dataset, [10], or have lots of environmental bias like the YouCook2 dataset [27]. This bias can also be caused by more than just one person in the camera view, like in the EgoHands+ dataset, where several people appear [1]. 3.2
Legal Requirements
In addition to the technical requirements that need to be considered in order to create a dataset that is suitable for industrial use, the non-technical aspects more precise the legal requirements also need to be taken into account. In countries of the European Union that have agreed to the General Data Protection Regulation (GDPR) or in Germany to the “DSG-neu” [18] it is necessary to observe the data protection regulations and personal rights of the employees. In countries with a worker’s committee, the council also strictly controls what happens to the data when employees are filmed on company premises. The reason for this is that in most cases the recording and tracking of personal data is not permitted under the above-mentioned data protection laws, or only under very strict conditions. This includes the prohibition of recording personal data such as the batch with batch number, the name of the employee on a name tag, the face or other information that leads to the unique identification of the employee. Furthermore, in countries like Germany, it is not permitted that the work task performed can be traced back to the respective employee [18]. Constrains regarding these private data regulation laws are recognized by several datasets where the face or the complete body of the probands are visible like in the IPN Hands dataset [3] the EgoHands+ [1], EYTH [10] but also the behavior in the environment without doing the actual task is recorded like in the Ikea Assembly dataset [2] or the Drive&Act dataset [17]. 3.3
Conclusion of the Weak Points
Since none of these general working conditions and privacy laws are fully met by the freely available existing datasets from Sect. 3.1 and Sect. 3.2 such as
1086
F. Sturm et al.
IPN Hand [3], EgoHands+ [1], EYTH [10], Ikea Assembly Dataset [2], 20bnsomething-something-v2 [6], Cambridge Hand Gesture Dataset [11], EGTEA Gaze+ [12], MPII Cooking 2 [20], YouCook 2 [27], COIN [24], Drive&Act [17], WorkingHands [22], ChaLearn Iso/ConGD [7], LTTM Senz3D [15,16], MultiModal Hand Activity Video Dataset [21], EgoGesture Dataset [26], Gun71 [19], it is urgent necessary to create a specific dataset for a human hand action recognition approach in industrial assembly lines. This dataset must meet all technical aspects mentioned such as finely granular actions, HOI, occlusion by parts, no depth data, no distortions due to environmental disturbances or those that could unnaturally influence the respective action sequence, as well as legal requirements such as no GDPR-critical data. At this point, it has to be noted that many more datasets exists than those mentioned so far, but it was concluded that these listed datasets are the most relevant for comparison with industrial assembly use cases.
4
Assembly Recording Setup
In order to have an industrial ground truth dataset, the first step is to create a controllable environment specifically for labeling assembly sequences, but it is also necessary to be as close as possible to the real world use case to prove scalability in a real scenario. Therefore, a PiBoy DMG1 , see Fig. 2, is assembled as an example product for this use case, which is very similar to the real-world scenario under consideration in terms of the assembly procedure. The product consists in this use case of eight different parts, see Table 2 which are stored in labeled bins on the worktop and need to be assembled in a special mount under the bins on the table, see Fig. 3. All parts are assembled by hand, and the full assembly process is complete as soon as the assembled product is removed from the camera field of view. The camera is mounted 0.90 m above the worktop and has a frontal view onto the worktop. As mentioned in Sect. 3 for the recording, it is mandatory to prove that the camera records no personal information of Table 2. Part Labels Part Labels Product Parts Part1 Part2 Part3 Part4 Part5 Part6 Part7 Part8 1
Front Housing Raspberry Pi HDMI Connector Cooling Fan Back Housing Screws Battery Battery Cover
https://experimentalpi.com/PiBoyDMGKit p 18.html
Dataset for Human Hand Action Recognition
1087
Fig. 1. Recording Test Bench Front
the assembler, therefore the view of the camera is restricted to only see the worktop more precise the hands and the wrist. The recording happened with a UI-326xCP-C Camera2 from IDS with a resolution of 1936×1216 in RGB format with 24fps and each frame is stored as .png datatype. The labeling and cutting of the different actions is performed by the assembler with the help of QR codes, created and read with OpenCV, which are presented on a second screen, see Fig. 1, to the camera. The QR codes contain the information of the particular action that is performed. These work tasks are each started by clicking on the instruction screen slightly to the right of the worker’s field of view, see Fig. 1, and during the recording each frame is saved in an ascending folder structure 2
https://en.ids-imaging.com/store/ui-3260cp-rev-2.html
1088
F. Sturm et al.
Fig. 2. Example Product PiBoyDMG
associated with the label. The assembler stops saving the relevant sequences to the respective classes by clicking on the instruction screen again when the task is finished. Between each work task, the information about the next work step on the instruction screen, is also labeled with a QR code to omit such sequences for the final dataset.
Dataset for Human Hand Action Recognition
1089
Fig. 3. Recording Test Bench Mount
5
Industrial Hand Assembly Dataset V1
This first version of the dataset consists of 12 classes, see Table 3, more precisely actions. 10 actions must be performed by the assembler by hand, and two actions must be performed with the help of a screwdriver which is stored on the right side of the mounting point, see Fig. 3. The tasks were selected according to the frequency as they are also performed in a real world assembly scenario after detailed observation in a real world assembly line. It was found that such real world actions mainly consist of gripping, which needs to be done by nearly all classes to get the respective part, placing (Assembly Step1,2,4,5,6,7,8,10,11,12), plugging (Assembly Step3,5,9) and screwing (Assembly Step7,8) operations. A total of 459,180 images were stored for further use. The frames per class are uneven distributed depending on the length and duration, between ˜5–15 secs, of each task done by each worker on the test bench. The bins of the necessary components for the demo assembly process are periodically changed in their position between the respective products to increase the variance of the movements. As usual in industrial environments, the hands of the worker are also partly occluded by bigger parts, e.g., Part 1, Part 4 and Part 5. An example can be seen in Fig. 5. For a better scalability and a higher variance in the dataset, the frames were additionally spatial augmented before the extraction of the key-
1090
F. Sturm et al. Table 3. Classes and Corresponding Assembly Actions
Classes Assembly Assembly Assembly Assembly Assembly Assembly Assembly Assembly Assembly Assembly Assembly Assembly
Assembly Action Step1 Step2 Step3 Step4 Step5 Step6 Step7 Step8 Step9 Step10 Step11 Step12
Place Housing (Part1) in Mounting Bracket Insert Raspberry Pi (Part2) in Middle of Housing (Position1) Insert HDMI Connector (Part3) in Housing Connectors on the Right (Position2) Place Fan (Part4) on top (Position3) of Raspberry Pi (Part2) Connect Cable with Fan Close Housing (Part5) (Position4) Screw (Part6) in Left Center Housing Screw Hole (Position5) Screw (Part6) in Right Center Housing Screw Hole (Position6) Plug in Battery Cable Place Battery (Part7) into Battery Compartment (Position7) Place Battery Cover (Part8) onto Battery (Position8) Place Finished Product within the Marked Area to the Left
Fig. 4. Basic Class Distribution
points to get more bias into the handedness and the direction of movements. Each sequence was, for the length of duration per task, five times augmented by a random spatial augmentation in vertical flipping which increases the amount
Dataset for Human Hand Action Recognition
1091
Fig. 5. Occlusion
of left and right handedness, example can be seen in Fig. 6a, horizontal flipping in Fig. 6b and random rotation by 90◦ C in Fig. 6c, with a probability of 50% each, to get 2,295,900 frames for the final model training, see in Sect. 8 Fig. 11 for the final class distribution. If a task in a sequence is shorter than the full duration, zero padding was performed for the pending frames of the sequence for later training on the skeleton information of the hands.
1092
F. Sturm et al.
Fig. 6. Spatial Augmentations
6
Model Training
In order to confirm that the generated dataset enables the training of a deep learning model, the architecture as well as the training environment are explained in the following sections. 6.1
Model Training Architecture
Since the focus of HAR relies in this work on especially fine-grained human hand actions, the network architecture is separated into two parts. The first part is the extraction of skeleton features from the hand, see Fig. 7a, which are relevant for human hand actions. More specifically, a keypoint detector to detect human hands and estimate hand posture. In this case, Googles MediaPipe Hands solution is used [25]. The output consists of a concatenation per hand, in this case two hands, with 21 keypoints each, see Fig. 7b. Since the keypoint information is provided in 3D world coordinates, x, y and z, an array consisting of 126 data points is created. This results into the second part of the architecture
Fig. 7. Hand Detector
Dataset for Human Hand Action Recognition
1093
to classify the sequential correlation of these previous extracted keypoints by an adapted Gated Transformer Network from [13]. This architecture is based on a two-tower Transformer, where the encoder in each tower capture timestep-wise and channel-wise attention. To merge the encoded feature of the two towers, a learnable weighted concatenation is used as a gate for the final fully connected layers. With this approach, they achieved state-of-the-art results on 13 multivariate time series classification tasks in the domain of Natural Language Processing, but also HAR [13]. Consequently, this work can be classified not only in the area of HAR, but also in the area of multivariate time series classification. 6.2
Model Training Environment
As already mentioned, the Model was created in PyTorch and stacked on top of Googles framework MediaPipe Hands, the training and hyperparameter tuning was done in Microsoft Azure on a STANDARD NC6 with 6 vCPUs and 56 GiB Memory. The final model training was done on a GPU which corresponds to half a K80 card with 12 GiB, and a maximum of 24 data disks and 1 NCiS in a duration of 1 h and 55 min.
7
Results and Examination of the Suitability
Table 4. Industrial Hand Action Dataset V1 Hand-Object Classes Frames Interaction Yes
12
Resolution FPS Activity
2,295,900 1936×1216 24
Environment Views
Industrial Industrial Assembly
1
GDPR Constrains No
The workbench in Fig. 1, created specifically for the dataset “Industrial Hand Action Dataset V1”, see Table 4, for simulating industrial assembly tasks is ideally suited for recording the desired work steps and can easily be expanded in
Fig. 8. Train-Validation Curve
1094
F. Sturm et al.
the future for further modifications and recording devices. Due to the clear and unambiguous structure of the product which has to be assembled, it is possible to create a suitable ground truth of the respective tasks without being disturbed by the background environment. Although the model trained on this dataset is prone to overfitting due to the size of the network with 18,269,959 trainable parameters correlated to 2,295,900 frames with 126 data points each, which will be investigated in more detail in later experiments, first training results were promising, with a test accuracy of 86.25% before hyperparameter tuning, and a validation accuracy of 94.73684% as can be seen in Fig. 8. Occlusions in the recorded fine-grained motions, especially during the assembling of bigger parts, are particularly relevant to the recorded dataset and help make the model being trained more robust to unforeseen occurrences. This robustness was further complemented by spatial augmenting. As already mentioned, the distribution of the 12 classes is based on the duration of the respective work steps. It can be seen in Fig. 4, that pick and place classes like Assembly Step1,2,10,12 were recorded significantly shorter than, e.g., the assembly of the HDMI connector in Assembly Step3 or the screwing work in Assembly Step7 and Assembly Step8. Further, in Fig. 9 it is visible that most of the tasks are clearly distinguishable from each other. There are overlaps in the respective screw movements, which must be performed on the one hand at the similar position and on the other hand with the same objects. The results from a final version of a recorded sequence is presented in Fig. 10. In a future version of this dataset, more industrial conditions like gloves and more fine-grained assembly tasks will be included. For a higher scalability, multi camera views are also planned. Further, a brighter diversity
Fig. 9. Confusion Matrix
Dataset for Human Hand Action Recognition
1095
of assembly tasks with different hand colors and genders will also be recorded. Additionally, results of the model training will be presented, and the dataset will be published to the scientific community.
Fig. 10. Example Assembly Step 1
8
Conclusion
The availability of a dataset that reflects the real world is an important criterion when investigating the scalability of a deep learning model. The scalability is a fundamental evaluation characteristic, especially in industrial applications. In this work, different image and video datasets similar to industrial assembly applications were investigated and their technical vulnerabilities as well as legal complications, with respect to privacy regulations, are listed. Subsequently, a custom-built dataset corresponding to the presented aspects consisting of 459,180 recorded and after augmentation 2,295,900 frames uneven distributed on 12 typical industrial fine-grained human hand action classes during assembly are presented. The experiments show that it is possible to train a sequential model with this dataset and achieve promising results. Before the dataset can be published, further experiments need to be performed, e.g., further variance in sequence length and further variation in hyperparameters and model architectures to classify sequences and to avoid overfitting. Since the dataset consists of various fine-grained human actions that occur daily in an industrial environment, this dataset helps test and apply the application of deep learning research methodologies to real world problems, especially in industrial assembly tasks.
1096
F. Sturm et al.
Appendix
Fig. 11. Augmented Class Distribution
Table 5. Technical Comparison of Human Action Recognition Datasets Dataset
Hand-Object Clips/ Modalities Classes Frames Interaction Videos
Resolution FPS
Gestures/ Hand Views Actions Pairs
IPN Hand [3] EgoHands+ [1] EYTH (EgoYouTubeHands) [10] Ikea Assembly Dataset [2] 20bn-something-something-v2 [6] Cambridge HandGesture Dataset [11] EGTEA Gaze+ [12] MPII Cooking 2 [20] YouCook 2 [27] COIN [24]
No Yes Yes Yes Yes No Yes Yes Yes Yes
13 4 X 33 174 9 32 59 X 83
1 2 X 1 1 0.5 1 1 1 1
1 1 1 3 1 1 1 1 1 1
Drive&Act [17]
Yes
WorkingHands [22] ChaLearn Iso/ConGD [7] LTTM Senz3D [15, 16]
Yes No No
Multi-Modal HandActivity Video Dataset [21] Yes EgoGesture Dataset [26] Gun71 [19]
No Yes
RGB RGB RGB RGB-D RGB RGB RGB RGB RGB RGB RGB-D/ Skeleton RGB-D RGB-D RGB-D Thermal/ RGB-D RGB-D RGB-D
X = Information not specified in reference
14 16 X 4 174 9 106 59 89 180
4,000 48 X 371 108,499 900 86 273 2,000 11,827
800 130 1,290 3,046,977 X X 14,000 2,881,616 X X
640×480 1280×720 X X X X 1280×960 1624×1224 X X
30 30 X 24 X X 24 29.4 X X
83
X
˜9,6Mio
1280×1024 X
X
1
6
13 249 11
X 22,535 1320
X X X
1920×1080 7 X X X X
13 47,833 11
1 1 0.5
1 1 1
15
790
401,765
X
X
X
X
1
83 71
24,000 X
2,953,224 320×240 12 X
X X
24,161 X
1 X
1 5–6
Dataset for Human Hand Action Recognition
1097
References 1. Bambach, S., Lee, S., Crandall, D.J., Yu, C.: Lending a hand: detecting hands and recognizing activities in complex egocentric interactions. In: 2015 IEEE International Conference on Computer Vision (ICCV), pp. 1949–1957 (2015) 2. Ben-Shabat, Y., et al.: The IKEA ASM dataset: understanding people assembling furniture through actions, objects and pose. CoRR abs/2007.00394 (2020) 3. Benitez-Garcia, G., Olivares-Mercado, J., Sanchez-Perez, G., Yanai, K.: IPN hand: a video dataset and benchmark for real-time continuous hand gesture recognition. CoRR, abs/2005.02134 (2020) 4. Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. CoRR abs/1705.07750 (2017) 5. Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fusion for video action recognition. CoRR abs/1604.06573 (2016) 6. Goyal, R., et al.: The “something something” video database for learning and evaluating visual common sense. CoRR abs/1706.04261 (2017) 7. Guyon, Athitsos, V., Jangyodsuk, P., Escalante, H.J.: The Chalearn gesture dataset (CGD 2011). Mach. Vis. Appl. 25(8), 1929–1951 (2014) 8. Jauch, C., Denecke, J., Huber, M.: Generating a hand pose data set for vision based manual assembly assistance systems. Electron. Imaging 119–1(01), 2021 (2021) 9. Ji, S., Wei, X., Yang, M., Kai, Yu.: 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 221–231 (2013) 10. Khan, A.U., Borji. A.: Analysis of hand segmentation in the wild. CoRR abs/1803.03317 (2018) 11. Kim, T.-K., Wong, S.-F., Cipolla, R.: Tensor canonical correlation analysis for action classification. In: 2007 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8 (2007) 12. Li, Y., Liu, M., Rehg, J.M.: In the eye of the beholder: gaze and actions in first person video (2020) 13. Liu, M., et al.: Gated transformer networks for multivariate time series classification. CoRR, abs/2103.14438 (2021) 14. Lopes, A., Souza, R., Pedrini, H.: A survey on RGB-D datasets. CoRR abs/2201.05761 (2022) 15. Marin, G., Dominio, F., Zanuttigh, P.: Hand gesture recognition with leap motion and kinect devices. In: 2014 IEEE International Conference on Image Processing (ICIP), pp. 1565–1569 (2014) 16. Marin, G., Dominio, F., Zanuttigh, P.: Hand gesture recognition with jointly calibrated leap motion and depth sensor. Multimedia Tools Appl. 75(22), 14991–15015 (2016) 17. Martin, M., et al.: Drive& Act: a multi-modal dataset for fine-grained driver behavior recognition in autonomous vehicles. In: The IEEE International Conference on Computer Vision (ICCV), vol. 10 (2019) 18. Plath, K.-U., editor. Bundesdatenschutzgesetz (BDSG): in der Fassung der Bekanntmachung vom 14. Januar 2003 (BGBl. I S. 66), zuletzt ge¨ andert durch Gesetz vom 14. August 2009 (BGBl. I S. 2814), pp. XV–LXI. Verlag Dr. Otto Schmidt (2012) 19. Rogez, G., Supancic, J.S., Ramanan, D.: Understanding everyday hands in action from RGB-d images. In: 2015 IEEE International Conference on Computer Vision (ICCV), pp. 3889–3897 (2015)
1098
F. Sturm et al.
20. Rohrbach, M., et al.: Recognizing fine-grained and composite activities using handcentric features and script data. Int. J. Comput. Vis. 119(3), 346–373 (2015) 21. Chi, H.-G., Kim, S.: First-person view hand segmentation of multi-modal hand activity video dataset. BMVC 2020 (2020) 22. Shilkrot, R., Narasimhaswamy, S., Vazir, S., Hoai, M.: Workinghands: a hand-tool assembly dataset for image segmentation and activity mining. In: BMVC (2019) 23. Singh, B., Marks, T.K., Jones, M., Tuzel, O., Shao, M.: A multi-stream bidirectional recurrent neural network for fine-grained action detection. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1961– 1970 (2016) 24. Tang, Y., et al.: COIN: a large-scale dataset for comprehensive instructional video analysis. CoRR abs/1903.02874 (2019) 25. Zhang, F., et al.: Mediapipe hands: On-device real-time hand tracking. CoRR abs/2006.10214 (2020) 26. Zhang, Y., Cao, C., Cheng, J., Hanqing, L.: EgoGesture: a new dataset and benchmark for egocentric hand gesture recognition. IEEE Trans. Multimedia 20(5), 1038–1050 (2018) 27. Zhou, L., Xu, C., Corso, J.J.: Towards automatic learning of procedures from web instructional videos (2017)
A Novel Features Selection Model for Fire Detection and Fire Circumstances Recognition by Considering Fire Texture: MIC-RF-RFE Jittarin Jetwiriyanon2 , Ziheng Feng1 , and Kanoksak Wattanachote1,2(B) 1 2
School of Information Science and Technology, Guangdong University of Foreign Studies, Guangzhou, China Graduate School of Business and Advanced Technology Management, Assumption University, Bangkok, Thailand [email protected]
Abstract. In the past decades, several computer vision-based fire detection techniques have been developed using color features, shape, and motion characteristics. However, the most deserved features have never been discussed. According to the facts, fire color characteristics have not been only in orange shades, but also in other colors shades such as green and blue shades, in the real world. We, hence, proposed a hybrid fire features selection model by considering fire textures. Our proposed model effectively utilized the collaborative techniques of maximal information coefficient (MIC) and random forests recursive feature elimination (RFRFE), to discover the most significant features for fire detection. We developed a hybrid features selection model of fire detection named maximal information coefficient in collaboration with random forests recursive feature elimination (MIC-RF-RFE). Selected features were then leveraged for two recognition purposes. On one hand, they were utilized for fire detection even in unconventional fire colors. On the other hand, they were applied for four fire circumstance states recognition. Several video footages both fire and non-fire were collected to extract various observed features to be utilized for our training and test datasets. Our experiments demonstrated that fire texture detection with our proposed algorithms of fire patterns recognition based on the unification of color pattern and motion pattern not only significantly increased the accuracy of fire detection but also additionally identified fire circumstances. Keywords: Fire Texture · Features Selection · Maximal Information Coefficient · Random Forests · Recursive Feature Elimination · Motion Pattern
1
Introduction
Fire is an incident that caused several damages to life and property within a few minutes. As recently reported in [1], two people died and 11 others were c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): SAI 2023, LNNS 711, pp. 1099–1115, 2023. https://doi.org/10.1007/978-3-031-37717-4_71
1100
J. Jetwiriyanon et al.
injured in a fire on June 27th , 2022 at a shophouse near Sampeng market on Ratchawong Road in Bangkok’s Samphanthawong district. It was caused by a power transformer near the building that exploded at noon around 11.25 a.m. Then, it was triggering a fire on the second floor of the building. The fire quickly burnt several flammable products inside, sending plumes of smoke into the sky before the fire engines and crews were deployed to combat the fire. The fire was brought under control about one hour after it began. A total of six shophouses burnt down and 11 people, mostly volunteers fighting the fire, were injured. From this situation, if there exists a system which is able to support firefighters and volunteers to acknowledge the state of fire circumstances during fighting the fire, then the number of injuries or death can be reduced. Another big incident, Southwest China was hit by forest fires [2]. The fire engulfed around 15 hectares of forest in Sichuan province on Saturday, March 30th , 2019. The fire was concentrated at an altitude of about 3,800 m in complex terrain with steep valleys and inconvenient transportation and communication. On Sunday afternoon, 689 firefighters were sent to combat the fire. Unluckily, a sudden change of wind direction instantly formed a huge fireball. The fireball attacked those firefighters and people in that area. Thirty people, including 27 firefighters and three locals, were confirmed dead while fighting the fire that broke out in Liangshan Yi mountain, Muli County, Sichuan province. As found in this report, the information of the direction and motion of fire and smoke caused by wind direction becomes an important issue. This information is able to be analyzed for finding out the velocity, direction and motion of fire which can be implied to inform the fire circumstances. The information of direction, motion, and velocity of fire can be extracted from fire texture [3,4]. As above, the main causes of fire which include human error and system failure also result in severe loss of lives and properties. It motivates several researchers to develop new technologies for fire detection. Nowadays, two broad categories of fire detection approaches can be identified by traditional fire alarms and vision-based surveillance systems. Traditional fire alarm systems are based on sensors that require close proximity for activation, such as infrared and optical sensors. Hereof, numerous vision-based surveillance systems have been evolved by researchers [5–9] to overcome the limitations of traditional approaches. Despite these advantages, there are still remained several challenges in fire detection e.g. speed and motion detection, fire source analytics, and maintaining a wellagreed trade-off between the detection accuracy. Hereby, the appearance of fire and smoke are interpreted to inform the knowledge of the current situation. For instance, the amount of fire and smoke can be analyzed to explain the circumstances or stages of a fire incident not only in the early stage of fire but also during the occurrence of fire circumstances. Many researchers have made several efforts to address closely those aspects by taking into consideration of color and motion features for fire detection. However, those aspects have still remained. Aforementioned, we are now interested in both the color appearance and motion characteristics of flame. Flame colors are able to inform us not only
MIC-RF-RFE
1101
the temperature of fire but also chemical compounds. The natural colors of frames are ranged from red to orange, yellow, and white to blue. Each dominant color in a flame is important information means to fire temperature as described in [10,11]. For example, 500 g. of Copper Sulfate with 1lb. of Sodium Borate (Borax) produces green flame of fire, 500 g. of Sodium Chloride produces yellow flame, and 100 g. of Potassium Chloride produces purple flame. Despite motion characteristics of fire and smoke have been studied by [3,4,8,9]. However, we have studied on the characteristics of fire, fire-smoke, and smoke have never been leveraged for developing an intelligent fire detection whose information interpretation of fire circumstances plays a major role in this evolution. One of the significant approaches is the dynamic textures analysis method the definition of periodic motion pattern of dynamic texture as motion analysis to extract the datasets from the videos. And the combination of the classification algorithms for the features selection and stages of fire detection, which are the modern supervised classification machine learning algorithms and suitable to our datasets to find the significant features to generate a higher accurate prediction for the fire stages detection. Therefore, our experiments demonstrated the new combination of machine learning methods for a significant features selection model for fire detection in different stages. The remaining parts of this article are devoted as in follows. Section 2 presents the related works. Section 3 introduces the main contributions of our proposed methods. Section 4 contains experimental results. Section 5 compares the fire detection result using our significantly selected features with other state-of-the-art fire detection methods. Finally, Sect. 6 devotes to conclusions and maneuvers of our future works.
2 2.1
Related Works Motion Vector and Features
The Farneb¨ ack technique was selected to demonstrate the motion vector field. We emphasized the surface block (SB) parameter for dataset preparation based on Farneb¨ ack method. SB was implemented to generate optical flow demonstrated as motion vector field [4], where the size of SB impacts the vector properties in an optical flow field, such as speed of motion vector, number of vectors, and motion vectors coherency. Small SB size conducts to a large number of vectors corresponding to the spatial domain of dynamic texture we are interested in. For instance, small SBs in a large spatial domain usher a large number of vectors in a video frame. Vector length (Vt) conveys the speed of vectors on the particular SB area. More than eighty features both color and motion properties were extracted from the motion vector field on the texture of interest (TOI), and were utilized for finding out the most significant features. Definition of each feature in our dataset was collected as 89 features please see at https://github.com/JittarinJet/Fire Detection 2022 Material/blob/main/89 Features Definition.pdf. 2.2
Features Selection Using MIC
In our first experiment, we have overcome the difficulty of selecting the best features by leveraging the Maximal Information Coefficient (MIC) via the devel-
1102
J. Jetwiriyanon et al.
oped Python code using minepy tools with alpha=0.6 and c=15. Our data in the first data set was extracted from 18 video clips with 7,882 frames of data records, 3,934 frames are smoke data and 3,948 frames are fire data. Because of the big data [12], is not possible for us to choose the features one-by-one. Set group to find the good features is also very difficult, so we use the MIC algorithm for identifying relationships between pairs of variables in large data sets effectively. It divides the x -y (Cartesian coordinates) into a different part, then let each group be one cluster and counts the Information Coefficient (1). I(X, Y ) =
x∈X y∈Y
p(x, y) log
p(x, y) p(x)p(y)
(1)
Then choose the max value of the divisions is the MIC value of these two features. MIC algorithm has been used to study the relationship among diverse variables. MIC captures a wide range of associations both functional and nonfunctional as shown in (2). max(I(li , ls )) max M IC(li , ls ) = (2) log2 min(nx , ny ) nx ×ny 0.05) or unknown words (t (48) = 1.305, p > 0.05) during the pretest. However, the
1420
H. Xiao et al.
improvement scores of the EAG and the CAG students showed a significant difference in both training words (t (48) = −3.544, p < 0.05) and unknown words (t (48) = −5.311, p < 0.05). Therefore, using ASR for pronunciation training was effective even for advanced level students, especially for the pronunciation in unknown words with an improvement of about 7.91% (with the mean difference being 5.696 out of 72 points), while only with an improvement of about 4.47% (with the mean difference being 1.608 out of 36 points) in training words. Table 3. Comparison of the EAG and the CAG on Scores of Target Vowels in the Pretest and on Improvement Scores
Pronunciation Score of Target Vowels in training words in the pretest
EAG (n = 25)
CAG (n = 25)
M
M
SD
MD
t
Sig (2-tailed)
SD
30.92 2.008 30.92
1.372 0.000
Pronunciation Score of Target 49.96 4.598 51.568 4.103 1.608 Vowels in unknown words in the pretest
0.000 1.000 (42.401) 1.305 (48)
1.668 −1.608 −3.544 (48)
Improvement Score of Target Vowels in training words
2.152 1.538 0.544
Improvement Score of Target Vowels in unknown words
5.296 4.194 −0.400 3.342 -5.696
−5.311 (48)
0.198
0.001 0.000
Note: EAG: experimental group with advanced English pronunciation level CAG: control group with advanced English pronunciation level.
A further question was whether there is a significant difference in the degree of improvement between students of different levels of phonetic proficiency. To answer this third research question, we compared pronunciation improvement scores between the two EG groups, the EAG and the EIG, who were divided into advanced and intermediate levels in the pre-screening test. Two Independent-Samples t Tests were used on target Table 4. Comparison of the EAG and the EIG on Improvement Scores of Target Vowels in Training Words EAG (n = 25) Improvement Score of Target Vowels
EIG (n = 25)
M
SD
M
SD
2.152
1.538
3.272
2.152
MD
t(48)
Sig.(2-tailed)
−1.12
−2.117
0.039
Note: EAG: experimental group with advanced English pronunciation level EIG: experimental group with intermediate English pronunciation level.
The Effect of ASR Apps on Monophthong Pronunciation
1421
vowel improvement scores (obtained by subtracting the pretest scores from the posttest scores) in training words and unknown words of the EAG and the EIG. The results are presented in Table 4 and Table 5, respectively. As can be seen from Table 4, the EAG students had significantly different target vowel improvement scores in training words after using ASR for pronunciation training from the EIG (t (48) = −2.117, p < 0.05). The EAG improved the average score by 2.152 points out of 36, while the EIG improved by an average of 3.272 points. The EIG’s average improvement score was significantly higher than the EAG’s, with a mean difference of 1.12 out of 36 points. We can conclude that ASR-based pronunciation training improved the vowel pronunciation accuracy in training words of intermediate level students more effectively and to a greater extent than advanced level students by about 3.1% (with the mean difference being 1.12 out of 36 points). Table 5. Comparison of the EAG and the EIG on Improvement Scores of Target Vowels in Unknown Words EAG (n = 25) Improvement Score of Target Vowels
EIG (n = 25)
M
SD
M
SD
5.296
4.194
7.296
4.478
MD
t(48)
Sig.(2-tailed)
−2
−1.63
0.11
Note: EAG: experimental group with advanced English pronunciation level EIG: experimental group with intermediate English pronunciation level.
In order to see whether this effect is also reflected in pseudowords and new words, consider Table 5. This table shows that after the EAG students used ASR for pronunciation training, the target vowel improvement score in unknown words was not significantly different from that of the EIG students (t (48) = −1.63, p > 0.05). From the two group means, the improvement score of the EIG (M = 7.296) was higher than that of the EAG (M = 5.296). The difference between the two mean scores was two points on a 72-point test. It can be seen that in unknown words, ASR-based pronunciation training was not significantly more effective for intermediate level students or advanced level. To sum up the above analysis, for the advanced level students in the experimental group, this training method could effectively improve their pronunciation, and showed no ceiling effects particularly in the improvement of vowels in unknown words. Compared with intermediate level students, the improvement of target vowel pronunciation accuracy in training words was still smaller, showing a significant difference. 5.4 Error Analysis of Each Monophthong of the EG in the Tests In this section, more details of the scores and errors for each monophthong in the pretest and posttest will be reported. Data in this analysis came from the experimental group only, as we are interested in how ASR-based phonetic training can effectively improve students’ production of each target vowel.
1422
H. Xiao et al.
Table 6. Descriptive Statistics of the EG Students’ Scores of Each Target Vowel in the Pretest and the Posttest
Pretest
Posttest
N
Mean (out of 18 points)
Std. Deviation
Maximum
Minimum
Range
/æ/
50
12.496
2.691
16.80
6.20
10.60
/ E/
50
12.288
2.577
16.60
5.20
11.40
/I/
50
12.896
1.613
15.60
8.80
6.80
/i/
50
11.520
2.988
17.00
5.00
12.00
/u/
50
13.492
2.176
17.60
6.60
11.00
/*/
50
10.972
2.096
14.60
6.40
8.20
/æ/
50
14.724
2.528
18.00
6.00
12.00
/ E/
50
13.984
1.864
17.40
10.20
7.20
/I/
50
13.612
1.937
17.00
9.20
7.80
/i/
50
13.632
2.526
17.60
8.60
9.00
/u/
50
14.712
2.007
18.00
8.60
9.40
/*/
50
12.008
1.907
17.80
8.20
9.60
Note: EG: experimental group
According to Table 6, the average score of the monophthong /u/ obtained by the EG participants was almost the highest among the six vowels both in the pretest and posttest and the improvement score was relatively high (MD = 1.22, obtained by subtracting pretest mean score of the vowel from the posttest). The average scores of /æ/, /E/, and /I/ obtained by the participants in the pretest were very similar, also being relatively high among the six monophthongs. However, in the posttest, there was a gap in the accuracy of these three vowels, with the pronunciation of /æ/ being improved more (MD = 2.228), followed by /E/ (MD = 1.696). The score of the vowel /I/ rose relatively little (MD = 0.716). Furthermore, the students’ vowels /i/ and /*/ got lower average scores in the pretest, but the vowel /i/ performed slightly better. In the posttest, the mean score of vowel /i/ also increased more (MD = 2.112), while the pronunciation of /*/ received less improvement (MD = 1.036). Before discussing the specific errors of each vowel that often occurred in the tests, we observed and concluded from the participants’ errors (marked by the three raters) that they frequently made three kinds of errors. One was pure pronunciation errors, caused by students’ incorrect mouth shape and pronunciation position, so this kind of error usually occurred between similar vowel contrasts. For instance, when pronouncing /E/, if the mouth is wide open without strict control, the pronunciation will be close to /æ/. It is worth noting that this type of error led to pronouncing a vowel close to another vowel, rather than completely replacing it by another vowel. The second type was category errors, due to misunderstanding of the vowel pronunciation rules or misrecognizing a word, and this kind of error frequently appeared in pseudowords and new words. For example, recognizing rym as the word rhyme led participants to pronounce the target
The Effect of ASR Apps on Monophthong Pronunciation
1423
vowel /I/ as /aI/. Another example was the misunderstanding of the target vowel in grid as /i/ instead of /I/. Such cases were not classified as pure pronunciation errors because the target vowel was completely replaced by the contrasting vowel /i/. In addition, human raters could obtain information about the student’s ability to distinguish between the pronunciation of /i/ and /I/ from the pronunciations of other words. For further explanation, if a student pronounces fix and disagree correctly, but mispronounces grid, it can be determined that this is a category error. The third type of error was stress errors, which only occurred in disyllabic or (especially) multisyllabic words and were often accompanied by category errors. For example, the word secateurs was often pronounced by participants as /sI’kætrz/ rather than /’sEktrz/. We classified this kind of error as stress error because placing word stress in the wrong position caused errors in target vowel pronunciations in these cases. In a nutshell, participants’ inaccurate pronunciations can be grouped into these three kinds of common errors. For specific common errors of each vowel, first, with regard to the monophthong /u/, students usually did not make pure pronunciation errors, except in cartoonist. Nearly onehalf of students pronounced it close to /*/ in this word. When encountering pseudowords or new words, it was a common mistake to pronounce this target vowel as another vowel, that is, to make category errors. In the words droofer and dootive, nearly one-third of students pronounced it as /O/; in fluke, it was pronounced as /*/. Otherwise, in festoon, most students made stress errors and pronounced it as //. In addition, half of the students did not pronounce the vowel /u/ in linuent. Among these common mistakes, after five weeks of training, the monophthong /u/’s pronunciations in the words cartoonist, droofer and dootive improved. Regarding the vowel /æ/, in the pretest, there were plenty of pure pronunciation errors which were pronounced close to the vowel /E/. Also, a small number of students’ pronunciations of this vowel were similar to the vowel /2/. However, when confronting pseudowords or new words, students seldom made category errors, pronouncing the vowel /æ/ as another vowel. Only nearly one-third of students tended to pronounce the target vowel as /e/ in scap and traddle and nearly one-fifth of students tended to pronounce /A/ in enhaster. In the posttest, the vowel /æ/ was less often pronounced close to the vowel /E/ or /2/ and there was no obvious tendency to pronounce it as /e/ in the word scap. As for the vowel /E/, most of the errors in the pretest resulted from being pronounced close to vowel /æ/ as well. In addition, students often made category errors when encountering an unfamiliar word. For example, nearly one-third of students tended to pronounce it as // in the word slen. Nearly one-fifth pronounced /E/ in theg as // and one-fifth of /I/. Nearly one in two students pronounced it as /i/ in jealous and one in four pronounced the vowel in elk as /I/. Besides, about one fourth of the students made stress errors in secateurs, pronouncing it as /sI’kætrz/ rather than /’sEktrz/. Then, in the posttest, for the training words, the EG students confused the vowel /E/ less with the pronunciation of the vowel /æ/. However, there were still many cases of pronouncing it close to /æ/ for unknown words. Besides, the pronunciations of jealous and elk improved. The vowel in theg was not pronounced as //, but tended to be similar to the vowel /æ/. There was no obvious improvement for other above-mentioned problems.
1424
H. Xiao et al.
About the fourth vowel /I/, most of the mispronunciations made by the EG students in the pretest were category errors. It was replaced by the long vowel /i/ when encountering words with uncertain pronunciations, including in the words syllable, rym, quim, vermin, trission, eponym and grid. The vowel in rym was also often pronounced as /aI/. In addition, when most students were faced with the words fenistic and inertia, although the target vowels were pronounced as /I/, errors were still made because of the stress. In the posttest, the students’ pronunciations of the three words rym, vermin and grid improved. When looking at the monophthong /i/, in the words disagree and outary, nearly a quarter of students pronounced it close to /e/, making pure pronunciation errors. What’s more, participants often pronounced this vowel as the vowel /I/ in the pretest. Whether it was in a known word, like each, even, fleet, or an unknown word, such as pareeky, crief , weal, pleat, easel, most students made category errors in the pronunciations of the target vowels. For breap, about a quarter of students pronounced it as /E/, and a quarter pronounced it as /e/. For stress errors, in the word manatee, most students placed the stress on the last syllable instead of the second one. Among the errors mentioned, the target vowels in the words disagree and outary were less often pronounced as /e/, and the pronunciations of even, easel, crief and manatee improved in the posttest. As for the pronunciation of the vowel /*/, there were many students who often made category errors by pronouncing it as the long vowel /u/, including in the words understood, push, wull, hoodwind, bulletin, bushel, and Pul. Or, it was pronounced as the diphthong /o*/ in tonsfully and bistful. When encountering unknown words, students also pronounced it as /A/, as in athorrook and crook, or /2/, as in sput. After training, the students’ pronunciation scores of target vowels in the words understood, hoodwind, sput, crook, and athorrook increased. Overall, the mistakes participants made in the tests were relatively regular and they could be related to several factors, which deserve further discussion. 5.5 Results and Analysis of the Questionnaire In the follow-up questionnaire survey, we set up seven open-ended questions to collect the EG students’ feedback, in order to understand their attitudes and opinions on this training method after using ASR for five weeks. The first two questions addressed the advantages and disadvantages of the Baidu Fanyi App. As for the advantages, the students mainly mentioned that the application can allow them to imitate the sample recording and give a feedback score in real time. Also, the pronunciation provided is by a real person and the pronunciation is accurate. Another main comment was that this app can provide learning resources and corrections by each phoneme, which is helpful for phonetic learning. Therefore, students believed that the above advantages made them more aware of their pronunciation mistakes in training. However, the students scored the average accuracy of speech recognition of the app at about 6.5 points out of 10, which was highly correlated with the tested error detection rate of the ASR app during the pilot study mentioned in the former chapter. They reported the main disadvantage of the app is that the accuracy of recognition was not very high, such as sometimes recognizing gem as “Jim”, limb as “Linda” or rill as “real” even if they pronounced correctly. It is difficult to recognize some uncommon
The Effect of ASR Apps on Monophthong Pronunciation
1425
words, so it forced students to pronounce words repeatedly in the process of using, which made them feel a little irritated, resulting in a certain negative impact on their enthusiasm. The second part of the questionnaire mainly compared this training method with the feedback given by teachers and peers. When it came to comparing with teachers’ feedback, students mainly thought that ASR-based training is more convenient and flexible. They could practice repeatedly without worrying about the communication restrictions of being with teachers. Regarding the disadvantages, the same issue of relatively low detection accuracy came up again. The accuracy of recognizing pronunciation is not high enough and the feedback given is not accurate enough. When compared with the feedback given by peers, about 63% of the EG students thought they preferred to use ASR for pronunciation practice, because it is more convenient and the feedback provided is more objective. The remaining 37% of the students thought that practicing with their classmates or friends makes the learning situation more realistic. The last three questions were mainly about the overall evaluation and impression of ASR-based training. When students were asked whether they felt their pronunciation improved, about 79% of them said they felt they made some progress. Some students said that the distinction between long vowels and short vowels was clearer to them. Some students said that they corrected the wrong pronunciations of some words and some students said that their consonant pronunciation also improved to a certain extent. However, about whether they would continue using ASR for pronunciation training in the future and whether they would recommend it to others, nearly half of the students held a negative attitude. The reason was the same as the disadvantage alluded to, mainly that the ASR recognition is not accurate enough. The other half of the students thought that they could use this application in their spare time, holding a positive attitude and believing they would continue using it and could recommend it to others.
6 Discussion and Conclusion 6.1 Discussion 6.1.1 Discussion on Pronunciation Improvement and Generalization First, from the results of the training words part, we concluded that after a period of training, the student’s mean score improved by about 4.22%. In just five weeks of training, the students’ pronunciation improvement was visible. This is consistent with studies that invited college students over the past five years [20; 17; 23; 21; 11]. Therefore, we can say that this ASR app is effective for training the pronunciation of English learners whose native language is Mandarin. It also correlates with Guskaroska’s (2019) [12] study on four pairs of vowel contrasts. It was proved that the use of ASR for pronunciation training by Macedonian EFL learners led to an average improvement of 6.71%. It is similar to but slightly higher than the result obtained in this study. In addition, we compared the improvement scores of the two groups of advanced level students and found that the advanced level students who used ASR for training also showed improvement. However, comparing the improvement degree of students with intermediate level and advanced level, it was found that for students with intermediate level who used ASR
1426
H. Xiao et al.
for pronunciation training, the accuracy of the target vowels in training words improved more, with a significant difference. In other words, for the pronunciation in training words, the improvement effect of using ASR in training was better for intermediate level students, but it was also effective for advanced level students. Meanwhile, for the unknown words part, the overall improvement of the EG students was about 7.2%. In the pretest, students’ target vowel accuracy rate for this part of the words (62.5%, about 45 points out of 72) was much lower than for familiar words (80.6%, about 29 points out of 36), but the improvement effect for this part was greater than the improvement of the target vowels’ pronunciation in training words. This shows that when students encountered unknown words, they did have some difficulties in recognizing and reading them. After practicing and familiarizing with some common pronunciation rules, they could read pseudowords and new words more accurately. As Ehri and his colleague found, the establishment of phonemic awareness is beneficial for reading and word spelling (Ehri, et al., 2001) [8]. It can be seen that the students’ ability to read words indeed improved, behind which was the imperceptible input of letter combinations and the pronunciation rules during training. Besides, the comparison of the improvement degree of students with an intermediate level and advanced level showed that there was no significant difference in the improvement degree of target vowels in unknown words when using ASR for pronunciation training. Although students with an intermediate level improved more, the difference did not reach a significant level, proving that this training method had no ceiling effect for freshmen in the university. Even though their starting point of the target vowel pronunciation level in familiar words was relatively high, there still was effective improvement for both the advanced level and intermediate level. Therefore, for the pronunciation training with ASR, the improvement effect was the same regardless of the level of students, when encountering pseudowords and new words. According to the above findings, the participants’ pronunciation accuracy of the monophthongs improved in both training words and unknown words, but the degree of improvement was not very high, behind which there may be several reasons. First, the training of the EG students only lasted five weeks. It was difficult to greatly improve the students’ incorrect pronunciation and cultivate phonemic awareness in such a short period of time. Second, the accuracy and error detection rate of ASR still needs improvement based on the feedback of the EG students in the questionnaire as well as the results measured in the pilot study. This may also be the reason for limiting the effectiveness of ASR on pronunciation improvement. Third, because students conducted pronunciation training in private, although they were required to submit training evidence on time every day, it cannot be ruled out that some students did not complete the task seriously. However, in any case, the above results prove that ASR can help students improve the pronunciation accuracy of target monophthongs, and to a certain extent, help them to be aware of and consolidate some grapheme-to-phoneme correspondence. 6.1.2 Discussion on Specific Errors of Monophthongs As regards the improvement of each specific monophthong of the EG students, /æ/ and /i/ showed the most improvement, followed by /E/ and /u/. The improvement effect for /*/ and /I/ was smaller. In the posttest, the accuracy of the four vowels /*/, /i/, /I/, /E/
The Effect of ASR Apps on Monophthong Pronunciation
1427
was slightly worse compared to /æ/ and /u/, especially for the vowel /*/. This is in some agreement with Li (2020), who found that /*/, /E/, /æ/ and /I/ should be difficult to learn for Mandarin-speaking English learners. Specific to the common type of target vowel error in training words, we found that pure pronunciation errors of target vowels in these training words were directly caused by English intralingual differences, including pronouncing /æ/, /E/ and /2/ close to each other as mentioned in the last chapter. Students could not correctly grasp the pronunciation position and mouth shape of them, resulting in one vowel sound similar to another. The study by Weng (2021) suggested that the incomplete correspondence between Mandarin and English phonemes caused subjects to be confused, such as /i/ and /I/ in English vowels with (i)/i/ in Mandarin, /E/, / æ/, /2/ with (a)/a/ in Mandarin and English /*/, /u/ with (u)/u/ in Mandarin. This echoes the findings of the present experiment. After learning English for more than ten years, students did not use (a)/a/ in Mandarin to replace the similar phonemes in English, namely /E/, /æ/ and /2/, but students still had a hard time handling these subtly similar sounds in English due to the incomplete correspondence between Mandarin and English. Since Mandarin does not have such pairs of vowel contrasts, which is quite different from English, students may be less familiar with such subtle intralingual differences. This is in line with Wardhaugh’s (1970) [26] Contrastive Analysis Hypothesis, arguing that the language errors of second language learners can be attributed to Language Transfer. So far, this confirms the moderate CAH theory of Oller and Ziahosseiny (1970) [22]: not only interlingual differences have a negative transfer effect on language learning, but also subtle intralingual differences. Furthermore, by scoring the vowel in each word of the students, we can confirm that the students knew the corresponding target vowels in the face of training words, but they cannot control the accurate pronunciation position and length of the vowels, resulting in inaccurate though not entirely wrong pronunciations. For example, students pronounced /æ/ similar to /E/ because the students’ mouth opening was too small and the back half of the tongue did not stretch up. The students pronounced /æ/ similar to /2/ because the tongue was too relaxed, the tongue tip did not gently touch the back of the bottom front teeth and the back half of the tongue did not stretch up. However, in daily English teaching, many teachers cannot teach students the specific differences of these similar phonemes in such details, so that English intralingual differences directly cause many pronunciation problems to English learners whose native language is Mandarin. Combined with the analysis of students’ common error types in pronunciation and the improvement results in the training words part, we can find that ASR was effective in helping students correct the common errors of these pairs of contrasts. Among these, the pronunciation of /æ/ and /E/ became better distinguished and improved. In follow-up questionnaires, some students also reported that the distinction between long vowels and short vowels became clearer after training. When we specifically analyze the common error types of target vowel pronunciation in unknown words, we found that students’ errors were relatively more diverse (with more category errors and stress errors) and they were more influenced by learners’ experience and cognition in the process of learning the second language, that is, their familiarity with the grapheme-to-phoneme correspondence. When students encountered
1428
H. Xiao et al.
new words or pseudowords, they sometimes pronounced the target vowels as other vowels. For instance, some students pronounced the target vowels of dootive and droofer as /O/. Since students could not associate “oo” with the correct English pronunciation rules when they encountered these words, they pronounced the target monophthong wrong. Sometimes, learners infer and summarize the rules of the second language pronunciation based on their own learning experience, and there is difference between the learner’s cognition and the actual language system, as proposed by Corder (1967) [3]. For example, some students pronounced the target vowels in slen and theg as //. Students may realize in the process of learning English that the letter “e” is sometimes pronounced as //, but in fact, the letter “e” in English words is sometimes pronounced as // in an unstressed syllable. There was a cognitive bias, which caused students to mispronounce words they did not know when they encountered them. Besides, it is noteworthy that /*/ and /u/ were not indistinguishable from each other in this experiment. When the target vowel was /u/, students usually did not mispronounce it as /*/. Conversely, when the target vowel was /*/, students sometimes mispronounced it close to or as /u/. Students seemed to have a preference for the pronunciation of /u/, which may be because /u/ was more commonly seen than /*/ by them. All of these is related to the fact that their phonemic awareness is not accurate and flexible enough. However, the results of this experiment proved that the use of ASR for pronunciation training was helpful to cultivate students’ phonemic awareness and correct category errors to a certain extent. For example, students’ pronunciation of target vowels in dootive and droofer was transferred to the correct /u/ sound after training. In sput, some students pronounced the target vowel as /2/, like in cut, in the pretest, but pronounced it as /*/ after training. For another example, in the pseudoword scap, some students pronounced the target vowel as /e/ in the pretest, but could pronounce it as /æ/ after training. Since this same situation did not improve in another pseudoword traddle, the training intervention (only the selected six vowels were included in the experiment and no /e/ was included) could be ruled out. Similarly, in the pseudoword enhaster, some students still pronounced the target vowel as /A/ after training. Again, no training intervention occurred in the posttest, though the training session only targeted the six selected monophthongs. It can be seen that after training, students were more likely to associate letters with their common pronunciation rules, and performed phonemic manipulation task with adjacent letters more accurately in some cases. According to Adams (1990)’s [1] description of phonemic awareness from the perspective of ability, this clearly shows that students’ English phonemic awareness increased. However, the above discussion also shows that the use of ASR was not highly effective since it only made students more familiar with the grapheme-to-phoneme correspondence of the target vowels in some words. This may be because the training time was insufficient. In addition, another common error type, pure pronunciation errors, also often appeared in unknown words, which were brought about directly by intralingual differences mentioned before. The subtle distinctions between similar vowels were not fully grasped. After the training, the students’ pure pronunciation errors of the three pairs of vowels in unknown words were also slightly improved, especially for the pronunciation of /æ/. It was more accurate and less often pronounced close to /E/ or /2/, which is consistent to some of the findings in training words. This shows once again
The Effect of ASR Apps on Monophthong Pronunciation
1429
that the pronunciation improvement effect of ASR on students can be generalized to pseudowords and new words. 6.2 Implications Combining the results of the follow-up questionnaire and the findings in the experiment, this study has implications for English pronunciation teaching. First, the experimental results show that ASR can help to improve learners’ English pronunciation of monophthongs, especially in terms of improving pure pronunciation errors resulted from not mastering subtle differences in the pairs of phoneme contrasts and helping students improve phonemic awareness to some extent. Second, feedback of the questionnaire from the EG participants indicates that using ASR in pronunciation training can help them improve their pronunciation to a certain extent and this training method is very convenient. But participants said that this particular ASR app is not accurate enough and lacks real interactivity. Taking into account the advantages and disadvantages of using the ASR app for pronunciation training and considering that teachers don’t have enough teaching time in the class to take care of all students, we believe that ASR can be used to assist teaching, but it cannot completely replace the role of teachers regarding pronunciation instruction. In addition to teachers’ guidance, students can have an application where they can receive timely feedback and which can be used for free anytime and anywhere, allowing them to discover their pronunciation problems in time and imitate the correct pronunciation. Nonetheless, the participants reported that because this application cannot recognize vowels and words with a high accuracy rate, they sometimes felt frustrated during practice. For that reason, teachers should choose materials of appropriate difficulty for the students when using this method to assist teaching and should closely evaluate particular applications that are employed for accuracy and convenience. 6.3 Limitations and Directions for Future Research In addition to the implications for English pronunciation teaching, this small project also uncovers directions for further research. First of all, a limitation of this study is the use of the pseudowords bistful and tonsfully in the tests, because in words like beautifully, the “u” may be a schwa or completely absent. For future research, I recommend avoiding words with such apparently confusing structures. In addition, there were variables that cannot be controlled during the experiment, such as extra learning of the participants, which may slightly affect the results. What’s more, recall that Li (2016) [16] found that training with disyllabic stimuli was much more effective for second language learners’ learning of Mandarin tones, with all possible permissible syllabic structures included. For future research, comparison of pronunciation improvement of target vowels through syllabic structures or numbers of syllables can be explored in more details. Furthermore, with the implementation of the country’s Double Reduction Policy5 , off-campus courses like Phonics for children are no longer encouraged, so it is worth exploring the question whether this ASR-based training method is effective for younger learners in cultivating phonemic awareness, in order to better use it in school teaching. 5 Website: http://www.gov.cn/zhengce/2021-07/24/content_5627132.htm.
1430
H. Xiao et al.
6.4 Conclusions This study explored the role of ASR in English pronunciation training by investigating three aspects, namely, the extent to which ASR can promote pronunciation improvement of six target monophthongs, whether and to what extent pronunciation improvement can be generalized to pseudowords as well as new words and whether there is a difference in effectiveness between learners with different pronunciation levels. The findings show that participants in the experimental group achieved vowel improvements by increasing about 4.22% in the target vowel scores, especially for improving pure pronunciation errors caused by subtle intralingual differences in the pairs of phoneme contrasts. The results also showed that the improvement effect of ASR in English pronunciation training can be generalized to new words and pseudowords, with an improvement of about 7.2%, helping students to improve some category errors in the pronunciation of unknown words and develop phonemic awareness to a certain extent. In addition, this pronunciation training method was effective for university freshmen with advanced pronunciation level without an obvious ceiling effect, particularly in generalizing to pseudowords and new words. However, the improvement effect was slightly better in training words for intermediate level learners. Although ASR still needs improvement in the accuracy of speech recognition, participants still expressed that using ASR for pronunciation training is very convenient. In consequence, this study believes that under the guidance of teachers, ASR programs can be used to assist in English pronunciation training to make up for the limitations of the current EFL classroom environment. This study explores more possibilities for future research about using ASR in EFL settings with limited native input.
Appendices Appendix A: Test Material Part 1 Training Words fix
sat
understood
ruler
plastic
print
syllable
festival
happen
each
cartoonist
push
excellent
minute
imagine
finally
move
even
jealous
could
suitable
never
kitchen
fleet
ten
bookshop
softly
input
let
loosen
exactly
act
hospital
disagree
too
neighborhood
Part 2 Pseudo Words slude
outary
camic
fletter
athorrook
pareeky
crief
tonsfully
cheedle
droofer
maticker
vumly
dective
fenistic
breap
scap
bistful
popectic (continued)
The Effect of ASR Apps on Monophthong Pronunciation
1431
(continued) Part 2 Pseudo Words trission
wull
linuent
sput
traddle
slen
blin
actity
zinner
dootive
chaxed
rollook
empity
trule
enhaster
rym
shoomering
theg
Part 3 New Words servitude
gastric
dexter
hoodwind
vermin
weal
grievances
bulletin
quim
inglenook
hex
manatee
pleat
hebetic
fatuously
eponym
crook
whack
easel
elk
rictal
bushel
bendlet
slue
festoon
grid
Pul
secateurs
banyan
inertia
acapnia
fluke
elude
vat
prudential
perky
Appendix B: Questionnaire 关于使用 “百度翻译app” 练习英语发音的评价性调查问卷 Evaluative Questionnaire on Practicing English Pronunciation with “Baidu Fanyi App” 说明: 本次问卷的回答是针对使用“百度翻译app”进行英语发音练习后的评价 调查。请根据自身实际情况进行较为详细的回答。调查结果仅作为实 验分析的依据并严格保密。感谢您的参与!Note: This questionnaire is an evaluation survey after using “Baidu Fanyi App” to practice English pronunciation. Please give detailed answers based on your actual experience. The results are only used as a basis for experimental analysis and are strictly confidential. Thank you for your participation! 1. 使用 “百度翻译app” 练习英语发音, 有什么优点或者特色功能?这些是否对您的 英语发音有帮助?What are the advantages or special functions of using “Baidu Fanyi App” to practice English pronunciation? Did these help you improve English pronunciation? 2. 使用 “百度翻译app” 练习英语发音后, 请您为语音识别准确率打分 (10分 满分), 您认为使用过程有什么不便的地方?这是否给您的发音练习带来消极影 响?Please score the recognition accuracy of “Baidu Fanyi App"(out of 10 points). Is there any inconvenience while using it to practice? Is that negatively affecting your pronunciation practice? 3. 与老师给您发音反馈相比, 使用 “百度翻译app” 获取发音反馈有什么优点和缺 点?What are the advantages and disadvantages of using “Baidu Fanyi App” to get pronunciation feedback, compared with getting feedback from teachers? 4. 您认为(A. 使用 “百度翻译app” 进行发音练习) 和(B. 与同学组队互相练习发音), 哪个更好, 为什么?Which do you think is better (A. using “Baidu Fanyi App” for pronunciation practice) or (B. practicing pronunciation with your peers), and why?
1432
H. Xiao et al.
5. 您认为在训练期间使用 “百度翻译app” 的语音识别技术辅助发音练习后, 总的 来说, 您的英语发音是否有提高?具体体现在哪方面?Do you think your English pronunciation improved in general after using the automatic speech recognition technology of “Baidu Fanyi App” to practice? What specific aspect was improved? 6. 在实验过后, 您未来是否会继续使用 “百度翻译app” 进行英语发音练习?为什 么?After the experiment, will you continue using “Baidu Fanyi App” to practice English pronunciation in the future? why? 7. 您是否会推荐其他人使用 “百度翻译app” 进行英语发音练习?Would you recommend others to use “Baidu Fanyi App” for English pronunciation practice?
References 1. Adams, M.J.: Beginning to read: thinking and learning. Psychol. Rev. 65, 197–208 (1990) 2. Ashwell, T., Elam, J.R.: How accurately can the Google web speech API recognize and transcribe Japanese L2 English learners’ oral production? JALT Call J. 13(1), 59–76 (2017) 3. Corder, S.P.: The significance of learners’ errors. Int. Rev. Appl. Linguist. Lang. Teach. 5, 161–170 (1967) 4. Cucchiarini, C., Neri, A., De Wet, F., Strik, H.: ASR-based pronunciation training: scoring accuracy and pedagogical effectiveness of a system for Dutch L2 learners. In: Proceedings of the International Speech Communication Association Conference, 2181–2184 (2007) 5. Demenko, G., Wagner, A., Cylwik, N.: The use of speech technology in foreign language pronunciation training. Arch. Acoust. 35(3), 309–329 (2010) 6. Derwing, T.M., Munro, M.J., Carbonaro, M.: Does popular speech recognition software work with ESL speech? TESOL Q. 34, 592–603 (2000) 7. Egan, J.P.: Articulation testing methods. Laryngoscope 58, 955–991 (1948) 8. Ehri, L.C., Nunes, S.R., Willows, D.M., Schuster, B.V., Yaghoub-Zadeh, Z., Shanahan, T.: Phonemic awareness instruction helps children learn to read: evidence from the National Reading Panel’s meta-analysis. Read. Res. Q. 36(3), 250–287 (2001) 9. Elimat, A.K., AbuSeileek, A.F.: Automatic speech recognition technology as an effective means for teaching pronunciation. JALT CALL J. 10(1), 21–47 (2014) 10. Eskenazi, M.: Using automatic speech processing for foreign language pronunciation tutoring: some issues and a prototype. Lang. Learn. Technol. 2(2), 62–76 (1999) 11. Evers, K., Chen, S.: Effects of Automatic Speech Recognition software on pronunciation for adults with different learning styles. J. Educ. Comput. Res. 59(4), 669–685 (2021) 12. Guskaroska, A.: ASR as a tool for providing feedback for vowel pronunciation practice. Graduate Theses and Dissertations, Iowa State University, Iowa (2019) 13. Hatch, E., Lazaraton, A.: The research manual: Research design and statistics for applied linguistics. Newbury House Publishers, NewYork (1991) 14. Hide, O., Van de Poel, K.: Interlanguage phonology: Implications for a remedial pronunciation course for Chinese learners of English. Antwerp Pap. Linguist. 100, 17–46 (2002) 15. Kim, I.: Automatic speech recognition: reliability and pedagogical implications for teaching pronunciation. Educ. Technol. Soc. 9(1), 322–334 (2006) 16. Li, Y.: Effects of high variability phonetic training on monosyllabic and disyllabic Mandarin Chinese tones for L2 Chinese learners. Doctoral dissertation, University of Kansas, Kansas (2016) 17. Li, M., Han, M., Chen, Z., Mo, Y., Chen, X., Liu, X.: Improving English pronunciation via automatic speech recognition technology. In: 2017 International Symposium on Educational Technology, pp. 224–228 (2017)
The Effect of ASR Apps on Monophthong Pronunciation
1433
18. Liakin, D., Cardoso, W., Liakina, N.: Learning L2 pronunciation with a mobile speech recognizer: French /y/. The CALICO J. 32(1), 1–25 (2015) 19. Neri, A., Mich, O., Gerosa, M., Giuliani, D.: The effectiveness of computer assisted pronunciation training for foreign language learning by children. Comput. Assist. Lang. Learn. 21(5), 393–408 (2008) 20. McCrocklin, S.M.: Pronunciation learner autonomy: the potential of automatic speech recognition. System 57, 25–42 (2016) 21. Mroz, A. P.: Noticing gaps in intelligibility through Automatic Speech Recognition (ASR): Impact on accuracy and proficiency. Paper presented at 2018 Computer-Assisted Language Instruction Consortium (CALICO) Conference, Urbana, IL, United States (2018) 22. Oller, J.W., Ziahosseiny, S.M.: The contrastive analysis hypothesis and spelling errors. Lang. Learn. 20, 183–189 (1970) 23. Sidgi, L.F., Shaari, A.J.: The usefulness of automatic speech recognition (ASR) eyespeak software in improving Iraqi EFL students’ pronunciation. Adv. Lang. Literary Stud. 8(1), 221–226 (2017) 24. Sun, H.I.: Effects of the ASR-embedded dictionary app use on college students in EFL pronunciation class. J. Res. Curriculum Instr. 22(6), 400–413 (2018) 25. Thomson, R.I.: Computer assisted pronunciationtraining: targeting second language vowel perception improves pronunciation. The CALICO J. 28(3), 744–765 (2011) 26. Wardhaugh, R.: The contrastive analysishypothesis. TESOL Q. 123–130 (1970) 27. Wang, Y.H., Young, S.S.C.: Effectiveness of feedback for enhancing English pronunciation in an ASR-based CALL system. J. Comput. Assist. Learn. 31(6), 493–504 (2015) 28. Zhang, J., Duan, R., Cao, W., Xie, Y.: A preliminary study on ASR-based detection of Mandarin mispronunciation by Japanese learners. In: Proceedings of the International Speech Communication Association Conference, 1478–1481 (2014) 29. Liu, X: Phonetic transfer in second language acquisition from Mandarin to English vowels (in Chinese). Master Dissertation. Shanghai: East China Normal University (2020) 30. Lin, X.: An empirical study on the effect of ASR technology on high school students’ pronunciation improvement (in Chinese). Master Dissertation. Fujian: Fujian Normal University (2014) 31. Lin, L. (2012). An empirical study on the effect of ASR-assisted pronunciation training software on phoneme teaching to non-English majors (in Chinese). Master Dissertation. Shandong: Shandong University 32. Wang, H.Y., Heuven, V.: Acoustic analysis of English vowels: a study of English vowel generation in Chinese, Dutch and American English contexts (in Chinese). Foreign Lang. Literatures 4, 226–236 (2013) 33. Weng, X. Y.(2021). An analysis of English vowel pronunciation errors in high school students (in Chinese). Master Dissertation. Shanghai: Shanghai Normal University 34. Xu, W.H.: A study on the effect of ASR-based online software on phoneme learning of Chinese university students. Master Dissertation. Hunan University, Hunan (2010)
IoT Secure Cloud Enabled Model for Soil Nutrition Monitoring and Fertilizer Suggestion for Agricultural Industry of Sri Lanka U. H. D Thinura Nethpiya Ariyaratne(B) , V Diyon Yasaswin Vitharana, L. H Don Ranul Deelaka, H. M Sumudu Maduranga Herath, Anuradha Jayakody, and Narmada Gamage Sri Lanka Institute of Information Technology, Malabe, Sri Lanka [email protected], {anuradha.j,narmada.g}@sliit.lk
Abstract. During 2021, the Sri Lankan government suddenly banned using and importing pesticides and chemical fertilizers for all the crops cultivated within the country. Because of this unexpected decision, most of the farmers in the country are experiencing numerous problems. There is a negative impact on the yield of the crops to a considerable degree because the plants are not getting the required amount of nutrients from the fertilizers provided by the government. Problems with the harvest can arise if the soil type, pH value, nutrient levels, and crop that the farmers are planning to cultivate are not considered when preparing the fertilizer. So, the farmers are complaining about the quality, cost, and lower efficiency of the fertilizers provided by the government. Therefore, there is a need to produce a high-quality, highly efficient, low-cost fertilizer that is suitable for the crop that the farmers are planning to cultivate by considering all the factors mentioned before. Furthermore, there is a requirement to upgrade the current fertilizer suggestion system within the country, as it only uses the expert knowledge of the agricultural officers and past paper-based data. This case study develops a cloudbased, portable, highly efficient, time-saving, and low-power-consuming IoT solution called “FertiHelp” to address the mentioned matter. The suggested system, in this case, obtains the levels of the soil nutrients (Nitrogen, Phosphorus, and Potassium), the pH value, humidity, moisture, and electroconductivity of the soil by sensors, and suggests a fertilizer that can be used for the soil. Also, the farmer is provided with recommendations for other crops that could be grown in the same soil. Keywords: pH Value · Fertilizer · Upgrade · Low-cost · IoT Fertilizer Suggestion · Crop Recommendations · Nitrogen · Phosphorus · Potassium · Electroconductivity
·
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): SAI 2023, LNNS 711, pp. 1434–1449, 2023. https://doi.org/10.1007/978-3-031-37717-4_94
FertiHelp
1
1435
Introduction
Traditionally, Sri Lanka has been mainly an agricultural nation. The rest of the world referred to Sri Lanka as an eastern granary. It traveled all the way there as a result of the successful harvest the country had. Sri Lanka’s culinary creations have been exported all over the world. Every Sri Lankan was aware that our forebears practiced a form of organic and natural farming and that they endeavored to maintain a close connection with both natures and learning throughout their history. However, at this point in time, the vast majority of Sri Lankan farmers who uses non-organic agricultural systems do have that type of link. It’s terrible that they are unable to acquire the level of agricultural produce and expertise that our forebears had. The population of the nation is to blame for this situation, along with a lack of information about the environment and the characteristics of the soil. In addition, contemporary individuals are interacting with scientific and technological developments. This applied case study proposes a system called “FertiHelp” that integrates scientific research, technological know-how, and agricultural domain experience. Farmers, agricultural offices, and department heads may boost crop output by using that method to make suitable decisions for their activities, obtain an idea about their region’s soil nutrition level and condition, and choose which types of fertilizers complement their fields. The state of the climate, the average temperature, and the qualities of the soil are the primary environmental elements that influence crop output. As a result of this, this case study is mainly based on those particular elements. While doing this case study, we conducted research to determine which factors need to affect the growth of plants, what the nutrition and condition of the soil are, and what our previous research on fertilizer suggestions revealed. We also talked to agricultural experts to find out what they currently think about when recommending fertilizers and how they recommend fertilizers. According to the soil quality level, the system suggests the fertilizer, and according to the fertilizer, there’s another mechanism to suggest the suitable crop according to the tested field. These mechanisms assist farmers in identifying the appropriate crop and growing it without wasting time or money. Also, these mechanisms use the weather conditions to suggest the plant and fertilizer because, in Sri Lanka, different crops are suitable for different areas. So, the system checks the location and the weather conditions to suggest the crop for farmers. “FertiHelp” sends data obtained from sensor nodes in the hardware device to the cloud. When there is no internet connection, the data is saved to internal storage and transferred to the cloud when an internet connection is established. Cloud services are highly adaptable and scalable and also provide useful data streams for analytics while maintaining better levels of security. Also, it has the opportunity and capability to grow either horizontally or vertically, depending on the requirements. We present some preliminary findings for paddy fields in a particular district in this article. It allows for large-scale data processing on realtime observation streams of data from many sources, including sensory networks,
1436
U. H. D. Thinura Nethpiya Ariyaratne et al.
Fig. 1. Picture Diagram of FertiHelp
weather forecasting services, and so on. As “FertiHelp” is based on the IoT and the cloud, it is imperative that security be considered in both situations. IoT security is a branch of information technology that focuses on protecting the connected devices and networks that make up the IoT. That’s not only valid; after data is transferred and saved in the cloud architecture, it is crucial to consider cloud security as well. This overall idea is pictorially represented by using Fig. 1. 1.1
Literature Survey
According to the crop recommendation and fertilizer purchase system, the suggestion would be predicated upon the nitrogen (N), phosphorus (P), and potassium (K) levels of the soil. This study leads to the creation of several techniques that can aid in the establishment of an effective predictive algorithm. This sys-
FertiHelp
1437
tem provides farmers with plants as well as fertilizer in order to increase agricultural productivity. It also enables farmers to order the recommended fertilizers directly from the mobile application. The viability of the soil is determined by its NPK levels. The “N”, “P”, and “K” letters signify the soil’s nitrogen, phosphorus, and potassium levels, respectively. The quality of the soil may be predicted depending on its NPK level. Crops might have somewhat yellowish leaves if the soil contains lower nitrogen levels, whereas plants would have greenish leaves if nitrogen levels are intermediate or high. The phosphorus concentration in the soil is responsible for the plant’s reproductive system and predicts the development of harvests and blooms. For the plant’s overall growth, the potassium level is crucial, as that percentage would foretell how sturdy any crop’s roots would be. They compared the current value to their data set, which was compiled from a massive amount of historical agricultural data. Both results were compared, and then the farmer was suggested to choose the crop that had the greatest number of points. When proposing the plant to the farmers, this approach uses the random forest algorithm [1]. According to an overview of the internet of things (IoT) and data analytics in agriculture: benefits and challenges research, there are constant efforts in agriculture to abandon conventional practices in favor of innovative methods. By employing these innovative new technologies, we can reduce the waste of resources and boost the efficiency of operations to meet the increased food demands that could result from the world’s rapid population growth. Utilizing IoT in agriculture is all about equipping farmers with decision-making tools and automated systems that integrate supply, knowledge, and services to boost production, quality, and revenue. Recent research on IoT in agricultural systems focuses mostly on the challenges and limits of conducting large-scale agriculture and food distribution chain trials. The preceding study conducted an exhaustive evaluation of IoT in food production. The evaluation comprised a review of released publications, white papers, and current solutions. An IoT ecosystem for agriculture has four essential components: IoT devices, communication technologies, Internet usage, storage systems, and computation. The application of IoT and DA, as well as how this facilitates smart agriculture, are discussed. This research also examines the benefits, downsides, unresolved issues, emerging trends, and future opportunities [2]. When using IoT to determine soil pH and fertilizer recommendation research, soil pH is a significant determinant of crop output. The pH of the soil is a measurement of the mobility of hydronium particles (often H+) present in the soil. Evaluation of soil nutrients is essential for a successful harvest. If the soil is of adequate quality, it is ideal for producing a rich harvest. They relied on soil pH as a reliable predictor of nutrient content. As the rate of soil corrosion varies, so does the solubility of some metallic particles. Rather than the corrosiveness itself, the concentration of these elements in an organization has a substantial effect on the growth of plants. The objective of regulating soil pH is not to achieve a certain pH value but rather to modify the causticity to a level where no toxic metals exist and supplement accessibility is optimal. Typically, this occurs when the soil pH is between 5.8 and 6.5; however, certain crops have high corrosiveness requirements [3].
1438
U. H. D. Thinura Nethpiya Ariyaratne et al.
The research, an IoT cloud-enabled model for safe and smart agriculture environments, presents a secure cloud-based IoT solution leveraging AWS authentication and authorization. Encryption, unique licenses, and strong cryptographic keys safeguard IoT shared storage. Using the Mobile Cloud Computing Simulator (MCCSim), they build a simulation platform to examine the edge layer’s impact on critical parameters. On each security level, edge computing for physical and cloud levels employs authorization and authentication. IoT devices automatically search a local, accessible edge, not the enterprise cloud. This speeds cloud decision-making and minimizes latency. AWSIoTAC includes edge computing. Between-tier security includes certificates, policies, public and private keys, and encrypted transfer methods. This method includes three layers: an AWS cloud layer that provides cloud services and support, an edge layer that caches data, and a physical layer that stores IoT device data. Physical layer data is routed straight to the edge device for modification and analysis to save time and energy until AWS cloud connections are necessary. Every AWS IoT device must validate itself using the device certificate (.X509), public and private keys, and the AWS IoT core key. This strategy provides reliability. Greengrass Core, like AWS IoT Core, has a certificate, a private key, and an AWS IoT Core key. MQTT messages and the MQTT Protocol safeguard all data flows [4]. The agriculture sector and the compelling need for cutting-edge smart farming technology are discussed in terms of secure smart agriculture monitoring techniques through isolation. Farmers are increasingly turning to the use of sensors in order to monitor their crops, as well as the soil and the weather. The information that has been acquired is then sent to a central cloud platform in order to be processed and assessed there. The results are given back to the farmers so that they can improve their farming techniques. The same devices, sensors, and actuators will be utilized by other stakeholders, such as early warning systems for catastrophic events, in order to provide efficient real-time management. It is possible to employ an agrometeorology machine-to-machine (M2M) system for precision agriculture to control crop diseases, provide alerts, and make weather forecasts. Two of the subsystems that will be included in the intended system are referred to as the server subsystem and the telemetry subsystem. In this particular study, ADCON RTUs are utilized to transmit data, which is subsequently shown in Grafana dashboards. In the future, they are going to put the system through a denial-of-service (DoS) attack simulation. Their primary focus will be on utilizing DoS attacks to restrict the amount of data that can be transferred between RTUs and the server, as well as to stop a legitimate user, such as a farmer, from accessing the data that is kept on the server [5]. The Remote Sensing and Controlling of Greenhouse Agriculture Parameters Based on IoT Research Less understanding of farm factors and generating innovations is a major challenge in modern agriculture. In the past, our ancestors avoided using unique plant growth development and instead used normal growth for all plants. The technological change in agriculture can grow plants in uncommon natural settings, which helps get more produce and less compost. Precision agriculture in greenhouses for plant growth has become popular due to
FertiHelp
1439
less expensive technologies for farmers to increase productivity. The greenhouse is a transparent, house-like structure that controls temperature, humidity, light, and other factors for optimum plant development. Precision agriculture detects, measures, and responds. It’s a way to identify greenhouse climates, send the data to the cloud, and then take action based on the data. The precise agriculture framework is improving due to the IoT’s Wireless Sensor Networks (WSN). Inconsistent greenhouse climate conditions affect plant growth and yield, so it’s important to monitor CO2, soil moisture, temperature, light, and other factors. This issue can be remedied by implementing an IoT innovation in precision agriculture, which integrates precise applications for greenhouse characteristics such as temperature range, water flow management, light radiation, and so on for good plant growth [6].
2
Methodology
According to Fig. 2, our IoT secure cloud model for soil nutrition monitoring and fertilizer suggestion system, which is also branded “FertiHelp”, has three sectors: the IoT device, the AWS cloud infrastructure, and the website. The IoT device was used to obtain all the soil data from the soil. This is portable, power-saving, and contains sensors that are used to obtain accurate soil data values. These soil data values will be saved on the SD card first for security purposes. Soon after the IoT device is connected to the cloud platform via the internet, it will send all the collected data to the cloud for processing. After these soil data are transmitted to the FertiHelp AWS cloud infrastructure, it performs several validation checks of the data packet and if the validations are successful, it will start filtering device and soil data. Device data are redirected to another cloud database that saves time series data which is AWS Timestream. Soil data will be redirected to a NoSQL database which is AWS DynamoDB. These soil data will then get fetched by another function to start data enrichment. This data enrichment includes suggesting the best fertilizer and the plant according to the recorded soil conditions. After these suggestions are made, these data will be sent to the database and the database records will get updated to include these enriched values accordingly. All these updates have happened within a few seconds duration. Apart from the mentioned AWS services, there are several other services we have used for the implementation of FertiHelp, this is deeply discussed in the latter parts of the report. We have also created a website that will be used by the users to interact with our system. All these soil and device data are viewable through our website separately in real time. The website is also hosted in AWS cloud architecture and API calls are used to communicate with the soil database and the device database to fetch the data when required by the users. The backend of the website is developed by using NodeJS and the frontend application is developed by using React.
1440
U. H. D. Thinura Nethpiya Ariyaratne et al.
Fig. 2. High-Level Architecture Diagram
2.1
Implementation of IoT Device for Data Collection
When considering plant growth and health, more factors are required to be thought of. Among those factors, environmental conditions and soil properties mainly impact crop production. We obtain the primary nutrients nitrogen (N), phosphorus (P), and potassium (K). Furthermore, soil pH, soil temperature, moisture, and electrical conductivity values are obtained. Serial I/O is used to make the communication between a sensor and the ESP8266 board. Predefined Hexadecimal offset is fed to the sensor, and it provides the output for that offset. Then, it is possible to get the relevant soil data values from that output [9]. To discover the climate conditions, we used the DHT11 sensor which collects the actual temperature and humidity of the soil sample’s surroundings. Location, date, and time of the soil sample are also taken because its important when we create the nutrition level map that helps to compare past and future nutrition levels of that certain area. Also, we are using GPS data to get the exact location of our IoT devices throughout the vicinity. When we collect the soil data, it will be shown to the data collecting officer instantly via the LCD display for verification purposes. Then if the officer is happy, the data push button needs to be pressed and the collected data will be pushed to the cloud and will also be stored in an SD card as a backup [7,8]. This process is represented in Fig. 3. Moreover, we have used Arduino IDE software for integrating and programming all the hardware components. When establishing the connection between the IoT devices and the AWS cloud platform, we decided to use AWS IoT core service. In the IoT core service, we created a new model called “a thing” for our IoT device. Then it provides the ability to generate a unique certificate, an authentication code, and attach policies for that created thing. Those certifi-
FertiHelp
1441
Fig. 3. IoT Device Architecture of FertiHelp
cates, private keys, and policies will be helping to create secure communication between the device and the cloud. 2.2
Implementation of the Cloud Architecture
Amazon Web Services (AWS) is the cloud service provider that is used to build up the “FertiHelp” project. FertiHelp cloud architecture will be using serverless technology because it provides a variety of features against typical cloud-based or server-centric systems. Serverless infrastructures provide developers with several benefits, including increased scalability, better flexibility, as well as a lesser duration to deliver, all at a cheaper price. Using serverless infrastructures, users do not need to bother about licensing, hosting, or administering backend systems. As shown in Fig. 4, there are several web services that will be used by the system to manage and process the data that has been collected and also to interact with the upstream which is the IoT device, and the downstream which is the website including the users. Some of the services would be AWS IoT Core, AWS Lambda, AWS SQS queues, etc. The usage of these services will be further discussed in the latter part of the report. The whole cloud architecture of “FertiHelp” can be further divided into four main parts, 1. 2. 3. 4.
Soil data management system Device data management system Fertilizer and Plant suggestion system management Website management system
When the IoT device sends a data packet via the MQTT protocol to the AWS IoT core service using the JSON format, an AWS Lambda function will be triggered. This function will validate the data received from the IoT device and if the validations are successful, it will start filtering the soil and device data sent from the device. Soil and device data management will be catered to separately after this point.
1442
U. H. D. Thinura Nethpiya Ariyaratne et al.
Fig. 4. FertiHelp Cloud Architecture Diagram
The separate subsystems mentioned above will be further discussed below to provide a better understanding of the usage of cloud services in FertiHelp. Soil Data Management System. If it is soil data, the data will be placed in an AWS SQS queue. This queue will be a first-in-first-out (FIFO) queue that is used to reduce the speed of flow of the packets sent to the database at a certain time. This queue will collect all the soil data information and another Lambda function will be configured to pick one by one of these requests and save the data to our soil database which is created using AWS Dynamo DB. This process is represented in Fig. 5.
FertiHelp
1443
Fig. 5. FertiHelp Cloud Architecture Diagram - Soil Data Management System
Device Data Management System. As shown in Fig. 6, if it is device status data, it will be directly sent to our device data database which will be a timeseries database that is created using the AWS timestream service. This database is configured to save detailed information about the IoT device like the device ID, current GPS location, available storage in the SD card, device uptime, and connection status. These data will be sent at a higher frequency, so the quantity of data will be high, and it requires a larger database if we are to save all these data in a typical database. But in AWS timestream we have created a data retention policy that removes stale/old data in the database and keeps only the latest available information of that certain device in order to save space and to help retrieve data from the database efficiently.
1444
U. H. D. Thinura Nethpiya Ariyaratne et al.
Fig. 6. FertiHelp Cloud Architecture Diagram - Device Data Management System
Fertilizer and Plant Suggestion System Management. Figure 7 represents how the soil fertilizer and plant suggestion system is hosted in the cloud architecture. When a soil data entry is added to the soil database, a lambda function will be triggered. This new entry will be fetched from the database and placed into an AWS SQS FIFO queue. This queue contains raw soil data that was fetched from the IoT device straightaway, these raw data will be then sent to the fertilizer and plant suggestion system one by one to decide the most suitable fertilizer for the soil and also to determine what other plants could be grown in the same soil. The soil fertilizer and plant suggestion systems are also placed in a lambda function. This consumes all the paper reports that are generated by the Agricultural department of Sri Lanka and suggests the best fertilizer and the plant as the output. After the soil data have been enriched by the soil fertilizer and plant suggestion system, these enriched data will be placed in another AWS SQS queue. This queue contains all the enriched soil data, and it will sequentially save these data to the soil database with the help of a data picker function created using AWS Lambda. The data picker function updates the database record in the DB with the new data output by the fertilizer and plant suggestion system.
FertiHelp
1445
Fig. 7. FertiHelp Cloud Architecture Diagram - Soil Fertilizer and Plant Suggestion Management System
Website Management System. The official website of ‘FertiHelp’ will be developed by using a NodeJS backend and React frontend. The backend of the website needs to communicate with the AWS DynamoDB to obtain the data related to soil information and also it needs to communicate with the AWS Timestream database to obtain the device information data as well. It is required to implement API gateways for these interactions between the website and the cloud databases. Furthermore, the overall web application needs to be hosted in the AWS platform and it’s required to interconnect the backend with the frontend through the platform itself. We use the AWS Amplify service to host the web application as it can create a container and deploy it by provisioning an environment automatically according to the requirement. There are separate lambda functions created to fetch soil data and device data from the respective databases, these lambda functions will
1446
U. H. D. Thinura Nethpiya Ariyaratne et al.
Fig. 8. FertiHelp Cloud Architecture Diagram - Website Management System
be triggered by the API gateways to bridge the gap between the website and the databases. When the users interact with the website, all the requests will be gone through the AWS application load balancer, which helps us efficiently manage user traffic and increase the security of our application. This process is represented in Fig. 8 [10]. 2.3
Implementation of the Fertilizer and Plant Suggestion System
In the plant suggesting system, we have done the pre-analysis of the soil data using the historic data and created a machine learning model that helps suggest the plant which is suitable for the tested soil area. Develop the Pre-analysis. As represented in Fig. 9, in the pre-analysis stage we implemented the function using machine learning to read and execute data stored in a CSV file. This CSV file contains the data related to crops and the required nutrition values for that specific crop. Then created a boxplot inside that function to find outliers in the datasets. After finding the outliers we have defined the min and the max values for each nutrition level. That function also can exclude all null values in the datasets. That helps to generate accurately analyzed data and store it in DynamoDB.
FertiHelp
1447
Fig. 9. FertiHelp Plant Suggestion Pre-Analysis Block Diagram
Develop the Processing Stage. This stage is shown by using Fig. 10. Here, we created a lambda function that is used to fetch collected soil data from the DynamoDB. That function fetches the latest raw soil data that has been collected from our IoT device using the SQS queue. The collected data is stored in different objects. Then call the machine learning function and feed the data using created objects by passing parameters. In that, we filtered each soil nutrition level and found the relevant columns and rows in the database. Using filtered data, defined a nutrition range and get the median values for the relevant dataset using the machine learning function. After using finalized data set, the function finds the suitable plant from the dataset and provides it as an output to the database. That value is stored in the database using the combination of device ID, time, and device location. Finally, in the API gateway we have implemented another lambda function to fetch all suggested data from the database which are related to the relevant device ID. Using those data, the website retrieves suggested plants depending on the area selected by the end user.
1448
U. H. D. Thinura Nethpiya Ariyaratne et al.
Fig. 10. FertiHelp Plant Suggestion Process Stage Block Diagram
2.4
Implementation of the Website
According to our scenario, our IoT device takes data from the tested soil and transfers it to the cloud to generate more information for the end user. As the end users, farmers and agricultural officers must have a platform that is easy to use for generating reports or providing information to the farmer of the tested area. As per the suggested solution for end users, a dynamic website is the most suitable method for Sri Lankan agricultural office. So, we created a dynamic website that provides several functions for end users.
3
Workflow Accuracy, Justifications and Future Work
The case study in this applied research introduces a prototype IoT device that can be used to overcome the previously mentioned constraints. The proposed prototype is designed as a fully cloud-native approach in which the end-user has the least influence on hardware handling. Given that the proposed solution recommends not only the best fertilizer for the soil but also the best crop for the soil, there may be technology-influenced, biased recommendations. Because the fertilizer suggestion system relies on data from the Agricultural Department of Sri Lanka and is not linked to any other data sources, it will not provide the most up-to-date recommendations. This scenario has been added as justification for the system’s main output in the current implementation. Connecting the proposed solution to any other current data source or a collaborative suggestion system could improve this for any future work. Also, as of right now, the web application that comes with the proposed system is the only way to interact with the collected data, which can be taken as an implementation restriction.
FertiHelp
4
1449
Conclusion
Agriculture is the mainstay of Sri Lankan society. There are many issues raised recently in the Sri Lankan agriculture sector that has not been addressed yet. A major problem raised recently was that at present in Sri Lanka, there is no solution available to identify in real-time which fertilizer should be used or which plant can be grown according to the soil nutrient levels. According to the current system, farmers will have to get a soil sample and hand it over to a laboratory to obtain the fertility levels of the soil, which will take a long time to get the results. FertiHelp’s solution will be a satisfactory answer to those problems faced by the farmers in Sri Lanka. It measures the soil nutrition and condition by using various soil factors, uploads the collected data sets to the cloud, processes those data through a machine learning data model, and provides the best-matching fertilizer to the relevant areas. It also suggests some other plants that could be grown. In this study, we provide that information accurately and efficiently by obtaining raw soil data from the IoT device and processing it with the help of the cloud and the data sets we have obtained. The research’s main objective is to help improve the yearly crop production of the farmers, provide quality, better fertilizer for the field, and suggest the best plant for planting their fields.
References 1. Shinde, M., Ekbote, K., Ghorpade, S., Pawar, S., Mone, S.: Crop recommendation and fertilizer purchase system. Int. J. Comput. Sci. Inf. Technol. 7, 665–667 (2016) 2. Elijah, O., Rahman, T., Orikumhi, I., Leow, C., Hindia, M.: An overview of Internet of Things (IoT) and data analytics in agriculture: benefits and challenges. IEEE Internet Things J. 5, 3758–3773 (2018) 3. Basak, S., Ratan, M., Roy, J.: IoT in determining soil PH and fertilizer suggestion. (Daffodil International University, 2018) 4. Tawalbeh, M., Quwaider, M., Lo’ai, A.: IoT cloud enabeled model for safe and smart agriculture environment. In: 2021 12th International Conference on Information and Communication Systems (ICICS), pp. 279-284 (2021) 5. Suciu, G., Istrate, C., Diu, M.: Secure smart agriculture monitoring technique through isolation. In: 2019 Global IoT Summit (GIoTS), pp. 1–5 (2019) 6. Pallavi, J., Mallapur, K.: Bendigeri Remote Sensing and Controlling of Greenhouse Agriculture Parameters based on IoT. (Vishwakarma Institute of Technology, 2017) 7. Charles, M.S.: T. Factors Affecting Yield of Crops. (IEEE, 2019) 8. Bendre, M., Thool, R., Thool, V.: Big data in precision agriculture: Weather forecasting for future farming. IEEE (2016) 9. Chun, U.: A nutrient recommendation system for soil fertilization based on evolutionary computation. IEEE (2021) 10. Husamuddin, M., Qayyum, M.: Internet of things: a study on security and privacy threats. In: 2017 2nd International Conference On Anti-Cyber Crimes (ICACC), pp. 93–97 (2017)
Modeling Internet-of-Things (IoT) Behavior for Enforcing Security and Privacy Policies Anubhav Gupta1 , Daniel Campos2 , Parth Ganeriwala2(B) , Siddhartha Bhattacharyya2 , TJ OConnor2 , and Adolf Dcosta2 1
University of British Columbia, British Columbia V1V 3C8, Canada 2 Florida Institute of Technology, Melbourne, FL 32901, USA [email protected]
Abstract. The availability and usage of the Internet of Things (IoT) have grown significantly over the last decade. This growth in ubiquitous computing has enabled continuous observation, decision-making, and execution of actions to improve the livelihood of millions. However, IoT also has an increased cybersecurity risk by making the user vulnerable to attacks from the outside or disclosing information without the user’s knowledge. As a result, it becomes essential to understand the behavior of IoT devices and then model the behavior at a higher level of abstraction that enables automated reasoning, reducing the gap between policy requirements and its analysis. Towards this end, we have developed our proposed framework, which combines policy requirements guided structured experimentation performed on the IoT devices, followed by modeling to store the knowledge for analysis. We leverage the experimental findings to formulate the relationship of IoT devices in the model. The creation of this model enables automated analysis to identify if there is a potential violation of security or privacy. We demonstrate our approach by investigating the behavior of IoT devices and performing analysis to check if the IoT devices execute risky cybersecurity behavior. Keywords: Ontology Requirements
1
· IoT · Cybersecurity · Privacy · Policy
Introduction
With the innovations in communication technologies, hardware designs, software systems, and artificial intelligence, Internet of Things (IoT) has enabled applications that benefit day-to-day activities. As a result, IoT has become an essential part of our daily lives. It is further envisioned that the future of the internet will consist of heterogeneously connected devices that will further extend the borders of the world with physical entities and virtual components [35]. Although, IoT has enhanced our capabilities, it has also increased our vulnerability to a wide variety of cybersecurity attacks. Some of the primary reasons for IoT devices to be more vulnerable are that these devices operate on wireless networks, thus increasing the ease with which c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): SAI 2023, LNNS 711, pp. 1451–1473, 2023. https://doi.org/10.1007/978-3-031-37717-4_95
1452
A. Gupta et al.
confidential information is available to attackers by eavesdropping. IoT devices mostly communicate information without an operator monitoring or observing the data being sent. Based on the nature of IoT devices, complex security methodologies are hard to implement [29]. Advancements in IoT related technologies have increased the availability and affordability of IoT devices. These devices are being used for home automation or fitness or conducting daily activities, without the knowledge of the vulnerabilities or awareness of privacy and security issues. As a result, it is critical to identify the vulnerabilities posed by IoT. In regards to the identification of the private data accessed by IoT, the types of data accessed, the types of unauthorized data accesses possible or the alarming behavior of IoT in collecting data, there is either lack of awareness in the vulnerable behavior executed by these IoT devices or vendors collect information without the knowledge of the consumers. This becomes critical, as the IoT devices are used for defense or safety-critical operations [18]. One of the major challenges faced in this area of research is in the identification and extraction of relevant information for IoT devices and their composition, pertaining to cybersecurity, which allows the representation of well-defined concepts that are realistic and facilitate computer-based analysis. Towards this end, we discuss our approach which investigated the creation of a framework that elaborates on the process of abstracting important information to perform automated analysis. In this research paper, related work is described in Sect. 2, the research background is discussed in Sect. 3, and our proposed methodology in Sect. 4. Then in Sect. 5 the policy requirements that guide the experimentation are discussed followed by modeling. In Sect. 6 the experimental framework is explained. Section 7 discusses the actual model, followed by the automated analysis performed in Sect. 8. Finally the results are discussed in Sect. 9 with conclusion in Sect. 10.
2
Related Work
With the exponential increase in the use of domain knowledge applications that require immense communication of information between systems, grasping the structural complexity of these systems along with their semantic relationships between the data and devices require models. To represent these models, ontologies are developed to represent data as concepts and formally specify the relationships between them. Ontologies enable the reuse of an agreed-upon collection of domain knowledge [24]. Li et al. [2] established the social attributes of things, evaluated the functionality of relations, which is one of the essential social attributes in IoT, and utilized super network architecture to portray the complex relationships among physical objects. They incorporated an ontology-based method to represent the relations of objects based on these relations and relation architecture. Ye et al. [39] presented a top-level ontology model which was used to represent the domain knowledge. They introduced concepts for common semantics in information at different levels to support the communication, re-use
Modeling Internet-of-Things (IoT) Behavior
1453
and sharing of ontologies between systems. They were restricted by the lack of a reasoner to implement their rules generated as part of their work. Rahman et al. [28] present a lightweight ontology with dynamic semantics that tackles semantic interoperability difficulties in different IoT implementations. The ontology significantly decreases query response time and computational resource requirement when compared to existing heavy-weight and complicated ontologies, while only including the most often used concepts in IoT. Their proposed ontology is an extension of IoT-Lite [21] and an abstraction of SSN ontology which is suitable for real-time IoT environment. The proposed methodology automatically concatenates a new node under a certain cluster based on the similarity index, making it suitable for a dynamic environment. However, they have not designed a light-weight generic ontology with dynamic semantic which has been left as future work. The Internet of Things (IoT) is an integral part of daily activities. IoT systems are composed of smart devices that interact and transmit data without the need for human involvement. Due to the sheer development and autonomous nature of IoT systems, these devices are exposed and vulnerable to significant risks. Thus behavior capturing, and verification procedures are required to show the trust-level of these devices [1]. Wang et al. [37] brought the primary focus of existing research to be on device modeling without access and utilization of information being considered. They discussed the development of a robust description ontology for knowledge representation in the IoT domain and explored how it can be applied to facilitate tasks such as service discovery, testing, and dynamic composition. However, they failed to bring in the information security service into their ontology. Hachem et al. [16] have focused on addressing challenges with the scalable, heterogeneous nature of IoT devices and have integrated these concepts to formulate a set of ontologies that give a robust description of the devices and as well as, their functional relationships. They have also modeled the domain of physics, which is the foundation of the IoT, since it allows for the approximation and estimation of functionality often represented by things. This approach however needed a modeling language that was not too detailed to implement their ontology for evaluation. SIoT [9] was developed as an approach to model the social relationships among the IoT devices and to give access to humans to discover, select and use objects with the services in IoT bringing forward human trust analysis. De et al. [12] presented a semantic modeling approach for different components in an IoT framework. Their model could be integrated into the IoT framework by using automated association mechanisms with physical entities and thus the data could be discovered using semantic search and reasoning mechanisms. Koorapati et al. [20] proposed a method to draw actionable insights through operational analytics. While most IoT research focuses on data from IoT services and sensors, they focus on the variety of metadata that comprises the end-to-end IoT ecosystem. This study shows how to represent an end-to-end IoT ecosystem utilizing a semantic-based approach such as ontologies. The SensorData Ontology developed in [38] is built based on Sensor Model Language (SensorML) specifications. The primary focus of SensorML is to pro-
1454
A. Gupta et al.
vide a robust and semantically-tied means of defining processes and processing components. Although the language provides an appropriate modeling approach, it is too detailed for specific use-case ontologies to be modeled and evaluated. Sommestad et al. [31] have developed a cyber security modeling language(CySeMoL) for security threat analysis to assess the probability of attacks on modeled systems. It does not assess confidentiality attacks as confidentiality is of lesser importance in industrial-control systems than IoT. Thus, there is a need to introduce a model based formal language for modeling IoT systems for information security and privacy analysis.
3 3.1
Background IoT Vulnerabilities
With the widespread use of IoT devices in both homeland and deployed settings, there is a range of different IoT security threats that are relevant. For example, Fitbit fitness trackers can provide geolocation data to foes [18], and wifi cameras can send unauthorized video from secure areas [11]. The state of IoT is rife with vulnerabilities and poorly designed devices and ecosystems [13,14,19,22, 25,26,36]. As such, there is a need for military, commercial, and civilians to be aware of these threats and ensure that everyone is equipped with the necessary knowledge, tools, or features to combat such threats. To this end, there is a need to conduct research to understand the threat that IoT devices pose and to identify cyber approaches across the full spectrum of operations that can mitigate these threats. This challenge demands further study to understand the interaction among IoT devices and generate cybersecurity defenses, both tactical and strategic, to ensure the secure and effective use of IoT devices. Given the nature of vulnerabilities, Zero trust philosophy works well the IoT environment. Significant research efforts have been conducted that focus on perimeter protection based on the philosophy of trusting insiders and distrusting outsiders of a network in developing solutions for cybersecurity such as firewalls. The philosophy of zero trust focuses on “Never Trust Always Verify”. Recent research in cybersecurity emphasized the need to design policies to detect and prevent insider threats in cyber infrastructure. Schultz [30] defines insider attackers as individuals who have been assigned privileges and the authority to execute their responsibilities. Still, they use that authority to damage the system or steal sensitive data. So, it is critical to design the system or software with zero trust philosophy. The behavior of each device can then be checked to identify the needed policies. 3.2
IoT Vulnerability Lifecycle
Rapid firmware development without any concern for the software development lifecycle is often the resulting cause of vulnerabilities in IoT. With the relatively low cost of IoT devices, vendors prioritize rapidly delivering a product to market
Modeling Internet-of-Things (IoT) Behavior
1455
and selling subscription-based services over software maintenance and developments [33]. A 2019 study by Parker Thompson examined over three million binaries from IoT and uncovered that vendors are failing to implement basic hardening features, including decades-old best practices [33]. Our experiences echo this concern. In the course of our experiments, we identified and reported six Common Vulnerabilities and Exposures (CVEs) for the Geeni and NightOwl branded WiFi camera doorbells [3–8]. In each case, we informed the vendors of vulnerability, the root cause behind the vulnerability, and offered countermeasures to protect against future attacks. Geeni mitigated a single vulnerability with a firmware update [7] and discontinued the product line associated with the three other vulnerabilities [4–6]. NightOwl did not respond to any of our reports about the vulnerabilities in their product lines [3,8]. Despite of the vulnerabilities currently being unpatched, no warning appears in either companion application to inform the user that they are using a vulnerable or discontinued product line. 3.3
Ontology
Ontologies can model data as concepts and formally identify the associations among them. We can leverage ontologies to create well-defined concepts that help researchers exchange information about a particular domain. In addition, ontologies provide reusability for an agreed-upon set of domain knowledge [24]. Several languages are used to develop ontologies. For example, Resource Description Framework (RDF), Ontology Inference Layer (OIL), Unified Modeling Language (UML), and Web Ontology Language (OWL) [10,17]. Prot´eg´e is a graphical environment that can be used to create an ontology. Prot´eg´e also supports the use of the OWL language [23]. Cybersecurity ontologies can be used to detect software vulnerabilities and their relationships. Undercoffer et al. [27] have been working on an Intrusion Detection System (IDS) ontology. The IDS ontology analyzed more than 4000 classes of computer attacks and how these attacks can be performed. The IDS ontology was then extended by Finin et al. [32] to include information integration and to write general rules, and it became the Unified Cybersecurity Ontology (UCO).
4
Proposed Methodology
Research indicates that modeling behavior of IoT is important in order to formulate cybersecurity policies. In this research effort, we come up with a proposed methodological framework (Fig. 1) that helps to model the behavior of IoT for generation of cybersecurity policies. We begin by generating “Policy Requirements” for the IoT devices that can help to increase user trust and privacy. For instance, a user might be interested in knowing if the IoT device used by him/her interacts with a cloud based server which is located in a country outside his/her country of residence. After the requirements have been generated, we go ahead and “Select” IoT devices that will be used for experimentation. We then
1456
A. Gupta et al.
develop the experimental framework and perform experiments on the selected IoT devices. Validate Cybersecurity Policies Policy Requirements
IoT Devices
Experimental Framework
Formulate Formal Queries
GENERATE
SELECT
DEVELOP
Generate a MindMap
Generate IoT Ontology
Fig. 1. Proposed Methodological Framework.
The experiments performed on the selected IoT devices aids the process of generating requirements for the Mind Map. A Mind Map is a diagram which is used to visually organize information. The information gathered from the experiment is transferred to the Mind Map which, helps to visually organize the information and makes it much more easier to understand and comprehend. It lists all the entities along with their relationships and the associations between different entities. Using the entities and relationships, we come up with a formal IoT ontology model which helps to run automated reasoning and draw inferences. These experiments along with policy requirements and device selection help to formulate queries, that are executed on the ontology to infer knowledge from the ontological model. The outcome of the query reasoning helps to validate the policies and detect any kind of violations. Our Contribution in this research has been the formulation of an innovative policy guided experimental framework to capture IoT interactions. We have also developed a method to label network data to classify IoT security vulnerabilities in identifying the behavior of the IoT devices. Furthermore, we demonstrated the modeling of the behavior of IoT devices at a higher level of abstraction that was obtained from experimental outcomes which, were pertinent to security and privacy policies. .
5
Generation of Policy Requirements that Guide the Model Development
One of the main focus of this research effort is to aid the process of understanding the need for policies that help in increasing IoT trust and privacy. The aim is to develop a knowledge base that models the behavior of IoT devices for automated analysis, which is guided by some of the already existing and identified vulnerabilities related to security and privacy. For instance, in 2018, the US Military announced to reexamine its security policies after fitness tracker data was shared on social media that revealed bases and patrol routes [18]. Many such similar
Modeling Internet-of-Things (IoT) Behavior
1457
data privacy breaches have raised serious concerns regarding security policies with respect to IoT. Through this research effort, we aim to come up with a policies guided approach to cybersecurity, that can help identifying the relevant elements and relations among them that need to be modeled, so that possible violations of the policies can be captured if present in the model. In-order to do so, we begin with the generation of policy requirements which have been categorised into four different categories: – Interaction: Does the system interact with external systems (Systems located outside a defined geographical location)? – Server Ownership: Who owns the server with which the system is interacting continuously? – Frequency: How often does the communication/packet sharing happen? – Size and Type (Secured or Unsecured) of Data: What is the size and type of data that is being shared by the system? Policies pertaining to “Interaction” primarily focus on the interactions between the system and the outside world. They help to identify whether the system is interacting with any external servers and if the servers are trustworthy. For instance a user might be interested in knowing if the IoT device used by him interacts with a server which is located outside his/her country of residence. Similarly, network packet monitoring might be of importance and knowing the periodicity of communication can reveal some important insights about the system. Moreover, knowledge about the “Type and Size of Data” being shared by the device also helps to draw some important inferences relative to security of the system. Knowing if the shared data is secured or unsecured is important, since it will help to detect and prevent security vulnerabilities within the system. The knowledge about “Server Ownership” is also useful as the user might be interested in knowing whether the device interacts with a server that is owned by the manufacturing company of the device or does the company outsource it to a third-party cloud based company. All these policy requirements not only increase user trust and privacy but, also help to identify any type of violations or security breaches.
6
Experiments
Through the experimentation process, this study aims to come up with an abstract representation of the behavior of an IoT device that, can be represented as an ontology, so that we can build a knowledge base that can be used to perform automated reasoning to guide the generation of secure policies. The IoT Lab at Florida Tech was setup and configured to perform experiments on fifty six different types of IoT devices. As part of an ongoing experiment to automate device flow labels in Florida Tech’s IoT Lab, IoT traffic was captured over a period of seven days on fifty six devices. All the communication between the devices present inside the lab and the outside world (network traffic entering and exiting the lab) was saved as packet capture files on a monitoring server
1458
A. Gupta et al.
while vendor APIs were periodically queried for event labels. For this study, we have particularly chosen Smart Doorbell IoT device because we believe that it will allow us to model an OWL representation which can be further generalized to represent any type of multimedia IoT device.
Experimental Framework PowerConnect Port Mirror
Raw Packet Data
Packet to JSON
Processed JSON
logstash
TShark
Elasticsearch Database
Fig. 2. Packets are Ingested and Processed into a Query-Able Format for Analysis.
Figure 2 above represents the experimental framework. To start with, a Port Mirror is used to generate packet data in the raw format. Since the experiments were performed inside the University, Port Mirror helped in mirroring the packets and allowed for a transparent packet capture thereby emulating a home environment. This approach allowed to capture network packets from a setup that used the same components of the network in a passive and transparent manner. For interactive analysis, captured data is fed to Wireshark’s command line processor, TShark (a network protocol analyzer), which converts the captured packet data into JSON objects. These JSON objects are then processed into Elasticsearch’s logstash (server side data processing pipeline), which buffers and post processes the ingested data. This results in network packet data which are in the form of processed JSON objects that can be stored inside a database and queried for analysis. Finally, this information is stored onto Elasticsearch database so that it can be analyzed in Kibana (a query language and data visualization dashboard for ElasticSearch), which allows us to perform aggregations and visualizations on each device separately and independently.
DoorBells
Geeni Doorpeek
Ring Pro 1st Gen
NightOwl Video Doorbell
Fig. 3. Doorbells used for Experimentation.
For this particular study, three different doorbells (Fig. 3) were selected for analysis based on “availability”, “observed maturity”, and “vendor popularity”. The devices were obtained by purchasing through popular online retailers and
Modeling Internet-of-Things (IoT) Behavior
1459
physical stores. To represent major vendors, a Ring Video Doorbell Pro 1st generation was selected. We selected a Geeni Doorpeek 1080p Wireless Doorbell, and a NightOwl Video Doorbell as these were widely available. We term NightOwl’s vendor as an “immature” vendor because it was realised during the experimentation that it employs industry wide bad-practises which could be due to lack of experience in the IoT security domain. 6.1
Hardware Analysis
Fig. 4. A Disassembled View of Doorbell to Identify High-Level Hardware Components. Firmware can also be Downloaded when Applicable to Identify Network Flows.
Vendors have a wide variety of hardware components that they can integrate in the final product. Generally, only major components are publicly identified, obscuring common dependencies or features that may exist across multiple devices. For this paper, we disassembled the three doorbells to identify common hardware patterns and components. We discovered that the Night Owl and Geeni Doorbells shared the most components. Both utilized a “HiSilicon” based ARM processor and an “SPI-based” flash chip for storing the operating system software. Alongside these components, we discovered a “UART” serial console and a small battery source to keep the system online during intermittent power outages. All devices contained a removable camera module and usb-based WiFi module, as well as a USB port for powering the device without the use of a doorbell transformer. – Firmware Analysis: During our analysis, we discovered a few calls from the HiSilicon based devices which we could not map to specific features. We previously identified that these devices utilized an ARM processor and an SPI flash chip for storing code. We hypothesized that an offline SPI programmer may be able to download the device firmware for offline analysis. Furthermore, if code signing or boot verification was not discovered, we could modify and upload the new firmware to perform dynamic analysis on the devices.
1460
A. Gupta et al.
To perform the chip read and write, we utilized a CH341a SPI programmer as indicated in Fig. 4. The programmer is available on several online retailers with a non-invasive clip to allow chip interaction without removing components. Firmware was downloaded and extracted using popular tools like Binwalk and Binary Ninja to perform extraction and analysis. Additionally, both devices contained telnet daemons and hard-coded admin credentials for remote management, which we overwrote to allow local access for dynamic analysis. To simplify explanations in the following sections, we define the following categorizations of requests: – Demographically Internal - Network requests which stay within a country’s borders. – Demographically External - Requests occurring outside a country’s borders. These requests have legal implications on how data is handled. – First Party - Any server or endpoint which stays under a vendor’s root domain. For example, Ring subdomains es.ring.com and lpd.ring.com are considered first party. – Third Party - Any server or endpoint which exists under another company’s root domain. Tuya Smart, a popular IoT platform, hosts its endpoints under its own domain. This approach is cost-effective, but can introduce unintended weaknesses as many developers either re-use existing code solutions or utilize a turnkey provider to develop the code base for them. During our analysis, we discovered that all of our HiSilicon based devices stored ARM binaries in an embedded SPI-based flash chip which varied from 4–8 MBs in size, alongside an open and accessible UART console. SPI-based flash chips use standard read and write interfaces which can be accessed using external hardware. We leverage this interface and a CH341a SPI programmer to download each device’s firmware. During this process, we discovered that none of the HiSilicon based devices utilized any form of code signing or image verification, which allowed us to explore additional investigative techniques using static and dynamic analysis. Dynamic analysis was performed by modifying the device software directly. Using the SPI programmer, we flashed modified images which gave us unrestricted access to the devices by resetting the root password and enabling telnet. Using this technique, we successfully identified previously unknown services on the NightOwl and Geeni doorbells which allowed us to map the spontaneous NightOwl UDP connections to P2P brokers and identifying the types of information sent by Geeni requests. 6.2
Generalized Observations
As indicated in Table 1, we discovered variations on APIs contacted, hosting providers used, the use of encryption, and the demographic presence of all three
Modeling Internet-of-Things (IoT) Behavior
1461
devices. Most notably, Ring appeared to utilize services that were both demographically internal and first-party in nature. We contrasted this with the Geeni doorbell, where services appeared to be completely third-party in nature, but demographically internal. Night Owl used a mix of first and third party services. We observed similar patterns in the use of hosting providers when first party sources were used. All vendors used Amazon Web Services almost exclusively, but Night Owl included an additional hosting provider for their first party sources. Table 1. Experimental Observations Device
First/Third Party Host/Vendor
Ring Video Doorbell Pro First Third Geeni Doorpeek Mixed Nightowl Doorbell
Demographic Presence
Amazon AWS Internal Tuya, AWS Internal First Party - Amazon AWS, Liquid Web Third Party - Unknown/Private Mixed
Ring transmitted the largest amount of data, with Night Owl following closely. These transmissions occurred primarily on event boundaries and were destined for identified CDN endpoints which mirrors audio and video style uploads. This is contrasted with the Geeni Doorbell, which sent the least amount of data. The Geeni doorbell only generated small amounts of data which could coincide with single frame uploads. We also observed several time-bounded flows which repeatedly occurred when dealing with update and API endpoints. 6.3
Generalized IoT Environments
For the following sections, specific environment traits are compared with the typical IoT model which separates CDNs, API, and Core servers from users and end devices. For all data tables, DNS and NTP traffic was filtered. While both are necessary for device function, neither significantly contribute to any of our findings. Ring Doorbell Pro. The Ring Doorbell recorded a total of 341 event labels over the course of the experiment. Of these events, 33 were generated by physically ringing the doorbell, and 308 were generated by triggering the device motion sensor. Along these events, we identified a total of 1,720 flows. 648 flows were performed over UDP connections to what appeared to be CDN server pools. 4 TCP connections were made to lpd.ring.com and were long-lived in nature, mirroring Core Server/idle connection functionality for on-demand events. 419 TCP connections were made to fw.ring.com, performing what we believe were firmware update checks. LPD and FW connections transferred less than 30MB of data in a 7-day period, emphasizing the possibility of API-exclusive data.
1462
A. Gupta et al.
Es.ring.com was resolved using the DNS alias event-sink-gw.ring.com. We observed the creation of 250 TCP connections which occurred around event boundaries. Only one connection stayed open for more than 3 s, with a large majority closing within 0–1 seconds. Most data was transmitted to unresolved addresses over 399 TCP and 648 UDP connections for 1GB and 5GB of data respectively. For both categories of data, we observed connection times typically ranging from 10–60 seconds mirroring typical media upload behaviors. Table 2. Ring Doorbell Experimental Observations. DNS
Encrypted Load Balanced TCP/UDP Flows Total Data Sent/Received
Purpose
lpd.ring.com es.ring.com fw.ring.com N/A
Y Y Y Y
API API Updates CDN/Unk
Y Y Y Y
4/0 250/0 419/0 399/648
26.10 MB 142.78 MB 4.66 MB 1.08 GB TCP/5.20 GB UDP
206 IP addresses were returned as possible endpoints for es.ring.com, versus 352 possible endpoints for the CDN-styled endpoints. We contrast this with LPD, which only returned 19 possible endpoints. We hypothesize that a significantly larger pool of media-oriented endpoints exist to handle the larger amount of transmitted media. ES and LPD scale down accordingly which appears to correlate with the amount of handled data. Table 2 summarizes the behaviors observed during the experimentation process. NightOwl Doorbell. The NightOwl doorbell recorded 76 total events. Of these events, 29 were generated by physically pressing a button while 47 were generated by triggering the device motion sensor. While the doorbell press events were close to what we observed with the Ring doorbell, the motion events were drastically reduced. We hypothesize that NightOwl’s decision to include a dedicated motion sensing hardware component caused this reduction. Many motion sensors were triggered by lab lights periodically cycling on and off. PIR based sensors were not activated by these events, mirroring the behavior observed in the NightOwl doorbell which corresponded with in-person movement events. NightOwl spread connections over 4 first-party domains and 6 unresolved endpoints. Of the 4 first-party domains, both host.nightowl domains are hosted under Liquid Web LLC. Update and cloud-storage are hosted under Amazon Web Services. Of the 6 unresolved endpoints, 5 are used to facilitate P2P streaming of the camera’s video feed. These endpoints are configured globally, with 2 endpoints located in China, 1 in the Netherlands, and the remainder in the United States. 1 unknown TCP endpoint hosts an insecure web server used to generate push notification events. Update.nightowlsuper.com also hosted an insecure web server used to indicate when a software update was available. Over the course of the experiment, the NightOwl doorbell connected twice to this endpoint. The two host.nightowl.com
Modeling Internet-of-Things (IoT) Behavior
1463
Table 3. NightOwl Doorbell Experimental Observations DNS
Encrypted Load Balanced TCP/UDP Flows Total Data Sent/Received
Purpose
cloud-storage.nightowlconnect.com host.nightowldvr07.com host.nightowldvr08.com update.nightowlsuper.com N/A
Y N N N N
API/CDN Unk Unk Update PNS/P2P
Y N N Y N
124126/0 0/3 0/3 2/0 98/15
916.04 MB 69.11 MB 69.12 MB 2.16 KB 123.03 KB TCP/7.84 KB UDP
endpoints maintained long-lived UDP connections which periodically transmitted data. The sent data was not deeply investigated, but very few bytes changed between packets indicating a lack of, or poorly implemented, encryption. The only encrypted connections were made under cloud-storage.nightowlconnect.com, which was hosted on Amazon Web Services. Conveniently, this endpoint appears to connect exclusively over port 443/TCP, which is a common endpoint for web-based APIs. We observed 124,126 unique connections to this endpoint, with connection times ranging from 0 s to 2 days, with an average time of 3 s. Furthermore, almost 1GB of data is transferred during our experiment, indicating its use as a potential CDN on top of API data. Over all endpoints, cloud-storage.nightowlconnect.com appeared to be the only endpoint which utilized encryption. All others were either performed over insecure http or used a poor implementation on top of an unknown protocol. Table 3 summarizes the observations and lists the important data flows. Geeni Doorpeek Doorbell. The Geeni doorbell recorded a total of 232 events. Of these events, 38 corresponded to physical button presses, and 194 corresponded with motion events. This correlates with the Ring Doorbell reported events and matches our previous hypothesis that the Night Owl reduction is caused by the hardware sensor. Geeni also reports changing light conditions as motion, bolstering its numbers up to Ring’s reported numbers which uses a similar technology. However, ring still reported 100 additional events that the Geeni doorbell failed to detect. Table 4. Geeni Doorpeek Doorbell Experimental Observations DNS
Encrypted Load Balanced TCP/UDP Flows Data
Purpose
aws3nat.tuyaus.com aws4nat.tuyaus.com a2.tuyaus.com m2.tuyaus.com ty-us-storage30 N/A
Y Y Y Y Y Y
P2P P2P API API CDN CDN
Y Y Y Y Y Y
0/3 0/3 633/0 39/0 193/0 38/0
4.20 KB 3.17 KB 3.40 MB 6.59 MB 13.22 MB 1.94 MB
We did not detect any first-party domains for the Geeni doorbell. Instead, we observed the direct use of Tuya endpoints, AWS Storage gateways, and two direct AWS S3 endpoints over 903 TCP flows. Under Tuya, we discovered two endpoints, aws4nat and aws3nat, which are used to assist in P2P applications
1464
A. Gupta et al.
like video and audio streaming. These endpoints represent 6 UDP flows out of the 909 total. ty-us-storage30’s 193 TCP connections and the 38 unknown connections resolve to Amazon Web Services S3 endpoints, which are used to upload and retrieve data in the cloud. We also observe that the sum of both services comes out to 231 unique connections, 1 event short of our reported event count. This bolsters our theory on its use as a CDN endpoint and also verifies that the device is only uploading to these endpoints on event boundaries. m2.tuyaus.com connections are exclusively MQTT which is traditionally used for long-lived idle connections for real-time commands on end devices. This mirrors Ring’s use of lpd.ring.com for holding long-term connections to control devices. Finally, a2.tuyaus.com connects exclusively over port 443, leverage fast, on-demand connections for small bursts of data over 633 short-lived connections. This corresponds with firmware update checks, event data, or API requests. Table 4 summarizes the experimental observations.
7
Model
Formal representation capability for different kinds of knowledge is provided within an Ontology framework. A knowledge framework can be constructed and formalized for a specific ontology by leveraging an ontology language. OWL, endorsed by the W3C, is one of the most recent developments in standard ontology language. We use Prot´eg´e 5.5.0 [23], a free open-source ontology editor and knowledge base construction tool, to develop an IoT Framework and use automated reasoning to infer knowledge from already existing knowledge. It consists of different built-in elements such as Classes Tab, Object Properties Tab, Data Properties Tab, Individuals Tab, Rules Tab, etc. Prot´eg´e’s most important feature is that it can be executed by a DL reasoner that provides for classification, consistency checking, and policy validation within the knowledge framework. In this work, we use Prot´eg´e to formally model and represent the requirements. We will graphically list the requirements through the Mind Map to generate an ontology as part of future work. We further go on to create instances, in order to build a knowledge base that can be reasoned using HermiT reasoner [15]. The inferences drawn from reasoning assist in validating the need for policies. To begin with, every model in Ontology contains a super class known as the “Thing” class which then flows down into other sub-classes. All other classes are sub-classes of the “Thing” class. The IoT device Ontology’s Thing class has five child classes namely “IoTDevice”, “server”, “softwareFeatures”, “data” and “hardwareComponents” as seen in Fig. 5. It is noteworthy that all five classes are disjoint with each other. The “IoTDevice” class is further decomposed into sub-classes as represented in Fig. 6. The “smartDoorBell”, “smartDoorLock” and “smartWatch” are child classes that inherit all the features and properties of the “IoTDevice” class. We mainly focus on the “smartDoorBell” and create instances with data properties for each of the vendors we experimented with. The data properties that have been generated during the experimentation process are used to formulate the knowledge
Modeling Internet-of-Things (IoT) Behavior
1465
Fig. 5. OWL Thing Class, Data Properties, and Object Properties are Omitted.
Fig. 6. OWL IoT Device Class, Data Properties and Object Properties are Omitted.
base within the IoT Ontology. We can then perform automated reasoning on this knowledge base for consistency checking and validation. We try to list all important data properties that we believe will help in drawing insightful inferences. The “IoTDevice” class is disjoint with the “server” class indicating that a server cannot be an IoT device and vice-versa. The “IoTDevice” is associated with the “server” through “isConnectedTo” object property association indicating that each IoT device is connected to at-least one or more servers. The “server” class is a representation of cloud based server that provides connectivity to IoT devices for data storage and other types of communication such as encryption-decryption, TLS handshaking, etc. It might be of relevance to know the location of the server with which the IoT device is interacting. Moreover, it is also important to know if the server is owned by the manufacturing company itself or it is being outsourced. Every instance of a server has certain data property attributes such as “encryptedConnection”, “portNumber”, “serverLocation”, “thirdPartyorFirstParty”, “demographicallyExternalorInternal” and “serverType” that define it’s behavior. Figure 7 represents the “softwareFeatures” class which is a subclass of “Thing” and is disjoint with the “IoTDevice” class. It represents various software features that are found in smart doorbells. The features have been categorised into three broad categories which are “securityFeatures”, “internetFeatures” and “smartFeatures”. Each category is further sub-categorised and are indicative of the features that are present in the doorbells. Security features such as “loadBalancing”, “eventEncryption”, etc. make the device and communication secure. Similarly, the Internet Features such as “streaming”, “imageRecognition”, etc.
1466
A. Gupta et al.
Fig. 7. OWL Software Features Class, Data Properties and Object Properties are Omitted.
indicate the general behavior of the doorbell. There are some bells that incorporate certain smart features into their functionality such as “soundRecognition”, “thermalImageRecognition”, etc. Every doorbell has a unique behavior and may contain one or more of such features. The data properties are formulated in a way that querying can indicate if a particular doorbell supports a feature or not. For instance, we can query to find out if the “Ring” doorbell performs “eventEncryption” or “dataEncryption”. The “hardwareComponents” class shown in Fig. 8 is disjoint with the “IoTDevice” class and represents the various hardware components that are present inside the doorbell. Majority of the doorbells have a “button”, “microprocessor”, “speaker”, “camera” and “microphone”. Some doorbells like the smart doorbells provide additional functionalities such as “heatSenor”, “motionSensor”, “thermalImageRecognitionSensor” and “networkModule”. Depending on the components present inside a doorbell we have created instances of each and listed the data properties that help to create the knowledge base. Similarly, the “Data” class as shown in Fig. 5 represents the data that is sent and received by the IoT device. It has attributes to store certain data properties such as the Type and Size of data which are essential to infer knowledge and come up with secure policies. The “smartDoorBell” class is connected to the “Data” class through an object property assertion “sendsAndReceives”. In the next section, we discuss the outcomes and results of automated reasoning which helps to infer and list some interesting findings.
Modeling Internet-of-Things (IoT) Behavior
1467
Fig. 8. OWL Hardware Components Class, Data Properties and Object Properties are Omitted.
8
Automated Reasoning in Protege
Reasoning within ontology helps draw inferences and can be achieved using inbuilt reasoners. In its normal distribution, Prot´eg´e includes a variety of reasoners. HermiT is available in Prot´eg´e through a reasoner Plug-in thereby providing HermiT’s special capabilities for checking ontology consistency. Reasoning helps infer new knowledge from the existing knowledge within an ontology. Classification is one of the most widely known usage of an automated reasoner. Classification is invoked as soon as we run the HermiT reasoner in Prot´eg´e. The classification process involves three steps. Firstly, it checks whether the structure of OWL ontology satisfies all axioms or not. If it does not satisfy all the axioms, it will classify the model to be inconsistent and return a warning. All inconsistencies have to be rectified inorder to move forward and perform reasoning. Secondly, it will check whether there exists a structure that satisfies all the axioms in the model and in which A has an instance, say y, where A is a class and y is an instance of any class. Lastly, for any two class names, say X and Y, that occur in the model, the reasoner tests whether each instance of X is also an instance of Y. The results of these tests are shown in the Prot´eg´e OWL editor in the form of the inferred class hierarchy.
Fig. 9. Inferred Knowledge for Geeni Instance using HermiT Reasoner.
1468
A. Gupta et al.
For our model, Fig. 9 graphically depicts the inferred class type for the instance geeni. Even though we do not explicitly specify it’s type, the reasoner based on the inherited knowledge, defines it as a “smartDoorBell”.
Fig. 10. Property Assertions and Inferred Property Knowledge for Geeni Instance using HermiT Reasoner.
Consequently, Fig. 10 graphically depicts the various property assertions and the inferred properties for the instance geeni. We see the hasFeature softwareFeature1 assertion made for the instance geeni and the HermiT reasoner makes an inference has softwareFeature1 automatically. Furthermore, Fig. 11 details the inferred class type for the instance softwareFeature1. This inference, encompasses the IoT device feature properties, which enables query based analysis to identify if there is a potential violation of any security or privacy policy requirements. Such similar kind of inferences are drawn using the inbuilt reasoner and it also allows for inconsistency checking which are removed before we proceed with querying. After classification using the HermiT reasoner, we perform automated reasoning on IoT Ontology framework using SPARQL [34] query to infer knowledge from the existing knowledge base. SPARQL [34] being a Resource Description Format (RDF) query language, allows to retrieve and manipulate information stored in ontology in RDF format. SPARQL querying allows for more deductive powerful reasoning than OWL alone and provides similar strong formal guarantees when performing inferences. OWL allows for automated reasoning using different web rule languages such as SPARQL, SWRL, RuleML etc. We prefer SPARQL over SWRL and other querying languages because it is being used more extensively for automated reasoning and allows support with the latest version of Prot´eg´e. SPARQL allows querying of the entire knowledge base which is a set of “subject-predicate-object” triples. It specifies four different query variations for specific purposes: 1. SELECT query, 2. CONSTRUCT query, 3. ASK query and 4. DESCRIBE query.
Modeling Internet-of-Things (IoT) Behavior
1469
Fig. 11. Inferred Knowledge for SoftwareFeature1 Instance using HermiT Reasoner.
The “SELECT” query is used to extract raw values from a SPARQL endpoint, the “CONSTRUCT” query transforms the extracted information into valid RDF, the “ASK” query is a question query that returns true/false and the “DESCRIBE” query also helps to extract RDF graph from the SPARQL endpoint but it gives more freedom and flexibility to the maintainer. To query our model we mainly use the “SELECT” and the “ASK” query to infer knowledge from the knowledge base. The queries are executed against the knowledge base, which returns a list of tuples of ontology values that satisfy the query.
9
Results
In this section, we present the results of policy requirements that were validated through SPARQL reasoning and querying. As discussed in the previous sections, we began by listing and categorizing the policy requirements, which guided the experimental flow. The experiments discussed in Sect. 6 helped in providing the formal requirements for the ontology. We formally modeled these requirements after identifying the various components, attributes and their associations. After modeling, the automated reasoning helped to check for any inconsistencies and flaws within the framework which were rectified. As part of the last step, validation of the four main policies listed in Sect. 5 was achieved through querying. Table 5 lists the different categories of policy requirements along with the SPARQL based queries and their outcomes. We first list the policy requirement for “Interaction” wherein the user might be interested in knowing if the system interacts with systems that are located outside a defined geographical region. The result of the query lists the name of the countries where the servers are located with which the system is interacting and sharing information. We then list the formal query for “Server Ownership” wherein the user might be interested in knowing the company that owns the server. The “ASK” query results in a Boolean value indicating that the ring doorbell is interacting with a server which is a first party server and is owned by the company itself. Next, the frequency policy requirement is translated into a formal representation. The result of the query lists the frequency of interaction between the doorbell and the server to which it is connected. For instance, the ring doorbell continuously interacts with
1470
A. Gupta et al. Table 5. Results of SPARQL Query
Requirements
Query
Outcome
1. Interaction
SELECT ?name ?server ?Demographics ?Location WHERE { ?name test:isConnectedTo ?server. ?server test:demographicallyExternalorInternal ?Demographics . ?server test:serverLocation ?Location}
name, server, Demographics, Location 1. ring, server2, “Internal”, “USA” 2. geeni, server1, “Both”, “USA, Belgium” 3. nightOwl, server3, “External”, “China”
2. Server Ownership
ASK {test:ring test:isConnectedTo test:server2 . test:server2 test:thirdPartyorFirstParty “First”ˆˆxsd:string}
True
3. Frequency
SELECT ?name ?server ?FrequencyOfInteraction WHERE { ?name test:isConnectedTo ?server. ?server test:frequencyofInteraction ?FrequencyOfInteraction}
name, server, FrequencyOfInteraction 1. ring, server2, “continuous” 2. geeni, server1, “periodic” 3. nightOwl, server3, “discrete”
4. Size and Type (Secured or Unsecured) of Data
SELECT ?name ?data ?SizeOfData ?TypeOfData ?Security WHERE { ?name test:sendsAndReceives ?data. ?data test:totalSize ?SizeOfData . ?data test:typeOfData ?TypeOfData . ?data test:dataEncryption ?Security}
name, data, SizeOfData, TypeOfData, Security 1. geeni, dataGeeni, “13.22mb”, “Multimedia”, “Unsecured” 2. ring, dataRing, “26mb”, “JSON”, “Secured” 3. nightOwl, dataNight, “51mb”, “Multimedia”, “Both”
the server to share information. Lastly, a user might be interested in knowing some properties of the data (size, type, and security) that is being shared by the device with the server. In total, we listed around fifty formal queries from different categories. The outcomes of these queries not only help to infer knowledge, identify potential security vulnerabilities but they can also help to validate the existing policies. The checking of violation of security, potential security vulnerabilities are important to identify because it’ll help to formulate new policies in the future that are more secure and ensure user privacy. For instance, the users might want to restrict interactions with servers that are located outside a defined geographical region. This specifically becomes important when the devices are being used for special tasks or by special forces deployed by the national agencies. In order to ensure the same, a policy can be formulated that restricts any interaction with the outside world. We believe the proposed formalization and framework will help to bridge the knowledge gap or lack awareness that exists.
10
Conclusion
In this research effort, we hypothesized and formulated a framework that enables capturing relevant information regarding the behavior of IoT devices guided by requirements focusing on capturing security vulnerabilities. We then employed the captured experimental knowledge to develop an ontological representation at a higher level of abstraction to perform automated reasoning. Our effort also demonstrated the integration of experimental knowledge with formal modeling to enable reasoning at a level of abstraction closer to requirements. It includes modeling objects and associations for automated reasoning. The models allow
Modeling Internet-of-Things (IoT) Behavior
1471
future research efforts to focus on context-based mitigation of security vulnerabilities by including human-machine interactions. Acknowledgments. This material is based upon work supported in whole or in part with funding from the Office of Naval Research (ONR) contract #N00014-20-1-2798. Any opinions, findings, conclusions, or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the ONR and/or any agency or entity of the United States Government.
References 1. Ali, J., Khalid, A.S., Yafi, E., Musa, S., Ahmed, W.: Towards a secure behavior modeling for IoT networks using blockchain. CoRR, abs/2001.01841 (2020) 2. Ali, L., Ye, X., Ning, H.: Thing relation modeling in the internet of things. IEEE Access, 5 (2017) 3. Anonymous and Anonymous. CVE-2020-2871. Available from MITRE, CVE-ID CVE-2020-2871, November 24 2020 4. Anonymous and Anonymous. CVE-2020-28998. Available from MITRE, CVE-ID CVE-2020-28998, November 24 2020 5. Anonymous and Anonymous. CVE-2020-28999. Available from MITRE, CVE-ID CVE-2020-28999, November 24 2020 6. Anonymous and Anonymous. CVE-2020-29000. Available from MITRE, CVE-ID CVE-2020-29000, November 24 2020 7. Anonymous and Anonymous. CVE-2020-29001. Available from MITRE, CVE-ID CVE-2020-29001, November 24 2020 8. Anonymous and Anonymous. CVE-2021-31793. Available from MITRE, CVE-ID CVE-2021-31793, March 24 2021 9. Atzori, L., Iera, A., Morabito, G.: SIoT: giving a social structure to the internet of things. 15(11), 1193–1195 (2011) 10. Bechhofer, S.: OWL: Web Ontology Language, pp. 2008–2009. Springer US, Boston (2009) 11. The Conversation. Hackers can access your mobile and laptop cameras and record you - cover them up now (2020) 12. De, S., Barnaghi, P., Bauer, M., Meissner, S.: Service modelling for the internet of things, p. 7 (2011) 13. Dooley, E.: ADT Technician Pleads Guilty to Hacking Home Security Footage, January 2021 14. Fereidooni, H., Frassetto, T., Miettinen, M., Sadeghi, A.-R., Conti, M.: Fitness trackers: fit for health but unfit for security and privacy. In: 2017 IEEE/ACM International Conference on Connected Health: Applications, Systems and Engineering Technologies (CHASE), pp. 19–24. IEEE (2017) 15. Glimm, B., Horrocks, I., Motik, B., Stoilos, G., Wang, Z.: Hermit: an owl 2 reasoner. J. Autom. Reason. 53, 245–269 (2014) 16. Hachem, S., Teixeira, T., Issarny, V.: Ontologies for the internet of things. In: Proceedings of the 8th Middleware Doctoral Symposium on - MDS 2011, pp. 1–6. ACM Press (2011)
1472
A. Gupta et al.
17. Horrocks, I., Patel-Schneider, P.F., Van Harmelen, F.: From shiq and RDF to owl: the making of a web ontology language. Web Semantics Sci. Serv. Agents World Wide Web 1(1), 7–26 (2003) 18. Hsu, J.: Strava data heat maps expose military base locations around the world — WIRED (2018) 19. Janes, B., Crawford, H., OConnor, T.J.: Never ending story: authentication and access control design flaws in shared IoT devices. In: Security and Privacy Workshops (SPW), pp. 104–109. IEEE, San Francisco (2020) 20. Koorapati, K., Pandu, R., Ramesh, P.K., Veeraswamy, S., Narasappa, U.: Towards a unified ontology for IoT fabric with SDDC. J. King Saud Univ. - Comput. Inf. Sci. 34(8, Part B), 6077–6091 (2022) 21. Maria Bermudez-Edo, P.B., Elsaleh, T., Taylor, K: IoT-lite: a lightweight semantic model for the internet of things. In: International Conferences on Ubiquitous Intelligence & Computing, pp. 1–8 (2016). Cited by: 1 22. Mitev, R., Pazii, A., Miettinen, M., Enck, W., Sadeghi, A.-R.: Leakypick: IoT audio spy detector. In: Annual Computer Security Applications Conference, pp. 694–705 (2020) 23. Musen, M.A.: The prot´eg´e project: a look back and a look forward. AI Matters 1(4), 4–12 (2015) 24. Noy, N.F., McGuinness, D.L., et al.: Ontology development 101: A guide to creating your first ontology (2001) 25. OConnor, T.J., Enck, w., Bradley. Reaves. Blinded and confused: Uncovering systemic flaws in device telemetry for smart-home internet of things. In: ACM Conference on Security and Privacy in Wireless and Mobile Networks (WiSec), pp. 140–150. ACM, Miami (2019) 26. OConnor, T.J., Jesse, D., Camps, D.: Through the spyglass: toward IoT companion app man-in-the-middle attacks. In: Cyber Security Experimentation and Test (CSET), Virtual Event, August 2021. USENIX 27. Pinkston, J., Undercoffer, J., Joshi, A., Finin, T.: A target-centric ontology for intrusion detection. In: Proceeding of the IJCAI-03 Workshop on Ontologies and Distributed Systems. Acapulco, August 9th, Citeseer (2004) 28. Rahman, H., Hussain, I.: A light-weight dynamic ontology for internet of things using machine learning technique. ICT Express 7(3), 355–360 (2021) 29. Roman, R., Zhou, J., Lopez, J.: On the features and challenges of security and privacy in distributed internet of things. Comput. Netw. 57(10), 2266–2279 (2013) 30. Schultz, E.: A framework for understanding and predicting insider attacks. J. Comput. Secur. 21(1), 526–531 (2002) 31. Sommestad, T., Ekstedt, M., Holm, H.: The cyber security modeling language: a tool for assessing the vulnerability of enterprise system architectures. 7(3), 363–373 (2013) 32. Syed, Z., Padia, A., Finin, T., Mathews, L., Joshi, A.: Uco: a unified cybersecurity ontology. In: Workshops at the Thirtieth AAAI Conference on Artificial Intelligence (2016) 33. Thimpson, P.: Binary Hardening in IoT products, August 2019 34. W3C. SPARQL: Query Language (2013) 35. Wang, P., Valerdi, R., Zhou, S., Li, L.: Introduction: advances in IoT research and applications. Inf. Syst. Front. 17(2), 239–241 (2015). https://doi.org/10.1007/ s10796-015-9549-2 36. Qi, W., Datta, P., Yang, W., Liu, S., Bates, A., Gunter, C.A.: Charting the attack surface of trigger-action IoT platforms. In: Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security, pp. 1439–1453 (2019)
Modeling Internet-of-Things (IoT) Behavior
1473
37. Wang, W., De, S., Toenjes, R., Reetz, E., Moessner, K.: A comprehensive ontology for knowledge representation in the internet of things. In: 2012 IEEE 11th International Conference on Trust, Security and Privacy in Computing and Communications, pp. 1793–1798. IEEE (2012) 38. Wei, W., Barnaghi, P.: Semantic annotation and reasoning for sensor data. In: Barnaghi, P., Moessner, K., Presser, M., Meissner, S. (eds.) EuroSSC 2009. LNCS, vol. 5741, pp. 66–76. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3642-04471-7 6 39. Ye, J., Stevenson, G., Dobson, S.: A top-level ontology for smart environments. Pervasive Mob. Comput. 7(3), 359–378 (2011)
Author Index
A Abdulai, Mohammed 1240 Abi Akl, Hanna 808 Abu-Khadrah, Ahmed 980 Afzal, Ismail 34 Agada, Ruth 1240 Aguilar, Daniel Pete M. 1058 Ahmad, Jawad 1216 Ali, Ali Mohd 980 Alkasassbeh, Mouhammd 352 Almseidin, Mohammad 352 Al-Qerem, Ahmad 980 Alyami, Mohammed 69 AlZyoud, Faisal 699 Amira, A. 1333 Anilkumar, Saurav 734 Anyiawe, Orieh Destiny 7 Apeko, Jewel Donkor 1227 Araghi, Tanya Koohpayeh 651 Arai, Kohei 111 Araújo, André 188 Ariza-Colpas, Paola Patricia 586, 598 Ascanio, Ronald Alexander Vacca 598 Ashe, Austin 1250 Athavale, Rishi 16 Azam, Muhammad Awais 613
B Ball, Edward 960 Banik, Shipra 202 Barranco-Gutiérrez, Alejandro Israel Barraza-Castillo, Ramón Iván 623 Bass, Michael 831 Beloff, Natalia 69 Berquedich, Mouna 639 Bhattacharyya, Siddhartha 1450 Bhunu Shava, F. 1345 Biswas, Rajarshi 1116 Boateng, Kwame Osei 86
623
Bosman, Anna Sergeevna 428 Brown, Dane 307 Brown, Timothy 742 Burdett, Eric 742 But-Aziz, Shariq 586, 598 C Cabaleiro, José C. 1149 Camaya, Tricia 1240 Campos, Daniel 1450 Carvalho, Marco M. 565 Catterall, Stephen 226 Chan, Felix T. S. 289 Chapman, Thomas 1240 Chebak, Ahmed 639 Chen, Zichao 1022 Chinchilla, Leidys del Carmen Contreras 598 Choi, Vince Sing 250 Chrétien, Stéphane 274 Chrysikos, Alexandros 226 Chung, S. H. 289 Cibula, Matej 753 Clement, Mark 742 Comai, Sara 1396 Couto, Henrique 188 D Dafalla, Zubeir Izaruku 1274 Damadi, Saeed 408 Dcosta, Adolf 1450 De Silva, Malithi 307 De Villiers, Johan Pieter 428 del Carmen Contreras Chinchilla, Leidys 586 Deng, Tiantai 1, 960 Dimitrova, Vesna 1288 Dinardo, Keir 1216 Don Ranul Deelaka, L. H 1434
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): SAI 2023, LNNS 711, pp. 1473–1476, 2023. https://doi.org/10.1007/978-3-031-37717-4
1474
Author Index
du Toit, Jaco 1139 Du Toit, Jaco 1185 Duneld, Martin 161 Duy, Dung Nguyen 684 Dwyer, Catherine 1309 E Ehatisham-ul-Haq, Muhammad Eybers, Sunet 865 Eyobu, Odongo Steven 1000
Hergenröther, Elke 459, 472 Hessenauer, Perry 831 Hood, Stephen 742 Horng, Shi-Jinn 340 Hsieh, Meng-Yen 968
613
F F. Pena, Tomás 1149 Fairbank, Michael 322 Fawcett, David 7 Feng, Ziheng 1099 Fernández-Fuentes, Xosé 1149 Finocchi, Jacopo 1396 Fredy, Justin 1263 Frison, Lilli 459 Fugini, Mariagrazia 1396 Fujimoto, Stanley 742 Fumi´c, Roko 668 G Gamage, Narmada 1434 Gamundani, A. M. 1345 Ganeriwala, Parth 1450 Gashi, Adriatik 472 Gavin, Bill 960 George, Nisha 1116 George, R. 734 Ghafarian, Ahmad 1263 Gjorgjievska Perusheska, Milena 1288 Gomez-Martinez, Meritxell 1378 Gorban, Alexander N. 210 Graham, D’Nita Andrews 905 Grieser, Gunter 472 Grivault, Ludovic 948 Gupta, Anubhav 1450 H Hartmann, A. 1345 Haslhofer, Bernhard 712 Hassan, Mohammad R. 980 He, James K. 147 Henriksson, Aron 161 Hergenroether, Elke 1079
I Ibrar, Kainat
613
J Jalali, Anahid 712 Jaleel, M. 1333 Jayakody, Anuradha 1434 Jeremiah, Rolston 1240 Jetwiriyanon, Jittarin 1099 Joshi, Chetan 742 K Kamach, Oulaid 639 Kanyama, M. N. 1345 Kasmire, J. 493 Kaupenjohann, Lukas 459 Kayem, Anne V. D. M. 1167 Kehinde, T. O. 289 Kermarrec, Yvon 948 Kibria, B. M. Golam 202 Kiwelekar, Arvind 398 Kooij, Robert E. 517 Kopecky, Sandra 1309 Kriglstein, Simone 712 L Le Berre, Pierre 948 Lemoudden, Mouad 1216 Lesley, Lutendo 52 Li, Kuan-Ching 968 Li, Xiu 161 Lian, Xing Ming 111 Lin, Hua Yi 968 Liu, Yulin 785, 1022 M Mahamunkar, Geetanjali 398 Malekmohamadi, H. 1333 Malik, Zeeshan Haider 34 Manuja, A. 734 Marik, Radek 753 Mata, Gabe 532
Author Index
1475
Mathew, A. 734 Matos, Paulo 1361 Mavrogonatou, Lida 147 Megías, David 651 Megyesi, Beáta 161 Meinel, Christoph 1167 Méndez-Gurrola, Iris Iddaly 623 Menner, Marietta 841 Mestre, Maribel Romero 598 Migliorelli, Carolina 1378 Mitarchuk, Volodimir 274 Mnkandla, Ernest 52 Mohabuth, Abdool Qaiyum 886 Morales-Ortega, Roberto-Cesar 586, 598 Moreno, Ana Ibañez 919 Mots’oehli, Moseli 428 Mtshemla, Mkhululi 1185 Murtaza, Fiza 613 Mutz, Marcel 1116 Muwumba, Mukakanya Abel 1000 N Nembhard, Fitzroy D. 565 Netak, Laxman 398 Neumann, Niels M. P. 517 Ngubiri, John 1000 Nguyen, Dinh Cong 684 Nguyen, The Cuong 684 nouri, Erfan 408 Nouri, Jalal 161 Novais, Paulo 1361 Nunoo-Mensah, Henry 86 O OConnor, TJ 1450 Oliveira, Pedro Filipe 1361 Orehovaˇcki, Tihomir 668 Orte, Silvia 1378 Ou, Keting 1410 Ouahabi, Nada 639 P Parsons, David 853 Perry, Xiao 365 Phan, Dinh Hung 684 Phillipson, Frank 517 Piñeres-Melo, Marlon Alberto Pirsiavash, Hamed 408
586, 598
Pisheh Var, Mahrad 322 Podlesny, Nikolai J. 1167 Prasomphan, Sathit 1127 Price, Joseph 742 Q Quentel, Paul
948
R Rafique, Sehrish 613 Ramírez Reyes, Abdiel 623 Ramsey, Austin 7 Rauber, Andreas 712 Ravi, Indrajitrakuraj 226 Rawat, Danda B. 768, 1202 Rawson, Marshall 129 Rawson, Michael G. 129 Recario, Reginald Neil C. 1058 Redrouthu, Sathvik 16 Reinhardt, Julian 1079 Richards, Dwight 1240 Rigby, Robert 226 Rodriguez-Bonilla, Andres Felipe 586, 598 Rodriguez-Bonilla, Ileana 586 Rodriguez-Parra, Diego Armando 586 Rohrer, Tobias 459 Roka, Sanjeev 768 Rosales, Andrea 651 Ros-Freixedes, Laura 1378 Rossi, Elisa 1396 Roth, Yehuda 577 Rumac, Mateo 668 S Sabir, Afsheen 34 Salamea-Palacios, Christian 262 Samonothai, Satayu 1127 Samothrakis, Spyridon 322 Samraj, Andrews 1274 Savy, Laurent 948 Scharf, Katrin 459 Schmeelk, Suzanna 175 Schoeman, Lily 865 Segrera, Daniel 742 Sharrab, Yousef 699 Siddiqui, Bassam 34 Siegel, Melanie 1079 Sistach-Bosch, Laura 1378
1476
Soares, Rendrikson 188 Soleimani, Arash 873 Sorenson, Lawry 742 Sowells-Boone, Evelyn R. 940 Stasinopoulos, Dimitrios 226 Sturm, Fabian 1079 Sumudu Maduranga Herath, H. M Sun, Yutong 1022 Sureshkumar, S. P. 734 Suriyachay, Earn 1127 Sutton, Oliver 210
Author Index
Vojnovikj, Petar Smilevski von Solms, Basie 1139
1434
T Taghva, Kazem 250 Tam, Le Nhan 684 Tamasri, Jiratchakit 1127 Tarawneh, Monther 699 Thinura Nethpiya Ariyaratne, U. H. D 1434 Thurner-Irmler, Julia 841 Times, Valéria 188 Turner, Carlene Buchanan 1250 Turner, Claude 1227, 1240, 1250 Tyukin, Ivan Y. 210 U Usigbe, Charles 365 V van de Weijer, Jeroen 1410 van der Westhuizen, Carl 1185 Varun, V. V. 734 Velcin, Julien 274 Villar, Sofía S. 147 Vitharana, V Diyon Yasaswin 1434 Voigt, Erik 841
1079
W Wang, Hongyan 1410 Wang, Xiaochen 1 Wattanachote, Kanoksak 1099 Werth, Dirk 1116 Wezeman, Robert S. 517 Whitaker, Jessica 1202 White, Martin 69 Wilson, Michael 86 Woodson, Michael 1043 Wu, Yongchao 161 X Xiao, Haiyan 1410 Xuan, Quang Nguyen Y Yan, Feng 1 Yan, Jie 1240 Yasin, Amanullah
684
613
Z Zegrari, Mourad 639 Zhang, Jane 1043 Zhang, Luyao 785, 1022 Zhang, Yufan 1022 Zhao, Ying 532, 552 Zhou, Charles 532 Zhou, Charles C. 552 Zhuang, Cheng-En 340 Zumba-Narváez, Edison 262 Zumba-Narváez, Fernando 262